Internal and external validity of
regression studies

Goal of this study

Internal versus External Validity

Testing for Heteroskedasticity

A statistical analysis has internal validity if the statistical inference made about causal effects are valid for the considered population.

An analysis is said to have external validity if inferences and conclusion are valid for the studies’ population and can be generalized to other populations and settings.

Internal and External Validity of Regression

Internal validity is satisfied if the statistical inference about the causal effects are valid for the population being studied.
External validity would need the inferences and conclusions to be generalized from the population and setting studied to other populations and setting
Example: Think about the california test example.
Is the slope, $\beta$str , unbiased and consistent?
Is this a valid estimate for NY?, WV?, NM?,. . .

Threats to internal validity

Omitted Variable Bias
Misspecification of Functional Form of Regression Model
Measurement Error
Missing Data and Sample Selection
Simultaneous Causality In each case, OLS assumption [$#1$]{yellow} is violated: $E(u_i| X_{1i}, X_{2i}, \dots,X_{ki})\neq 0.$

Omitted variable bias

If you have the data for omitted variables:
- Be specific about your coefficient of interest.
- Use a priori reason for adding variables.
- Use statistical tests for questionable variables (t and F).
- Provide all the potential specifications in tabular form.
What if you don’t have the data:
- Panel data (next chapter)
- IV method (chapter 12)
- Randomized control experiments

Misspecification of Functional Form of Regression Model

For continuous dependent variable: Use methods discussed in chapter 8 to modify functional forms.
For discrete or binary dependent variable: Chapter 11 (we will get there. . . ).

Lets our model is $Y_i = X_i^2$ but one uses $Y_i=\beta_0+\beta_1X_{1i}+u_i$

Code
Output

Code


# set seed for reproducibility
set.seed(5)
library(tidyverse)
# simulate data set
X <- runif(100, -5, 5)
Y <- X^2 + rnorm(100)
df<-cbind(X,Y)
df<-as.data.frame(df)

Code

# estimate the regression function
ms_mod <- lm(Y ~ X)
ms_mod

Code

# set seed for reproducibility
set.seed(5)
library(tidyverse)
# simulate data set
X <- runif(100, -5, 5)
Y <- X^2 + rnorm(100)
df<-cbind(X,Y)
df<-as.data.frame(df)
# estimate the regression function
ms_mod <- lm(Y ~ X)
ms_mod
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##       9.017        0.435

Plot this linear fit

Code

ggplot(df)+aes(X,Y)+geom_point()+geom_smooth(method = "lm",se=FALSE)

Measurement Error in $X$

General regression model: $Y_i = \beta_0 + \beta_1X_{i} + u_{i}$ What you would like to measure is $X_i$, but what you do actually measure is $\overset{\sim}{X}_i$. Then the error is $X_i - \overset{\sim}{X}_i$

Regression model with mesurement error $\overset{\sim}{X}_i$ instead of $X_i$ $\begin{align*} Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + \underbrace{\beta_1 (X_i - \overset{\sim}{X}_i) + u_i}_{=v_i} \\ Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + v_i \end{align*}$

Rewrite $ν_i = \beta_1 (X_i - \overset{\sim}{X}_i)+u_i$

\[\begin{align*} Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + \underbrace{\beta_1 (X_i - \overset{\sim}{X}_i) + u_i}_{=v_i} \\ Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + v_i \end{align*}\]

where here $\overset{\sim}{X}_i$ and $v_i$ are correlated resulting to slope coefficient $\hat{\beta_1}$ to be biased.

The classical measurement error model assumes that the measurement error, $w_i$, has zero mean and that it is uncorrelated with the variable, $\overset{\sim}{X}_i$, and the error term of the population regression model, $u_i$: $\begin{equation} \overset{\sim}{X}_i = X_i + w_i, \ \ \rho_{w_i,u_i}=0, \ \ \rho_{w_i,X_i}=0 \end{equation}$ This holds $\begin{equation} \widehat{\beta}_1 \xrightarrow{p}{\frac{\sigma_{X}^2}{\sigma_{X}^2 + \sigma_{w}^2}} \beta_1 \tag{9.1} \end{equation}$

$\sigma_{X}^2, \sigma_{w}^2 > 0$ so $\hat{\beta_1}$ smaller than 1.

If there is no measurement error, $\sigma_{w}^2=0$ such that $\widehat{\beta}_1 \xrightarrow{p}{\beta_1}$ .
If $\sigma_{w}^2 \gg \sigma_{X}^2$ we have $\widehat{\beta}_1 \xrightarrow{p}{0}$

This is the case if the measurement error is so large that there essentially is no information on $X$ in the data that can be used to estimate $\beta$.

$\begin{align} (X, Y) \sim \mathcal{N}\left[\begin{pmatrix}50\\ 100\end{pmatrix},\begin{pmatrix}10 & 5 \\ 5 & 10 \end{pmatrix}\right] \tag{9.3} \end{align}$

$\begin{align*} Y_i =& \, 100 + 0.5 (X_i - 50) \\ =& \, 75 + 0.5 X_i. \tag{9.4} \end{align*}$

$\overset{\sim}{X_i} = X_i + w_i$ $w_i \overset{i.i.d.}{\sim} \mathcal{N}(0,10)$ $w_i$ is independent of $x_i$.

Code

# set seed
set.seed(1)

# load the package 'mvtnorm' and simulate bivariate normal data
library(mvtnorm)
dat <- data.frame(
  rmvnorm(1000, c(50, 100), 
          sigma = cbind(c(10, 5), c(5, 10))))

# set columns names
colnames(dat) <- c("X", "Y")

Plots

Code

# estimate the model (without measurement error)
noerror_mod <- lm(Y ~ X, data = dat)

# estimate the model (with measurement error in X)
dat$X <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X, data = dat)

# print estimated coefficients to console
noerror_mod$coefficients
## (Intercept)           X 
##      76.300       0.476
error_mod$coefficients
## (Intercept)           X 
##      87.276       0.255

plot in ggplot2

Code



# estimate the model (with measurement error in X)
dat$X1 <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X, data = dat)
ggplot(dat)+aes(X,Y)+geom_smooth(aes(X, Y), data = dat, 
              method = "lm", se = FALSE, color = "red") +
  geom_smooth(aes(X1, Y), data = dat, 
              method = "lm", se = FALSE, color = "blue") +
  geom_point()

Missing Data and Sample Selection

Missing at random: There is no bias, but it reduces our sample size.
Missing regressor values: Same as above.
Missing $Y$ due to selection process (sample selection bias): For example of this type of bias, think about why older, say, baseball players are on average better than younger ones.

4. Sample Selection Bias

5. Simultaneous Causality

Simultaneous Causality Bias So far we have assumed that the changes in the independent variable $X$ are responsible for changes in the dependent variable $Y$ . When the reverse is also true, we say that there is simultaneous causality between $X$ and $Y$ . This reverse causality leads to correlation between $X$ and the error in the population regression of interest such that the coefficient on $X$ is estimated with bias.

Summary

Threats to Internal Validity of a Regression Study

The five primary threats to internal validity of a multiple regression study are:

Omitted variables
Misspecification of functional form
Errors in variables (measurement errors in the regressors)
Sample selection
Simultaneous causality

All these threats lead to failure of the first least squares assumption $E(u_i\vert X_{1i},\dots ,X_{ki}) \neq 0$

so that the OLS estimator is biased and inconsistent.

Furthermore, if one does not adjust for heteroskedasticity and/or serial correlation, incorrect standard errors may be a threat to internal validity of the study.

Uploaded data and required R packages

Data sets can be downloaded from Stock and Watson. We are working here with California Test School data used in Chapter 4-9. For further details, one may visit Nishant Yonzan page

Code

library(haven) ## To read data from STATA/SPSS we use *haven* package

caschool<- read_dta("data/caschool.dta")
maschool <- read_dta("data/mcas.dta")

Code

library(qwraps2) ## A collection of wrapper functions for reproducible reports
options(dplyr.width = Inf) # to force R to print all values
# CA school mean and sd
ca_mean <-
t(
summarise(caschool,
"Test Score" = mean(testscr),
"Student-Teacher Ratio" = mean(str),
"% English Learners" = mean(el_pct),
"% Receiving Lunch Subsidy" = mean(meal_pct),
"Average District Income ($)" = mean(avginc*1000),
count=n()
)
)
ca_sd <-
t(
summarise(caschool,
"Test Score" = sd(testscr),
"Student-Teacher Ratio" = sd(str),
"% English Learners" = sd(el_pct),
"% Receiving Lunch Subsidy" = sd(meal_pct),
"Average District Income ($)" = sd(avginc*1000),
count=n()
))

Code

# MA school mean and sd
ma_mean <-
t(
summarise(maschool,
"Test Score" = mean(totsc4),
"Student-Teacher Ratio" = mean(tchratio),
"% English Learners" = mean(pctel),
"% Receiving Lunch Subsidy" = mean(lnch_pct),
"Average District Income ($)" =
mean(percap*1000),
count=n()
)
)
ma_sd <-
t(
summarise(maschool,
"Test Score" = sd(totsc4),
"Student-Teacher Ratio" = sd(tchratio),
"% English Learners" = sd(pctel),
"% Receiving Lunch Subsidy" = sd(lnch_pct),
"Average District Income ($)" = sd(percap*1000),
count=n()
)
)
##Table using kable:
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
kable(
cbind(ca_mean,ca_sd,ma_mean,ma_sd),
align = "c",
digits = 1,
"html",
booktabs = TRUE
) %>%
column_spec(
2:5, width = "4em") %>%
kable_styling(
bootstrap_options = "striped", full_width = FALSE) %>%
add_header_above(
c(" ", "Average" = 1, "SD"=1, "Average" = 1, "SD" = 1)) %>%
add_header_above(
c(" " , "California" = 2, "Massachusetts" = 2)) %>%

row_spec(5,hline_after = TRUE)

	California		Massachusetts
	Average	SD	Average	SD
Test Score	654.2	19.1	709.8	15.1
Student-Teacher Ratio	19.6	1.9	17.3	2.3
% English Learners	15.8	18.3	1.1	2.9
% Receiving Lunch Subsidy	44.7	27.1	15.3	15.1
Average District Income ($)	15316.6	7225.9	18746.8	5807.6
count	420.0	420.0	220.0	220.0

Multiple Regression

Code

# load packages
library(lmtest)
library(sandwich)

Create binary variable for HiEL and also squares and cubes of str variable.

For CA

Code

caschool$hiel<-ifelse(caschool$el_pct>median(caschool$el_pct),1,0)
caschool$str_squared=caschool$str^2
caschool$str_cubed<-caschool$str^3
caschool$inc_squared<-caschool$avginc^2
caschool$inc_cubed<-caschool$avginc^3

For MA

Code

maschool$hiel = ifelse(maschool$pctel > median(maschool$pctel), 1, 0)
maschool$str_squared = maschool$tchratio^2
maschool$str_cubed = maschool$tchratio^3
maschool$inc_squared = maschool$percap^2
maschool$inc_cubed = maschool$percap^3

Changing names

Code

maschool$testscr = maschool$totsc4
maschool$str = maschool$tchratio
maschool$el_pct = maschool$pctel
maschool$meal_pct = maschool$lnch_pct
maschool$avginc = maschool$percap

Regression

Regression models for CA

Code

CA_mod3<-lm(testscr~str+el_pct+meal_pct+avginc+inc_squared+inc_cubed, data=caschool)
CA_mod4<-lm(testscr~str+str_squared+str_cubed+el_pct+meal_pct+avginc+inc_squared+inc_cubed, data=caschool)
CA_mod5 <- lm(
testscr ~ str + meal_pct + hiel + hiel*str + avginc + inc_squared + inc_cubed,
data = caschool
)

Regression for MA

Code


MA_mod3 = lm(
testscr ~ str + el_pct + meal_pct + avginc + inc_squared + inc_cubed,
data = maschool
)
MA_mod4 = lm(
testscr ~ str + str_squared + str_cubed + el_pct + meal_pct + avginc + inc_squared + inc_cubed,
data = maschool
)
MA_mod5 = lm(
testscr ~ str + meal_pct + hiel + hiel*str + avginc + inc_squared + inc_cubed,
data = maschool
)

Comparison

Code


library(huxtable)
CA_MA<-huxreg("CA(1)"=CA_mod3, "CA(2)"=CA_mod4, "CA(3)"=CA_mod5,
"MA(1)"=MA_mod3, "MA(2)"=MA_mod4, "MA(3)"=MA_mod5,
coefs = c('STR'='str', 'STR^2'='str_squared', 'STR^3'='str_cubed' ,
'% English learners'='el_pct',
'% Eligible for free lunch'='meal_pct',
'HiEL (Binary, % English learners > median)'='hiel',
'HiEL X STR'='str:hiel','District income'='avginc',
'District income^2'='inc_squared', 'District income^3'='inc_cubed',
'Intercept'='(Intercept)'),
statistics = c('N' = 'nobs', 'R^2' = 'r.squared'),
stars = c(`*` = 0.1, `**` = 0.05)
)

CA_MA

	CA(1)	CA(2)	CA(3)	MA(1)	MA(2)	MA(3)
STR	-0.508 **	55.618 **	-0.342	-0.641 **	12.426	-1.018 **
	(0.228)	(24.903)	(0.318)	(0.262)	(14.309)	(0.350)
STR^2		-2.915 **			-0.680
		(1.266)			(0.781)
STR^3		0.050 **			0.011
		(0.021)			(0.014)
% English learners	-0.205 **	-0.196 **		-0.437	-0.434
	(0.032)	(0.032)		(0.279)	(0.279)
% Eligible for free lunch	-0.410 **	-0.412 **	-0.454 **	-0.582 **	-0.587 **	-0.709 **
	(0.030)	(0.030)	(0.029)	(0.077)	(0.079)	(0.072)
HiEL (Binary, % English learners > median)			6.924			-12.561
			(9.100)			(9.439)
HiEL X STR			-0.612			0.799
			(0.463)			(0.548)
District income	-1.063 *	-0.913	-0.597	-3.067	-3.382	-3.867 *
	(0.630)	(0.631)	(0.644)	(2.267)	(2.327)	(2.320)
District income^2	0.073 **	0.067 **	0.051 *	0.164 *	0.174 *	0.184 **
	(0.026)	(0.026)	(0.026)	(0.087)	(0.089)	(0.088)
District income^3	-0.001 **	-0.001 **	-0.001 *	-0.002 **	-0.002 **	-0.002 **
	(0.000)	(0.000)	(0.000)	(0.001)	(0.001)	(0.001)
Intercept	687.009 **	330.079 **	682.559 **	744.025 **	665.496 **	759.914 **
	(6.874)	(162.334)	(8.303)	(19.437)	(84.143)	(20.802)
N	420	420	420	220	220	220
R^2	0.809	0.812	0.802	0.685	0.687	0.686
** p < 0.05; * p < 0.1.

Internal and external validity of regression studies

Goal of this study

Internal and External Validity of Regression

Threats to internal validity

Omitted variable bias

Misspecification of Functional Form of Regression Model

Plot this linear fit

Measurement Error in \(X\)

Plots

plot in ggplot2

Missing Data and Sample Selection

4. Sample Selection Bias

5. Simultaneous Causality

Summary

Threats to Internal Validity of a Regression Study

Uploaded data and required R packages

Multiple Regression

For CA

For MA

Changing names

Regression

Regression models for CA

Regression for MA

Comparison

Internal and external validity of
regression studies