Internal and external validity of
regression studies

Goal of this study

  • Internal versus External Validity
  • Testing for Heteroskedasticity

A statistical analysis has internal validity if the statistical inference made about causal effects are valid for the considered population.

An analysis is said to have external validity if inferences and conclusion are valid for the studies’ population and can be generalized to other populations and settings.

Internal and External Validity of Regression

  • Internal validity is satisfied if the statistical inference about the causal effects are valid for the population being studied.
  • External validity would need the inferences and conclusions to be generalized from the population and setting studied to other populations and setting
  • Example: Think about the california test example.
  • Is the slope, \(\beta\)str , unbiased and consistent?
  • Is this a valid estimate for NY?, WV?, NM?,. . .

Threats to internal validity

  1. Omitted Variable Bias

  2. Misspecification of Functional Form of Regression Model

  3. Measurement Error

  4. Missing Data and Sample Selection

  5. Simultaneous Causality In each case, OLS assumption [\(#1\)]{yellow} is violated: \(E(u_i| X_{1i}, X_{2i}, \dots,X_{ki})\neq 0.\)

Omitted variable bias

  • If you have the data for omitted variables:
    • Be specific about your coefficient of interest.
    • Use a priori reason for adding variables.
    • Use statistical tests for questionable variables (t and F).
    • Provide all the potential specifications in tabular form.
  • What if you don’t have the data:
    • Panel data (next chapter)
    • IV method (chapter 12)
    • Randomized control experiments

Misspecification of Functional Form of Regression Model

  • For continuous dependent variable: Use methods discussed in chapter 8 to modify functional forms.
  • For discrete or binary dependent variable: Chapter 11 (we will get there. . . ).

Lets our model is \(Y_i = X_i^2\) but one uses \(Y_i=\beta_0+\beta_1X_{1i}+u_i\)

Code

# set seed for reproducibility
set.seed(5)
library(tidyverse)
# simulate data set
X <- runif(100, -5, 5)
Y <- X^2 + rnorm(100)
df<-cbind(X,Y)
df<-as.data.frame(df)
Code
# estimate the regression function
ms_mod <- lm(Y ~ X)
ms_mod
Code
# set seed for reproducibility
set.seed(5)
library(tidyverse)
# simulate data set
X <- runif(100, -5, 5)
Y <- X^2 + rnorm(100)
df<-cbind(X,Y)
df<-as.data.frame(df)
# estimate the regression function
ms_mod <- lm(Y ~ X)
ms_mod
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##       9.017        0.435

Plot this linear fit

Code
ggplot(df)+aes(X,Y)+geom_point()+geom_smooth(method = "lm",se=FALSE)

Measurement Error in \(X\)

General regression model: \(Y_i = \beta_0 + \beta_1X_{i} + u_{i}\) What you would like to measure is \(X_i\), but what you do actually measure is \(\overset{\sim}{X}_i\). Then the error is \(X_i - \overset{\sim}{X}_i\)

  • Regression model with mesurement error \(\overset{\sim}{X}_i\) instead of \(X_i\) \(\begin{align*} Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + \underbrace{\beta_1 (X_i - \overset{\sim}{X}_i) + u_i}_{=v_i} \\ Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + v_i \end{align*}\)

Rewrite \(ν_i = \beta_1 (X_i - \overset{\sim}{X}_i)+u_i\)

\[\begin{align*} Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + \underbrace{\beta_1 (X_i - \overset{\sim}{X}_i) + u_i}_{=v_i} \\ Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + v_i \end{align*}\]

where here \(\overset{\sim}{X}_i\) and \(v_i\) are correlated resulting to slope coefficient \(\hat{\beta_1}\) to be biased.

The classical measurement error model assumes that the measurement error, \(w_i\), has zero mean and that it is uncorrelated with the variable, \(\overset{\sim}{X}_i\), and the error term of the population regression model, \(u_i\): \(\begin{equation} \overset{\sim}{X}_i = X_i + w_i, \ \ \rho_{w_i,u_i}=0, \ \ \rho_{w_i,X_i}=0 \end{equation}\) This holds \(\begin{equation} \widehat{\beta}_1 \xrightarrow{p}{\frac{\sigma_{X}^2}{\sigma_{X}^2 + \sigma_{w}^2}} \beta_1 \tag{9.1} \end{equation}\)

\(\sigma_{X}^2, \sigma_{w}^2 > 0\) so \(\hat{\beta_1}\) smaller than 1.

  1. If there is no measurement error, \(\sigma_{w}^2=0\) such that \(\widehat{\beta}_1 \xrightarrow{p}{\beta_1}\) .

  2. If \(\sigma_{w}^2 \gg \sigma_{X}^2\) we have \(\widehat{\beta}_1 \xrightarrow{p}{0}\)

This is the case if the measurement error is so large that there essentially is no information on \(X\) in the data that can be used to estimate \(\beta\).

\(\begin{align} (X, Y) \sim \mathcal{N}\left[\begin{pmatrix}50\\ 100\end{pmatrix},\begin{pmatrix}10 & 5 \\ 5 & 10 \end{pmatrix}\right] \tag{9.3} \end{align}\)

\(\begin{align*} Y_i =& \, 100 + 0.5 (X_i - 50) \\ =& \, 75 + 0.5 X_i. \tag{9.4} \end{align*}\)

\(\overset{\sim}{X_i} = X_i + w_i\) \(w_i \overset{i.i.d.}{\sim} \mathcal{N}(0,10)\) \(w_i\) is independent of \(x_i\).

Code
# set seed
set.seed(1)

# load the package 'mvtnorm' and simulate bivariate normal data
library(mvtnorm)
dat <- data.frame(
  rmvnorm(1000, c(50, 100), 
          sigma = cbind(c(10, 5), c(5, 10))))

# set columns names
colnames(dat) <- c("X", "Y")

Plots

Code
# estimate the model (without measurement error)
noerror_mod <- lm(Y ~ X, data = dat)

# estimate the model (with measurement error in X)
dat$X <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X, data = dat)

# print estimated coefficients to console
noerror_mod$coefficients
## (Intercept)           X 
##      76.300       0.476
error_mod$coefficients
## (Intercept)           X 
##      87.276       0.255

plot in ggplot2

Code


# estimate the model (with measurement error in X)
dat$X1 <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X, data = dat)
ggplot(dat)+aes(X,Y)+geom_smooth(aes(X, Y), data = dat, 
              method = "lm", se = FALSE, color = "red") +
  geom_smooth(aes(X1, Y), data = dat, 
              method = "lm", se = FALSE, color = "blue") +
  geom_point()

Missing Data and Sample Selection

  1. Missing at random: There is no bias, but it reduces our sample size.
  2. Missing regressor values: Same as above.
  3. Missing \(Y\) due to selection process (sample selection bias): For example of this type of bias, think about why older, say, baseball players are on average better than younger ones.

4. Sample Selection Bias

5. Simultaneous Causality

Simultaneous Causality Bias So far we have assumed that the changes in the independent variable \(X\) are responsible for changes in the dependent variable \(Y\) . When the reverse is also true, we say that there is simultaneous causality between \(X\) and \(Y\) . This reverse causality leads to correlation between \(X\) and the error in the population regression of interest such that the coefficient on \(X\) is estimated with bias.

Summary

Threats to Internal Validity of a Regression Study

The five primary threats to internal validity of a multiple regression study are:

  • Omitted variables

  • Misspecification of functional form

  • Errors in variables (measurement errors in the regressors)

  • Sample selection

  • Simultaneous causality

All these threats lead to failure of the first least squares assumption \(E(u_i\vert X_{1i},\dots ,X_{ki}) \neq 0\)

so that the OLS estimator is biased and inconsistent.

Furthermore, if one does not adjust for heteroskedasticity and/or serial correlation, incorrect standard errors may be a threat to internal validity of the study.

Uploaded data and required R packages

Data sets can be downloaded from Stock and Watson. We are working here with California Test School data used in Chapter 4-9. For further details, one may visit Nishant Yonzan page

Code
library(haven) ## To read data from STATA/SPSS we use *haven* package

caschool<- read_dta("data/caschool.dta")
maschool <- read_dta("data/mcas.dta")
Code
library(qwraps2) ## A collection of wrapper functions for reproducible reports
options(dplyr.width = Inf) # to force R to print all values
# CA school mean and sd
ca_mean <-
t(
summarise(caschool,
"Test Score" = mean(testscr),
"Student-Teacher Ratio" = mean(str),
"% English Learners" = mean(el_pct),
"% Receiving Lunch Subsidy" = mean(meal_pct),
"Average District Income ($)" = mean(avginc*1000),
count=n()
)
)
ca_sd <-
t(
summarise(caschool,
"Test Score" = sd(testscr),
"Student-Teacher Ratio" = sd(str),
"% English Learners" = sd(el_pct),
"% Receiving Lunch Subsidy" = sd(meal_pct),
"Average District Income ($)" = sd(avginc*1000),
count=n()
))
Code
# MA school mean and sd
ma_mean <-
t(
summarise(maschool,
"Test Score" = mean(totsc4),
"Student-Teacher Ratio" = mean(tchratio),
"% English Learners" = mean(pctel),
"% Receiving Lunch Subsidy" = mean(lnch_pct),
"Average District Income ($)" =
mean(percap*1000),
count=n()
)
)
ma_sd <-
t(
summarise(maschool,
"Test Score" = sd(totsc4),
"Student-Teacher Ratio" = sd(tchratio),
"% English Learners" = sd(pctel),
"% Receiving Lunch Subsidy" = sd(lnch_pct),
"Average District Income ($)" = sd(percap*1000),
count=n()
)
)
##Table using kable:
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
kable(
cbind(ca_mean,ca_sd,ma_mean,ma_sd),
align = "c",
digits = 1,
"html",
booktabs = TRUE
) %>%
column_spec(
2:5, width = "4em") %>%
kable_styling(
bootstrap_options = "striped", full_width = FALSE) %>%
add_header_above(
c(" ", "Average" = 1, "SD"=1, "Average" = 1, "SD" = 1)) %>%
add_header_above(
c(" " , "California" = 2, "Massachusetts" = 2)) %>%

row_spec(5,hline_after = TRUE)
California
Massachusetts
Average
SD
Average
SD
Test Score 654.2 19.1 709.8 15.1
Student-Teacher Ratio 19.6 1.9 17.3 2.3
% English Learners 15.8 18.3 1.1 2.9
% Receiving Lunch Subsidy 44.7 27.1 15.3 15.1
Average District Income ($) 15316.6 7225.9 18746.8 5807.6
count 420.0 420.0 220.0 220.0

Multiple Regression

Code
# load packages
library(lmtest)
library(sandwich)

Create binary variable for HiEL and also squares and cubes of str variable.

For CA

Code
caschool$hiel<-ifelse(caschool$el_pct>median(caschool$el_pct),1,0)
caschool$str_squared=caschool$str^2
caschool$str_cubed<-caschool$str^3
caschool$inc_squared<-caschool$avginc^2
caschool$inc_cubed<-caschool$avginc^3

For MA

Code
maschool$hiel = ifelse(maschool$pctel > median(maschool$pctel), 1, 0)
maschool$str_squared = maschool$tchratio^2
maschool$str_cubed = maschool$tchratio^3
maschool$inc_squared = maschool$percap^2
maschool$inc_cubed = maschool$percap^3

Changing names

Code
maschool$testscr = maschool$totsc4
maschool$str = maschool$tchratio
maschool$el_pct = maschool$pctel
maschool$meal_pct = maschool$lnch_pct
maschool$avginc = maschool$percap

Regression

Regression models for CA

Code
CA_mod3<-lm(testscr~str+el_pct+meal_pct+avginc+inc_squared+inc_cubed, data=caschool)
CA_mod4<-lm(testscr~str+str_squared+str_cubed+el_pct+meal_pct+avginc+inc_squared+inc_cubed, data=caschool)
CA_mod5 <- lm(
testscr ~ str + meal_pct + hiel + hiel*str + avginc + inc_squared + inc_cubed,
data = caschool
)

Regression for MA

Code

MA_mod3 = lm(
testscr ~ str + el_pct + meal_pct + avginc + inc_squared + inc_cubed,
data = maschool
)
MA_mod4 = lm(
testscr ~ str + str_squared + str_cubed + el_pct + meal_pct + avginc + inc_squared + inc_cubed,
data = maschool
)
MA_mod5 = lm(
testscr ~ str + meal_pct + hiel + hiel*str + avginc + inc_squared + inc_cubed,
data = maschool
)

Comparison

Code

library(huxtable)
CA_MA<-huxreg("CA(1)"=CA_mod3, "CA(2)"=CA_mod4, "CA(3)"=CA_mod5,
"MA(1)"=MA_mod3, "MA(2)"=MA_mod4, "MA(3)"=MA_mod5,
coefs = c('STR'='str', 'STR^2'='str_squared', 'STR^3'='str_cubed' ,
'% English learners'='el_pct',
'% Eligible for free lunch'='meal_pct',
'HiEL (Binary, % English learners > median)'='hiel',
'HiEL X STR'='str:hiel','District income'='avginc',
'District income^2'='inc_squared', 'District income^3'='inc_cubed',
'Intercept'='(Intercept)'),
statistics = c('N' = 'nobs', 'R^2' = 'r.squared'),
stars = c(`*` = 0.1, `**` = 0.05)
)

CA_MA
CA(1)CA(2)CA(3)MA(1)MA(2)MA(3)
STR-0.508 **55.618 **-0.342   -0.641 **12.426   -1.018 **
(0.228)  (24.903)  (0.318)  (0.262)  (14.309)  (0.350)  
STR^2       -2.915 **              -0.680          
       (1.266)                (0.781)         
STR^3       0.050 **              0.011          
       (0.021)                (0.014)         
% English learners-0.205 **-0.196 **       -0.437   -0.434          
(0.032)  (0.032)         (0.279)  (0.279)         
% Eligible for free lunch-0.410 **-0.412 **-0.454 **-0.582 **-0.587 **-0.709 **
(0.030)  (0.030)  (0.029)  (0.077)  (0.079)  (0.072)  
HiEL (Binary, % English learners > median)              6.924                 -12.561   
              (9.100)                (9.439)  
HiEL X STR              -0.612                 0.799   
              (0.463)                (0.548)  
District income-1.063 * -0.913   -0.597   -3.067   -3.382   -3.867 * 
(0.630)  (0.631)  (0.644)  (2.267)  (2.327)  (2.320)  
District income^20.073 **0.067 **0.051 * 0.164 * 0.174 * 0.184 **
(0.026)  (0.026)  (0.026)  (0.087)  (0.089)  (0.088)  
District income^3-0.001 **-0.001 **-0.001 * -0.002 **-0.002 **-0.002 **
(0.000)  (0.000)  (0.000)  (0.001)  (0.001)  (0.001)  
Intercept687.009 **330.079 **682.559 **744.025 **665.496 **759.914 **
(6.874)  (162.334)  (8.303)  (19.437)  (84.143)  (20.802)  
N420       420       420       220       220       220       
R^20.809   0.812   0.802   0.685   0.687   0.686   
** p < 0.05; * p < 0.1.