A statistical analysis has internal validity if the statistical inference made about causal effects are valid for the considered population.
An analysis is said to have external validity if inferences and conclusion are valid for the studies’ population and can be generalized to other populations and settings.
Omitted Variable Bias
Misspecification of Functional Form of Regression Model
Measurement Error
Missing Data and Sample Selection
Simultaneous Causality In each case, OLS assumption [\(#1\)]{yellow} is violated: \(E(u_i| X_{1i}, X_{2i}, \dots,X_{ki})\neq 0.\)
Lets our model is \(Y_i = X_i^2\) but one uses \(Y_i=\beta_0+\beta_1X_{1i}+u_i\)
# set seed for reproducibility
set.seed(5)
library(tidyverse)
# simulate data set
X <- runif(100, -5, 5)
Y <- X^2 + rnorm(100)
df<-cbind(X,Y)
df<-as.data.frame(df)
# estimate the regression function
ms_mod <- lm(Y ~ X)
ms_mod
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 9.017 0.435
General regression model: \(Y_i = \beta_0 + \beta_1X_{i} + u_{i}\) What you would like to measure is \(X_i\), but what you do actually measure is \(\overset{\sim}{X}_i\). Then the error is \(X_i - \overset{\sim}{X}_i\)
Rewrite \(ν_i = \beta_1 (X_i - \overset{\sim}{X}_i)+u_i\)
\[\begin{align*} Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + \underbrace{\beta_1 (X_i - \overset{\sim}{X}_i) + u_i}_{=v_i} \\ Y_i =& \, \beta_0 + \beta_1 \overset{\sim}{X}_i + v_i \end{align*}\]
where here \(\overset{\sim}{X}_i\) and \(v_i\) are correlated resulting to slope coefficient \(\hat{\beta_1}\) to be biased.
The classical measurement error model assumes that the measurement error, \(w_i\), has zero mean and that it is uncorrelated with the variable, \(\overset{\sim}{X}_i\), and the error term of the population regression model, \(u_i\): \(\begin{equation} \overset{\sim}{X}_i = X_i + w_i, \ \ \rho_{w_i,u_i}=0, \ \ \rho_{w_i,X_i}=0 \end{equation}\) This holds \(\begin{equation} \widehat{\beta}_1 \xrightarrow{p}{\frac{\sigma_{X}^2}{\sigma_{X}^2 + \sigma_{w}^2}} \beta_1 \tag{9.1} \end{equation}\)
\(\sigma_{X}^2, \sigma_{w}^2 > 0\) so \(\hat{\beta_1}\) smaller than 1.
If there is no measurement error, \(\sigma_{w}^2=0\) such that \(\widehat{\beta}_1 \xrightarrow{p}{\beta_1}\) .
If \(\sigma_{w}^2 \gg \sigma_{X}^2\) we have \(\widehat{\beta}_1 \xrightarrow{p}{0}\)
This is the case if the measurement error is so large that there essentially is no information on \(X\) in the data that can be used to estimate \(\beta\).
\(\begin{align} (X, Y) \sim \mathcal{N}\left[\begin{pmatrix}50\\ 100\end{pmatrix},\begin{pmatrix}10 & 5 \\ 5 & 10 \end{pmatrix}\right] \tag{9.3} \end{align}\)
\(\begin{align*} Y_i =& \, 100 + 0.5 (X_i - 50) \\ =& \, 75 + 0.5 X_i. \tag{9.4} \end{align*}\)
\(\overset{\sim}{X_i} = X_i + w_i\) \(w_i \overset{i.i.d.}{\sim} \mathcal{N}(0,10)\) \(w_i\) is independent of \(x_i\).
# estimate the model (without measurement error)
noerror_mod <- lm(Y ~ X, data = dat)
# estimate the model (with measurement error in X)
dat$X <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X, data = dat)
# print estimated coefficients to console
noerror_mod$coefficients
## (Intercept) X
## 76.300 0.476
error_mod$coefficients
## (Intercept) X
## 87.276 0.255
# estimate the model (with measurement error in X)
dat$X1 <- dat$X + rnorm(n = 1000, sd = sqrt(10))
error_mod <- lm(Y ~ X, data = dat)
ggplot(dat)+aes(X,Y)+geom_smooth(aes(X, Y), data = dat,
method = "lm", se = FALSE, color = "red") +
geom_smooth(aes(X1, Y), data = dat,
method = "lm", se = FALSE, color = "blue") +
geom_point()
Simultaneous Causality Bias So far we have assumed that the changes in the independent variable \(X\) are responsible for changes in the dependent variable \(Y\) . When the reverse is also true, we say that there is simultaneous causality between \(X\) and \(Y\) . This reverse causality leads to correlation between \(X\) and the error in the population regression of interest such that the coefficient on \(X\) is estimated with bias.
The five primary threats to internal validity of a multiple regression study are:
Omitted variables
Misspecification of functional form
Errors in variables (measurement errors in the regressors)
Sample selection
Simultaneous causality
All these threats lead to failure of the first least squares assumption \(E(u_i\vert X_{1i},\dots ,X_{ki}) \neq 0\)
so that the OLS estimator is biased and inconsistent.
Furthermore, if one does not adjust for heteroskedasticity and/or serial correlation, incorrect standard errors may be a threat to internal validity of the study.
Data sets can be downloaded from Stock and Watson. We are working here with California Test School data used in Chapter 4-9. For further details, one may visit Nishant Yonzan page
library(qwraps2) ## A collection of wrapper functions for reproducible reports
options(dplyr.width = Inf) # to force R to print all values
# CA school mean and sd
ca_mean <-
t(
summarise(caschool,
"Test Score" = mean(testscr),
"Student-Teacher Ratio" = mean(str),
"% English Learners" = mean(el_pct),
"% Receiving Lunch Subsidy" = mean(meal_pct),
"Average District Income ($)" = mean(avginc*1000),
count=n()
)
)
ca_sd <-
t(
summarise(caschool,
"Test Score" = sd(testscr),
"Student-Teacher Ratio" = sd(str),
"% English Learners" = sd(el_pct),
"% Receiving Lunch Subsidy" = sd(meal_pct),
"Average District Income ($)" = sd(avginc*1000),
count=n()
))
# MA school mean and sd
ma_mean <-
t(
summarise(maschool,
"Test Score" = mean(totsc4),
"Student-Teacher Ratio" = mean(tchratio),
"% English Learners" = mean(pctel),
"% Receiving Lunch Subsidy" = mean(lnch_pct),
"Average District Income ($)" =
mean(percap*1000),
count=n()
)
)
ma_sd <-
t(
summarise(maschool,
"Test Score" = sd(totsc4),
"Student-Teacher Ratio" = sd(tchratio),
"% English Learners" = sd(pctel),
"% Receiving Lunch Subsidy" = sd(lnch_pct),
"Average District Income ($)" = sd(percap*1000),
count=n()
)
)
##Table using kable:
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
kable(
cbind(ca_mean,ca_sd,ma_mean,ma_sd),
align = "c",
digits = 1,
"html",
booktabs = TRUE
) %>%
column_spec(
2:5, width = "4em") %>%
kable_styling(
bootstrap_options = "striped", full_width = FALSE) %>%
add_header_above(
c(" ", "Average" = 1, "SD"=1, "Average" = 1, "SD" = 1)) %>%
add_header_above(
c(" " , "California" = 2, "Massachusetts" = 2)) %>%
row_spec(5,hline_after = TRUE)
California |
Massachusetts |
|||
---|---|---|---|---|
Average |
SD |
Average |
SD |
|
Test Score | 654.2 | 19.1 | 709.8 | 15.1 |
Student-Teacher Ratio | 19.6 | 1.9 | 17.3 | 2.3 |
% English Learners | 15.8 | 18.3 | 1.1 | 2.9 |
% Receiving Lunch Subsidy | 44.7 | 27.1 | 15.3 | 15.1 |
Average District Income ($) | 15316.6 | 7225.9 | 18746.8 | 5807.6 |
count | 420.0 | 420.0 | 220.0 | 220.0 |
Create binary variable for HiEL and also squares and cubes of str variable.
CA_mod3<-lm(testscr~str+el_pct+meal_pct+avginc+inc_squared+inc_cubed, data=caschool)
CA_mod4<-lm(testscr~str+str_squared+str_cubed+el_pct+meal_pct+avginc+inc_squared+inc_cubed, data=caschool)
CA_mod5 <- lm(
testscr ~ str + meal_pct + hiel + hiel*str + avginc + inc_squared + inc_cubed,
data = caschool
)
MA_mod3 = lm(
testscr ~ str + el_pct + meal_pct + avginc + inc_squared + inc_cubed,
data = maschool
)
MA_mod4 = lm(
testscr ~ str + str_squared + str_cubed + el_pct + meal_pct + avginc + inc_squared + inc_cubed,
data = maschool
)
MA_mod5 = lm(
testscr ~ str + meal_pct + hiel + hiel*str + avginc + inc_squared + inc_cubed,
data = maschool
)
library(huxtable)
CA_MA<-huxreg("CA(1)"=CA_mod3, "CA(2)"=CA_mod4, "CA(3)"=CA_mod5,
"MA(1)"=MA_mod3, "MA(2)"=MA_mod4, "MA(3)"=MA_mod5,
coefs = c('STR'='str', 'STR^2'='str_squared', 'STR^3'='str_cubed' ,
'% English learners'='el_pct',
'% Eligible for free lunch'='meal_pct',
'HiEL (Binary, % English learners > median)'='hiel',
'HiEL X STR'='str:hiel','District income'='avginc',
'District income^2'='inc_squared', 'District income^3'='inc_cubed',
'Intercept'='(Intercept)'),
statistics = c('N' = 'nobs', 'R^2' = 'r.squared'),
stars = c(`*` = 0.1, `**` = 0.05)
)
CA(1) | CA(2) | CA(3) | MA(1) | MA(2) | MA(3) | |
---|---|---|---|---|---|---|
STR | -0.508 ** | 55.618 ** | -0.342 | -0.641 ** | 12.426 | -1.018 ** |
(0.228) | (24.903) | (0.318) | (0.262) | (14.309) | (0.350) | |
STR^2 | -2.915 ** | -0.680 | ||||
(1.266) | (0.781) | |||||
STR^3 | 0.050 ** | 0.011 | ||||
(0.021) | (0.014) | |||||
% English learners | -0.205 ** | -0.196 ** | -0.437 | -0.434 | ||
(0.032) | (0.032) | (0.279) | (0.279) | |||
% Eligible for free lunch | -0.410 ** | -0.412 ** | -0.454 ** | -0.582 ** | -0.587 ** | -0.709 ** |
(0.030) | (0.030) | (0.029) | (0.077) | (0.079) | (0.072) | |
HiEL (Binary, % English learners > median) | 6.924 | -12.561 | ||||
(9.100) | (9.439) | |||||
HiEL X STR | -0.612 | 0.799 | ||||
(0.463) | (0.548) | |||||
District income | -1.063 * | -0.913 | -0.597 | -3.067 | -3.382 | -3.867 * |
(0.630) | (0.631) | (0.644) | (2.267) | (2.327) | (2.320) | |
District income^2 | 0.073 ** | 0.067 ** | 0.051 * | 0.164 * | 0.174 * | 0.184 ** |
(0.026) | (0.026) | (0.026) | (0.087) | (0.089) | (0.088) | |
District income^3 | -0.001 ** | -0.001 ** | -0.001 * | -0.002 ** | -0.002 ** | -0.002 ** |
(0.000) | (0.000) | (0.000) | (0.001) | (0.001) | (0.001) | |
Intercept | 687.009 ** | 330.079 ** | 682.559 ** | 744.025 ** | 665.496 ** | 759.914 ** |
(6.874) | (162.334) | (8.303) | (19.437) | (84.143) | (20.802) | |
N | 420 | 420 | 420 | 220 | 220 | 220 |
R^2 | 0.809 | 0.812 | 0.802 | 0.685 | 0.687 | 0.686 |
** p < 0.05; * p < 0.1. |