Regression

OLS, panel data, IV, DiD, and event studies

OLS Basics

Simple Linear Regression

* Simple regression
reg income education

* Multiple regression
reg income education age female

* With categorical variables (factor notation)
reg income education i.region i.year

* Interactions
reg income education##female    // Full interaction
reg income c.education#c.age    // Continuous interaction

library(fixest)

# Simple regression
model <- feols(income ~ education, data = data)
summary(model)

# Multiple regression
model <- feols(income ~ education + age + female, data = data)

# With categorical variables
model <- feols(income ~ education + factor(region) + factor(year),
               data = data)

# Interactions
model <- feols(income ~ education * female, data = data)
model <- feols(income ~ education:age, data = data)

Interpreting Coefficients

Coefficient Interpretation

Continuous X: A one-unit increase in X is associated with a beta-unit change in Y, holding other variables constant.
Binary X: The difference in Y between X=1 and X=0, holding other variables constant.
Log Y: A one-unit increase in X is associated with a (beta*100)% change in Y.
Log X: A 1% increase in X is associated with a (beta/100)-unit change in Y.
Log-Log: A 1% increase in X is associated with a beta% change in Y (elasticity).

Post-Estimation

reg income education age female

* Predicted values
predict yhat

* Residuals
predict resid, residuals

* Test coefficient equals zero
test education = 0

* Test multiple coefficients
test education age

* Linear combination of coefficients
lincom education + 2*age

model <- lm(income ~ education + age + female, data = data)

# Predicted values
data$yhat <- predict(model)

# Residuals
data$resid <- residuals(model)

# Test coefficient equals zero (from summary)
summary(model)$coefficients

# F-test for multiple coefficients
library(car)
linearHypothesis(model, c("education = 0", "age = 0"))

# Linear combination
library(multcomp)
glht(model, linfct = c("education + 2*age = 0"))

Standard Errors

Always Cluster When in Doubt

If observations within groups (states, firms, individuals over time) are correlated, standard OLS standard errors are too small. Cluster at the level of treatment variation or the highest level of aggregation that makes sense.

Types of Standard Errors

* Heteroskedasticity-robust (HC1)
reg income education, robust

* Clustered by state
reg income education, cluster(state)

* Two-way clustering
reg income education, cluster(state year)

* With reghdfe (recommended for fixed effects)
reghdfe income education, absorb(state) cluster(state)

library(fixest)

# Heteroskedasticity-robust
model <- feols(income ~ education, data = data, vcov = "hetero")

# Clustered by state
model <- feols(income ~ education, data = data, vcov = ~state)

# Two-way clustering
model <- feols(income ~ education, data = data, vcov = ~state + year)

# With fixed effects
model <- feols(income ~ education | state, data = data, vcov = ~state)

Clustering Rules of Thumb

Cluster at the level of treatment assignment (e.g., if policy varies by state, cluster by state)
With few clusters (<30), consider wild bootstrap or other small-sample corrections
Never cluster at a finer level than treatment variation

Panel Data and Fixed Effects

Fixed Effects Regression

Fixed effects control for all time-invariant characteristics of each unit (observed and unobserved).

* Declare panel structure
xtset state year

* Entity fixed effects
xtreg income policy, fe

* Better: use reghdfe for speed and flexibility
* Install: ssc install reghdfe
reghdfe income policy, absorb(state)

* Two-way fixed effects (entity + time)
reghdfe income policy, absorb(state year)

* With clustered standard errors
reghdfe income policy, absorb(state year) cluster(state)

library(fixest)

# Entity fixed effects
model <- feols(income ~ policy | state, data = data)

# Two-way fixed effects (entity + time)
model <- feols(income ~ policy | state + year, data = data)

# With clustered standard errors
model <- feols(income ~ policy | state + year,
               data = data, vcov = ~state)

# View results
summary(model)
etable(model)

When to Use Fixed Effects

Use FE When

Unobserved confounders are time-invariant
You have panel data with within-unit variation
You want to control for "all things about units"

Don't Use FE When

Your treatment doesn't vary within units
You care about time-invariant effects
Confounders are time-varying

Instrumental Variables

Two-Stage Least Squares (2SLS)

* 2SLS: Y = outcome, X = endogenous, Z = instrument
* Syntax: ivregress 2sls Y controls (X = Z)

ivregress 2sls income age female (education = distance_to_college)

* First-stage F-statistic (check relevance)
estat firststage

* With robust standard errors
ivregress 2sls income age (education = distance), robust

* With fixed effects
ivreghdfe income age (education = distance), absorb(state) cluster(state)

library(fixest)

# 2SLS with fixest
# Syntax: Y ~ controls | FE | endogenous ~ instrument
model <- feols(income ~ age + female | 0 | education ~ distance_to_college,
               data = data)
summary(model)

# Check first-stage F-statistic
fitstat(model, "ivf")

# With fixed effects
model <- feols(income ~ age | state | education ~ distance,
               data = data, vcov = ~state)

First-Stage F-Statistic

The F-statistic tests instrument relevance. Rule of thumb: F > 10 suggests the instrument is strong enough. Weak instruments (F < 10) lead to biased estimates and unreliable inference.

Complete IV Example: Protest Size and Voting

Research question: Does protest size affect voting? Problem: larger protests might occur in places that would vote differently anyway. Solution: use weather (rain) as an instrument—rain reduces protest size but shouldn't directly affect voting.

* First stage: Does rain affect protest size?
reg protest rain, robust
eststo first_stage

* 2SLS: Effect of protest on voting
ivregress 2sls vote (protest = rain), robust
eststo iv_model

* Reduced form: Direct effect of rain on voting
reg vote rain, robust
eststo reduced_form

* Export results
esttab first_stage iv_model reduced_form ///
    using "iv_results.tex", replace ///
    b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) ///
    stats(N r2 F, fmt(%9.0fc %9.3f %9.2f))

library(fixest)
library(modelsummary)

# First stage: Does rain affect protest size?
first_stage <- feols(protest ~ rain, data = data, vcov = "hetero")

# 2SLS: Effect of protest on voting
iv_model <- feols(vote ~ 1 | 0 | protest ~ rain, data = data, vcov = "hetero")

# Reduced form: Direct effect of rain on voting
reduced_form <- feols(vote ~ rain, data = data, vcov = "hetero")

# Export results
modelsummary(
  list("First Stage" = first_stage,
       "2SLS" = iv_model,
       "Reduced Form" = reduced_form),
  stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
  output = "iv_results.tex"
)

Common IV Pitfalls

Watch Out For

Weak instruments: F < 10 means your IV estimates are badly biased toward OLS
Invalid exclusion: If the instrument affects Y directly, your estimate is biased
LATE vs ATE: IV estimates effects only for "compliers" - this may not generalize

Difference-in-Differences

Difference-in-differences compares changes over time between a treated group and a control group. The key assumption is that both groups would have followed parallel trends in the absence of treatment.

Real-World Example: No-Fault Divorce and Female Suicide

Research Question

Did the introduction of no-fault divorce laws affect female suicide rates? U.S. states adopted these laws at different times between 1969-1985, creating a natural experiment with staggered treatment timing.

The analysis requires setting up several key variables:

Treatment year — When each state adopted no-fault divorce
Post indicator — Whether the current year is after treatment for that state
Time to treatment — Years before/after treatment (for event studies)

Basic 2x2 DiD

* Generate treatment indicator
gen treated = (state == "CA")
gen post = (year >= 2015)
gen treat_post = treated * post

* Basic DiD regression
reg outcome treated post treat_post, cluster(state)

* The coefficient on treat_post is the DiD estimate

* Better: Two-way fixed effects
reghdfe outcome treat_post, absorb(state year) cluster(state)

* With controls (like the divorce study)
reghdfe asmrs treat_post pcinc asmrh cases, absorb(stfips year) cluster(stfips)

library(fixest)

# Generate treatment indicator
data <- data %>%
  mutate(
    treated = state == "CA",
    post = year >= 2015,
    treat_post = treated * post
  )

# Basic DiD regression
model <- feols(outcome ~ treated + post + treat_post,
               data = data, vcov = ~state)

# Two-way fixed effects (preferred)
model <- feols(outcome ~ treat_post | state + year,
               data = data, vcov = ~state)

# The treat_post coefficient is the DiD estimate
summary(model)

Testing Parallel Trends

The key assumption is that treated and control groups would have followed parallel trends absent treatment. Check pre-treatment trends:

* Create year dummies interacted with treatment
* (Omit the year before treatment as reference)
reghdfe outcome ib2014.year#treated, ///
    absorb(state year) cluster(state)

* Plot coefficients
coefplot, keep(*#*) vertical ///
    yline(0) xline(4.5, lpattern(dash)) ///
    title("Event Study: Pre-trends Check")

library(fixest)

# Create relative time variable
data <- data %>%
  mutate(rel_time = year - 2015)  # 2015 = treatment year

# Event study regression (omit t=-1 as reference)
model <- feols(outcome ~ i(rel_time, treated, ref = -1) | state + year,
               data = data, vcov = ~state)

# Plot
iplot(model, main = "Event Study: Pre-trends Check")

Event Studies

Event studies show dynamic treatment effects over time, both before (to check parallel trends) and after treatment. They're essential for:

Testing parallel trends: Pre-treatment coefficients should be near zero
Understanding dynamics: Does the effect grow, shrink, or stay constant over time?
Detecting anticipation: Did units respond before official treatment?

Setting Up the Variables

With staggered treatment (different units treated at different times), you need to create a "time to treatment" variable:

* Create treatment year variable (example: no-fault divorce)
gen treatment_year = .
replace treatment_year = 1970 if state == "CA"
replace treatment_year = 1971 if state == "FL"
* ... etc for each treated state
* Leave as missing for never-treated states (controls)

* Create post-treatment indicator
gen treat_post = (year >= treatment_year) & !missing(treatment_year)

* Create time-to-treatment variable
gen time_to_treat = year - treatment_year
replace time_to_treat = 0 if missing(treatment_year)  // Never-treated = 0

* Stata can't handle negative factor values, so shift
summ time_to_treat
gen shifted_ttt = time_to_treat - r(min)

# Create treatment year variable
data <- data %>%
  mutate(
    treatment_year = case_when(
      state == "CA" ~ 1970,
      state == "FL" ~ 1971,
      # ... etc
      TRUE ~ NA_real_  # Never-treated states
    ),
    # Post-treatment indicator
    treat_post = year >= treatment_year & !is.na(treatment_year),
    # Time to treatment
    time_to_treat = year - treatment_year,
    time_to_treat = if_else(is.na(time_to_treat), 0, time_to_treat)
  )

Basic Event Study

* Create relative time to treatment
gen rel_time = year - treatment_year

* Create dummies for each relative time period
* (typically bin endpoints and omit t=-1)
forvalues t = -5/5 {
    gen rel_`t' = (rel_time == `t')
}
replace rel_m5 = (rel_time <= -5)  // Bin early periods
replace rel_5 = (rel_time >= 5)    // Bin late periods

* Regression (omit t=-1 as reference)
reghdfe outcome rel_m5-rel_m2 rel_0-rel_5, ///
    absorb(state year) cluster(state)

* Plot
coefplot, keep(rel_*) vertical yline(0) ///
    xline(3.5, lpattern(dash)) ///
    title("Event Study")

library(fixest)

# Create relative time variable
data <- data %>%
  mutate(rel_time = year - treatment_year)

# Event study with fixest (automatically handles binning)
model <- feols(
  outcome ~ i(rel_time, ref = -1) | state + year,
  data = data,
  vcov = ~state
)

# Plot with confidence intervals
iplot(model,
      main = "Event Study",
      xlab = "Years Relative to Treatment",
      ylab = "Effect on Outcome")

# Add reference line at zero and treatment time
abline(h = 0, lty = 2)
abline(v = -0.5, lty = 2, col = "red")

Staggered DiD

When different units are treated at different times, standard two-way fixed effects can be biased. Use modern estimators:

* Install: ssc install did_multiplegt
* Or: ssc install csdid

* Callaway and Sant'Anna (2021)
csdid outcome, ivar(state) time(year) gvar(treatment_year) agg(simple)

* Sun and Abraham (2021)
* Install: ssc install eventstudyinteract
eventstudyinteract outcome rel_m5-rel_5, ///
    cohort(treatment_year) control_cohort(never_treated) ///
    absorb(state year) cluster(state)

# Callaway and Sant'Anna (2021)
library(did)

model <- att_gt(
  yname = "outcome",
  tname = "year",
  idname = "state",
  gname = "treatment_year",  # Year first treated (0 if never)
  data = data
)

# Aggregate to simple ATT
agg <- aggte(model, type = "simple")
summary(agg)

# Event study aggregation
agg_es <- aggte(model, type = "dynamic")
ggdid(agg_es)

# Sun and Abraham (2021)
library(fixest)
model <- feols(
  outcome ~ sunab(treatment_year, year) | state + year,
  data = data,
  vcov = ~state
)
iplot(model)

ADVANCEDWhen to Worry About Staggered Timing

Not required for 14.33. Standard TWFE is problematic when: (1) treatment timing varies across units, AND (2) treatment effects change over time. Use Callaway-Sant'Anna, Sun-Abraham, or similar methods in these cases.

← Applied Micro Next: Advanced →