Causal Inference

From naive comparisons to credible research designs

The Question

Let's start with a simple question: Does going to MIT affect wages?

This seems straightforward. Just compare people who went to MIT to people who didn't, right? Unfortunately, it's much harder than it sounds. Let's work through why.

The Goal of Causal Inference

We want to know: if we took a specific person and randomly chose whether they attend MIT or not, how would their wages differ? This is fundamentally different from asking how MIT graduates compare to non-MIT graduates.

The Potential Outcomes Framework

To think clearly about causation, we need to formalize what we mean by a "treatment effect."

Two Potential Worlds

For any individual, consider two hypothetical scenarios:

Y(1): Outcome if Treated

Sarah's wages if she attends MIT

Y(0): Outcome if Not Treated

Sarah's wages if she doesn't attend MIT

The causal effect for Sarah is: Y(1) - Y(0)

The Fundamental Problem

We can never observe both Y(1) and Y(0) for the same person. Sarah either goes to MIT or she doesn't. We observe one reality, not both. The unobserved outcome is called the counterfactual.

Example: Sarah's Potential Outcomes

Scenario	Y(1): MIT	Y(0): No MIT	Effect
Sarah (attended MIT)	$120,000 (observed)	??? (counterfactual)	???

If Sarah had the same ability and work ethic, but went to a state school instead, would she earn $100,000? $80,000? $120,000? We simply don't know.

Average Treatment Effect

Since we can't measure individual effects, we focus on averages across many people:

ATE = E[Y(1) - Y(0)] = Average effect for everyone
ATT = E[Y(1) - Y(0) | D=1] = Average effect on those who actually got treated

Naive Approaches (That Fail)

Before diving into solutions, let's understand why the obvious approaches don't work.

Approach 1: Before vs. After

Compare wages while at MIT vs. after graduation:

Wages during MIT: ~$0 (undergrads don't get paid)

Wages after MIT: ~$85,000

"Effect" of MIT: +$85,000!

Why This Fails

This comparison conflates the effect of MIT with the effect of time (aging, gaining experience, entering the labor market). Anyone's wages would increase from college to post-graduation, regardless of where they went to school.

Approach 2: MIT Grads vs. Non-MIT Grads

Compare wages of MIT graduates to wages of people who didn't go to MIT:

MIT graduates: ~$125,000 average salary

Non-MIT graduates: ~$65,000 average salary

"Effect" of MIT: +$60,000!

Why This Fails: Selection Bias

MIT students aren't a random sample. They're selected based on high test scores, strong work ethic, ambitious personalities, wealthy family backgrounds, etc. These same traits would lead to higher wages even without MIT. We can't tell how much of the $60,000 gap is MIT vs. pre-existing differences.

The problem: ability affects both MIT attendance AND wages directly

Approach 3: "Controlling For" Observables

A natural response: "What if we compare people with the same SAT scores? Then we're holding ability constant."

* "Controlling for" SAT score
reg wages attended_mit sat_score

# "Controlling for" SAT score
lm(wages ~ attended_mit + sat_score, data = df)

# "Controlling for" SAT score
import statsmodels.formula.api as smf
model = smf.ols('wages ~ attended_mit + sat_score', data=df).fit()

Why This Still Fails

Even among people with identical SAT scores, MIT students are systematically different from non-MIT students:

Who applies: A student with a 1550 SAT who applies to MIT has different ambitions than someone with a 1550 who doesn't apply
Who gets in: MIT admits students based on essays, recommendations, extracurriculars—all things correlated with future success
Who attends: Among admits, those who choose MIT over Harvard or Stanford may be different in unobservable ways

Consider two students who both scored 1520 on the SAT:

Alice (MIT)

Founded a robotics club
Published research in high school
Family values elite education
Strong peer network

Bob (State School)

Got a high SAT score but didn't apply
Preferred staying close to home
Less focused on prestige
Different career aspirations

Same SAT score, but Alice would likely earn more than Bob regardless of where they went to college. The things that made Alice apply to and attend MIT are the same things that will make her successful in her career.

Selection on Unobservables

You can only control for things you can measure. But the decision to attend MIT is driven by motivation, ambition, family expectations, risk tolerance, career goals—most of which we can't observe. Adding more controls helps, but can never fully solve the problem.

The Gold Standard: Random Assignment

What if we could randomly assign students to MIT vs. other schools?

Why randomization works: If treatment is randomly assigned, then on average, the treated and control groups are identical in all characteristics (observed and unobserved). Any difference in outcomes must be due to treatment.

Why We Can't Randomize MIT Attendance

MIT is selective, and nobody is going to let us randomly assign students to go there. This is why we need quasi-experimental methods—strategies that approximate random assignment using naturally occurring variation.

Difference-in-Differences (DiD)

DiD is the workhorse of applied microeconomics. The idea: compare changes over time between treated and control groups.

Recommendation for 14.33

For your research project, I recommend a difference-in-differences approach. Here's why:

Straightforward analysis: The regression setup is intuitive and easy to implement
Testable assumptions: You can validate parallel trends by examining pre-treatment periods
Flexible data requirements: Many different kinds of panel data can support a DiD design
This is real economics: DiD is the bread and butter of applied micro—professors at top schools publish DiD papers in top journals all the time

The Setup

Suppose we want to estimate the effect of raising the minimum wage on employment. Some states raised their minimum wage in 2015, while neighboring states kept theirs the same:

Treatment group: States that raised the minimum wage
Control group: States that didn't raise it
Before period: Employment before the policy change
After period: Employment after the policy change

The DiD Estimator

DiD = [Y_{treatment,after} - Y_{treatment,before}] - [Y_{control,after} - Y_{control,before}]

The Parallel Trends Assumption

Critical Assumption

DiD requires that absent treatment, the treatment group would have followed the same trend as the control group. This is untestable because we can't observe the counterfactual.

In the minimum wage example, we're assuming that if treated states had not raised their minimum wage, their employment would have evolved the same way as the control states' employment.

Pre-trends validation: If the two groups were following parallel trajectories before treatment, that's reassuring. We can check whether employment (or other outcomes) moved together in the pre-treatment period. If pre-trends are parallel and the two groups seem like they'd be following similar trajectories, we can typically say it's a good control group and design.

If pre-trends aren't parallel: This suggests the groups were too different to begin with. If treated states were already on a different employment trajectory before the policy change, any post-policy difference might just reflect pre-existing differences, not the effect of the minimum wage.

Balance on observables: It can also be useful to check whether the treatment and control groups are similar on observable characteristics before treatment. If they look similar on things we can measure (demographics, industry mix, prior employment trends), it's more plausible they'd follow similar trajectories.

The fundamental limitation: We can never prove that the groups would not have diverged absent the treatment. But if they had been following parallel trajectories for many time periods before treatment, a reasonable person could assume they would have continued on parallel trajectories afterward.

What We Can Check

While we can't prove parallel trends, we can check if trends were parallel before treatment:

* Check pre-trends with an event study
gen time_to_treat = year - treatment_year
forvalues k = -5/-1 {
    gen pre`=abs(`k')' = (time_to_treat == `k') * treated_group
}
forvalues k = 0/5 {
    gen post`k' = (time_to_treat == `k') * treated_group
}

reghdfe wages pre5-pre2 post0-post5, absorb(school year) cluster(school)

* If pre-period coefficients are near zero, parallel trends is plausible

# Check pre-trends with an event study
pacman::p_load(fixest)

# IMPORTANT: Never-treated units don't have a "time to treatment"
# Set them to -1000 and exclude with ref = c(-1, -1000)
df <- df %>%
  mutate(
    time_to_treat = ifelse(is.na(treatment_year), -1000, year - treatment_year),
    ever_treated = !is.na(treatment_year)
  )

# Event study regression
# ref = c(-1, -1000) excludes both the reference period and never-treated
es <- feols(wages ~ i(time_to_treat, ever_treated, ref = c(-1, -1000)) |
            school + year, data = df, cluster = ~school)

# Plot coefficients
iplot(es)

# If pre-period coefficients are near zero, parallel trends is plausible

# Check pre-trends with an event study
import pyfixest as pf

df['time_to_treat'] = df['year'] - df['treatment_year']
df['time_to_treat'] = df['time_to_treat'].fillna(-1000).astype(int)
df['ever_treated'] = df['treatment_year'].notna().astype(int)

# Event study with pyfixest
model = pf.feols(
    'wages ~ i(time_to_treat, ever_treated, ref=c(-1, -1000)) | school + year',
    data=df, vcov={'CRV1': 'school'}
)

# Plot
model.iplot()

# If pre-period coefficients are near zero, parallel trends is plausible

Other Key Assumptions

No differential contemporaneous shocks: DiD assumes there's no other shock that hits only the treated group at the same time as treatment. If treated states also experienced a major factory closure at the same time as the minimum wage increase, we couldn't separate the effect of the minimum wage from the effect of the factory closure.

Treatment is often a "package": A minimum wage increase may come bundled with other labor reforms. DiD estimates the combined effect of the whole package—you can't separately identify the effect of one component versus another.

Regression Discontinuity (RD)

RD exploits sharp cutoffs in treatment assignment. If treatment jumps discontinuously at a threshold, units just above and below the cutoff are essentially randomly assigned.

SAT Score Example

Suppose MIT admits students who score above 1500 on the SAT. Consider two students:

Alice: SAT = 1502 (admitted to MIT)
Bob: SAT = 1498 (not admitted to MIT)

Alice and Bob are nearly identical—separated by a trivial difference in test performance. If we compare their wages, we're getting close to a true causal effect.

RD Assumptions

Continuity

All other factors that affect wages vary smoothly across the cutoff. There's no other reason for a jump at exactly 1500.

No Manipulation

Students can't precisely control their score to land just above the cutoff. If they could, those just above would be systematically different from those just below.

Testing the Assumptions

* Check for manipulation: McCrary density test
rddensity sat_score, c(1500) plot

* Check for jumps in covariates (should be smooth)
rdrobust family_income sat_score, c(1500)
rdrobust parental_education sat_score, c(1500)

* Main RD estimate
rdrobust wages sat_score, c(1500)

# Check for manipulation: density test
pacman::p_load(rddensity)
density_test <- rddensity(X = df$sat_score, c = 1500)
summary(density_test)
rdplotdensity(density_test, X = df$sat_score)

# Check for jumps in covariates (should be smooth)
pacman::p_load(rdrobust)
rdrobust(y = df$family_income, x = df$sat_score, c = 1500)

# Main RD estimate
rdrobust(y = df$wages, x = df$sat_score, c = 1500)

# RD analysis in Python
from rdrobust import rdrobust, rdbwselect
from rddensity import rddensity

# Check for manipulation: density test
density_test = rddensity(X=df['sat_score'], c=1500)
print(density_test)

# Check for jumps in covariates (should be smooth)
rd_covariate = rdrobust(y=df['family_income'], x=df['sat_score'], c=1500)
print(rd_covariate)

# Main RD estimate
rd_result = rdrobust(y=df['wages'], x=df['sat_score'], c=1500)
print(rd_result)

Limitation: RD estimates a local effect—the effect for students right at the cutoff. This might not generalize to students with much higher or lower scores.

Instrumental Variables (IV)

IV uses an external source of variation that affects treatment but has no direct effect on the outcome. This "instrument" isolates the causal effect.

MIT Parent as an Instrument?

Consider using "having a parent who went to MIT" as an instrument for attending MIT yourself.

Requirements for a Valid Instrument

Relevance: The instrument must affect treatment (testable)
Exclusion: The instrument must only affect the outcome through treatment (not testable)
Independence: The instrument must be as-if randomly assigned

Checking Relevance (First Stage)

Does having an MIT parent increase the probability of attending MIT? Almost certainly yes:

* First stage: parent_mit → attended_mit
reg attended_mit parent_mit controls
* Look for F-statistic > 10 (rule of thumb)

# First stage: parent_mit → attended_mit
first_stage <- lm(attended_mit ~ parent_mit + controls, data = df)
summary(first_stage)
# Look for F-statistic > 10 (rule of thumb)

# First stage: parent_mit → attended_mit
import statsmodels.formula.api as smf
first_stage = smf.ols('attended_mit ~ parent_mit + controls', data=df).fit()
print(first_stage.summary())
# Look for F-statistic > 10 (rule of thumb)

Legacy admissions are common at elite schools, so this first stage would likely be strong.

The Exclusion Restriction Problem

Why This Instrument FAILS

Having an MIT parent affects your wages through many channels besides your own MIT attendance:

Genetics: MIT parents may pass on high-ability genes
Environment: Growing up with MIT parents means more educational resources, intellectual stimulation
Networks: MIT parents have connections that help their children's careers regardless of college
Wealth: MIT parents tend to be wealthy, providing financial advantages

The instrument affects wages directly, not just through MIT attendance. The exclusion restriction is violated.

What Would Make a Good Instrument?

A valid instrument needs to be:

Correlated with MIT attendance
Plausibly random (not chosen based on ability or ambition)
Only affecting wages through MIT

Some possibilities that economists have tried for college quality questions (though most are debatable):

Geographic proximity: Distance to college (Card, 1995)—but does distance really have no direct effect on outcomes?
Lottery-based admissions: Some programs use lotteries among qualified applicants—rare but credible when available
Policy changes: Changes in financial aid that affect some students but not others

Caveat: Good instruments are hard to find. Most proposed instruments have plausible exclusion restriction violations. This is why IV is less common in student projects than DiD.

Key insight: A good instrument is rare because you need something that affects college choice but has no other connection to labor market success. Most things that affect whether you go to MIT also affect your wages directly.

Summary: Choosing a Design

Method	Key Assumption	Testable?	MIT Example
RCT	None (randomization)	N/A	Not feasible
DiD	Parallel trends	Partially (pre-trends)	Scholarship policy change
RD	No manipulation, continuity	Partially (density, covariates)	SAT cutoff for admission
IV	Exclusion restriction	No (only relevance)	MIT parent fails; need better instrument

The Bottom Line

There is no perfect method. Every causal inference strategy requires assumptions that cannot be fully tested. The goal is to find plausible sources of variation and be transparent about what could go wrong. The best research combines multiple approaches and tests robustness.

← 2. Project Organization Next: 4. Finding a Project →

Found something unclear or have a suggestion? Email [email protected].