Regression
OLS, panel data, IV, DiD, and event studies
Every regression command follows the same pattern: reg Y X, options. Fixed effects? Add absorb(). Clustered SEs? Add cluster(). IV? Wrap the endogenous variable. Once you understand the basic syntax, everything else is just variations on a theme.
OLS Basics
Ordinary Least Squares (OLS) finds the line that minimizes the sum of squared differences between observed values and predicted values. It gives you the "best-fitting" linear relationship between your outcome variable Y and your predictors X. The slope coefficient tells you: holding everything else constant, how much does Y change when X increases by one unit?
Simple Linear Regression
* Simple regression
reg income education
* Multiple regression
reg income education age female
* With categorical variables (factor notation)
reg income education i.region i.year
* Interactions
reg income education##female // Full interaction
reg income c.education#c.age // Continuous interaction
pacman::p_load(fixest)
# Simple regression
model <- feols(income ~ education, data = data)
summary(model)
# Multiple regression
model <- feols(income ~ education + age + female, data = data)
# With categorical variables
model <- feols(income ~ education + factor(region) + factor(year),
data = data)
# Interactions
model <- feols(income ~ education * female, data = data)
model <- feols(income ~ education:age, data = data)
import statsmodels.formula.api as smf
import pandas as pd
# Simple regression
model = smf.ols('income ~ education', data=data).fit()
print(model.summary())
# Multiple regression
model = smf.ols('income ~ education + age + female', data=data).fit()
# With categorical variables
model = smf.ols('income ~ education + C(region) + C(year)', data=data).fit()
# Interactions
model = smf.ols('income ~ education * female', data=data).fit()
model = smf.ols('income ~ education:age', data=data).fit()
Interpreting Coefficients
Coefficient Interpretation
- Continuous X: A one-unit increase in X is associated with a beta-unit change in Y, holding other variables constant.
- Binary X: The difference in Y between X=1 and X=0, holding other variables constant.
- Log Y: A one-unit increase in X is associated with a (beta*100)% change in Y.
- Log X: A 1% increase in X is associated with a (beta/100)-unit change in Y.
- Log-Log: A 1% increase in X is associated with a beta% change in Y (elasticity).
Post-Estimation
After running a regression, you often want to do more than just look at coefficients. Predicted values show what your model expects for each observation—useful for identifying outliers or visualizing fit. Residuals are the difference between actual and predicted values—patterns in residuals can reveal model problems. Hypothesis tests let you ask specific questions: Is this coefficient different from zero? Are two coefficients equal? What's the combined effect of multiple variables?
reg income education age female
* Predicted values
predict yhat
* Residuals
predict resid, residuals
* Test coefficient equals zero
test education = 0
* Test multiple coefficients
test education age
* Linear combination of coefficients
lincom education + 2*age
model <- lm(income ~ education + age + female, data = data)
# Predicted values
data$yhat <- predict(model)
# Residuals
data$resid <- residuals(model)
# Test coefficient equals zero (from summary)
summary(model)$coefficients
# F-test for multiple coefficients
pacman::p_load(car)
linearHypothesis(model, c("education = 0", "age = 0"))
# Linear combination
pacman::p_load(multcomp)
glht(model, linfct = c("education + 2*age = 0"))
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + female', data=data).fit()
# Predicted values
data['yhat'] = model.predict()
# Residuals
data['resid'] = model.resid
# Test coefficient equals zero (from summary)
print(model.summary())
# F-test for multiple coefficients (joint hypothesis)
print(model.f_test("(education = 0), (age = 0)"))
# Linear combination
print(model.t_test("education + 2*age = 0"))
Standard Errors
Standard errors measure how precisely you've estimated each coefficient. A coefficient of 10 with a standard error of 2 is very different from a coefficient of 10 with a standard error of 50—the first is precisely estimated, the second is essentially noise. Standard errors determine your confidence intervals and p-values, so getting them right is crucial for inference.
The challenge is that standard OLS assumes errors are independent across observations. In real data, this is often violated: observations within the same state are correlated, the same person observed over time is correlated, etc. When errors are correlated, standard OLS standard errors are too small, leading to false confidence in your results.
Always Cluster When in Doubt
If observations within groups (states, firms, individuals over time) are correlated, standard OLS standard errors are too small. Cluster at the level of treatment variation or the highest level of aggregation that makes sense.
Types of Standard Errors
Here are the main options for computing standard errors, from simplest to most robust:
* Heteroskedasticity-robust (HC1)
reg income education, robust
* Clustered by state
reg income education, cluster(state)
* Two-way clustering (requires reghdfe)
reghdfe income education, noabsorb cluster(state year)
* With reghdfe (recommended for fixed effects)
reghdfe income education, absorb(state) cluster(state)
pacman::p_load(fixest)
# Heteroskedasticity-robust
model <- feols(income ~ education, data = data, vcov = "hetero")
# Clustered by state
model <- feols(income ~ education, data = data, vcov = ~state)
# Two-way clustering
model <- feols(income ~ education, data = data, vcov = ~state + year)
# With fixed effects
model <- feols(income ~ education | state, data = data, vcov = ~state)
import statsmodels.formula.api as smf
# Heteroskedasticity-robust (HC1)
model = smf.ols('income ~ education', data=data).fit(cov_type='HC1')
# Clustered by state
model = smf.ols('income ~ education', data=data).fit(
cov_type='cluster', cov_kwds={'groups': data['state']})
# For fixed effects with clustering, use linearmodels
from linearmodels.panel import PanelOLS
data = data.set_index(['state', 'year'])
model = PanelOLS.from_formula(
'income ~ education + EntityEffects',
data=data
).fit(cov_type='clustered', cluster_entity=True)
Clustering Rules of Thumb
- Cluster at the level of treatment assignment (e.g., if policy varies by state, cluster by state)
- With few clusters (<30), consider wild bootstrap or other small-sample corrections
- Never cluster at a finer level than treatment variation
Panel Data and Fixed Effects
Fixed Effects Regression
Fixed effects control for all time-invariant characteristics of each unit (observed and unobserved).
Click for answer
State fixed effects! By including a dummy for each state, you're comparing each state to itself over time. California in 2020 is compared to California in 2015, not to Texas. This "differences out" all the permanent differences between states.
* Declare panel structure
xtset state year
* Entity fixed effects
xtreg income policy, fe
* Better: use reghdfe for speed and flexibility
* Install: ssc install reghdfe
reghdfe income policy, absorb(state)
* Two-way fixed effects (entity + time)
reghdfe income policy, absorb(state year)
* With clustered standard errors
reghdfe income policy, absorb(state year) cluster(state)
pacman::p_load(fixest)
# Entity fixed effects
model <- feols(income ~ policy | state, data = data)
# Two-way fixed effects (entity + time)
model <- feols(income ~ policy | state + year, data = data)
# With clustered standard errors
model <- feols(income ~ policy | state + year,
data = data, vcov = ~state)
# View results
summary(model)
etable(model)
from linearmodels.panel import PanelOLS
import pandas as pd
# Set multi-index for panel data
data = data.set_index(['state', 'year'])
# Entity fixed effects
model = PanelOLS.from_formula(
'income ~ policy + EntityEffects',
data=data
).fit()
# Two-way fixed effects (entity + time)
model = PanelOLS.from_formula(
'income ~ policy + EntityEffects + TimeEffects',
data=data
).fit()
# With clustered standard errors
model = PanelOLS.from_formula(
'income ~ policy + EntityEffects + TimeEffects',
data=data
).fit(cov_type='clustered', cluster_entity=True)
print(model.summary)
What Do Fixed Effects Actually Do?
Unit fixed effects control for all characteristics of a unit that don't change over time. To see why this matters, consider estimating the effect of education on wages.
You might worry that race and sex affect both education and wages. So you add them as controls:
But what about other time-invariant characteristics you can't measure? Family background, innate ability, childhood neighborhood, personality traits—all of these might affect both education and wages. You can't observe them, so you can't control for them.
Individual fixed effects solve this. By including a dummy for each person, you control for everything about that person that doesn't change over time—observed or unobserved:
The αi absorbs race, sex, family background, innate ability, and every other time-invariant characteristic. You don't need to list them individually—the fixed effect captures them all. The cost is you can no longer estimate the effect of time-invariant variables (β₂ and β₃ disappear), but you've eliminated a whole class of omitted variable bias.
Time fixed effects work the same way but for time periods. They control for anything that affects all units in a given period—recessions, policy changes, seasonal patterns:
With both unit and time fixed effects (two-way fixed effects), your estimate of β₁ comes purely from within-unit changes over time, after removing any time trends common to all units. This is the workhorse specification for causal inference with panel data.
TWFE Powers DiD and Event Studies
Difference-in-differences and event studies—the most common designs in applied microeconomics—are almost always estimated using two-way fixed effects. The unit fixed effects control for permanent differences between treated and control groups. The time fixed effects control for common shocks that hit everyone. What's left is the treatment effect: how treated units changed relative to control units, relative to the time trend. When you see a DiD or event study paper, you're almost certainly looking at a TWFE regression.
The Limitation
Fixed effects only control for time-invariant confounders. If something changes over time and affects both your treatment and outcome (a time-varying confounder), fixed effects won't help. You'd need to control for it directly or find a different identification strategy.
Instrumental Variables
Suppose you want to know: does more education cause higher wages? You run a regression of wages on years of schooling and find a positive coefficient. But does this mean education causes higher wages?
The problem is that people who get more education might be different in ways that also affect their wages. Maybe they're smarter, more motivated, or from wealthier families. These traits affect both education and wages. So when you see that educated people earn more, you can't tell if it's because:
- Education actually makes people more productive (the causal effect you want), or
- Smart/motivated people both get more education and earn more (a confound)
OLS mixes these together. Your coefficient captures both the true effect of education and the fact that "education" is partly a proxy for ability. This is called omitted variable bias—ability affects both X and Y, but you can't measure it.
The IV Solution
Instrumental variables solve this by finding something that affects education but has no direct effect on wages. This "instrument" creates variation in education that's unrelated to ability.
A classic example: distance to the nearest college. People who grow up closer to a college tend to get more education (it's cheaper and easier to attend). But distance to a college doesn't directly make you earn more—it only affects wages through its effect on education.
The logic: if we compare people who differ only in how close they grew up to a college, any wage difference must be due to their different education levels—not ability, since distance to college isn't correlated with ability.
Two-Stage Least Squares (2SLS)
In practice, we implement this in two stages:
- First stage: Predict education using only the instrument (distance to college). This gives us "clean" variation in education that's driven by geography, not ability.
- Second stage: Use these predicted education values to estimate the effect on wages. Since the predictions are based only on distance, they're uncorrelated with ability.
* 2SLS: Y = outcome, X = endogenous, Z = instrument
* Syntax: ivregress 2sls Y controls (X = Z)
ivregress 2sls income age female (education = distance_to_college)
* First-stage F-statistic (check relevance)
estat firststage
* With robust standard errors
ivregress 2sls income age (education = distance), robust
* With fixed effects
ivreghdfe income age (education = distance), absorb(state) cluster(state)
pacman::p_load(fixest)
# 2SLS with fixest
# Syntax: Y ~ controls | FE | endogenous ~ instrument
model <- feols(income ~ age + female | 0 | education ~ distance_to_college,
data = data)
summary(model)
# Check first-stage F-statistic
fitstat(model, "ivf")
# With fixed effects
model <- feols(income ~ age | state | education ~ distance,
data = data, vcov = ~state)
from linearmodels.iv import IV2SLS
import pandas as pd
# 2SLS: Y = outcome, X = endogenous, Z = instrument
model = IV2SLS.from_formula(
'income ~ 1 + age + female + [education ~ distance_to_college]',
data=data
).fit()
print(model.summary)
# First-stage F-statistic
print(f"First-stage F: {model.first_stage.diagnostics['f.stat'].stat:.2f}")
# With robust standard errors
model = IV2SLS.from_formula(
'income ~ 1 + age + [education ~ distance]',
data=data
).fit(cov_type='robust')
First-Stage F-Statistic
The F-statistic tests instrument relevance. Rule of thumb: F > 10 suggests the instrument is strong enough. Weak instruments (F < 10) lead to biased estimates and unreliable inference.
Complete IV Example: Protest Size and Voting
Research question: Does protest size affect voting? Problem: larger protests might occur in places that would vote differently anyway. Solution: use weather (rain) as an instrument—rain reduces protest size but shouldn't directly affect voting.
* First stage: Does rain affect protest size?
reg protest rain, robust
eststo first_stage
* 2SLS: Effect of protest on voting
ivregress 2sls vote (protest = rain), robust
eststo iv_model
* Reduced form: Direct effect of rain on voting
reg vote rain, robust
eststo reduced_form
* Export results
esttab first_stage iv_model reduced_form ///
using "iv_results.tex", replace ///
b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) ///
stats(N r2 F, fmt(%9.0fc %9.3f %9.2f))
pacman::p_load(fixest, modelsummary)
# First stage: Does rain affect protest size?
first_stage <- feols(protest ~ rain, data = data, vcov = "hetero")
# 2SLS: Effect of protest on voting
iv_model <- feols(vote ~ 1 | 0 | protest ~ rain, data = data, vcov = "hetero")
# Reduced form: Direct effect of rain on voting
reduced_form <- feols(vote ~ rain, data = data, vcov = "hetero")
# Export results
modelsummary(
list("First Stage" = first_stage,
"2SLS" = iv_model,
"Reduced Form" = reduced_form),
stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
output = "iv_results.tex"
)
import statsmodels.formula.api as smf
from linearmodels.iv import IV2SLS
from stargazer.stargazer import Stargazer
# First stage: Does rain affect protest size?
first_stage = smf.ols('protest ~ rain', data=data).fit(cov_type='HC1')
# 2SLS: Effect of protest on voting
iv_model = IV2SLS.from_formula(
'vote ~ 1 + [protest ~ rain]', data=data
).fit(cov_type='robust')
# Reduced form: Direct effect of rain on voting
reduced_form = smf.ols('vote ~ rain', data=data).fit(cov_type='HC1')
# Export results
stargazer = Stargazer([first_stage, reduced_form])
stargazer.render_latex() # or .render_html()
Common IV Pitfalls
Watch Out For
- Weak instruments: F < 10 means your IV estimates are badly biased toward OLS
- Invalid exclusion: If the instrument affects Y directly, your estimate is biased
- LATE vs ATE: IV estimates effects only for "compliers" - this may not generalize
Difference-in-Differences
Difference-in-differences compares changes over time between a treated group and a control group. The key assumption is that both groups would have followed parallel trends in the absence of treatment.
The DiD Logic in One Table
| Before Treatment | After Treatment | Change | |
|---|---|---|---|
| Treated (got the policy) | A | B | B - A |
| Control (no policy) | C | D | D - C |
| DiD Effect = (B - A) - (D - C) | Causal effect | ||
Why this works: The change in the control group (D - C) tells us what would have happened to the treated group without treatment. Subtracting it removes secular trends.
Real-World Example: No-Fault Divorce and Female Suicide
Research Question
Did the introduction of no-fault divorce laws affect female suicide rates? U.S. states adopted these laws at different times between 1969-1985, creating a natural experiment with staggered treatment timing.
The analysis requires setting up several key variables:
- Treatment year — When each state adopted no-fault divorce
- Post indicator — Whether the current year is after treatment for that state
- Time to treatment — Years before/after treatment (for event studies)
Basic 2x2 DiD
* Generate treatment indicator
gen treated = (state == "CA")
gen post = (year >= 2015)
gen treat_post = treated * post
* Basic DiD regression
reg outcome treated post treat_post, cluster(state)
* The coefficient on treat_post is the DiD estimate
* Better: Two-way fixed effects
reghdfe outcome treat_post, absorb(state year) cluster(state)
* With controls (like the divorce study)
reghdfe asmrs treat_post pcinc asmrh cases, absorb(stfips year) cluster(stfips)
pacman::p_load(fixest)
# Generate treatment indicator
data <- data %>%
mutate(
treated = state == "CA",
post = year >= 2015,
treat_post = treated * post
)
# Basic DiD regression
model <- feols(outcome ~ treated + post + treat_post,
data = data, vcov = ~state)
# Two-way fixed effects (preferred)
model <- feols(outcome ~ treat_post | state + year,
data = data, vcov = ~state)
# The treat_post coefficient is the DiD estimate
summary(model)
import pandas as pd
from linearmodels.panel import PanelOLS
# Generate treatment indicator
data['treated'] = (data['state'] == 'CA').astype(int)
data['post'] = (data['year'] >= 2015).astype(int)
data['treat_post'] = data['treated'] * data['post']
# Basic DiD regression
import statsmodels.formula.api as smf
model = smf.ols('outcome ~ treated + post + treat_post', data=data).fit(
cov_type='cluster', cov_kwds={'groups': data['state']})
# Two-way fixed effects (preferred)
data_panel = data.set_index(['state', 'year'])
model = PanelOLS.from_formula(
'outcome ~ treat_post + EntityEffects + TimeEffects',
data=data_panel
).fit(cov_type='clustered', cluster_entity=True)
print(model.summary)
Testing Parallel Trends
The key assumption is that treated and control groups would have followed parallel trends absent treatment. Check pre-treatment trends:
* Create year dummies interacted with treatment
* (Omit the year before treatment as reference)
reghdfe outcome ib2014.year#treated, ///
absorb(state year) cluster(state)
* Plot coefficients
coefplot, keep(*#*) vertical ///
yline(0) xline(4.5, lpattern(dash)) ///
title("Event Study: Pre-trends Check")
pacman::p_load(fixest, ggfixest, dplyr)
# Create relative time variable
# Set never-treated units to -1000 (outside data range)
data <- data %>%
mutate(
rel_time = year - treatment_year,
rel_time = if_else(is.na(rel_time), -1000, rel_time),
treated = !is.na(treatment_year)
)
# Event study regression
# ref = c(-1, -1000) excludes both t=-1 and never-treated (-1000)
model <- feols(
outcome ~ i(rel_time, treated, ref = c(-1, -1000)) | state + year,
data = data, vcov = ~state
)
# Plot with ggiplot (publication-ready)
p <- ggiplot(model) +
labs(title = "Event Study: Pre-trends Check",
x = "Years Relative to Treatment",
y = "Coefficient Estimate") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
Event Studies
Event studies show dynamic treatment effects over time, both before (to check parallel trends) and after treatment. They're essential for:
R users: See the fixest walkthrough for comprehensive documentation on event studies with feols().
- Testing parallel trends: Pre-treatment coefficients should be near zero
- Understanding dynamics: Does the effect grow, shrink, or stay constant over time?
- Detecting anticipation: Did units respond before official treatment?
Reading Event Study Plots
Event study coefficients tell a story. Here's how to interpret common patterns:
1. Testing Parallel Trends
Pre-treatment coefficients (before t=0) should be statistically indistinguishable from zero. This suggests treated and control groups were on similar trajectories before treatment.
2. Understanding Effect Dynamics
How the treatment effect evolves over time can reveal important information about the mechanism. Does the policy take time to work? Does the effect persist or fade?
3. Detecting Anticipation Effects
Sometimes units respond before the official treatment date—perhaps because the policy was announced in advance, or because they anticipated it. This shows up as non-zero coefficients just before t=0.
What to Do About Anticipation
- Redefine treatment timing: If a law was announced 6 months before implementation, use the announcement date as t=0
- Report honestly: Acknowledging anticipation is better than hiding it—it may be substantively interesting
- Check the mechanism: Anticipation can help you understand how the policy works (e.g., through expectations vs. direct effects)
Setting Up the Variables
With staggered treatment (different units treated at different times), you need to create a "time to treatment" variable:
* Create treatment year variable (example: no-fault divorce)
gen treatment_year = .
replace treatment_year = 1970 if state == "CA"
replace treatment_year = 1971 if state == "FL"
* ... etc for each treated state
* Leave as missing for never-treated states (controls)
* Create post-treatment indicator
gen treat_post = (year >= treatment_year) & !missing(treatment_year)
* Create time-to-treatment variable
gen time_to_treat = year - treatment_year
replace time_to_treat = 0 if missing(treatment_year) // Never-treated = 0
* Stata can't handle negative factor values, so shift
summ time_to_treat
gen shifted_ttt = time_to_treat - r(min)
# Create treatment year variable
# Never-treated states have NA for treatment_year
data <- data %>%
mutate(
treatment_year = case_when(
state == "CA" ~ 1970,
state == "FL" ~ 1971,
# ... etc for each treated state
TRUE ~ NA_real_ # Never-treated: leave as NA
),
# Binary: ever treated vs never treated
treated = !is.na(treatment_year),
# Post-treatment indicator
treat_post = year >= treatment_year & treated,
# Time to treatment (for event studies)
# IMPORTANT: Set never-treated to -1000 (outside data range)
# Then exclude with ref = c(-1, -1000) in fixest
time_to_treat = if_else(is.na(treatment_year), -1000, year - treatment_year)
)
Basic Event Study
* Create relative time to treatment
gen rel_time = year - treatment_year
* Create dummies for each relative time period
* (typically bin endpoints and omit t=-1)
forvalues t = -5/5 {
gen rel_`t' = (rel_time == `t')
}
replace rel_m5 = (rel_time <= -5) // Bin early periods
replace rel_5 = (rel_time >= 5) // Bin late periods
* Regression (omit t=-1 as reference)
reghdfe outcome rel_m5-rel_m2 rel_0-rel_5, ///
absorb(state year) cluster(state)
* Plot
coefplot, keep(rel_*) vertical yline(0) ///
xline(3.5, lpattern(dash)) ///
title("Event Study")
pacman::p_load(fixest, ggfixest, ggplot2, dplyr)
# Create relative time variable
# IMPORTANT: Never-treated units don't have a meaningful "time to treatment"
# Set them to -1000 (outside data range), then exclude with ref = c(-1, -1000)
# See: https://lrberge.github.io/fixest/articles/fixest_walkthrough.html
data <- data %>%
mutate(
rel_time = year - treatment_year,
rel_time = if_else(is.na(rel_time), -1000, rel_time),
treated = !is.na(treatment_year)
)
# Optional: bin endpoints to avoid sparse cells (but keep -1000 for never-treated)
data <- data %>%
mutate(rel_time = case_when(
rel_time == -1000 ~ -1000, # Keep never-treated at -1000
rel_time <= -6 ~ -6, # Bin early periods
rel_time >= 6 ~ 6, # Bin late periods
TRUE ~ rel_time
))
# Event study with fixest
# ref = c(-1, -1000) excludes BOTH t=-1 (reference) AND t=-1000 (never-treated)
model <- feols(
outcome ~ i(rel_time, treated, ref = c(-1, -1000)) | state + year,
data = data,
vcov = ~state
)
# Plot with ggiplot (publication-ready)
p <- ggiplot(model) +
labs(title = "Event Study",
x = "Years Relative to Treatment",
y = "Effect on Outcome") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = -0.5, linetype = "dashed", color = "red") +
theme_minimal()
A Note on Staggered Timing
Recent research has found that DiD estimates can be biased when there is both (A) staggered treatment timing and (B) heterogeneous treatment effects. You don't need to know anything about this for 14.33. If you're interested, ask your TA!
Regression Tables
Regression tables present your main results. The standard format in economics shows multiple specifications (columns) with the same outcome variable, progressively adding controls or changing samples. This shows readers how your results change as you adjust the model—stable coefficients across specifications build confidence in your findings.
Never Copy Numbers by Hand
Export tables directly from your code to LaTeX or Word. Manual copying introduces errors and makes your work non-reproducible. Use estout/esttab in Stata, etable/modelsummary in R, or stargazer in Python.
Basic Regression Tables
Standard elements include: coefficient estimates, standard errors (in parentheses below the coefficient), significance stars, the number of observations, R-squared or other fit statistics, and notes explaining what controls are included.
* Run regressions and store results
eststo clear
eststo: reg income treatment
eststo: reg income treatment age female
eststo: reg income treatment age female education
* Create table
esttab using "regression_table.tex", ///
se star(* 0.10 ** 0.05 *** 0.01) ///
label r2 ///
title("Effect of Treatment on Income") ///
mtitles("(1)" "(2)" "(3)") ///
addnotes("Standard errors in parentheses") ///
replace
* Common options:
* se - show standard errors (not t-stats)
* star(...) - significance stars
* label - use variable labels
* r2 - show R-squared
* b(3) se(3) - 3 decimal places
* drop(...) - hide certain coefficients
# ============================================
# Option 1: etable (from fixest) - simplest
# ============================================
pacman::p_load(fixest)
# Run regressions with feols
m1 <- feols(income ~ treatment, data = data)
m2 <- feols(income ~ treatment + age + female, data = data)
m3 <- feols(income ~ treatment + age + female + education, data = data)
# Quick console table
etable(m1, m2, m3)
# Export to LaTeX
etable(m1, m2, m3, tex = TRUE, file = "regression_table.tex")
# ============================================
# Option 2: modelsummary - most flexible
# ============================================
pacman::p_load(modelsummary)
# Works with any model type (lm, feols, glm, etc.)
m1 <- lm(income ~ treatment, data = data)
m2 <- lm(income ~ treatment + age + female, data = data)
m3 <- lm(income ~ treatment + age + female + education, data = data)
# Create table
modelsummary(
list("(1)" = m1, "(2)" = m2, "(3)" = m3),
stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
gof_omit = "AIC|BIC|Log",
title = "Effect of Treatment on Income"
)
# Export to LaTeX, Word, or HTML
modelsummary(list(m1, m2, m3), output = "table.tex")
modelsummary(list(m1, m2, m3), output = "table.docx")
modelsummary(list(m1, m2, m3), output = "table.html")
import statsmodels.formula.api as smf
from stargazer.stargazer import Stargazer
# Run regressions
m1 = smf.ols("income ~ treatment", data=data).fit()
m2 = smf.ols("income ~ treatment + age + female", data=data).fit()
m3 = smf.ols("income ~ treatment + age + female + education", data=data).fit()
# Create table with stargazer
stargazer = Stargazer([m1, m2, m3])
stargazer.title("Effect of Treatment on Income")
print(stargazer.render_latex())
# Or manual approach
from statsmodels.iolib.summary2 import summary_col
print(summary_col([m1, m2, m3], stars=True).as_latex())
R Table Packages: Which to Use?
- etable (fixest): Best for quick tables with
feols()models. Built into fixest, so no extra dependencies. Great for console output and fast LaTeX export. - modelsummary: Most flexible and modern. Works with any model type, exports to LaTeX/Word/HTML/PNG, and has excellent customization. Recommended for publication tables.
Customizing Tables
Default table output rarely matches what you need for publication. You'll want to: rename variables, add rows showing which controls or fixed effects are included, report the mean of the dependent variable, and drop unimportant coefficients.
* Add mean of dependent variable
eststo clear
eststo: reg income treatment age female
estadd local controls "Yes"
sum income if e(sample)
estadd scalar mean_y = r(mean)
* Include in table
esttab using "table.tex", ///
se star(* 0.10 ** 0.05 *** 0.01) ///
scalars("mean_y Mean of Dep. Var." "controls Controls") ///
label replace
* Rename coefficients in output
esttab, rename(treatment "Treatment Effect")
# ----- etable (fixest) customization -----
etable(m1, m2, m3,
drop = "Constant", # Hide intercept
dict = c(treatment = "Treatment Effect"), # Rename
extralines = list(
"_^Controls" = c("No", "Yes", "Yes"),
"_^Mean of DV" = rep(round(mean(data$income), 2), 3)
),
tex = TRUE, file = "table.tex"
)
# ----- modelsummary customization -----
pacman::p_load(modelsummary, tibble)
rows <- tribble(
~term, ~"(1)", ~"(2)", ~"(3)",
"Controls", "No", "Yes", "Yes",
"Mean of DV", sprintf("%.2f", mean(data$income)),
sprintf("%.2f", mean(data$income)),
sprintf("%.2f", mean(data$income))
)
modelsummary(list(m1, m2, m3),
add_rows = rows,
coef_rename = c("treatment" = "Treatment Effect"),
coef_omit = "Intercept"
)
# Add custom rows with stargazer
stargazer = Stargazer([m1, m2])
stargazer.add_line("Controls", ["No", "Yes"])
stargazer.add_line("Mean of DV", [f"{data['income'].mean():.2f}"] * 2)
stargazer.rename_covariates({"treatment": "Treatment Effect"})
print(stargazer.render_latex())
Table Checklist
- Do all columns have clear headers?
- Are standard errors in parentheses (not t-stats)?
- Is the dependent variable clearly labeled?
- Are significance levels defined in a note?
- Is sample size reported for each column?
- Are controls indicated (even if not shown)?
Found something unclear or have a suggestion? Email [email protected].