Basic Regression
OLS, standard errors, and fixed effects
Every regression command follows the same pattern: reg Y X, options. Fixed effects? Add absorb(). Clustered SEs? Add cluster(). Once you understand the basic syntax, everything else is just variations on a theme.
OLS Basics
What Is OLS, Really?
Ordinary Least Squares is just finding the best-fitting line through your data. It's not magic. Here's the idea:
Suppose you have data on education and wages. You want to fit a line through these points. But which line? There are infinite possibilities. OLS picks the specific line that minimizes the sum of squared vertical distances from each point to the line. In other words, it finds the coefficients that make your predictions as close as possible to the actual data.
In one dimension (one X variable), this is literally the "line of best fit" you might draw by eye through a scatter plot. In higher dimensions (multiple X variables), it's the same idea—the "plane" (or hyperplane) that minimizes prediction errors.
OLS Algorithm (Conceptual)
- Guess some coefficients (β₀, β₁, β₂, ...)
- Make predictions: Ŷ = β₀ + β₁X₁ + β₂X₂ + ...
- Calculate errors: The vertical distance from each actual Y to your prediction Ŷ
- Square them: Errors are squared so negative and positive errors both "count"
- Sum them: Add up all squared errors
- Repeat: Try different coefficients until you find the ones that make this sum as small as possible
In practice, you don't try coefficients randomly—there's a mathematical formula that solves for the optimal coefficients directly. But conceptually, OLS is just optimization: pick coefficients to minimize prediction error.
The result gives you the "best-fitting" linear relationship between your outcome variable Y and your predictors X. The slope coefficient tells you: holding everything else constant, how much does Y change when X increases by one unit?
Simple Linear Regression
* Simple regression
reg income education
* Multiple regression
reg income education age female
* With categorical variables (factor notation)
reg income education i.region i.year
* Interactions
reg income education##female // Full interaction
reg income c.education#c.age // Continuous interaction
pacman::p_load(fixest)
# Simple regression
model <- feols(income ~ education, data = data)
summary(model)
# Multiple regression
model <- feols(income ~ education + age + female, data = data)
# With categorical variables
model <- feols(income ~ education + factor(region) + factor(year),
data = data)
# Interactions
model <- feols(income ~ education * female, data = data)
model <- feols(income ~ education:age, data = data)
import statsmodels.formula.api as smf
import pandas as pd
# Simple regression
model = smf.ols('income ~ education', data=data).fit()
print(model.summary())
# Multiple regression
model = smf.ols('income ~ education + age + female', data=data).fit()
# With categorical variables
model = smf.ols('income ~ education + C(region) + C(year)', data=data).fit()
# Interactions
model = smf.ols('income ~ education * female', data=data).fit()
model = smf.ols('income ~ education:age', data=data).fit()
Interpreting Coefficients
Coefficient Interpretation
- Continuous X: A one-unit increase in X is associated with a beta-unit change in Y, holding other variables constant.
- Binary X: The difference in Y between X=1 and X=0, holding other variables constant.
- Log Y: A one-unit increase in X is associated with a (beta*100)% change in Y.
- Log X: A 1% increase in X is associated with a (beta/100)-unit change in Y.
- Log-Log: A 1% increase in X is associated with a beta% change in Y (elasticity).
Post-Estimation
After running a regression, you often want to do more than just look at coefficients. Predicted values show what your model expects for each observation—useful for identifying outliers or visualizing fit. Residuals are the difference between actual and predicted values—patterns in residuals can reveal model problems. Hypothesis tests let you ask specific questions: Is this coefficient different from zero? Are two coefficients equal? What's the combined effect of multiple variables?
reg income education age female
* Predicted values
predict yhat
* Residuals
predict resid, residuals
* Test coefficient equals zero
test education = 0
* Test multiple coefficients
test education age
* Linear combination of coefficients
lincom education + 2*age
model <- lm(income ~ education + age + female, data = data)
# Predicted values
data$yhat <- predict(model)
# Residuals
data$resid <- residuals(model)
# Test coefficient equals zero (from summary)
summary(model)$coefficients
# F-test for multiple coefficients
pacman::p_load(car)
linearHypothesis(model, c("education = 0", "age = 0"))
# Linear combination
pacman::p_load(multcomp)
glht(model, linfct = c("education + 2*age = 0"))
import statsmodels.formula.api as smf
model = smf.ols('income ~ education + age + female', data=data).fit()
# Predicted values
data['yhat'] = model.predict()
# Residuals
data['resid'] = model.resid
# Test coefficient equals zero (from summary)
print(model.summary())
# F-test for multiple coefficients (joint hypothesis)
print(model.f_test("(education = 0), (age = 0)"))
# Linear combination
print(model.t_test("education + 2*age = 0"))
Standard Errors
Standard errors measure how precisely you've estimated each coefficient. A coefficient of 10 with a standard error of 2 is very different from a coefficient of 10 with a standard error of 50—the first is precisely estimated, the second is essentially noise. Standard errors determine your confidence intervals and p-values, so getting them right is crucial for inference.
The challenge is that standard OLS assumes errors are independent across observations. In real data, this is often violated: observations within the same state are correlated, the same person observed over time is correlated, etc. When errors are correlated, standard OLS standard errors are too small, leading to false confidence in your results.
Always Cluster When in Doubt
If observations within groups (states, firms, individuals over time) are correlated, standard OLS standard errors are too small. Cluster at the level of treatment variation or the highest level of aggregation that makes sense.
Types of Standard Errors
Here are the main options for computing standard errors, from simplest to most robust:
* Heteroskedasticity-robust (HC1)
reg income education, robust
* Clustered by state
reg income education, cluster(state)
* Two-way clustering (requires reghdfe)
reghdfe income education, noabsorb cluster(state year)
* With reghdfe (recommended for fixed effects)
reghdfe income education, absorb(state) cluster(state)
pacman::p_load(fixest)
# Heteroskedasticity-robust
model <- feols(income ~ education, data = data, vcov = "hetero")
# Clustered by state
model <- feols(income ~ education, data = data, vcov = ~state)
# Two-way clustering
model <- feols(income ~ education, data = data, vcov = ~state + year)
# With fixed effects
model <- feols(income ~ education | state, data = data, vcov = ~state)
import statsmodels.formula.api as smf
import pyfixest as pf
# Heteroskedasticity-robust (HC1)
model = smf.ols('income ~ education', data=data).fit(cov_type='HC1')
# Clustered by state
model = smf.ols('income ~ education', data=data).fit(
cov_type='cluster', cov_kwds={'groups': data['state']})
# For fixed effects with clustering, use pyfixest
model = pf.feols('income ~ education | state', data=data,
vcov={'CRV1': 'state'})
Clustering Rules of Thumb
- Cluster at the level of treatment assignment (e.g., if policy varies by state, cluster by state)
- With few clusters (<30), consider wild bootstrap or other small-sample corrections
- Never cluster at a finer level than treatment variation
Panel Data and Fixed Effects
Fixed Effects Regression
Fixed effects control for all time-invariant characteristics of each unit (observed and unobserved).
Click for answer
State fixed effects! By including a dummy for each state, you're comparing each state to itself over time. California in 2020 is compared to California in 2015, not to Texas. This "differences out" all the permanent differences between states.
* Declare panel structure
xtset state year
* Entity fixed effects
xtreg income policy, fe
* Better: use reghdfe for speed and flexibility
* Install: ssc install reghdfe
reghdfe income policy, absorb(state)
* Two-way fixed effects (entity + time)
reghdfe income policy, absorb(state year)
* With clustered standard errors
reghdfe income policy, absorb(state year) cluster(state)
pacman::p_load(fixest)
# Entity fixed effects
model <- feols(income ~ policy | state, data = data)
# Two-way fixed effects (entity + time)
model <- feols(income ~ policy | state + year, data = data)
# With clustered standard errors
model <- feols(income ~ policy | state + year,
data = data, vcov = ~state)
# View results
summary(model)
etable(model)
import pyfixest as pf
# Entity fixed effects
model = pf.feols('income ~ policy | state', data=data)
# Two-way fixed effects (entity + time)
model = pf.feols('income ~ policy | state + year', data=data)
# With clustered standard errors
model = pf.feols('income ~ policy | state + year', data=data,
vcov={'CRV1': 'state'})
model.summary()
What Do Fixed Effects Actually Do?
Unit fixed effects control for all characteristics of a unit that don't change over time. To see why this matters, consider estimating the effect of education on wages.
You might worry that race and sex affect both education and wages. So you add them as controls:
But what about other time-invariant characteristics you can't measure? Family background, innate ability, childhood neighborhood, personality traits—all of these might affect both education and wages. You can't observe them, so you can't control for them.
Individual fixed effects solve this. By including a dummy for each person, you control for everything about that person that doesn't change over time—observed or unobserved:
The αi absorbs race, sex, family background, innate ability, and every other time-invariant characteristic. You don't need to list them individually—the fixed effect captures them all. The cost is you can no longer estimate the effect of time-invariant variables (β₂ and β₃ disappear), but you've eliminated a whole class of omitted variable bias.
Time fixed effects work the same way but for time periods. They control for anything that affects all units in a given period—recessions, policy changes, seasonal patterns:
With both unit and time fixed effects (two-way fixed effects), your estimate of β₁ comes purely from within-unit changes over time, after removing any time trends common to all units. This is the workhorse specification for causal inference with panel data.
TWFE Powers DiD and Event Studies
Difference-in-differences and event studies—the most common designs in applied microeconomics—are almost always estimated using two-way fixed effects. The unit fixed effects control for permanent differences between treated and control groups. The time fixed effects control for common shocks that hit everyone. What's left is the treatment effect: how treated units changed relative to control units, relative to the time trend. When you see a DiD or event study paper, you're almost certainly looking at a TWFE regression.
The Limitation
Fixed effects only control for time-invariant confounders. If something changes over time and affects both your treatment and outcome (a time-varying confounder), fixed effects won't help. You'd need to control for it directly or find a different identification strategy.
Regression Tables
Regression tables present your main results. The standard format in economics shows multiple specifications (columns) with the same outcome variable, progressively adding controls or changing samples. This shows readers how your results change as you adjust the model—stable coefficients across specifications build confidence in your findings.
Never Copy Numbers by Hand
Export tables directly from your code to LaTeX or Word. Manual copying introduces errors and makes your work non-reproducible. Use estout/esttab in Stata, etable/modelsummary in R, or stargazer in Python.
Basic Regression Tables
Standard elements include: coefficient estimates, standard errors (in parentheses below the coefficient), significance stars, the number of observations, R-squared or other fit statistics, and notes explaining what controls are included.
* Run regressions and store results
eststo clear
eststo: reg income treatment
eststo: reg income treatment age female
eststo: reg income treatment age female education
* Create table
esttab using "regression_table.tex", ///
se star(* 0.10 ** 0.05 *** 0.01) ///
label r2 ///
title("Effect of Treatment on Income") ///
mtitles("(1)" "(2)" "(3)") ///
addnotes("Standard errors in parentheses") ///
replace
* Common options:
* se - show standard errors (not t-stats)
* star(...) - significance stars
* label - use variable labels
* r2 - show R-squared
* b(3) se(3) - 3 decimal places
* drop(...) - hide certain coefficients
# ============================================
# Option 1: etable (from fixest) - simplest
# ============================================
pacman::p_load(fixest)
# Run regressions with feols
m1 <- feols(income ~ treatment, data = data)
m2 <- feols(income ~ treatment + age + female, data = data)
m3 <- feols(income ~ treatment + age + female + education, data = data)
# Quick console table
etable(m1, m2, m3)
# Export to LaTeX
etable(m1, m2, m3, tex = TRUE, file = "regression_table.tex")
# ============================================
# Option 2: modelsummary - most flexible
# ============================================
pacman::p_load(modelsummary)
# Works with any model type (lm, feols, glm, etc.)
m1 <- lm(income ~ treatment, data = data)
m2 <- lm(income ~ treatment + age + female, data = data)
m3 <- lm(income ~ treatment + age + female + education, data = data)
# Create table
modelsummary(
list("(1)" = m1, "(2)" = m2, "(3)" = m3),
stars = c('*' = 0.10, '**' = 0.05, '***' = 0.01),
gof_omit = "AIC|BIC|Log",
title = "Effect of Treatment on Income"
)
# Export to LaTeX, Word, or HTML
modelsummary(list(m1, m2, m3), output = "table.tex")
modelsummary(list(m1, m2, m3), output = "table.docx")
modelsummary(list(m1, m2, m3), output = "table.html")
import statsmodels.formula.api as smf
from stargazer.stargazer import Stargazer
# Run regressions
m1 = smf.ols("income ~ treatment", data=data).fit()
m2 = smf.ols("income ~ treatment + age + female", data=data).fit()
m3 = smf.ols("income ~ treatment + age + female + education", data=data).fit()
# Create table with stargazer
stargazer = Stargazer([m1, m2, m3])
stargazer.title("Effect of Treatment on Income")
print(stargazer.render_latex())
# Or manual approach
from statsmodels.iolib.summary2 import summary_col
print(summary_col([m1, m2, m3], stars=True).as_latex())
R Table Packages: Which to Use?
- etable (fixest): Best for quick tables with
feols()models. Built into fixest, so no extra dependencies. Great for console output and fast LaTeX export. - modelsummary: Most flexible and modern. Works with any model type, exports to LaTeX/Word/HTML/PNG, and has excellent customization. Recommended for publication tables.
Customizing Tables
Default table output rarely matches what you need for publication. You'll want to: rename variables, add rows showing which controls or fixed effects are included, report the mean of the dependent variable, and drop unimportant coefficients.
* Add mean of dependent variable
eststo clear
eststo: reg income treatment age female
estadd local controls "Yes"
sum income if e(sample)
estadd scalar mean_y = r(mean)
* Include in table
esttab using "table.tex", ///
se star(* 0.10 ** 0.05 *** 0.01) ///
scalars("mean_y Mean of Dep. Var." "controls Controls") ///
label replace
* Rename coefficients in output
esttab, rename(treatment "Treatment Effect")
# ----- etable (fixest) customization -----
etable(m1, m2, m3,
drop = "Constant", # Hide intercept
dict = c(treatment = "Treatment Effect"), # Rename
extralines = list(
"_^Controls" = c("No", "Yes", "Yes"),
"_^Mean of DV" = rep(round(mean(data$income), 2), 3)
),
tex = TRUE, file = "table.tex"
)
# ----- modelsummary customization -----
pacman::p_load(modelsummary, tibble)
rows <- tribble(
~term, ~"(1)", ~"(2)", ~"(3)",
"Controls", "No", "Yes", "Yes",
"Mean of DV", sprintf("%.2f", mean(data$income)),
sprintf("%.2f", mean(data$income)),
sprintf("%.2f", mean(data$income))
)
modelsummary(list(m1, m2, m3),
add_rows = rows,
coef_rename = c("treatment" = "Treatment Effect"),
coef_omit = "Intercept"
)
# Add custom rows with stargazer
stargazer = Stargazer([m1, m2])
stargazer.add_line("Controls", ["No", "Yes"])
stargazer.add_line("Mean of DV", [f"{data['income'].mean():.2f}"] * 2)
stargazer.rename_covariates({"treatment": "Treatment Effect"})
print(stargazer.render_latex())
Table Checklist
- Do all columns have clear headers?
- Are standard errors in parentheses (not t-stats)?
- Is the dependent variable clearly labeled?
- Are significance levels defined in a note?
- Is sample size reported for each column?
- Are controls indicated (even if not shown)?
Found something unclear or have a suggestion? Email [email protected].