Stata Session 2

Loops, project organization, and regression

What You'll Learn

Using loops to avoid repetitive code
Working with locals and temporary files
Organizing your project with code, data, and output folders
Running OLS, fixed effects, and IV regressions

Live Quiz — Test your understanding during class

Loops and Locals

Loops let you write code once and run it many times. If you ever find yourself copy-pasting code and just changing one number, you should use a loop instead.

The `forvalues` Loop

Without a loop (bad):

* This is repetitive and error-prone
replace kids_cat = 0 if nchild == 0
replace kids_cat = 1 if nchild == 1
replace kids_cat = 2 if nchild == 2
replace kids_cat = 3 if nchild == 3

# This is repetitive and error-prone
df$kids_cat[df$nchild == 0] <- 0
df$kids_cat[df$nchild == 1] <- 1
df$kids_cat[df$nchild == 2] <- 2
df$kids_cat[df$nchild == 3] <- 3

# This is repetitive and error-prone
df.loc[df['nchild'] == 0, 'kids_cat'] = 0
df.loc[df['nchild'] == 1, 'kids_cat'] = 1
df.loc[df['nchild'] == 2, 'kids_cat'] = 2
df.loc[df['nchild'] == 3, 'kids_cat'] = 3

With a loop (good):

* Much cleaner!
forvalues i = 0/3 {
    replace kids_cat = `i' if nchild == `i'
}

# Much cleaner!
for (i in 0:3) {
    df$kids_cat[df$nchild == i] <- i
}

# Much cleaner!
for i in range(4):
    df.loc[df['nchild'] == i, 'kids_cat'] = i

How to Build a Loop

Write the code for one value first (e.g., the i = 0 case)
Make sure it works
Then wrap it in a loop, replacing the number with the loop variable

This way, if something breaks, you know whether it's the code or the loop.

Loop Variations

* Count from 0 to 3
forvalues i = 0/3 {
    display `i'
}

* Count by 5s from 5 to 100
forvalues i = 5(5)100 {
    display `i'
}

* Loop over a list of values (not just numbers)
foreach v in income age education {
    summarize `v'
}

# Count from 0 to 3
for (i in 0:3) {
    print(i)
}

# Count by 5s from 5 to 100
for (i in seq(5, 100, by = 5)) {
    print(i)
}

# Loop over a list of values (not just numbers)
for (v in c("income", "age", "education")) {
    print(summary(df[[v]]))
}

# Count from 0 to 3
for i in range(4):
    print(i)

# Count by 5s from 5 to 100
for i in range(5, 101, 5):
    print(i)

# Loop over a list of values (not just numbers)
for v in ['income', 'age', 'education']:
    print(df[v].describe())

Quick Check: Loops

Question: In Stata, what values will forvalues i = 0/3 iterate through?

Understanding Locals

A local (in Stata) or variable (in R/Python) stores a value temporarily. In Stata, you create it without quotes but use it wrapped in backtick and apostrophe: `name'

* Create a local
local myvar = 7

* Use the local
display `myvar'

* Locals are useful for storing lists
local controls "age education income"
reg wage `controls'

* Or for storing results
reg wage education
local coef = _b[education]
display "The coefficient was `coef'"

# Create a variable
myvar <- 7

# Use the variable
print(myvar)

# Variables are useful for storing lists
controls <- c("age", "education", "income")
model <- lm(wage ~ age + education + income, data = df)

# Or for storing results
model <- lm(wage ~ education, data = df)
coef_val <- coef(model)["education"]
print(paste("The coefficient was", coef_val))

# Create a variable
myvar = 7

# Use the variable
print(myvar)

# Variables are useful for storing lists
controls = ['age', 'education', 'income']
import statsmodels.formula.api as smf
model = smf.ols('wage ~ age + education + income', data=df).fit()

# Or for storing results
model = smf.ols('wage ~ education', data=df).fit()
coef_val = model.params['education']
print(f"The coefficient was {coef_val}")

Stata: Locals Only Last One "Run"

If you highlight local x = 5 and run it, then separately highlight display `x' and run it, Stata will have forgotten x. You must run both lines together. (R and Python don't have this issue.)

Preserve and Restore

Sometimes you want to temporarily modify data, do something, then go back to the original:

* Save the current state
preserve

* Do something that changes the data
keep if state == "CA"
save california_only.dta, replace

* Go back to the full dataset
restore

* The full dataset is back!

# Make a copy before modifying
df_backup <- df

# Do something that changes the data
df_ca <- df[df$state == "CA", ]
saveRDS(df_ca, "california_only.rds")

# The original is still in df_backup
# (or just use df_ca and keep df unchanged)

# Make a copy before modifying
df_backup = df.copy()

# Do something that changes the data
df_ca = df[df['state'] == 'CA']
df_ca.to_csv('california_only.csv', index=False)

# The original is still in df_backup
# (or just filter without reassigning to keep df unchanged)

Project Organization

A well-organized project makes your work reproducible and saves you hours of confusion later.

The Standard Structure

my_project/
├── master.do              # Runs everything (or master.R / main.py)
├── README.md
│
├── build/                 # Data preparation
│   ├── input/             # Raw data (NEVER edit these!)
│   ├── code/              # Scripts to clean data
│   └── output/            # Cleaned data
│
├── analysis/              # Your analysis
│   ├── code/              # Regression scripts, etc.
│   └── output/
│       ├── tables/
│       └── figures/
│
└── paper/                 # Your writeup
    └── draft.tex

The Master Script

Your master.do (or master.R or main.py) file should run your entire project from start to finish. Anyone should be able to run it and reproduce all your results.

Quick Check: Project Organization

Question: You download census data from the web. Where should you save it?

Example Master Script

/*==============================================================================
    Master Do-File: My Research Project
    Author: Your Name
    Date: February 2026
==============================================================================*/

clear all
set more off

* Set the project root (CHANGE THIS!)
global root "/Users/yourname/Dropbox/my_project"

* Define paths
global build    "$root/build"
global analysis "$root/analysis"

* Run the build scripts
do "$build/code/01_clean_census.do"
do "$build/code/02_clean_policy.do"
do "$build/code/03_merge.do"

* Run the analysis
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_main_regression.do"
do "$analysis/code/03_robustness.do"

#==============================================================================
#    Master Script: My Research Project
#    Author: Your Name
#    Date: February 2026
#==============================================================================

rm(list = ls())

# Set the project root (CHANGE THIS!)
root <- "/Users/yourname/Dropbox/my_project"

# Define paths
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")

# Run the build scripts
source(file.path(build, "code", "01_clean_census.R"))
source(file.path(build, "code", "02_clean_policy.R"))
source(file.path(build, "code", "03_merge.R"))

# Run the analysis
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_main_regression.R"))
source(file.path(analysis, "code", "03_robustness.R"))

"""
Master Script: My Research Project
Author: Your Name
Date: February 2026
"""

import os
import subprocess

# Set the project root (CHANGE THIS!)
ROOT = "/Users/yourname/Dropbox/my_project"

# Define paths
BUILD = os.path.join(ROOT, "build")
ANALYSIS = os.path.join(ROOT, "analysis")

# Run the build scripts
exec(open(os.path.join(BUILD, "code", "01_clean_census.py")).read())
exec(open(os.path.join(BUILD, "code", "02_clean_policy.py")).read())
exec(open(os.path.join(BUILD, "code", "03_merge.py")).read())

# Run the analysis
exec(open(os.path.join(ANALYSIS, "code", "01_summary_stats.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "02_main_regression.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "03_robustness.py")).read())

Number Your Scripts

Prefix scripts with numbers (01_, 02_, etc.) so it's clear what order they run in. This also keeps them sorted correctly in your file browser.

Regression

This section covers the main regression commands you'll use in applied economics research.

Basic OLS Regression

* Simple regression
reg wage education

* Add control variables
reg wage education age experience

* Heteroskedasticity-robust standard errors
reg wage education age experience, robust

* Cluster standard errors by state
reg wage education age experience, cluster(state)

# Simple regression
model <- lm(wage ~ education, data = df)
summary(model)

# Add control variables
model <- lm(wage ~ education + age + experience, data = df)

# Heteroskedasticity-robust standard errors
library(sandwich)
library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Cluster standard errors by state
library(clubSandwich)
coef_test(model, vcov = vcovCR(model, cluster = df$state, type = "CR0"))

import statsmodels.formula.api as smf

# Simple regression
model = smf.ols('wage ~ education', data=df).fit()
print(model.summary())

# Add control variables
model = smf.ols('wage ~ education + age + experience', data=df).fit()

# Heteroskedasticity-robust standard errors
model_robust = smf.ols('wage ~ education + age + experience', data=df).fit(
    cov_type='HC1'
)

# Cluster standard errors by state
model_cluster = smf.ols('wage ~ education + age + experience', data=df).fit(
    cov_type='cluster', cov_kwds={'groups': df['state']}
)

Interactions

* # adds just the interaction
reg wage education gender#married

* ## adds the interaction AND main effects
reg wage education gender##married

* For continuous variables, use c. prefix
reg wage c.education##c.experience

# : adds just the interaction
model <- lm(wage ~ education + gender:married, data = df)

# * adds the interaction AND main effects
model <- lm(wage ~ education + gender*married, data = df)

# For continuous variables
model <- lm(wage ~ education * experience, data = df)

# : adds just the interaction
model = smf.ols('wage ~ education + gender:married', data=df).fit()

# * adds the interaction AND main effects
model = smf.ols('wage ~ education + gender*married', data=df).fit()

# For continuous variables
model = smf.ols('wage ~ education * experience', data=df).fit()

Quick Check: Standard Errors

Question: You're analyzing wage data where workers are grouped by firm. What type of standard errors should you use?

Fixed Effects

* Using absorb() to add fixed effects
* (doesn't show the FE coefficients, which is usually what you want)
reg wage education age, absorb(state)

* For MANY fixed effects, use reghdfe (faster)
ssc install reghdfe  // run once to install
ssc install ftools   // dependency

reghdfe wage education age, absorb(state year)

* Two-way fixed effects with clustering
reghdfe wage education, absorb(state year) cluster(state)

# Using fixest package (fastest option)
library(fixest)

# One-way fixed effects
model <- feols(wage ~ education + age | state, data = df)

# Two-way fixed effects
model <- feols(wage ~ education + age | state + year, data = df)

# With clustered standard errors
model <- feols(wage ~ education | state + year,
               cluster = ~state, data = df)

# Using pyfixest (similar to R's fixest)
import pyfixest as pf

# One-way fixed effects
model = pf.feols('wage ~ education + age | state', data=df)

# Two-way fixed effects
model = pf.feols('wage ~ education + age | state + year', data=df)

# With clustered standard errors
model = pf.feols('wage ~ education | state + year',
                 vcov={'CRV1': 'state'}, data=df)

When to Use High-Dimensional Fixed Effects

Use reghdfe (Stata), fixest (R), or pyfixest (Python) whenever you have many fixed effects (hundreds or thousands of groups). They're much faster than including dummy variables, and they handle multiple sets of fixed effects efficiently.

Instrumental Variables

* Install ivreghdfe
ssc install ivreghdfe  // run once

* IV regression syntax: (endogenous = instruments)
* Example: education is endogenous, quarter_of_birth is the instrument
ivreghdfe wage (education = quarter_of_birth) age experience

* With fixed effects
ivreghdfe wage (education = quarter_of_birth) age, absorb(state year)

# Using fixest for IV
library(fixest)

# IV regression: education instrumented by quarter_of_birth
model <- feols(wage ~ age + experience |
               education ~ quarter_of_birth, data = df)

# With fixed effects
model <- feols(wage ~ age | state + year |
               education ~ quarter_of_birth, data = df)

# Using pyfixest for IV
import pyfixest as pf

# IV regression: education instrumented by quarter_of_birth
model = pf.feols('wage ~ age + experience | education ~ quarter_of_birth',
                 data=df)

# With fixed effects
model = pf.feols('wage ~ age | state + year | education ~ quarter_of_birth',
                 data=df)

Exporting Results

* Install estout for publication tables
ssc install estout

* Store results
eststo clear
eststo m1: reg wage education, robust
eststo m2: reg wage education age, robust
eststo m3: reg wage education age experience, robust

* Create a nice table
esttab m1 m2 m3, se r2 label ///
    title("Wage Regressions") ///
    mtitles("(1)" "(2)" "(3)")

* Export to LaTeX
esttab m1 m2 m3 using "tables/wage_regs.tex", replace ///
    se r2 label booktabs

# Using modelsummary
library(modelsummary)

# Run regressions
m1 <- lm(wage ~ education, data = df)
m2 <- lm(wage ~ education + age, data = df)
m3 <- lm(wage ~ education + age + experience, data = df)

# Create table
modelsummary(list("(1)" = m1, "(2)" = m2, "(3)" = m3),
             stars = TRUE,
             gof_omit = "IC|Log")

# Export to LaTeX
modelsummary(list(m1, m2, m3),
             output = "tables/wage_regs.tex")

# Using stargazer-style output with pyfixest
import pyfixest as pf

# Run regressions
m1 = pf.feols('wage ~ education', data=df, vcov='HC1')
m2 = pf.feols('wage ~ education + age', data=df, vcov='HC1')
m3 = pf.feols('wage ~ education + age + experience', data=df, vcov='HC1')

# Create table
pf.etable([m1, m2, m3])

# Export to LaTeX
pf.etable([m1, m2, m3], type='tex', file='tables/wage_regs.tex')

Session Complete!

You've learned the essentials of loops, project organization, and regression in Stata, R, and Python. Test your knowledge with the live quiz or move on to the next session.

Ready to Test Your Knowledge?

Answer questions about loops, locals, and regression commands

Take the Quiz

← Back to Tutorials