Stata Session 2
Reshaping, merging, loops, and regression
Today's topics build directly on Session 1. Reshaping and merging are how you transform data into the format you need. Loops are just a way to avoid copy-paste. Regression commands might look intimidating, but they're just functions that take variables as inputs. If you understood summarize last week, you can understand regress today.
What You'll Learn
- How to reshape data between "wide" and "long" formats
- How to merge multiple datasets together
- Using loops to avoid repetitive code
- Working with locals and temporary files
- Running OLS, fixed effects, and IV regressions
Download Complete Scripts
Run the full session code in your preferred language:
Reshaping Data
Data comes in two basic shapes: wide and long. Understanding the difference—and knowing how to convert between them—is essential for data analysis.
Wide vs. Long Format
Consider GDP data for three countries over three years:
Wide Format
Each row is a country. Years are columns.
| country | gdp2018 | gdp2019 | gdp2020 |
|---|---|---|---|
| USA | 20.5 | 21.4 | 20.9 |
| China | 13.9 | 14.3 | 14.7 |
| Japan | 5.0 | 5.1 | 5.0 |
3 rows (one per country)
Long Format
Each row is a country-year. Year is a variable.
| country | year | gdp |
|---|---|---|
| USA | 2018 | 20.5 |
| USA | 2019 | 21.4 |
| USA | 2020 | 20.9 |
| China | 2018 | 13.9 |
| ... | ... | ... |
9 rows (one per country-year)
When to Use Each Format
- Long format: Usually needed for regression, plotting, and most analysis. This is the "tidy" format.
- Wide format: Sometimes easier for data entry or viewing, but rarely what you want for analysis.
Rule of thumb: If you're unsure, you probably want long format.
Reshaping: Wide → Long
gdp2018, gdp2019, gdp2020, how many rows will you have after reshaping to long?
* Start with wide data: country, gdp2018, gdp2019, gdp2020
* Reshape wide to long
* Syntax: reshape long [stub], i([id var]) j([new time var])
reshape long gdp, i(country) j(year)
* What this does:
* - i(country): country is the ID variable (stays as is)
* - j(year): creates a new variable "year" from the suffixes
* - gdp: the stub - looks for gdp2018, gdp2019, etc.
* Check the result
list, clean
# Start with wide data: country, gdp2018, gdp2019, gdp2020
# Reshape wide to long using pivot_longer
df_long <- df %>%
pivot_longer(
cols = starts_with("gdp"), # Columns to reshape
names_to = "year", # New column for the names
names_prefix = "gdp", # Remove "gdp" prefix
values_to = "gdp" # New column for the values
) %>%
mutate(year = as.integer(year)) # Convert year to integer
# Check the result
print(df_long)
# Start with wide data: country, gdp2018, gdp2019, gdp2020
# Reshape wide to long using melt
df_long = df.melt(
id_vars=['country'], # Keep these as is
var_name='year', # Name for the variable column
value_name='gdp' # Name for the value column
)
# Clean up: extract year from "gdp2018" -> 2018
df_long['year'] = df_long['year'].str.replace('gdp', '').astype(int)
# Check the result
print(df_long)
Reshaping: Long → Wide
* Start with long data: country, year, gdp
* Reshape long to wide
reshape wide gdp, i(country) j(year)
* This creates: country, gdp2018, gdp2019, gdp2020
# Start with long data: country, year, gdp
# Reshape long to wide using pivot_wider
df_wide <- df_long %>%
pivot_wider(
names_from = year,
values_from = gdp,
names_prefix = "gdp"
)
# This creates: country, gdp2018, gdp2019, gdp2020
# Start with long data: country, year, gdp
# Reshape long to wide using pivot
df_wide = df_long.pivot(
index='country',
columns='year',
values='gdp'
).reset_index()
# This creates: country, 2018, 2019, 2020
# Rename columns if you want the "gdp" prefix
df_wide.columns = ['country'] + [f'gdp{y}' for y in df_wide.columns[1:]]
Merging Data
Real research almost always requires combining data from multiple sources. This is called merging (Stata) or joining (R/Python).
The Concept
Imagine you have:
- Dataset A: People's ages (id, age)
- Dataset B: People's incomes (id, income)
You want to combine them into one dataset with id, age, AND income. The key (the variable that links them) is id.
Merge Types
In Stata, the master dataset is the one already in memory (the one you're working with), and the using dataset is the one you're bringing in with the merge command. The notation m:1 means "many-to-one": many rows in the master match one row in the using dataset. For example, if your master has 10,000 individual people and your using has 50 state-level rows, that's m:1—many people per state, one state row each. In R, the first argument to merge() plays the role of the master, and the second is the using.
| Type | Stata | R | When to Use |
|---|---|---|---|
| 1:1 | merge 1:1 |
merge(...) |
Each row in A matches exactly one row in B |
| m:1 | merge m:1 |
merge(..., all.x = TRUE) |
Many rows in A match one row in B (e.g., people to states) |
| 1:m | merge 1:m |
merge(..., all.y = TRUE) |
One row in A matches many rows in B |
| m:m | merge m:m |
— | Almost never correct! Usually indicates a problem. |
Merging: Step by Step
* Example: Merge person data with state-level unemployment
* Step 1: Load the "master" dataset (person-level)
use person_data.dta, clear
count // Check: 10,000 people
* Step 2: Merge with "using" dataset (state-level)
* m:1 because many people live in each state
merge m:1 state using state_unemployment.dta
* Step 3: CHECK THE MERGE RESULTS
tab _merge
* _merge values:
* 1 = in master only (person with no matching state)
* 2 = in using only (state with no people)
* 3 = matched (what you want)
* Step 4: Decide what to do with unmatched
* Usually keep only matched:
keep if _merge == 3
drop _merge
# Example: Merge person data with state-level unemployment
# Step 1: Load datasets
person_data <- read_csv("person_data.csv")
state_data <- read_csv("state_unemployment.csv")
# Check dimensions
nrow(person_data) # 10,000 people
nrow(state_data) # 50 states
# Step 2: Merge
# all.x = TRUE keeps all people, adds state data where matched
merged <- merge(person_data, state_data, by = "state", all.x = TRUE)
# Step 3: Check for unmatched
sum(is.na(merged$unemployment)) # How many have missing state data?
# Step 4: Remove unmatched if needed
merged <- merged %>% filter(!is.na(unemployment))
# Example: Merge person data with state-level unemployment
# Step 1: Load datasets
person_data = pd.read_csv("person_data.csv")
state_data = pd.read_csv("state_unemployment.csv")
# Check dimensions
len(person_data) # 10,000 people
len(state_data) # 50 states
# Step 2: Merge
# merge with indicator to check matching
merged = person_data.merge(
state_data,
on='state',
how='left',
indicator=True
)
# Step 3: Check for unmatched
print(merged['_merge'].value_counts())
# Step 4: Remove unmatched if needed
merged = merged[merged['_merge'] == 'both']
merged = merged.drop(columns=['_merge'])
- Did the number of observations change unexpectedly?
- How many observations matched vs. didn't match?
- Do unmatched observations make sense (e.g., states with no people)?
Most merge bugs are silent—your code runs but gives wrong answers. Checking _merge is how you catch them.
Loops and Locals
Loops let you write code once and run it many times. If you ever find yourself copy-pasting code and just changing one number, you should use a loop instead.
The forvalues Loop
Without a loop (bad):
* This is repetitive and error-prone
replace kids_cat = 0 if nchild == 0
replace kids_cat = 1 if nchild == 1
replace kids_cat = 2 if nchild == 2
replace kids_cat = 3 if nchild == 3
# This is repetitive and error-prone
df$kids_cat[df$nchild == 0] <- 0
df$kids_cat[df$nchild == 1] <- 1
df$kids_cat[df$nchild == 2] <- 2
df$kids_cat[df$nchild == 3] <- 3
# This is repetitive and error-prone
df.loc[df['nchild'] == 0, 'kids_cat'] = 0
df.loc[df['nchild'] == 1, 'kids_cat'] = 1
df.loc[df['nchild'] == 2, 'kids_cat'] = 2
df.loc[df['nchild'] == 3, 'kids_cat'] = 3
With a loop (good):
* Much cleaner!
forvalues i = 0/3 {
replace kids_cat = `i' if nchild == `i'
}
# Much cleaner!
for (i in 0:3) {
df$kids_cat[df$nchild == i] <- i
}
# Much cleaner!
for i in range(4):
df.loc[df['nchild'] == i, 'kids_cat'] = i
- First iteration:
i= 0 → runsreplace kids_cat = 0 if nchild == 0 - Second iteration:
i= 1 → runsreplace kids_cat = 1 if nchild == 1 - Third iteration:
i= 2 → runsreplace kids_cat = 2 if nchild == 2 - Fourth iteration:
i= 3 → runsreplace kids_cat = 3 if nchild == 3
The backticks `i' get replaced with the current value of the loop variable.
How to Build a Loop
- Write the code for one value first (e.g., the
i = 0case) - Make sure it works
- Then wrap it in a loop, replacing the number with the loop variable
This way, if something breaks, you know whether it's the code or the loop.
Loop Variations
forvalues i = 5(5)100 print? (Hint: the middle number is the step size.)
* Count from 0 to 3
forvalues i = 0/3 {
display `i'
}
* Count by 5s from 5 to 100
forvalues i = 5(5)100 {
display `i'
}
* Loop over a list of values (not just numbers)
foreach v in income age education {
summarize `v'
}
# Count from 0 to 3
for (i in 0:3) {
print(i)
}
# Count by 5s from 5 to 100
for (i in seq(5, 100, by = 5)) {
print(i)
}
# Loop over a list of values (not just numbers)
for (v in c("income", "age", "education")) {
print(summary(df[[v]]))
}
# Count from 0 to 3
for i in range(4):
print(i)
# Count by 5s from 5 to 100
for i in range(5, 101, 5):
print(i)
# Loop over a list of values (not just numbers)
for v in ['income', 'age', 'education']:
print(df[v].describe())
Click to reveal answer: What does 5(5)100 print?
It prints: 5, 10, 15, 20, 25, ... 95, 100. The syntax is start(step)end, so this starts at 5, goes up by 5 each time, and stops at 100.
foreach vs forvalues:
forvalues= loop over numbers (0, 1, 2, 3 or 5, 10, 15...)foreach= loop over a list of anything (variable names, file names, strings)
Quick Check: Loops
Question: In Stata, what values will forvalues i = 0/3 iterate through?
Understanding Locals
A local (in Stata) or variable (in R/Python) stores a value temporarily. Think of it as giving a nickname to something you'll use repeatedly.
- Change once, update everywhere: If you store your control variables in a local, you only need to change one line to add/remove a control
- Store results: Save a coefficient from one regression to use in another calculation
- Make code readable:
`controls'is clearer than a long list of variables
In Stata, you create a local without quotes but use it wrapped in backtick and apostrophe: `name'
* Create a local
local myvar = 7
* Use the local
display `myvar'
* Locals are useful for storing lists
local controls "age education income"
reg wage `controls'
* Or for storing results
reg wage education
local coef = _b[education]
display "The coefficient was `coef'"
# Create a variable
myvar <- 7
# Use the variable
print(myvar)
# Variables are useful for storing lists
controls <- c("age", "education", "income")
model <- lm(wage ~ age + education + income, data = df)
# Or for storing results
model <- lm(wage ~ education, data = df)
coef_val <- coef(model)["education"]
print(paste("The coefficient was", coef_val))
# Create a variable
myvar = 7
# Use the variable
print(myvar)
# Variables are useful for storing lists
controls = ['age', 'education', 'income']
import statsmodels.formula.api as smf
model = smf.ols('wage ~ age + education + income', data=df).fit()
# Or for storing results
model = smf.ols('wage ~ education', data=df).fit()
coef_val = model.params['education']
print(f"The coefficient was {coef_val}")
If you highlight local x = 5 and run it, then separately highlight display `x' and run it, Stata says "x not found". Why? Locals disappear at the end of each "run". Solution: Highlight BOTH lines and run them together, or put them in a do-file. (R and Python don't have this issue.)
Preserve and Restore
Sometimes you want to temporarily modify data, do something, then go back to the original:
* Save the current state
preserve
* Do something that changes the data
keep if state == "CA"
save california_only.dta, replace
* Go back to the full dataset
restore
* The full dataset is back!
# Make a copy before modifying
df_backup <- df
# Do something that changes the data
df_ca <- df[df$state == "CA", ]
saveRDS(df_ca, "california_only.rds")
# The original is still in df_backup
# (or just use df_ca and keep df unchanged)
# Make a copy before modifying
df_backup = df.copy()
# Do something that changes the data
df_ca = df[df['state'] == 'CA']
df_ca.to_csv('california_only.csv', index=False)
# The original is still in df_backup
# (or just filter without reassigning to keep df unchanged)
Regression
This section covers the main regression commands you'll use in applied economics research. The syntax is similar across languages: you specify a dependent variable (Y), independent variables (X), and options like standard error types.
Basic OLS Regression
* Simple regression
reg wage education
* Add control variables
reg wage education age experience
* Heteroskedasticity-robust standard errors
reg wage education age experience, robust
* Cluster standard errors by state
reg wage education age experience, cluster(state)
# Simple regression
model <- lm(wage ~ education, data = df)
summary(model)
# Add control variables
model <- lm(wage ~ education + age + experience, data = df)
# Heteroskedasticity-robust standard errors
pacman::p_load(sandwich, lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
# Cluster standard errors by state
pacman::p_load(clubSandwich)
coef_test(model, vcov = vcovCR(model, cluster = df$state, type = "CR0"))
import statsmodels.formula.api as smf
# Simple regression
model = smf.ols('wage ~ education', data=df).fit()
print(model.summary())
# Add control variables
model = smf.ols('wage ~ education + age + experience', data=df).fit()
# Heteroskedasticity-robust standard errors
model_robust = smf.ols('wage ~ education + age + experience', data=df).fit(
cov_type='HC1'
)
# Cluster standard errors by state
model_cluster = smf.ols('wage ~ education + age + experience', data=df).fit(
cov_type='cluster', cov_kwds={'groups': df['state']}
)
| Situation | Standard Errors | Stata Option |
|---|---|---|
| Cross-sectional data, simple models | Robust (HC1)* | , robust |
| Panel data (observations grouped by unit) | Clustered by unit | , cluster(state) |
| Workers grouped by firm | Clustered by firm | , cluster(firm) |
| DiD with state-level treatment | Clustered by state | , cluster(state) |
Rule of thumb: Cluster at the level of treatment assignment or the level where errors might be correlated.
*What is HC1? HC1 stands for "Heteroskedasticity-Consistent type 1." Stata's , robust option uses HC1 by default. R and Python require you to specify it explicitly. For most economics papers, HC1 is standard.
Interactions
Example: Does the return to education differ by gender? Add education##gender to test whether the education coefficient is different for men vs. women.
* # adds just the interaction
reg wage education gender#married
* ## adds the interaction AND main effects
reg wage education gender##married
* For continuous variables, use c. prefix
reg wage c.education##c.experience
# : adds just the interaction
model <- lm(wage ~ education + gender:married, data = df)
# * adds the interaction AND main effects
model <- lm(wage ~ education + gender*married, data = df)
# For continuous variables
model <- lm(wage ~ education * experience, data = df)
# : adds just the interaction
model = smf.ols('wage ~ education + gender:married', data=df).fit()
# * adds the interaction AND main effects
model = smf.ols('wage ~ education + gender*married', data=df).fit()
# For continuous variables
model = smf.ols('wage ~ education * experience', data=df).fit()
education##female coefficients:
education= effect of education for males (the reference group)1.female= wage gap between females and males (at 0 years of education)1.female#c.education= how much more/less females gain from each year of education compared to males
Total effect of education for females = education + 1.female#c.education
Quick Check: Standard Errors
Question: You're analyzing wage data where workers are grouped by firm. What type of standard errors should you use?
Fixed Effects
State fixed effects control for everything about a state that doesn't change over time (geography, culture, history).
Year fixed effects control for everything that affects all states equally in a given year (recessions, federal policy).
Two-way fixed effects (TWFE) = state FE + year FE. This is the workhorse of diff-in-diff.
* Using absorb() to add fixed effects
* (doesn't show the FE coefficients, which is usually what you want)
reg wage education age, absorb(state)
* For MANY fixed effects, use reghdfe (faster)
ssc install reghdfe // run once to install
ssc install ftools // dependency
reghdfe wage education age, absorb(state year)
* Two-way fixed effects with clustering
reghdfe wage education, absorb(state year) cluster(state)
# Using fixest package (fastest option)
pacman::p_load(fixest)
# One-way fixed effects
model <- feols(wage ~ education + age | state, data = df)
# Two-way fixed effects
model <- feols(wage ~ education + age | state + year, data = df)
# With clustered standard errors
model <- feols(wage ~ education | state + year,
cluster = ~state, data = df)
# Using pyfixest (similar to R's fixest)
import pyfixest as pf
# One-way fixed effects
model = pf.feols('wage ~ education + age | state', data=df)
# Two-way fixed effects
model = pf.feols('wage ~ education + age | state + year', data=df)
# With clustered standard errors
model = pf.feols('wage ~ education | state + year',
vcov={'CRV1': 'state'}, data=df)
When to Use High-Dimensional Fixed Effects
Use reghdfe (Stata), fixest (R), or pyfixest (Python) whenever you have many fixed effects (hundreds or thousands of groups). They're much faster than including dummy variables, and they handle multiple sets of fixed effects efficiently.
Instrumental Variables
Classic example: Does education increase wages? Problem: unobserved ability affects both education AND wages.
Solution: Find an instrument (like quarter of birth) that affects education but has no direct effect on wages.
- Relevance: The instrument affects the endogenous variable (check the first-stage F-statistic)
- Exclusion: The instrument affects Y only through the endogenous variable (not directly)
Condition 1 is testable. Condition 2 is not—you must argue it's plausible.
* Install ivreghdfe
ssc install ivreghdfe // run once
* IV regression syntax: (endogenous = instruments)
* Example: education is endogenous, quarter_of_birth is the instrument
ivreghdfe wage (education = quarter_of_birth) age experience
* With fixed effects
ivreghdfe wage (education = quarter_of_birth) age, absorb(state year)
# Using fixest for IV
pacman::p_load(fixest)
# IV regression: education instrumented by quarter_of_birth
model <- feols(wage ~ age + experience |
education ~ quarter_of_birth, data = df)
# With fixed effects
model <- feols(wage ~ age | state + year |
education ~ quarter_of_birth, data = df)
# Using pyfixest for IV
import pyfixest as pf
# IV regression: education instrumented by quarter_of_birth
model = pf.feols('wage ~ age + experience | education ~ quarter_of_birth',
data=df)
# With fixed effects
model = pf.feols('wage ~ age | state + year | education ~ quarter_of_birth',
data=df)
Exporting Results
* Install estout for publication tables
ssc install estout
* Store results
eststo clear
eststo m1: reg wage education, robust
eststo m2: reg wage education age, robust
eststo m3: reg wage education age experience, robust
* Create a nice table
esttab m1 m2 m3, se r2 label ///
title("Wage Regressions") ///
mtitles("(1)" "(2)" "(3)")
* Export to LaTeX
esttab m1 m2 m3 using "tables/wage_regs.tex", replace ///
se r2 label booktabs
# Using modelsummary
pacman::p_load(modelsummary)
# Run regressions
m1 <- lm(wage ~ education, data = df)
m2 <- lm(wage ~ education + age, data = df)
m3 <- lm(wage ~ education + age + experience, data = df)
# Create table
modelsummary(list("(1)" = m1, "(2)" = m2, "(3)" = m3),
stars = TRUE,
gof_omit = "IC|Log")
# Export to LaTeX
modelsummary(list(m1, m2, m3),
output = "tables/wage_regs.tex")
# Using stargazer-style output with pyfixest
import pyfixest as pf
# Run regressions
m1 = pf.feols('wage ~ education', data=df, vcov='HC1')
m2 = pf.feols('wage ~ education + age', data=df, vcov='HC1')
m3 = pf.feols('wage ~ education + age + experience', data=df, vcov='HC1')
# Create table
pf.etable([m1, m2, m3])
# Export to LaTeX
pf.etable([m1, m2, m3], type='tex', file='tables/wage_regs.tex')
Practice Exercise
Try this exercise to cement what you've learned. Work through it step by step—don't just read the answer.
Exercise: Wage Regression with Loops
You have wage data with variables wage, education, age, experience, and state. Write code to:
- Create a local/variable called
controlscontainingage experience - Run a regression of
wageoneducationplus the controls, with robust standard errors - Run the same regression with state fixed effects and clustered standard errors
- Use a loop to summarize all three independent variables (
education,age,experience)
Click to see solution (Stata)
* 1. Create local with controls
local controls "age experience"
* 2. OLS with robust SEs
reg wage education `controls', robust
* 3. With state FE and clustered SEs
reghdfe wage education `controls', absorb(state) cluster(state)
* 4. Summarize each variable in a loop
foreach v in education age experience {
summarize `v'
}
Click to see solution (R)
pacman::p_load(fixest, lmtest, sandwich)
# 1. Vector of controls
controls <- c("age", "experience")
# 2. OLS with robust SEs
m1 <- lm(wage ~ education + age + experience, data = df)
coeftest(m1, vcov = vcovHC(m1, type = "HC1"))
# 3. With state FE and clustered SEs
m2 <- feols(wage ~ education + age + experience | state,
cluster = ~state, data = df)
# 4. Summarize each variable in a loop
for (v in c("education", "age", "experience")) {
print(summary(df[[v]]))
}
Session Complete!
You've learned the essentials of loops, project organization, and regression. The key takeaways:
- Loops eliminate copy-paste code. Write it once, run it many times.
- Locals store values you'll reuse. Change once, update everywhere.
- Project structure separates raw data → cleaned data → analysis → output.
- Standard errors should be clustered at the level of treatment or correlation.
- Fixed effects control for time-invariant unobservables.
Ready to Test Your Knowledge?
Answer questions about loops, locals, and regression commands
Take the QuizFound something unclear or have a suggestion? Email [email protected].