Research and Communication in Economics

Stata Session 2

Reshaping, merging, loops, and regression

Building on what you know

Today's topics build directly on Session 1. Reshaping and merging are how you transform data into the format you need. Loops are just a way to avoid copy-paste. Regression commands might look intimidating, but they're just functions that take variables as inputs. If you understood summarize last week, you can understand regress today.

What You'll Learn

  • How to reshape data between "wide" and "long" formats
  • How to merge multiple datasets together
  • Using loops to avoid repetitive code
  • Working with locals and temporary files
  • Running OLS, fixed effects, and IV regressions

Download Complete Scripts

Run the full session code in your preferred language:

Stata .do R .R Python .py
Live Quiz — Test your understanding during class
Take Quiz

Reshaping Data

Data comes in two basic shapes: wide and long. Understanding the difference—and knowing how to convert between them—is essential for data analysis.

Wide vs. Long Format

Consider GDP data for three countries over three years:

Wide Format

Each row is a country. Years are columns.

country gdp2018 gdp2019 gdp2020
USA20.521.420.9
China13.914.314.7
Japan5.05.15.0

3 rows (one per country)

Long Format

Each row is a country-year. Year is a variable.

country year gdp
USA201820.5
USA201921.4
USA202020.9
China201813.9
.........

9 rows (one per country-year)

When to Use Each Format

  • Long format: Usually needed for regression, plotting, and most analysis. This is the "tidy" format.
  • Wide format: Sometimes easier for data entry or viewing, but rarely what you want for analysis.

Rule of thumb: If you're unsure, you probably want long format.

Reshaping: Wide → Long

🤔 Predict: If you have wide data with 3 countries and columns gdp2018, gdp2019, gdp2020, how many rows will you have after reshaping to long?
* Start with wide data: country, gdp2018, gdp2019, gdp2020

* Reshape wide to long
* Syntax: reshape long [stub], i([id var]) j([new time var])
reshape long gdp, i(country) j(year)

* What this does:
*   - i(country): country is the ID variable (stays as is)
*   - j(year): creates a new variable "year" from the suffixes
*   - gdp: the stub - looks for gdp2018, gdp2019, etc.

* Check the result
list, clean
# Start with wide data: country, gdp2018, gdp2019, gdp2020

# Reshape wide to long using pivot_longer
df_long <- df %>%
  pivot_longer(
    cols = starts_with("gdp"),     # Columns to reshape
    names_to = "year",             # New column for the names
    names_prefix = "gdp",          # Remove "gdp" prefix
    values_to = "gdp"              # New column for the values
  ) %>%
  mutate(year = as.integer(year))  # Convert year to integer

# Check the result
print(df_long)
# Start with wide data: country, gdp2018, gdp2019, gdp2020

# Reshape wide to long using melt
df_long = df.melt(
    id_vars=['country'],           # Keep these as is
    var_name='year',               # Name for the variable column
    value_name='gdp'               # Name for the value column
)

# Clean up: extract year from "gdp2018" -> 2018
df_long['year'] = df_long['year'].str.replace('gdp', '').astype(int)

# Check the result
print(df_long)

Reshaping: Long → Wide

* Start with long data: country, year, gdp

* Reshape long to wide
reshape wide gdp, i(country) j(year)

* This creates: country, gdp2018, gdp2019, gdp2020
# Start with long data: country, year, gdp

# Reshape long to wide using pivot_wider
df_wide <- df_long %>%
  pivot_wider(
    names_from = year,
    values_from = gdp,
    names_prefix = "gdp"
  )

# This creates: country, gdp2018, gdp2019, gdp2020
# Start with long data: country, year, gdp

# Reshape long to wide using pivot
df_wide = df_long.pivot(
    index='country',
    columns='year',
    values='gdp'
).reset_index()

# This creates: country, 2018, 2019, 2020
# Rename columns if you want the "gdp" prefix
df_wide.columns = ['country'] + [f'gdp{y}' for y in df_wide.columns[1:]]

Merging Data

Real research almost always requires combining data from multiple sources. This is called merging (Stata) or joining (R/Python).

The Concept

Imagine you have:

  • Dataset A: People's ages (id, age)
  • Dataset B: People's incomes (id, income)

You want to combine them into one dataset with id, age, AND income. The key (the variable that links them) is id.

🤔 Predict: What happens if person 5 is in Dataset A but not in Dataset B? What should their income be in the merged data?

Merge Types

💡 Understanding Merge Notation:

In Stata, the master dataset is the one already in memory (the one you're working with), and the using dataset is the one you're bringing in with the merge command. The notation m:1 means "many-to-one": many rows in the master match one row in the using dataset. For example, if your master has 10,000 individual people and your using has 50 state-level rows, that's m:1—many people per state, one state row each. In R, the first argument to merge() plays the role of the master, and the second is the using.

Type Stata R When to Use
1:1 merge 1:1 merge(...) Each row in A matches exactly one row in B
m:1 merge m:1 merge(..., all.x = TRUE) Many rows in A match one row in B (e.g., people to states)
1:m merge 1:m merge(..., all.y = TRUE) One row in A matches many rows in B
m:m merge m:m Almost never correct! Usually indicates a problem.

Merging: Step by Step

* Example: Merge person data with state-level unemployment

* Step 1: Load the "master" dataset (person-level)
use person_data.dta, clear
count  // Check: 10,000 people

* Step 2: Merge with "using" dataset (state-level)
* m:1 because many people live in each state
merge m:1 state using state_unemployment.dta

* Step 3: CHECK THE MERGE RESULTS
tab _merge

* _merge values:
*   1 = in master only (person with no matching state)
*   2 = in using only (state with no people)
*   3 = matched (what you want)

* Step 4: Decide what to do with unmatched
* Usually keep only matched:
keep if _merge == 3
drop _merge
# Example: Merge person data with state-level unemployment

# Step 1: Load datasets
person_data <- read_csv("person_data.csv")
state_data <- read_csv("state_unemployment.csv")

# Check dimensions
nrow(person_data)  # 10,000 people
nrow(state_data)   # 50 states

# Step 2: Merge
# all.x = TRUE keeps all people, adds state data where matched
merged <- merge(person_data, state_data, by = "state", all.x = TRUE)

# Step 3: Check for unmatched
sum(is.na(merged$unemployment))  # How many have missing state data?

# Step 4: Remove unmatched if needed
merged <- merged %>% filter(!is.na(unemployment))
# Example: Merge person data with state-level unemployment

# Step 1: Load datasets
person_data = pd.read_csv("person_data.csv")
state_data = pd.read_csv("state_unemployment.csv")

# Check dimensions
len(person_data)  # 10,000 people
len(state_data)   # 50 states

# Step 2: Merge
# merge with indicator to check matching
merged = person_data.merge(
    state_data,
    on='state',
    how='left',
    indicator=True
)

# Step 3: Check for unmatched
print(merged['_merge'].value_counts())

# Step 4: Remove unmatched if needed
merged = merged[merged['_merge'] == 'both']
merged = merged.drop(columns=['_merge'])
⚠️ Always Check Your Merge!
  • Did the number of observations change unexpectedly?
  • How many observations matched vs. didn't match?
  • Do unmatched observations make sense (e.g., states with no people)?

Most merge bugs are silent—your code runs but gives wrong answers. Checking _merge is how you catch them.

Loops and Locals

Loops let you write code once and run it many times. If you ever find yourself copy-pasting code and just changing one number, you should use a loop instead.

The forvalues Loop

🎯 Real scenario: You're creating a categorical variable for number of children (0, 1, 2, 3+). You could write 4 separate lines... or use a loop.

Without a loop (bad):

* This is repetitive and error-prone
replace kids_cat = 0 if nchild == 0
replace kids_cat = 1 if nchild == 1
replace kids_cat = 2 if nchild == 2
replace kids_cat = 3 if nchild == 3
# This is repetitive and error-prone
df$kids_cat[df$nchild == 0] <- 0
df$kids_cat[df$nchild == 1] <- 1
df$kids_cat[df$nchild == 2] <- 2
df$kids_cat[df$nchild == 3] <- 3
# This is repetitive and error-prone
df.loc[df['nchild'] == 0, 'kids_cat'] = 0
df.loc[df['nchild'] == 1, 'kids_cat'] = 1
df.loc[df['nchild'] == 2, 'kids_cat'] = 2
df.loc[df['nchild'] == 3, 'kids_cat'] = 3

With a loop (good):

* Much cleaner!
forvalues i = 0/3 {
    replace kids_cat = `i' if nchild == `i'
}
# Much cleaner!
for (i in 0:3) {
    df$kids_cat[df$nchild == i] <- i
}
# Much cleaner!
for i in range(4):
    df.loc[df['nchild'] == i, 'kids_cat'] = i
💡 What happens when this runs (Stata):
  1. First iteration: i = 0 → runs replace kids_cat = 0 if nchild == 0
  2. Second iteration: i = 1 → runs replace kids_cat = 1 if nchild == 1
  3. Third iteration: i = 2 → runs replace kids_cat = 2 if nchild == 2
  4. Fourth iteration: i = 3 → runs replace kids_cat = 3 if nchild == 3

The backticks `i' get replaced with the current value of the loop variable.

How to Build a Loop

  1. Write the code for one value first (e.g., the i = 0 case)
  2. Make sure it works
  3. Then wrap it in a loop, replacing the number with the loop variable

This way, if something breaks, you know whether it's the code or the loop.

Loop Variations

🤔 Before running this code, predict: What will forvalues i = 5(5)100 print? (Hint: the middle number is the step size.)
* Count from 0 to 3
forvalues i = 0/3 {
    display `i'
}

* Count by 5s from 5 to 100
forvalues i = 5(5)100 {
    display `i'
}

* Loop over a list of values (not just numbers)
foreach v in income age education {
    summarize `v'
}
# Count from 0 to 3
for (i in 0:3) {
    print(i)
}

# Count by 5s from 5 to 100
for (i in seq(5, 100, by = 5)) {
    print(i)
}

# Loop over a list of values (not just numbers)
for (v in c("income", "age", "education")) {
    print(summary(df[[v]]))
}
# Count from 0 to 3
for i in range(4):
    print(i)

# Count by 5s from 5 to 100
for i in range(5, 101, 5):
    print(i)

# Loop over a list of values (not just numbers)
for v in ['income', 'age', 'education']:
    print(df[v].describe())
Click to reveal answer: What does 5(5)100 print?

It prints: 5, 10, 15, 20, 25, ... 95, 100. The syntax is start(step)end, so this starts at 5, goes up by 5 each time, and stops at 100.

💡 foreach vs forvalues:
  • forvalues = loop over numbers (0, 1, 2, 3 or 5, 10, 15...)
  • foreach = loop over a list of anything (variable names, file names, strings)

Quick Check: Loops

Question: In Stata, what values will forvalues i = 0/3 iterate through?

Understanding Locals

A local (in Stata) or variable (in R/Python) stores a value temporarily. Think of it as giving a nickname to something you'll use repeatedly.

🎯 Why use locals?
  • Change once, update everywhere: If you store your control variables in a local, you only need to change one line to add/remove a control
  • Store results: Save a coefficient from one regression to use in another calculation
  • Make code readable: `controls' is clearer than a long list of variables

In Stata, you create a local without quotes but use it wrapped in backtick and apostrophe: `name'

* Create a local
local myvar = 7

* Use the local
display `myvar'

* Locals are useful for storing lists
local controls "age education income"
reg wage `controls'

* Or for storing results
reg wage education
local coef = _b[education]
display "The coefficient was `coef'"
# Create a variable
myvar <- 7

# Use the variable
print(myvar)

# Variables are useful for storing lists
controls <- c("age", "education", "income")
model <- lm(wage ~ age + education + income, data = df)

# Or for storing results
model <- lm(wage ~ education, data = df)
coef_val <- coef(model)["education"]
print(paste("The coefficient was", coef_val))
# Create a variable
myvar = 7

# Use the variable
print(myvar)

# Variables are useful for storing lists
controls = ['age', 'education', 'income']
import statsmodels.formula.api as smf
model = smf.ols('wage ~ age + education + income', data=df).fit()

# Or for storing results
model = smf.ols('wage ~ education', data=df).fit()
coef_val = model.params['education']
print(f"The coefficient was {coef_val}")
⚠️ CRITICAL STATA QUIRK: Locals Only Last One "Run"

If you highlight local x = 5 and run it, then separately highlight display `x' and run it, Stata says "x not found". Why? Locals disappear at the end of each "run". Solution: Highlight BOTH lines and run them together, or put them in a do-file. (R and Python don't have this issue.)

Preserve and Restore

Sometimes you want to temporarily modify data, do something, then go back to the original:

* Save the current state
preserve

* Do something that changes the data
keep if state == "CA"
save california_only.dta, replace

* Go back to the full dataset
restore

* The full dataset is back!
# Make a copy before modifying
df_backup <- df

# Do something that changes the data
df_ca <- df[df$state == "CA", ]
saveRDS(df_ca, "california_only.rds")

# The original is still in df_backup
# (or just use df_ca and keep df unchanged)
# Make a copy before modifying
df_backup = df.copy()

# Do something that changes the data
df_ca = df[df['state'] == 'CA']
df_ca.to_csv('california_only.csv', index=False)

# The original is still in df_backup
# (or just filter without reassigning to keep df unchanged)

Regression

This section covers the main regression commands you'll use in applied economics research. The syntax is similar across languages: you specify a dependent variable (Y), independent variables (X), and options like standard error types.

Basic OLS Regression

Stata syntax:
reg wage education age experience, robust
command Y (dependent) X (independent) options
* Simple regression
reg wage education

* Add control variables
reg wage education age experience

* Heteroskedasticity-robust standard errors
reg wage education age experience, robust

* Cluster standard errors by state
reg wage education age experience, cluster(state)
# Simple regression
model <- lm(wage ~ education, data = df)
summary(model)

# Add control variables
model <- lm(wage ~ education + age + experience, data = df)

# Heteroskedasticity-robust standard errors
pacman::p_load(sandwich, lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Cluster standard errors by state
pacman::p_load(clubSandwich)
coef_test(model, vcov = vcovCR(model, cluster = df$state, type = "CR0"))
import statsmodels.formula.api as smf

# Simple regression
model = smf.ols('wage ~ education', data=df).fit()
print(model.summary())

# Add control variables
model = smf.ols('wage ~ education + age + experience', data=df).fit()

# Heteroskedasticity-robust standard errors
model_robust = smf.ols('wage ~ education + age + experience', data=df).fit(
    cov_type='HC1'
)

# Cluster standard errors by state
model_cluster = smf.ols('wage ~ education + age + experience', data=df).fit(
    cov_type='cluster', cov_kwds={'groups': df['state']}
)
💡 Which standard errors should I use?
Situation Standard Errors Stata Option
Cross-sectional data, simple models Robust (HC1)* , robust
Panel data (observations grouped by unit) Clustered by unit , cluster(state)
Workers grouped by firm Clustered by firm , cluster(firm)
DiD with state-level treatment Clustered by state , cluster(state)

Rule of thumb: Cluster at the level of treatment assignment or the level where errors might be correlated.

*What is HC1? HC1 stands for "Heteroskedasticity-Consistent type 1." Stata's , robust option uses HC1 by default. R and Python require you to specify it explicitly. For most economics papers, HC1 is standard.

Interactions

🎯 When to use interactions: When you think the effect of X on Y depends on another variable.

Example: Does the return to education differ by gender? Add education##gender to test whether the education coefficient is different for men vs. women.

* # adds just the interaction
reg wage education gender#married

* ## adds the interaction AND main effects
reg wage education gender##married

* For continuous variables, use c. prefix
reg wage c.education##c.experience
# : adds just the interaction
model <- lm(wage ~ education + gender:married, data = df)

# * adds the interaction AND main effects
model <- lm(wage ~ education + gender*married, data = df)

# For continuous variables
model <- lm(wage ~ education * experience, data = df)
# : adds just the interaction
model = smf.ols('wage ~ education + gender:married', data=df).fit()

# * adds the interaction AND main effects
model = smf.ols('wage ~ education + gender*married', data=df).fit()

# For continuous variables
model = smf.ols('wage ~ education * experience', data=df).fit()
💡 Interpreting education##female coefficients:
  • education = effect of education for males (the reference group)
  • 1.female = wage gap between females and males (at 0 years of education)
  • 1.female#c.education = how much more/less females gain from each year of education compared to males

Total effect of education for females = education + 1.female#c.education

Quick Check: Standard Errors

Question: You're analyzing wage data where workers are grouped by firm. What type of standard errors should you use?

Fixed Effects

🎯 What fixed effects do: Control for all time-invariant characteristics of a group.

State fixed effects control for everything about a state that doesn't change over time (geography, culture, history).
Year fixed effects control for everything that affects all states equally in a given year (recessions, federal policy).

Two-way fixed effects (TWFE) = state FE + year FE. This is the workhorse of diff-in-diff.

* Using absorb() to add fixed effects
* (doesn't show the FE coefficients, which is usually what you want)
reg wage education age, absorb(state)

* For MANY fixed effects, use reghdfe (faster)
ssc install reghdfe  // run once to install
ssc install ftools   // dependency

reghdfe wage education age, absorb(state year)

* Two-way fixed effects with clustering
reghdfe wage education, absorb(state year) cluster(state)
# Using fixest package (fastest option)
pacman::p_load(fixest)

# One-way fixed effects
model <- feols(wage ~ education + age | state, data = df)

# Two-way fixed effects
model <- feols(wage ~ education + age | state + year, data = df)

# With clustered standard errors
model <- feols(wage ~ education | state + year,
               cluster = ~state, data = df)
# Using pyfixest (similar to R's fixest)
import pyfixest as pf

# One-way fixed effects
model = pf.feols('wage ~ education + age | state', data=df)

# Two-way fixed effects
model = pf.feols('wage ~ education + age | state + year', data=df)

# With clustered standard errors
model = pf.feols('wage ~ education | state + year',
                 vcov={'CRV1': 'state'}, data=df)

When to Use High-Dimensional Fixed Effects

Use reghdfe (Stata), fixest (R), or pyfixest (Python) whenever you have many fixed effects (hundreds or thousands of groups). They're much faster than including dummy variables, and they handle multiple sets of fixed effects efficiently.

Instrumental Variables

🎯 When to use IV: When your key independent variable is endogenous (correlated with the error term).

Classic example: Does education increase wages? Problem: unobserved ability affects both education AND wages.
Solution: Find an instrument (like quarter of birth) that affects education but has no direct effect on wages.

⚠️ A valid instrument must satisfy two conditions:
  1. Relevance: The instrument affects the endogenous variable (check the first-stage F-statistic)
  2. Exclusion: The instrument affects Y only through the endogenous variable (not directly)

Condition 1 is testable. Condition 2 is not—you must argue it's plausible.

* Install ivreghdfe
ssc install ivreghdfe  // run once

* IV regression syntax: (endogenous = instruments)
* Example: education is endogenous, quarter_of_birth is the instrument
ivreghdfe wage (education = quarter_of_birth) age experience

* With fixed effects
ivreghdfe wage (education = quarter_of_birth) age, absorb(state year)
# Using fixest for IV
pacman::p_load(fixest)

# IV regression: education instrumented by quarter_of_birth
model <- feols(wage ~ age + experience |
               education ~ quarter_of_birth, data = df)

# With fixed effects
model <- feols(wage ~ age | state + year |
               education ~ quarter_of_birth, data = df)
# Using pyfixest for IV
import pyfixest as pf

# IV regression: education instrumented by quarter_of_birth
model = pf.feols('wage ~ age + experience | education ~ quarter_of_birth',
                 data=df)

# With fixed effects
model = pf.feols('wage ~ age | state + year | education ~ quarter_of_birth',
                 data=df)

Exporting Results

* Install estout for publication tables
ssc install estout

* Store results
eststo clear
eststo m1: reg wage education, robust
eststo m2: reg wage education age, robust
eststo m3: reg wage education age experience, robust

* Create a nice table
esttab m1 m2 m3, se r2 label ///
    title("Wage Regressions") ///
    mtitles("(1)" "(2)" "(3)")

* Export to LaTeX
esttab m1 m2 m3 using "tables/wage_regs.tex", replace ///
    se r2 label booktabs
# Using modelsummary
pacman::p_load(modelsummary)

# Run regressions
m1 <- lm(wage ~ education, data = df)
m2 <- lm(wage ~ education + age, data = df)
m3 <- lm(wage ~ education + age + experience, data = df)

# Create table
modelsummary(list("(1)" = m1, "(2)" = m2, "(3)" = m3),
             stars = TRUE,
             gof_omit = "IC|Log")

# Export to LaTeX
modelsummary(list(m1, m2, m3),
             output = "tables/wage_regs.tex")
# Using stargazer-style output with pyfixest
import pyfixest as pf

# Run regressions
m1 = pf.feols('wage ~ education', data=df, vcov='HC1')
m2 = pf.feols('wage ~ education + age', data=df, vcov='HC1')
m3 = pf.feols('wage ~ education + age + experience', data=df, vcov='HC1')

# Create table
pf.etable([m1, m2, m3])

# Export to LaTeX
pf.etable([m1, m2, m3], type='tex', file='tables/wage_regs.tex')

Practice Exercise

Try this exercise to cement what you've learned. Work through it step by step—don't just read the answer.

Exercise: Wage Regression with Loops

You have wage data with variables wage, education, age, experience, and state. Write code to:

  1. Create a local/variable called controls containing age experience
  2. Run a regression of wage on education plus the controls, with robust standard errors
  3. Run the same regression with state fixed effects and clustered standard errors
  4. Use a loop to summarize all three independent variables (education, age, experience)
Click to see solution (Stata)
* 1. Create local with controls
local controls "age experience"

* 2. OLS with robust SEs
reg wage education `controls', robust

* 3. With state FE and clustered SEs
reghdfe wage education `controls', absorb(state) cluster(state)

* 4. Summarize each variable in a loop
foreach v in education age experience {
    summarize `v'
}
Click to see solution (R)
pacman::p_load(fixest, lmtest, sandwich)

# 1. Vector of controls
controls <- c("age", "experience")

# 2. OLS with robust SEs
m1 <- lm(wage ~ education + age + experience, data = df)
coeftest(m1, vcov = vcovHC(m1, type = "HC1"))

# 3. With state FE and clustered SEs
m2 <- feols(wage ~ education + age + experience | state,
            cluster = ~state, data = df)

# 4. Summarize each variable in a loop
for (v in c("education", "age", "experience")) {
    print(summary(df[[v]]))
}

Session Complete!

You've learned the essentials of loops, project organization, and regression. The key takeaways:

  • Loops eliminate copy-paste code. Write it once, run it many times.
  • Locals store values you'll reuse. Change once, update everywhere.
  • Project structure separates raw data → cleaned data → analysis → output.
  • Standard errors should be clustered at the level of treatment or correlation.
  • Fixed effects control for time-invariant unobservables.

Ready to Test Your Knowledge?

Answer questions about loops, locals, and regression commands

Take the Quiz
← Back to Tutorials

Found something unclear or have a suggestion? Email [email protected].