Research and Communication in Economics

Introduction to Stata

From setup to regression — a comprehensive introduction

What You'll Learn

  • How to set up a research project with proper folder structure
  • How to explore a dataset you've never seen before
  • How to select the observations and variables you need
  • How to create new variables from existing ones
  • How to import data from CSV files
  • How to reshape data between "wide" and "long" formats
  • How to merge multiple datasets together
  • Using loops to avoid repetitive code
  • Working with locals and temporary files
  • Running OLS, fixed effects, and IV regressions

Download Complete Scripts

Run the full session code in your preferred language:

Part 1: Setup, Exploration, Data Import

Stata .do R .R Python .py

Part 2: Reshaping, Merging, Loops, Regression

Stata .do R .R Python .py

A note before we start: Learning to code is like learning a new language—it takes practice, and everyone makes mistakes. If something doesn't work, that's normal! Read error messages carefully, check for typos, and remember that debugging is a core skill, not a sign you're doing something wrong.

Module 1: Introduction to Stata

Imagine you just got a new dataset. Before doing any analysis, you need to understand what you have. How many observations? What variables? What do they look like? This module teaches you how to answer these questions.

Step 1: Setting Up Your Script

Every Stata do-file should start the same way. Think of this as "clearing the stage" before your performance:

Before running this code, predict: What does clear all do? Why might we want to close any open log file before starting?
* ========================================
* SETUP - Run this at the start of every script
* ========================================

* Step 1: Clear everything from memory
* (This ensures you're starting fresh, not using old data)
clear all

* Step 2: Close any open log file
* (capture means "try this, but don't error if it fails")
capture log close

* Step 3: Don't pause after each screenful of output
set more off

* Step 4: Set your working directory
* IMPORTANT: Change this path to YOUR project folder!
* To find your path: On Mac, right-click your project folder in Finder
* → 'Get Info' → copy the path next to 'Where'.
* On Windows, open the folder in Explorer → click the address bar → copy.
* Use forward slashes (/) in Stata even on Windows.
cd "/Users/yourname/Dropbox/my_project"

* Step 5: Start a log file (records everything you do)
* "replace" overwrites the old log if it exists
log using my_analysis.log, replace
# ========================================
# SETUP - Run this at the start of every script
# ========================================

# Step 1: Clear environment (remove all objects)
rm(list = ls())

# Step 2: Set working directory
# IMPORTANT: Change this path to YOUR project folder!
setwd("/Users/yourname/Dropbox/my_project")

# Step 3: Load required packages
pacman::p_load(tidyverse, haven)
# ========================================
# SETUP - Run this at the start of every script
# ========================================

# Step 1: Import required packages
import pandas as pd
import numpy as np
import os

# Step 2: Set working directory
# IMPORTANT: Change this path to YOUR project folder!
os.chdir("/Users/yourname/Dropbox/my_project")

Why This Matters

Starting fresh (clear all) prevents a common bug: accidentally using variables from a previous session. The log file creates a record of everything you did—invaluable when you need to remember your steps months later or when something goes wrong.

Step 2: First Look at Your Data

You've downloaded a dataset. Now what? Let's say it's Census data with information about people. Before diving into analysis, answer these questions:

  • How many observations? (Are there 100 people? 1 million?)
  • What variables exist? (Income? Age? Education?)
  • What do the values look like? (Is age in years? Months? Categories?)
Predict: If you have a variable called sex with values 1 and 2, what might those numbers mean? How would you find out?
* Load the data
use mydata.dta, clear

* ----- EXPLORATION COMMANDS -----

* browse: Opens a spreadsheet view
* USE THIS FIRST to see what your data looks like
browse

* describe: Shows variable names, types, and labels
* This tells you WHAT variables you have
describe

* summarize: Shows mean, sd, min, max for numeric variables
* This tells you the RANGE of your data
summarize

* summarize specific variables
summarize age income

* summarize with detail: adds percentiles
* Useful for seeing the full distribution
summarize income, detail

* tabulate: Counts observations by category
* ESSENTIAL for categorical variables
tabulate sex

* See the numeric codes behind the labels
tabulate sex, nolabel
# Load the data
df <- read_dta("mydata.dta")

# ----- EXPLORATION COMMANDS -----

# View(): Opens a spreadsheet view
# USE THIS FIRST to see what your data looks like
View(df)

# glimpse(): Shows variable names and types
# This tells you WHAT variables you have
glimpse(df)

# summary(): Shows summary stats for all variables
summary(df)

# Summary for specific variables
summary(df$age)
summary(df$income)

# Detailed summary with percentiles
pacman::p_load(psych)
describe(df$income)

# table(): Counts observations by category
# ESSENTIAL for categorical variables
table(df$sex)

# With percentages
prop.table(table(df$sex))
# Load the data
df = pd.read_stata("mydata.dta")

# ----- EXPLORATION COMMANDS -----

# head(): Shows first few rows
# USE THIS FIRST to see what your data looks like
df.head(10)

# info(): Shows variable names and types
# This tells you WHAT variables you have
df.info()

# describe(): Shows summary stats for numeric variables
df.describe()

# Summary for specific variables
df[['age', 'income']].describe()

# Detailed summary with percentiles
df['income'].describe(percentiles=[.1, .25, .5, .75, .9])

# value_counts(): Counts observations by category
# ESSENTIAL for categorical variables
df['sex'].value_counts()

# With percentages
df['sex'].value_counts(normalize=True)
What the output tells you:
  • describe output shows variable names, types (string vs numeric), and labels (human-readable descriptions)
  • summarize output shows Obs (count of non-missing), Mean, Std. Dev., Min, and Max
  • tabulate output shows how many observations fall into each category

Never Manually Edit Data

Stata's Data Editor lets you click on cells and change values. Never do this. Why? Because there's no record of what you changed. If you need to make changes, do it in code so your work is reproducible.

Step 3: Selecting Your Sample

Your dataset might have millions of observations, but your research question might focus on a specific group. For example, if you're studying women's labor force participation, you might want to:

  • Keep only women (drop men)
  • Keep only working-age adults (ages 25-54)
  • Drop observations with missing income
Predict: What's the difference between keep if age >= 30 and drop if age < 30? Do they give the same result?
* ----- SELECTING OBSERVATIONS -----

* Keep only women (where sex == 2)
* "==" means "is equal to" (double equals for comparison)
keep if sex == 2

* Check how many observations remain
count

* Keep only ages 25-54
* "&" means AND - both conditions must be true
keep if age >= 25 & age <= 54

* Alternative: drop observations you don't want
* "|" means OR - either condition triggers the drop
drop if age < 25 | age > 54

* ----- SELECTING VARIABLES -----

* Drop variables you don't need
* This makes your dataset smaller and easier to work with
drop year serial pernum

* Keep only specific variables
keep id age income sex education
# ----- SELECTING OBSERVATIONS -----

# Keep only women (where sex == 2)
df <- df %>% filter(sex == 2)

# Check how many observations remain
nrow(df)

# Keep only ages 25-54
df <- df %>% filter(age >= 25 & age <= 54)

# Alternative: drop observations you don't want
df <- df %>% filter(!(age < 25 | age > 54))

# ----- SELECTING VARIABLES -----

# Drop variables you don't need
df <- df %>% select(-year, -serial, -pernum)

# Keep only specific variables
df <- df %>% select(id, age, income, sex, education)
# ----- SELECTING OBSERVATIONS -----

# Keep only women (where sex == 2)
df = df[df['sex'] == 2]

# Check how many observations remain
len(df)

# Keep only ages 25-54
df = df[(df['age'] >= 25) & (df['age'] <= 54)]

# Alternative: drop observations you don't want
df = df[~((df['age'] < 25) | (df['age'] > 54))]

# ----- SELECTING VARIABLES -----

# Drop variables you don't need
df = df.drop(columns=['year', 'serial', 'pernum'])

# Keep only specific variables
df = df[['id', 'age', 'income', 'sex', 'education']]

Check Your Work!

After every keep or drop, run count (Stata) or nrow(df) (R) or len(df) (Python) to verify you have the expected number of observations. A common mistake is accidentally dropping everything!

Step 4: Creating New Variables

Raw data rarely has exactly the variables you need. You'll often need to create:

  • Indicator variables (0/1): Is this person married? Employed? Over 65?
  • Categorical variables: Group ages into bins (18-24, 25-34, 35-44, etc.)
  • Transformed variables: Log of income, income in thousands, etc.
Predict: In Stata, what's the difference between gen married = 1 if marst == 1 and gen married = (marst == 1)? What happens to observations where marst != 1 in each case?
Critical Warning About Indicator Variables: If you write gen married = 1 if marst == 1, Stata creates married = 1 for married people, but married = missing (not 0!) for everyone else. This silently breaks your regressions because observations with missing values get dropped. Always use the two-step method with replace, or the safer one-step method shown below.
* ----- INDICATOR VARIABLES (0/1) -----

* Method 1: Two steps (WRONG - creates missing values!)
gen married = 1 if marst == 1
* Problem: married is MISSING (not 0) when marst != 1
tab married, missing  // You'll see missing values

* Method 2: Two steps (CORRECT)
gen married = 1 if marst == 1
replace married = 0 if marst != 1
* Now married is 0 or 1 for everyone

* Method 3: One step (BEST - cleaner code)
gen married = (marst == 1)
* The expression (marst == 1) evaluates to 1 if true, 0 if false

* ----- CATEGORICAL VARIABLES -----

* Group number of children: 0, 1, 2, 3, 4+
gen kids_cat = nchild if nchild <= 4
replace kids_cat = 4 if nchild > 4

* Create dummy variables from categorical
* This creates Kids_1, Kids_2, Kids_3, Kids_4, Kids_5
tab kids_cat, gen(Kids_)

* ----- TRANSFORMED VARIABLES -----

* Log of income (useful for skewed distributions)
* Add 1 to handle zeros (ln(0) is undefined)
gen ln_income = ln(income + 1)

* Income in thousands
gen income_k = income / 1000
# ----- INDICATOR VARIABLES (0/1) -----

# Method 1: Using ifelse (explicit)
df <- df %>% mutate(married = ifelse(marst == 1, 1, 0))

# Method 2: Using as.integer (cleaner)
df <- df %>% mutate(married = as.integer(marst == 1))

# ----- CATEGORICAL VARIABLES -----

# Group number of children: 0, 1, 2, 3, 4+
df <- df %>% mutate(
  kids_cat = case_when(
    nchild <= 4 ~ nchild,
    nchild > 4 ~ 4
  )
)

# Create dummy variables from categorical
df <- df %>% mutate(
  Kids_0 = as.integer(kids_cat == 0),
  Kids_1 = as.integer(kids_cat == 1),
  Kids_2 = as.integer(kids_cat == 2),
  Kids_3 = as.integer(kids_cat == 3),
  Kids_4 = as.integer(kids_cat == 4)
)

# ----- TRANSFORMED VARIABLES -----

# Log of income
df <- df %>% mutate(ln_income = log(income + 1))

# Income in thousands
df <- df %>% mutate(income_k = income / 1000)
# ----- INDICATOR VARIABLES (0/1) -----

# Simple boolean comparison (returns True/False, converted to 1/0)
df['married'] = (df['marst'] == 1).astype(int)

# ----- CATEGORICAL VARIABLES -----

# Group number of children: 0, 1, 2, 3, 4+
df['kids_cat'] = df['nchild'].clip(upper=4)

# Create dummy variables from categorical
dummies = pd.get_dummies(df['kids_cat'], prefix='Kids')
df = pd.concat([df, dummies], axis=1)

# ----- TRANSFORMED VARIABLES -----

# Log of income
df['ln_income'] = np.log(df['income'] + 1)

# Income in thousands
df['income_k'] = df['income'] / 1000

Language Comparison

Language Comparison

Task Stata R (tidyverse) Python (pandas)
Create variable gen x = ... mutate(x = ...) df['x'] = ...
Modify variable replace x = ... mutate(x = ...) df['x'] = ...

Module 2: Project Organization

Before we dive into loading data, let's set up a proper project structure. A well-organized project saves hours of confusion later and makes your work reproducible.

Why organization matters NOW: It's much easier to start with good habits than to reorganize a messy project later. The 15 minutes you spend setting this up will save you hours of debugging and searching for files.

The Standard Structure

my_project/
├── master.do              # Runs everything (or master.R / main.py)
├── README.md
│
├── build/                 # Data preparation
│   ├── input/             # Raw data (NEVER edit these!)
│   ├── code/              # Scripts to clean data
│   └── output/            # Cleaned data
│
├── analysis/              # Your analysis
│   ├── code/              # Regression scripts, etc.
│   └── output/
│       ├── tables/
│       └── figures/
│
└── paper/                 # Your writeup
    └── draft.tex

The Golden Rule: Raw Data is READ-ONLY

Never modify files in your input/ folder. If you need to make changes, write code that reads the raw data and saves a cleaned version to output/. This way you can always reproduce your work from scratch.

The Master Script

Your master.do file should run your entire project from start to finish. Anyone should be able to run it and reproduce all your results.

/*==============================================================================
    Master Do-File: My Research Project
    Author: Your Name
    Date: February 2026
==============================================================================*/

clear all
set more off

* Set the project root (CHANGE THIS!)
global root "/Users/yourname/Dropbox/my_project"

* Define paths
global build    "$root/build"
global analysis "$root/analysis"

* Run the build scripts
do "$build/code/01_clean_census.do"
do "$build/code/02_clean_policy.do"
do "$build/code/03_merge.do"

* Run the analysis
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_main_regression.do"
do "$analysis/code/03_robustness.do"
#==============================================================================
#    Master Script: My Research Project
#    Author: Your Name
#    Date: February 2026
#==============================================================================

rm(list = ls())

# Set the project root (CHANGE THIS!)
root <- "/Users/yourname/Dropbox/my_project"

# Define paths
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")

# Run the build scripts
source(file.path(build, "code", "01_clean_census.R"))
source(file.path(build, "code", "02_clean_policy.R"))
source(file.path(build, "code", "03_merge.R"))

# Run the analysis
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_main_regression.R"))
source(file.path(analysis, "code", "03_robustness.R"))
"""
Master Script: My Research Project
Author: Your Name
Date: February 2026
"""

import os
import subprocess

# Set the project root (CHANGE THIS!)
ROOT = "/Users/yourname/Dropbox/my_project"

# Define paths
BUILD = os.path.join(ROOT, "build")
ANALYSIS = os.path.join(ROOT, "analysis")

# Run the build scripts
exec(open(os.path.join(BUILD, "code", "01_clean_census.py")).read())
exec(open(os.path.join(BUILD, "code", "02_clean_policy.py")).read())
exec(open(os.path.join(BUILD, "code", "03_merge.py")).read())

# Run the analysis
exec(open(os.path.join(ANALYSIS, "code", "01_summary_stats.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "02_main_regression.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "03_robustness.py")).read())

Number Your Scripts

Prefix scripts with numbers (01_, 02_, etc.) so it's clear what order they run in. This also keeps them sorted correctly in your file browser.

Quick Check: Project Organization

Question: You download census data from the web. Where should you save it?

Module 3: Reading in Data

Real-world data comes in many formats: CSV files from websites, Excel spreadsheets from collaborators, Stata files from data archives. This module teaches you how to get data into your software.

Importing CSV Files

CSV (Comma-Separated Values) is the most common format for sharing data. It's just a text file where values are separated by commas.

What's in a CSV file? Open one in a text editor (not Excel) and you'll see something like:
name,age,income
Alice,32,50000
Bob,45,75000
Carol,28,42000
The first row is usually variable names. Each subsequent row is one observation.
* Import CSV with first row as variable names
import delimited "mydata.csv", varnames(1) clear

* Immediately check what you got
browse       // Look at the data
describe     // Check variable types
summarize    // Check value ranges

* Common issue: numbers imported as strings
* WHY? If your CSV has a comma in any number (like '1,234'), Stata reads
* the entire column as text. Same if there's any non-numeric character
* ('$50', 'N/A'). You'll know this happened if `summarize income` says
* 'no observations' or shows nothing.
* Fix by destringing (converting to numeric)
destring income, replace
# Import CSV (read_csv from tidyverse is best)
df <- read_csv("mydata.csv")

# Immediately check what you got
glimpse(df)    # Check variable types
summary(df)    # Check value ranges

# Note: read_csv automatically detects types
# It's smarter than base R's read.csv()
# Import CSV
df = pd.read_csv("mydata.csv")

# Immediately check what you got
df.head()     # Look at first rows
df.info()     # Check variable types
df.describe() # Check value ranges

# Note: pandas usually detects types correctly
# You can specify types if needed:
# df = pd.read_csv("mydata.csv", dtype={'id': str})

Importing Stata Files

Many economics datasets are distributed as Stata .dta files. These are convenient because they preserve variable labels and value labels.

* Open Stata file
use "mydata.dta", clear

* Open specific variables only (saves memory for large files)
use age income education using "mydata.dta", clear

* Open only observations meeting a condition
use if age >= 25 using "mydata.dta", clear
# Need the haven package
pacman::p_load(haven)

# Read Stata file
df <- read_dta("mydata.dta")

# Note: haven preserves variable labels
# Access them with:
attr(df$income, "label")
# pandas can read Stata files directly
df = pd.read_stata("mydata.dta")

# Note: pandas preserves variable labels
# Access them with:
df.columns.to_list()  # Variable names

Saving Your Data

After cleaning and creating variables, save your work so you don't have to redo it.

* Save as Stata file
save "mydata_clean.dta", replace

* Export as CSV
export delimited "mydata_clean.csv", replace
# Save as RDS (R's native format)
saveRDS(df, "mydata_clean.rds")

# Save as Stata file
write_dta(df, "mydata_clean.dta")

# Save as CSV
write_csv(df, "mydata_clean.csv")
# Save as pickle (Python's native format)
df.to_pickle("mydata_clean.pkl")

# Save as Stata file
df.to_stata("mydata_clean.dta")

# Save as CSV
df.to_csv("mydata_clean.csv", index=False)

Practice Quiz: Basics

Test your understanding of the basics before moving on:

Q1: In Stata, what's wrong with this code?

gen employed = 1 if empstat == 1
Show Answer

This creates employed = 1 for employed people, but employed = missing for everyone else (not 0!). You need either:
gen employed = 1 if empstat == 1
replace employed = 0 if empstat != 1
Or: gen employed = (empstat == 1)

Q2: You download census data from the web. Where should you save it in your project folder?

Show Answer

build/input/. Raw data goes in the input folder. Never modify files in this folder—if you need to make changes, write code that reads the raw data and saves a cleaned version to build/output/.

Q3: You import a CSV and the income variable shows up as a string instead of a number. How do you fix this in Stata?

Show Answer

Use destring income, replace to convert the string to a numeric variable. This is a common issue when CSVs have numbers formatted with commas or other text characters.

Q4: You run keep if age >= 18 but then count shows 0 observations. What probably happened?

Show Answer

Most likely, age is stored as a string (text), not a number. The comparison age >= 18 doesn't work as expected on strings. Use destring age, replace first to convert it to numeric.

Building on what you know

The topics below build directly on the basics above. Reshaping and merging are how you transform data into the format you need. Loops are just a way to avoid copy-paste. Regression commands might look intimidating, but they're just functions that take variables as inputs. If you understood summarize above, you can understand regress below.

Live Quiz — Test your understanding during class
Take Quiz

Reshaping Data

Data comes in two basic shapes: wide and long. Understanding the difference—and knowing how to convert between them—is essential for data analysis.

Wide vs. Long Format

Consider GDP data for three countries over three years:

Wide Format

Each row is a country. Years are columns.

country gdp2018 gdp2019 gdp2020
USA20.521.420.9
China13.914.314.7
Japan5.05.15.0

3 rows (one per country)

Long Format

Each row is a country-year. Year is a variable.

country year gdp
USA201820.5
USA201921.4
USA202020.9
China201813.9
.........

9 rows (one per country-year)

When to Use Each Format

  • Long format: Usually needed for regression, plotting, and most analysis. This is the "tidy" format.
  • Wide format: Sometimes easier for data entry or viewing, but rarely what you want for analysis.

Rule of thumb: If you're unsure, you probably want long format.

Reshaping: Wide to Long

Predict: If you have wide data with 3 countries and columns gdp2018, gdp2019, gdp2020, how many rows will you have after reshaping to long?
* Start with wide data: country, gdp2018, gdp2019, gdp2020

* Reshape wide to long
* Syntax: reshape long [stub], i([id var]) j([new time var])
reshape long gdp, i(country) j(year)

* What this does:
*   - i(country): country is the ID variable (stays as is)
*   - j(year): creates a new variable "year" from the suffixes
*   - gdp: the stub - looks for gdp2018, gdp2019, etc.

* Check the result
list, clean
# Start with wide data: country, gdp2018, gdp2019, gdp2020

# Reshape wide to long using pivot_longer
df_long <- df %>%
  pivot_longer(
    cols = starts_with("gdp"),     # Columns to reshape
    names_to = "year",             # New column for the names
    names_prefix = "gdp",          # Remove "gdp" prefix
    values_to = "gdp"              # New column for the values
  ) %>%
  mutate(year = as.integer(year))  # Convert year to integer

# Check the result
print(df_long)
# Start with wide data: country, gdp2018, gdp2019, gdp2020

# Reshape wide to long using melt
df_long = df.melt(
    id_vars=['country'],           # Keep these as is
    var_name='year',               # Name for the variable column
    value_name='gdp'               # Name for the value column
)

# Clean up: extract year from "gdp2018" -> 2018
df_long['year'] = df_long['year'].str.replace('gdp', '').astype(int)

# Check the result
print(df_long)

Reshaping: Long to Wide

* Start with long data: country, year, gdp

* Reshape long to wide
reshape wide gdp, i(country) j(year)

* This creates: country, gdp2018, gdp2019, gdp2020
# Start with long data: country, year, gdp

# Reshape long to wide using pivot_wider
df_wide <- df_long %>%
  pivot_wider(
    names_from = year,
    values_from = gdp,
    names_prefix = "gdp"
  )

# This creates: country, gdp2018, gdp2019, gdp2020
# Start with long data: country, year, gdp

# Reshape long to wide using pivot
df_wide = df_long.pivot(
    index='country',
    columns='year',
    values='gdp'
).reset_index()

# This creates: country, 2018, 2019, 2020
# Rename columns if you want the "gdp" prefix
df_wide.columns = ['country'] + [f'gdp{y}' for y in df_wide.columns[1:]]

Merging Data

Real research almost always requires combining data from multiple sources. This is called merging (Stata) or joining (R/Python).

The Concept

Imagine you have:

  • Dataset A: People's ages (id, age)
  • Dataset B: People's incomes (id, income)

You want to combine them into one dataset with id, age, AND income. The key (the variable that links them) is id.

Predict: What happens if person 5 is in Dataset A but not in Dataset B? What should their income be in the merged data?

Merge Types

Understanding Merge Notation::

In Stata, the master dataset is the one already in memory (the one you're working with), and the using dataset is the one you're bringing in with the merge command. The notation m:1 means "many-to-one": many rows in the master match one row in the using dataset. For example, if your master has 10,000 individual people and your using has 50 state-level rows, that's m:1—many people per state, one state row each. In R, the first argument to merge() plays the role of the master, and the second is the using.

Type Stata R When to Use
1:1 merge 1:1 merge(...) Each row in A matches exactly one row in B
m:1 merge m:1 merge(..., all.x = TRUE) Many rows in A match one row in B (e.g., people to states)
1:m merge 1:m merge(..., all.y = TRUE) One row in A matches many rows in B
m:m merge m:m Almost never correct! Usually indicates a problem.

Merging: Step by Step

* Example: Merge person data with state-level unemployment

* Step 1: Load the "master" dataset (person-level)
use person_data.dta, clear
count  // Check: 10,000 people

* Step 2: Merge with "using" dataset (state-level)
* m:1 because many people live in each state
merge m:1 state using state_unemployment.dta

* Step 3: CHECK THE MERGE RESULTS
tab _merge

* _merge values:
*   1 = in master only (person with no matching state)
*   2 = in using only (state with no people)
*   3 = matched (what you want)

* Step 4: Decide what to do with unmatched
* Usually keep only matched:
keep if _merge == 3
drop _merge
# Example: Merge person data with state-level unemployment

# Step 1: Load datasets
person_data <- read_csv("person_data.csv")
state_data <- read_csv("state_unemployment.csv")

# Check dimensions
nrow(person_data)  # 10,000 people
nrow(state_data)   # 50 states

# Step 2: Merge
# all.x = TRUE keeps all people, adds state data where matched
merged <- merge(person_data, state_data, by = "state", all.x = TRUE)

# Step 3: Check for unmatched
sum(is.na(merged$unemployment))  # How many have missing state data?

# Step 4: Remove unmatched if needed
merged <- merged %>% filter(!is.na(unemployment))
# Example: Merge person data with state-level unemployment

# Step 1: Load datasets
person_data = pd.read_csv("person_data.csv")
state_data = pd.read_csv("state_unemployment.csv")

# Check dimensions
len(person_data)  # 10,000 people
len(state_data)   # 50 states

# Step 2: Merge
# merge with indicator to check matching
merged = person_data.merge(
    state_data,
    on='state',
    how='left',
    indicator=True
)

# Step 3: Check for unmatched
print(merged['_merge'].value_counts())

# Step 4: Remove unmatched if needed
merged = merged[merged['_merge'] == 'both']
merged = merged.drop(columns=['_merge'])
Always Check Your Merge!:
  • Did the number of observations change unexpectedly?
  • How many observations matched vs. didn't match?
  • Do unmatched observations make sense (e.g., states with no people)?

Most merge bugs are silent—your code runs but gives wrong answers. Checking _merge is how you catch them.

Loops and Locals

Loops let you write code once and run it many times. If you ever find yourself copy-pasting code and just changing one number, you should use a loop instead.

The forvalues Loop

Real scenario: You're creating a categorical variable for number of children (0, 1, 2, 3+). You could write 4 separate lines... or use a loop.

Without a loop (bad):

* This is repetitive and error-prone
replace kids_cat = 0 if nchild == 0
replace kids_cat = 1 if nchild == 1
replace kids_cat = 2 if nchild == 2
replace kids_cat = 3 if nchild == 3
# This is repetitive and error-prone
df$kids_cat[df$nchild == 0] <- 0
df$kids_cat[df$nchild == 1] <- 1
df$kids_cat[df$nchild == 2] <- 2
df$kids_cat[df$nchild == 3] <- 3
# This is repetitive and error-prone
df.loc[df['nchild'] == 0, 'kids_cat'] = 0
df.loc[df['nchild'] == 1, 'kids_cat'] = 1
df.loc[df['nchild'] == 2, 'kids_cat'] = 2
df.loc[df['nchild'] == 3, 'kids_cat'] = 3

With a loop (good):

* Much cleaner!
forvalues i = 0/3 {
    replace kids_cat = `i' if nchild == `i'
}
# Much cleaner!
for (i in 0:3) {
    df$kids_cat[df$nchild == i] <- i
}
# Much cleaner!
for i in range(4):
    df.loc[df['nchild'] == i, 'kids_cat'] = i
What happens when this runs (Stata)::
  1. First iteration: i = 0 → runs replace kids_cat = 0 if nchild == 0
  2. Second iteration: i = 1 → runs replace kids_cat = 1 if nchild == 1
  3. Third iteration: i = 2 → runs replace kids_cat = 2 if nchild == 2
  4. Fourth iteration: i = 3 → runs replace kids_cat = 3 if nchild == 3

The backticks `i' get replaced with the current value of the loop variable.

How to Build a Loop

  1. Write the code for one value first (e.g., the i = 0 case)
  2. Make sure it works
  3. Then wrap it in a loop, replacing the number with the loop variable

This way, if something breaks, you know whether it's the code or the loop.

Loop Variations

Before running this code, predict: What will forvalues i = 5(5)100 print? (Hint: the middle number is the step size.)
* Count from 0 to 3
forvalues i = 0/3 {
    display `i'
}

* Count by 5s from 5 to 100
forvalues i = 5(5)100 {
    display `i'
}

* Loop over a list of values (not just numbers)
foreach v in income age education {
    summarize `v'
}
# Count from 0 to 3
for (i in 0:3) {
    print(i)
}

# Count by 5s from 5 to 100
for (i in seq(5, 100, by = 5)) {
    print(i)
}

# Loop over a list of values (not just numbers)
for (v in c("income", "age", "education")) {
    print(summary(df[[v]]))
}
# Count from 0 to 3
for i in range(4):
    print(i)

# Count by 5s from 5 to 100
for i in range(5, 101, 5):
    print(i)

# Loop over a list of values (not just numbers)
for v in ['income', 'age', 'education']:
    print(df[v].describe())
Click to reveal answer: What does 5(5)100 print?

It prints: 5, 10, 15, 20, 25, ... 95, 100. The syntax is start(step)end, so this starts at 5, goes up by 5 each time, and stops at 100.

foreach vs forvalues::
  • forvalues = loop over numbers (0, 1, 2, 3 or 5, 10, 15...)
  • foreach = loop over a list of anything (variable names, file names, strings)

Quick Check: Loops

Question: In Stata, what values will forvalues i = 0/3 iterate through?

Understanding Locals

A local (in Stata) or variable (in R/Python) stores a value temporarily. Think of it as giving a nickname to something you'll use repeatedly.

Why use locals?
  • Change once, update everywhere: If you store your control variables in a local, you only need to change one line to add/remove a control
  • Store results: Save a coefficient from one regression to use in another calculation
  • Make code readable: `controls' is clearer than a long list of variables

In Stata, you create a local without quotes but use it wrapped in backtick and apostrophe: `name'

* Create a local
local myvar = 7

* Use the local
display `myvar'

* Locals are useful for storing lists
local controls "age education income"
reg wage `controls'

* Or for storing results
reg wage education
local coef = _b[education]
display "The coefficient was `coef'"
# Create a variable
myvar <- 7

# Use the variable
print(myvar)

# Variables are useful for storing lists
controls <- c("age", "education", "income")
model <- lm(wage ~ age + education + income, data = df)

# Or for storing results
model <- lm(wage ~ education, data = df)
coef_val <- coef(model)["education"]
print(paste("The coefficient was", coef_val))
# Create a variable
myvar = 7

# Use the variable
print(myvar)

# Variables are useful for storing lists
controls = ['age', 'education', 'income']
import statsmodels.formula.api as smf
model = smf.ols('wage ~ age + education + income', data=df).fit()

# Or for storing results
model = smf.ols('wage ~ education', data=df).fit()
coef_val = model.params['education']
print(f"The coefficient was {coef_val}")

CRITICAL STATA QUIRK: Locals Only Last One "Run"

If you highlight local x = 5 and run it, then separately highlight display `x' and run it, Stata says "x not found". Why? Locals disappear at the end of each "run". Solution: Highlight BOTH lines and run them together, or put them in a do-file. (R and Python don't have this issue.)

Preserve and Restore

Sometimes you want to temporarily modify data, do something, then go back to the original:

* Save the current state
preserve

* Do something that changes the data
keep if state == "CA"
save california_only.dta, replace

* Go back to the full dataset
restore

* The full dataset is back!
# Make a copy before modifying
df_backup <- df

# Do something that changes the data
df_ca <- df[df$state == "CA", ]
saveRDS(df_ca, "california_only.rds")

# The original is still in df_backup
# (or just use df_ca and keep df unchanged)
# Make a copy before modifying
df_backup = df.copy()

# Do something that changes the data
df_ca = df[df['state'] == 'CA']
df_ca.to_csv('california_only.csv', index=False)

# The original is still in df_backup
# (or just filter without reassigning to keep df unchanged)

Regression

This section covers the main regression commands you'll use in applied economics research. The syntax is similar across languages: you specify a dependent variable (Y), independent variables (X), and options like standard error types.

Basic OLS Regression

Stata syntax:
reg wage education age experience, robust
command Y (dependent) X (independent) options
* Simple regression
reg wage education

* Add control variables
reg wage education age experience

* Heteroskedasticity-robust standard errors
reg wage education age experience, robust

* Cluster standard errors by state
reg wage education age experience, cluster(state)
# Simple regression
model <- lm(wage ~ education, data = df)
summary(model)

# Add control variables
model <- lm(wage ~ education + age + experience, data = df)

# Heteroskedasticity-robust standard errors
pacman::p_load(sandwich, lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Cluster standard errors by state
pacman::p_load(clubSandwich)
coef_test(model, vcov = vcovCR(model, cluster = df$state, type = "CR0"))
import statsmodels.formula.api as smf

# Simple regression
model = smf.ols('wage ~ education', data=df).fit()
print(model.summary())

# Add control variables
model = smf.ols('wage ~ education + age + experience', data=df).fit()

# Heteroskedasticity-robust standard errors
model_robust = smf.ols('wage ~ education + age + experience', data=df).fit(
    cov_type='HC1'
)

# Cluster standard errors by state
model_cluster = smf.ols('wage ~ education + age + experience', data=df).fit(
    cov_type='cluster', cov_kwds={'groups': df['state']}
)
Which standard errors should I use?:
Situation Standard Errors Stata Option
Cross-sectional data, simple models Robust (HC1)* , robust
Panel data (observations grouped by unit) Clustered by unit , cluster(state)
Workers grouped by firm Clustered by firm , cluster(firm)
DiD with state-level treatment Clustered by state , cluster(state)

Rule of thumb: Cluster at the level of treatment assignment or the level where errors might be correlated.

*What is HC1? HC1 stands for "Heteroskedasticity-Consistent type 1." Stata's , robust option uses HC1 by default. R and Python require you to specify it explicitly. For most economics papers, HC1 is standard.

Interactions

When to use interactions: When you think the effect of X on Y depends on another variable.

Example: Does the return to education differ by gender? Add education##gender to test whether the education coefficient is different for men vs. women.

* # adds just the interaction
reg wage education gender#married

* ## adds the interaction AND main effects
reg wage education gender##married

* For continuous variables, use c. prefix
reg wage c.education##c.experience
# : adds just the interaction
model <- lm(wage ~ education + gender:married, data = df)

# * adds the interaction AND main effects
model <- lm(wage ~ education + gender*married, data = df)

# For continuous variables
model <- lm(wage ~ education * experience, data = df)
# : adds just the interaction
model = smf.ols('wage ~ education + gender:married', data=df).fit()

# * adds the interaction AND main effects
model = smf.ols('wage ~ education + gender*married', data=df).fit()

# For continuous variables
model = smf.ols('wage ~ education * experience', data=df).fit()
Interpreting education##female coefficients::
  • education = effect of education for males (the reference group)
  • 1.female = wage gap between females and males (at 0 years of education)
  • 1.female#c.education = how much more/less females gain from each year of education compared to males

Total effect of education for females = education + 1.female#c.education

Quick Check: Standard Errors

Question: You're analyzing wage data where workers are grouped by firm. What type of standard errors should you use?

Fixed Effects

What fixed effects do: Control for all time-invariant characteristics of a group.

State fixed effects control for everything about a state that doesn't change over time (geography, culture, history).
Year fixed effects control for everything that affects all states equally in a given year (recessions, federal policy).

Two-way fixed effects (TWFE) = state FE + year FE. This is the workhorse of diff-in-diff.

* Using absorb() to add fixed effects
* (doesn't show the FE coefficients, which is usually what you want)
reg wage education age, absorb(state)

* For MANY fixed effects, use reghdfe (faster)
ssc install reghdfe  // run once to install
ssc install ftools   // dependency

reghdfe wage education age, absorb(state year)

* Two-way fixed effects with clustering
reghdfe wage education, absorb(state year) cluster(state)
# Using fixest package (fastest option)
pacman::p_load(fixest)

# One-way fixed effects
model <- feols(wage ~ education + age | state, data = df)

# Two-way fixed effects
model <- feols(wage ~ education + age | state + year, data = df)

# With clustered standard errors
model <- feols(wage ~ education | state + year,
               cluster = ~state, data = df)
# Using pyfixest (similar to R's fixest)
import pyfixest as pf

# One-way fixed effects
model = pf.feols('wage ~ education + age | state', data=df)

# Two-way fixed effects
model = pf.feols('wage ~ education + age | state + year', data=df)

# With clustered standard errors
model = pf.feols('wage ~ education | state + year',
                 vcov={'CRV1': 'state'}, data=df)

When to Use High-Dimensional Fixed Effects

Use reghdfe (Stata), fixest (R), or pyfixest (Python) whenever you have many fixed effects (hundreds or thousands of groups). They're much faster than including dummy variables, and they handle multiple sets of fixed effects efficiently.

Instrumental Variables

When to use IV: When your key independent variable is endogenous (correlated with the error term).

Classic example: Does education increase wages? Problem: unobserved ability affects both education AND wages.
Solution: Find an instrument (like quarter of birth) that affects education but has no direct effect on wages.

A valid instrument must satisfy two conditions::
  1. Relevance: The instrument affects the endogenous variable (check the first-stage F-statistic)
  2. Exclusion: The instrument affects Y only through the endogenous variable (not directly)

Condition 1 is testable. Condition 2 is not—you must argue it's plausible.

* Install ivreghdfe
ssc install ivreghdfe  // run once

* IV regression syntax: (endogenous = instruments)
* Example: education is endogenous, quarter_of_birth is the instrument
ivreghdfe wage (education = quarter_of_birth) age experience

* With fixed effects
ivreghdfe wage (education = quarter_of_birth) age, absorb(state year)
# Using fixest for IV
pacman::p_load(fixest)

# IV regression: education instrumented by quarter_of_birth
model <- feols(wage ~ age + experience |
               education ~ quarter_of_birth, data = df)

# With fixed effects
model <- feols(wage ~ age | state + year |
               education ~ quarter_of_birth, data = df)
# Using pyfixest for IV
import pyfixest as pf

# IV regression: education instrumented by quarter_of_birth
model = pf.feols('wage ~ age + experience | education ~ quarter_of_birth',
                 data=df)

# With fixed effects
model = pf.feols('wage ~ age | state + year | education ~ quarter_of_birth',
                 data=df)

Exporting Results

* Install estout for publication tables
ssc install estout

* Store results
eststo clear
eststo m1: reg wage education, robust
eststo m2: reg wage education age, robust
eststo m3: reg wage education age experience, robust

* Create a nice table
esttab m1 m2 m3, se r2 label ///
    title("Wage Regressions") ///
    mtitles("(1)" "(2)" "(3)")

* Export to LaTeX
esttab m1 m2 m3 using "tables/wage_regs.tex", replace ///
    se r2 label booktabs
# Using modelsummary
pacman::p_load(modelsummary)

# Run regressions
m1 <- lm(wage ~ education, data = df)
m2 <- lm(wage ~ education + age, data = df)
m3 <- lm(wage ~ education + age + experience, data = df)

# Create table
modelsummary(list("(1)" = m1, "(2)" = m2, "(3)" = m3),
             stars = TRUE,
             gof_omit = "IC|Log")

# Export to LaTeX
modelsummary(list(m1, m2, m3),
             output = "tables/wage_regs.tex")
# Using stargazer-style output with pyfixest
import pyfixest as pf

# Run regressions
m1 = pf.feols('wage ~ education', data=df, vcov='HC1')
m2 = pf.feols('wage ~ education + age', data=df, vcov='HC1')
m3 = pf.feols('wage ~ education + age + experience', data=df, vcov='HC1')

# Create table
pf.etable([m1, m2, m3])

# Export to LaTeX
pf.etable([m1, m2, m3], type='tex', file='tables/wage_regs.tex')

Practice Exercise

Try this exercise to cement what you've learned. Work through it step by step—don't just read the answer.

Exercise: Wage Regression with Loops

You have wage data with variables wage, education, age, experience, and state. Write code to:

  1. Create a local/variable called controls containing age experience
  2. Run a regression of wage on education plus the controls, with robust standard errors
  3. Run the same regression with state fixed effects and clustered standard errors
  4. Use a loop to summarize all three independent variables (education, age, experience)
Click to see solution (Stata)
* 1. Create local with controls
local controls "age experience"

* 2. OLS with robust SEs
reg wage education `controls', robust

* 3. With state FE and clustered SEs
reghdfe wage education `controls', absorb(state) cluster(state)

* 4. Summarize each variable in a loop
foreach v in education age experience {
    summarize `v'
}
Click to see solution (R)
pacman::p_load(fixest, lmtest, sandwich)

# 1. Vector of controls
controls <- c("age", "experience")

# 2. OLS with robust SEs
m1 <- lm(wage ~ education + age + experience, data = df)
coeftest(m1, vcov = vcovHC(m1, type = "HC1"))

# 3. With state FE and clustered SEs
m2 <- feols(wage ~ education + age + experience | state,
            cluster = ~state, data = df)

# 4. Summarize each variable in a loop
for (v in c("education", "age", "experience")) {
    print(summary(df[[v]]))
}

Tutorial Complete!

You've learned the essentials of Stata, from setup to regression. The key takeaways:

  • Setup & exploration: Always start with clear all, then describe, summarize, and tabulate.
  • Project organization separates raw data → cleaned data → analysis → output.
  • Data import: Use import delimited for CSV, use for Stata files.
  • Reshaping & merging transform data into the format your analysis needs.
  • Loops eliminate copy-paste code. Write it once, run it many times.
  • Locals store values you'll reuse. Change once, update everywhere.
  • Standard errors should be clustered at the level of treatment or correlation.
  • Fixed effects control for time-invariant unobservables.

Ready to Test Your Knowledge?

Answer questions about everything covered in this tutorial

Take the Quiz
← Back to Tutorials

Found something unclear or have a suggestion? Email [email protected].