Research and Communication in Economics

Data Fundamentals

Load, clean, reshape, and merge your data

Data wrangling is 80% of the work

The glamorous part of empirical economics is running regressions. But most of your time will be spent cleaning, reshaping, and merging data. This is normal. Professional economists spend most of their coding time on data prep too. Once you get comfortable with these operations, the analysis becomes straightforward.

Think of Data as Excel Spreadsheets

The datasets we work with in Stata and R are just tables—exactly like Excel spreadsheets. Each row is an observation (a person, a country-year, a transaction), and each column is a variable (age, GDP, price). If you've used Excel, you already understand the basic structure. The difference is that instead of clicking and dragging, we write code to manipulate these tables—which makes our work reproducible and scalable to millions of rows.

Choosing Your Unit of Analysis

Before you write any code, ask yourself: What is one row in my final dataset? This is your unit of analysis, and it determines everything else.

Common Units of Analysis in Economics

  • Individual — One row per person (survey data, experiments)
  • Individual-year — One row per person per year (labor panels)
  • State-year — One row per state per year (policy studies)
  • Country-year — One row per country per year (macro/trade)
  • Firm — One row per company (cross-sectional)
  • Firm-quarter — One row per company per quarter (finance panels)
  • Transaction — One row per purchase/event (high-frequency data)

Your research question determines your unit of analysis. Consider these examples:

Question: Do minimum wage increases affect employment?

Unit: State-year

Policy varies by state and year. Outcome (employment) is also measured at state-year level.

Question: Does education affect individual earnings?

Unit: Individual-year

Track individuals over time in panel data (e.g., NLSY, PSID) to see how education affects earnings trajectories.

Question: Do lower BAC limits reduce drunk driving fatalities?

Unit: State-year

Policy (BAC limit) varies by state and year. Fatalities are counted per state-year.

Question: Does job training affect re-employment?

Unit: Individual-month

Track each person's employment status over time after treatment.

The Data Rarely Comes at the Right Level

You'll often need to aggregate raw data to match your unit of analysis. For example:

  • You want state-year data, but your raw data has one row per car crash. You need to count crashes per state-year.
  • You want individual-level data, but you have multiple survey responses per person. You need to collapse to one row per person.
  • You want firm-year data, but you have daily stock prices. You need to aggregate to annual averages.

The Aggregating Data section below shows how to do this.

Reading & Writing Files

Loading Data

The first step in any analysis is loading your data into memory. Different file formats require different commands:

* Load Stata dataset
use "data.dta", clear

* Load CSV file
import delimited "data.csv", clear

* Load Excel file
import excel "data.xlsx", sheet("Sheet1") firstrow clear
pacman::p_load(haven, readr, readxl)

# Load Stata dataset
data <- read_dta("data.dta")

# Load CSV file
data <- read_csv("data.csv")

# Load Excel file
data <- read_excel("data.xlsx", sheet = "Sheet1")
import pandas as pd

# Load Stata dataset
data = pd.read_stata("data.dta")

# Load CSV file
data = pd.read_csv("data.csv")

# Load Excel file
data = pd.read_excel("data.xlsx", sheet_name="Sheet1")

Loading Specific Variables or Observations

For large datasets, you can load only what you need:

* Load only specific variables
use income age education using "data.dta", clear

* Load only observations meeting a condition
use "data.dta" if year == 2020, clear

* Combine both
use income age using "data.dta" if state == "CA", clear
# Load specific columns from CSV
data <- read_csv("data.csv", col_select = c(income, age, education))

# Filter while reading (with data.table for large files)
pacman::p_load(data.table)
data <- fread("data.csv", select = c("income", "age"))[year == 2020]
# Load specific columns
data = pd.read_csv("data.csv", usecols=["income", "age", "education"])

# Load and filter (for very large files, use chunks)
data = pd.read_stata("data.dta", columns=["income", "age"])
data = data[data["year"] == 2020]

Saving Data

After cleaning or transforming your data, save it so you don't have to re-run everything next time:

* Save as Stata dataset
save "cleaned_data.dta", replace

* Export to CSV
export delimited "cleaned_data.csv", replace
# Save as Stata dataset
write_dta(data, "cleaned_data.dta")

# Export to CSV
write_csv(data, "cleaned_data.csv")
# Save as Stata dataset
data.to_stata("cleaned_data.dta", write_index=False)

# Export to CSV
data.to_csv("cleaned_data.csv", index=False)

Save Intermediate Files

Save your data after each major step with numbered prefixes. When something breaks at step 5, you can start from step 4 instead of re-running everything:

  • 01_imported.dta
  • 02_cleaned.dta
  • 03_merged.dta

Getting Your Data: Best Practices

Before you can load data, you need to get it. How you acquire your data matters for reproducibility.

The Golden Rule: Raw Data is READ-ONLY

Never modify your original data files. Ever. If you need to clean or transform data, write code that reads the raw file and saves a new cleaned version. This way you can always trace back to the original source.

Download Programmatically When Possible

The best way to get data is to download it with code. This makes your entire pipeline reproducible—anyone can run your script and get the same data.

* Example: Download FARS data (fatal crash data from NHTSA)
* Each year's data is a separate zip file

local years "2000 2001 2002 2003 2004 2005"

foreach year of local years {
    * Download the zip file
    copy "https://static.nhtsa.gov/nhtsa/downloads/FARS/`year'/National/FARS`year'NationalCSV.zip" ///
         "data/raw/FARS`year'.zip", replace

    * Unzip it
    unzipfile "data/raw/FARS`year'.zip", replace
}

* Now you have reproducible data acquisition!
# Example: Download FARS data (fatal crash data from NHTSA)
pacman::p_load(tidyverse)

years <- 2000:2005

for (year in years) {
  url <- glue::glue(
    "https://static.nhtsa.gov/nhtsa/downloads/FARS/{year}/National/FARS{year}NationalCSV.zip"
  )
  destfile <- glue::glue("data/raw/FARS{year}.zip")

  # Download
  download.file(url, destfile, mode = "wb")

  # Unzip
  unzip(destfile, exdir = glue::glue("data/raw/FARS{year}"))
}

# Now you have reproducible data acquisition!
# Example: Download FARS data (fatal crash data from NHTSA)
import requests
import zipfile
from pathlib import Path

years = range(2000, 2006)

for year in years:
    url = f"https://static.nhtsa.gov/nhtsa/downloads/FARS/{year}/National/FARS{year}NationalCSV.zip"
    zip_path = Path(f"data/raw/FARS{year}.zip")

    # Download
    response = requests.get(url)
    zip_path.write_bytes(response.content)

    # Unzip
    with zipfile.ZipFile(zip_path, 'r') as z:
        z.extractall(f"data/raw/FARS{year}")

# Now you have reproducible data acquisition!

Why Programmatic Downloads Matter

  • Reproducibility: Anyone can run your code and get the exact same data
  • Documentation: The URL in your code shows exactly where the data came from
  • Automation: Easy to update when new data is released
  • Version control: Your git history shows when you downloaded the data

When You Can't Download Programmatically

Sometimes data requires manual download (login required, CAPTCHA, web forms). In these cases, document thoroughly:

Create a DATA_SOURCES.md File

# Data Sources

## FARS (Fatality Analysis Reporting System)
- **Source:** NHTSA
- **URL:** https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars
- **Downloaded:** 2026-02-04
- **Files:** FARS2000.zip through FARS2008.zip
- **Archive:** https://web.archive.org/web/20260204/https://www.nhtsa.gov/...

## State Policy Data
- **Source:** Alcohol Policy Information System (APIS)
- **URL:** https://alcoholpolicy.niaaa.nih.gov/
- **Downloaded:** 2026-02-04
- **Notes:** Required selecting "Blood Alcohol Concentration Limits"
  and exporting to CSV. See screenshots in docs/data_download/

Archive Your Sources with archive.org

Websites change. Data gets moved or deleted. Before you start a project, archive your data sources:

  1. Go to web.archive.org
  2. Paste the URL of your data source
  3. Click "Save Page" to create an archived version
  4. Save the archive URL in your DATA_SOURCES.md

This way, even if the original website disappears, you (and anyone replicating your work) can see exactly where your data came from.

Pro Tip: Screenshot the Download Process

For data that requires navigating a web interface, take screenshots of each step. Save them in a docs/data_download/ folder. Future you (and your replicators) will thank you.

Data Cleaning

Inspecting Your Data

Before doing anything else, look at your data. How many observations? What variables exist? Are there missing values? Any obvious errors? These commands give you a quick overview:

* Overview of dataset
describe

* Summary statistics
summarize

* First few rows
list in 1/10

* Check for missing values
misstable summarize

* Detailed info on specific variables
codebook income education
# Overview of dataset
str(data)
glimpse(data)

# Summary statistics
summary(data)

# First few rows
head(data, 10)

# Check for missing values
colSums(is.na(data))
# Overview of dataset
data.info()
data.dtypes

# Summary statistics
data.describe()

# First few rows
data.head(10)

# Check for missing values
data.isna().sum()

Creating and Modifying Variables

Most analysis requires creating new variables—transformations like log income, indicators like "high earner," or cleaning up invalid values. Here are the essential operations:

* Create new variable
gen log_income = log(income)

* Conditional assignment
gen high_earner = (income > 100000)

* Modify variable
replace age = . if age < 0

* Drop observations
drop if missing(income)

* Keep only certain variables
keep id income age
pacman::p_load(dplyr)

data <- data %>%
  mutate(
    # Create new variable
    log_income = log(income),
    # Conditional assignment
    high_earner = income > 100000,
    # Modify variable
    age = if_else(age < 0, NA_real_, age)
  ) %>%
  # Drop observations
  filter(!is.na(income)) %>%
  # Keep only certain variables
  select(id, income, age)
import numpy as np

# Create new variable
data["log_income"] = np.log(data["income"])

# Conditional assignment
data["high_earner"] = data["income"] > 100000

# Modify variable
data.loc[data["age"] < 0, "age"] = np.nan

# Drop observations
data = data.dropna(subset=["income"])

# Keep only certain variables
data = data[["id", "income", "age"]]

Recoding Categorical Variables

Often you need to convert continuous variables into categories (age groups, education levels) or simplify existing categories. This is called recoding:

* Recode numeric values into categories
recode education (1/8 = 1 "Less than HS") ///
                 (9/11 = 2 "Some HS") ///
                 (12 = 3 "HS Graduate") ///
                 (13/15 = 4 "Some College") ///
                 (16/20 = 5 "College+"), gen(educ_cat)

* Simple recode
recode age (18/29 = 1) (30/49 = 2) (50/64 = 3) (65/100 = 4), gen(age_group)
data <- data %>%
  mutate(
    educ_cat = case_when(
      education <= 8 ~ "Less than HS",
      education <= 11 ~ "Some HS",
      education == 12 ~ "HS Graduate",
      education <= 15 ~ "Some College",
      education >= 16 ~ "College+"
    ),
    age_group = cut(age, breaks = c(18, 30, 50, 65, 100),
                    labels = c("18-29", "30-49", "50-64", "65+"),
                    right = FALSE)
  )
# Using pd.cut for numeric ranges
data["age_group"] = pd.cut(data["age"],
                           bins=[18, 30, 50, 65, 100],
                           labels=["18-29", "30-49", "50-64", "65+"])

# Using np.select for complex recoding
conditions = [
    data["education"] <= 8,
    data["education"] <= 11,
    data["education"] == 12,
    data["education"] <= 15,
    data["education"] >= 16
]
choices = ["Less than HS", "Some HS", "HS Graduate", "Some College", "College+"]
# Note: default="" required for numpy 1.25+ when choices are strings
data["educ_cat"] = np.select(conditions, choices, default="")

Reshaping Data

Long Data is (Almost) Always Better

Most commands expect long format: multiple rows per unit, one column for values. If you're copy-pasting variable names with years (income_2018, income_2019...), you probably need to reshape to long.

Wide vs. Long Format

Wide Format
id  income_2020  income_2021
1   50000        52000
2   60000        61000

One row per unit. Each time period is a separate column.

Long Format
id  year  income
1   2020  50000
1   2021  52000
2   2020  60000
2   2021  61000

Multiple rows per unit. Two columns (id + year) together identify each row.

Key Insight: Multiple Identifier Columns

In long format, a single column no longer uniquely identifies each row. Instead, two or more columns together form the unique identifier:

  • Panel data: country + year together identify each observation
  • Individual-time: person_id + date together identify each observation
  • Repeated measures: subject_id + treatment_round together identify each observation

This is why Stata's reshape command requires both i() (the unit identifier) and j() (the new time/category variable).

Reshaping Wide to Long

Before running this code, predict: If you have wide data with 100 people and columns for income_2020, income_2021, income_2022, how many rows will you have after reshaping to long?
Click for answer

300 rows. Each of the 100 people will have 3 rows (one for each year). The data goes from 100 x 4 columns to 300 x 3 columns.

* i() = unit identifier, j() = what becomes the new column
reshape long income_, i(id) j(year)
rename income_ income
pacman::p_load(tidyr)

data_long <- data %>%
  pivot_longer(
    cols = starts_with("income_"),
    names_to = "year",
    names_prefix = "income_",
    values_to = "income"
  ) %>%
  mutate(year = as.numeric(year))
# Melt wide to long
data_long = pd.melt(
    data,
    id_vars=["id"],
    value_vars=["income_2020", "income_2021"],
    var_name="year",
    value_name="income"
)
# Clean up year column
data_long["year"] = data_long["year"].str.replace("income_", "").astype(int)

Reshaping Long to Wide

Sometimes you need the opposite—converting long data to wide format (e.g., for certain graph commands or exporting).

* i() = unit identifier, j() = column whose values become new variable names
reshape wide income, i(id) j(year)

* This creates: income2020, income2021, etc.
pacman::p_load(tidyr)

data_wide <- data_long %>%
  pivot_wider(
    id_cols = id,
    names_from = year,
    names_prefix = "income_",
    values_from = income
  )
# Pivot long to wide
data_wide = data_long.pivot(
    index="id",
    columns="year",
    values="income"
).reset_index()

# Rename columns
data_wide.columns = ["id"] + [f"income_{y}" for y in data_wide.columns[1:]]

Aggregating Data

Often your raw data comes at a finer level than your unit of analysis. You need to aggregate (collapse) it to match.

A Real Example: Crash Data → State-Year Counts

Suppose you're studying whether lower BAC limits reduce drunk driving fatalities. Your raw data is the FARS database, which has one row per fatal crash:

Raw Data: Crash-Level (one row per crash)
crash_id   state    year    drunk_driver    fatalities
1001       MA       2010    1               1
1002       MA       2010    0               2
1003       MA       2010    1               1
1004       CA       2010    0               3
1005       CA       2010    1               1
1006       MA       2011    1               2
...

~30,000 fatal crashes per year × 20 years = ~600,000 rows

But your unit of analysis is state-year because BAC policies vary at the state-year level. You need to aggregate:

Aggregated Data: State-Year Level (one row per state-year)
state    year    total_crashes    drunk_crashes    total_fatalities
MA       2010    3                2                4
CA       2010    2                1                4
MA       2011    1                1                2
...

50 states × 20 years = 1,000 rows

* Start with crash-level data
use "fars_crashes.dta", clear

* Collapse to state-year level
collapse (count) total_crashes=crash_id ///
         (sum) drunk_crashes=drunk_driver ///
         (sum) total_fatalities=fatalities, ///
         by(state year)

* Now you have one row per state-year
* Ready to merge with policy data
pacman::p_load(dplyr)

# Start with crash-level data
crashes <- read_dta("fars_crashes.dta")

# Collapse to state-year level
state_year <- crashes %>%
  group_by(state, year) %>%
  summarize(
    total_crashes = n(),
    drunk_crashes = sum(drunk_driver, na.rm = TRUE),
    total_fatalities = sum(fatalities, na.rm = TRUE),
    .groups = "drop"
  )

# Now you have one row per state-year
import pandas as pd

# Start with crash-level data
crashes = pd.read_stata("fars_crashes.dta")

# Collapse to state-year level
state_year = crashes.groupby(["state", "year"]).agg(
    total_crashes=("crash_id", "count"),
    drunk_crashes=("drunk_driver", "sum"),
    total_fatalities=("fatalities", "sum")
).reset_index()

# Now you have one row per state-year

Always Check Your Aggregation

After aggregating, verify the results make sense:

  • Row count: Do you have the expected number of state-years?
  • Totals: Does sum(total_crashes) equal the original row count?
  • Unique IDs: Is each state-year now unique?

Group Calculations with egen

The egen command ("extensions to generate") creates new variables based on functions applied across observations:

* Mean across all observations
egen income_mean = mean(income)

* Mean within groups (by state)
bysort state: egen income_mean_state = mean(income)

* Other useful egen functions:
bysort state: egen income_max = max(income)
bysort state: egen income_sd = sd(income)
bysort state: egen income_count = count(income)
bysort state: egen income_p50 = pctile(income), p(50)
pacman::p_load(dplyr)

# Add group-level statistics while keeping all rows
data <- data %>%
  group_by(state) %>%
  mutate(
    income_mean_state = mean(income, na.rm = TRUE),
    income_max = max(income, na.rm = TRUE),
    income_sd = sd(income, na.rm = TRUE),
    income_count = n(),
    income_p50 = median(income, na.rm = TRUE)
  ) %>%
  ungroup()
# Add group-level statistics while keeping all rows
data["income_mean_state"] = data.groupby("state")["income"].transform("mean")
data["income_max"] = data.groupby("state")["income"].transform("max")
data["income_sd"] = data.groupby("state")["income"].transform("std")
data["income_count"] = data.groupby("state")["income"].transform("count")
data["income_p50"] = data.groupby("state")["income"].transform("median")

gen sum vs. egen sum

In Stata, gen sum() creates a running total, while egen sum() creates the group total. Always read the help file—function names can be misleading!

Collapsing to Group Level

Use collapse to aggregate data to a higher level (e.g., individual → state, state-year → state):

* Collapse to state-level means
collapse (mean) income age, by(state)

* Multiple statistics
collapse (mean) income_mean=income ///
         (sd) income_sd=income ///
         (count) n=income, by(state)

* Different statistics for different variables
collapse (mean) income (max) tax_rate (sum) population, by(state year)
# Collapse to state level
state_data <- data %>%
  group_by(state) %>%
  summarize(
    income_mean = mean(income, na.rm = TRUE),
    income_sd = sd(income, na.rm = TRUE),
    n = n()
  )

# Different statistics for different variables
state_year_data <- data %>%
  group_by(state, year) %>%
  summarize(
    income = mean(income, na.rm = TRUE),
    tax_rate = max(tax_rate, na.rm = TRUE),
    population = sum(population, na.rm = TRUE),
    .groups = "drop"
  )
# Collapse to state level
state_data = data.groupby("state").agg(
    income_mean=("income", "mean"),
    income_sd=("income", "std"),
    n=("income", "count")
).reset_index()

# Different statistics for different variables
state_year_data = data.groupby(["state", "year"]).agg(
    income=("income", "mean"),
    tax_rate=("tax_rate", "max"),
    population=("population", "sum")
).reset_index()

egen vs. collapse

  • egen — Adds a new column but keeps all rows. Use when you need the group statistic alongside individual data.
  • collapse — Reduces the dataset to one row per group. Use when you want aggregated data only.

Merging Datasets

Merging combines two datasets that share a common identifier (like person ID or state-year).

Types of Joins

The most important thing to understand is what happens to rows that don't match:

Visual diagram of join types showing inner, left, right, and full outer joins with Venn diagrams

Credit: datacourses.com

  • Inner join: Keep only rows that match in both datasets
  • Left join: Keep all rows from the left (master) dataset
  • Right join: Keep all rows from the right (using) dataset
  • Full outer join: Keep all rows from both datasets

Merge Types by Relationship

Know Your Merge Type

  • 1:1 — Each row matches exactly one row (person → demographics)
  • m:1 — Many rows match one (students → school characteristics)
  • 1:m — One row matches many (firm → firm-year panel)
  • m:mAlmost always wrong! If you think you need this, reshape first.

When Merges Create Duplicates

Merges can increase your row count. This is sometimes expected and sometimes a bug.

Expected: m:1 or 1:m Merges

When merging individual data with aggregate data, rows should duplicate:

  • 100 students + 5 schools → still 100 rows (each student gets their school's data)
  • 50 states + 10 years of policy data → 500 state-year rows

Unexpected: Accidental Duplicates

If your row count increases unexpectedly, you probably have duplicate keys:

  • Two "John Smith" entries in one dataset → each gets merged twice
  • Forgot that some IDs appear multiple times
* BEFORE merging: Check for unexpected duplicates
duplicates report person_id      // Should say "all unique" for 1:1
duplicates report state          // OK to have duplicates for m:1

* Count rows before and after
count
local n_before = r(N)

merge m:1 state using "state_data.dta"

count
local n_after = r(N)

* If n_after > n_before in a m:1 merge, something is wrong!
di "Rows before: `n_before'"
di "Rows after: `n_after'"
if `n_after' > `n_before' {
    di as error "WARNING: Row count increased - check for duplicate keys!"
}
# BEFORE merging: Check for duplicates
individuals %>% count(person_id) %>% filter(n > 1)  # Should be empty for 1:1
state_data %>% count(state) %>% filter(n > 1)       # Should be empty

# Count rows before and after
n_before <- nrow(individuals)

merged <- merge(individuals, state_data, by = "state", all.x = TRUE)

n_after <- nrow(merged)

# Check if row count changed unexpectedly
if (n_after > n_before) {
  warning("Row count increased! Check for duplicate keys in state_data")
  # Find the problem
  state_data %>% count(state) %>% filter(n > 1)
}
# BEFORE merging: Check for duplicates
print(individuals.groupby("person_id").size().loc[lambda x: x > 1])  # Should be empty
print(state_data.groupby("state").size().loc[lambda x: x > 1])       # Should be empty

# Count rows before and after
n_before = len(individuals)

merged = individuals.merge(state_data, on="state", how="left")

n_after = len(merged)

# Check if row count changed unexpectedly
if n_after > n_before:
    print("WARNING: Row count increased! Check for duplicate keys")
    print(state_data.groupby("state").size().loc[lambda x: x > 1])

The Row Count Rule

  • 1:1 merge: Row count should stay exactly the same (or decrease if using inner join)
  • m:1 merge: Row count should stay the same as master dataset
  • 1:m merge: Row count will increase (one master row → many using rows)
  • If row count increases when you didn't expect it, you have duplicate keys somewhere

Example: Merging Individual Data with State Data

Suppose you have individual-level data and want to add state characteristics:

individuals.dta (master)
person_id  state    income
1          MA       50000
2          MA       60000
3          CA       70000
4          NY       55000
state_data.dta (using)
state    min_wage  population
MA       15.00    7000000
CA       15.50    39500000
TX       7.25     29500000
* Load master data
use "individuals.dta", clear

* Merge: many individuals per state → m:1
merge m:1 state using "state_data.dta"

* ALWAYS check the results
tab _merge

/*
   _merge |      Freq.
----------+------------
        1 |          1    ← Only in master (NY - no state data)
        2 |          1    ← Only in using (TX - no individuals)
        3 |          3    ← Matched (MA, CA)
*/

* Investigate unmatched
list if _merge == 1   // NY had no state data
list if _merge == 2   // TX had no individuals

* Keep matched only (after understanding why others didn't match)
keep if _merge == 3
drop _merge
pacman::p_load(dplyr, haven)

individuals <- read_dta("individuals.dta")
state_data <- read_dta("state_data.dta")

# Left merge: keep all individuals, add state data where available
merged <- merge(individuals, state_data, by = "state", all.x = TRUE)

# Check for unmatched
merged %>% filter(is.na(min_wage))  # NY had no state data

# Inner merge: keep only matched
merged <- merge(individuals, state_data, by = "state")
individuals = pd.read_stata("individuals.dta")
state_data = pd.read_stata("state_data.dta")

# Left join: keep all individuals, add state data where available
merged = individuals.merge(state_data, on="state", how="left", indicator=True)

# Check merge results
print(merged["_merge"].value_counts())

# Check for unmatched
print(merged[merged["min_wage"].isna()])  # NY had no state data

# Inner join: keep only matched
merged = individuals.merge(state_data, on="state", how="inner")

What To Do After Checking _merge

After running tab _merge, ask yourself: Why didn't these match? Common reasons:

  • Typos in merge keys ("NY" vs "New York")
  • Different time periods (state data goes 2000–2010, individual data goes 1995–2015)
  • Data errors or missing values in the key variable

Investigate with list state if _merge == 1 before deciding whether to keep if _merge == 3. Always document why you dropped unmatched observations.

Pre-Merge Data Cleaning

Merge variables must be exactly identical in both datasets—same name, same values, same type. Most merge problems come from subtle differences that require string cleaning.

* Check if your variable is actually numeric or string
browse, nolabel   // Shows actual values, not labels

* Common string cleaning operations:

* 1. Remove periods: "D.C." → "DC"
replace state = subinstr(state, ".", "", .)

* 2. Remove leading/trailing spaces: " MA " → "MA"
replace state = trim(state)

* 3. Extract substring: "Washington DC" → "DC"
replace state = substr(state, -2, 2)  // Last 2 characters

* 4. Convert case: "ma" → "MA"
replace state = upper(state)

* 5. Rename to match other dataset
rename statefip state

* 6. Convert string to numeric (or vice versa)
destring state_code, replace
tostring state_fips, replace
pacman::p_load(stringr)

# Common string cleaning operations:

# 1. Remove periods
data$state <- str_replace_all(data$state, "\\.", "")

# 2. Remove leading/trailing spaces
data$state <- str_trim(data$state)

# 3. Extract substring
data$state <- str_sub(data$state, -2, -1)  # Last 2 characters

# 4. Convert case
data$state <- str_to_upper(data$state)

# 5. Rename to match other dataset
data <- data %>% rename(state = statefip)

# 6. Convert types
data$state_code <- as.numeric(data$state_code)
data$state_fips <- as.character(data$state_fips)
# Common string cleaning operations:

# 1. Remove periods
data["state"] = data["state"].str.replace(".", "", regex=False)

# 2. Remove leading/trailing spaces
data["state"] = data["state"].str.strip()

# 3. Extract substring
data["state"] = data["state"].str[-2:]  # Last 2 characters

# 4. Convert case
data["state"] = data["state"].str.upper()

# 5. Rename to match other dataset
data = data.rename(columns={"statefip": "state"})

# 6. Convert types
data["state_code"] = pd.to_numeric(data["state_code"])
data["state_fips"] = data["state_fips"].astype(str)

Labels Can Be Deceiving

In Stata, when you see "Washington" in the data browser, it might actually be stored as the number 53 with a label attached. Use browse, nolabel or tab var, nolabel to see the true values. Merges use the actual values, not the labels!

Common Merge Problems

Watch Out For

  • Duplicates on key: If your key isn't unique when it should be, you'll get unexpected results. Always run duplicates report first.
  • Case sensitivity: "MA" ≠ "ma" in Stata. Use upper() or lower() to standardize.
  • Trailing spaces: "MA " ≠ "MA". Use trim() to remove.
  • Different variable types: String "1" ≠ numeric 1. Convert with destring or tostring.
  • Labels hiding values: What looks like "California" might be stored as 6. Check with nolabel.

Appending (Stacking) Datasets

Use append to stack datasets vertically (same variables, different observations):

* Append multiple years of data
use "data_2020.dta", clear
append using "data_2021.dta"
append using "data_2022.dta"
# Bind rows from multiple dataframes
data_all <- bind_rows(
  read_dta("data_2020.dta"),
  read_dta("data_2021.dta"),
  read_dta("data_2022.dta")
)
# Concatenate multiple dataframes
data_all = pd.concat([
    pd.read_stata("data_2020.dta"),
    pd.read_stata("data_2021.dta"),
    pd.read_stata("data_2022.dta")
], ignore_index=True)

Critical Stata Bugs That Will Break Your Analysis

Stata has some behaviors that can silently corrupt your analysis.

Missing Values Are Infinity

In Stata, missing values (.) are treated as larger than any number. This means comparisons like if x > 100 will include missing values.

* This INCLUDES missing values (dangerous!)
gen high_income = (income > 100000)

* This is what you actually want
gen high_income = (income > 100000) if !missing(income)

* Or equivalently
gen high_income = (income > 100000 & income < .)

* The ordering is: all numbers < . < .a < .b < ... < .z
* So . > 999999999 evaluates to TRUE
# R handles this more intuitively - NA propagates
high_income <- income > 100000
# Returns NA where income is NA, not TRUE

# But still be explicit about NA handling
high_income <- income > 100000 & !is.na(income)
# Python/pandas handles this intuitively - NaN propagates
high_income = data["income"] > 100000
# Returns NaN where income is NaN, not True

# But still be explicit about NA handling
high_income = (data["income"] > 100000) & data["income"].notna()
Why this matters: If you're creating indicator variables or filtering data, missing values will silently be included whenever you use >, >=, or !=. Your sample sizes will be wrong, and your estimates will be biased.

egen group Silently Top-Codes

When you use egen group() to create a numeric ID from string or categorical variables, Stata uses an int by default. If you have more unique values than an int can hold (~32,000), it silently repeats the maximum value.

* DANGEROUS: If you have >32,767 unique beneficiaries,
* later IDs will all be assigned the same number!
egen bene_id = group(beneficiary_name)

* SAFE: Explicitly use a long integer
egen long bene_id = group(beneficiary_name)

* Or for very large datasets (>2 billion unique values)
egen double bene_id = group(beneficiary_name)

* Always verify you got the right number of groups
distinct beneficiary_name
assert r(ndistinct) == r(N)  // This will fail if top-coded
# R doesn't have this problem - factors work fine
data$bene_id <- as.numeric(factor(data$beneficiary_name))

# Or use dplyr
data <- data %>%
  mutate(bene_id = as.numeric(factor(beneficiary_name)))
# Python doesn't have this problem - uses 64-bit integers by default
data["bene_id"] = pd.factorize(data["beneficiary_name"])[0] + 1

# Verify
assert data["bene_id"].nunique() == data["beneficiary_name"].nunique()
Why this matters: If you're creating panel identifiers, observations from different individuals will be assigned the same ID. Your fixed effects will be wrong, standard errors will be wrong, and you'll have no indication anything went wrong.

Numeric Precision

Stata's float type only has ~7 digits of precision. Large IDs stored as floats will be silently corrupted.

* float has ~7 digits of precision
* This can cause merge failures on IDs!
gen float id = 1234567890
* id is now 1234567936 (corrupted!)

* Always use long or double for IDs
gen long id = 1234567890      // integers up to ~2 billion
gen double id = 1234567890    // integers up to ~9 quadrillion

* Check if a variable might have precision issues
summarize id
* If max is > 16777216 and storage type is float, you have a problem

* Fix: compress optimizes storage types automatically
compress
# R uses double precision by default, so this is less of an issue
# But be careful with very large integers
id <- 1234567890  # Fine in R

# For extremely large integers, use bit64 package
pacman::p_load(bit64)
big_id <- as.integer64("9223372036854775807")
# Python uses 64-bit integers by default, so this is less of an issue
id = 1234567890  # Fine in Python

# pandas also handles large integers well
data["id"] = pd.to_numeric(data["id"], downcast=None)  # Keep full precision

# Check dtypes
print(data.dtypes)

Sorting Instability

Stata's sort command is unstable—when there are ties, the order of tied observations is random and can change between runs. This causes reproducibility issues.

* DANGEROUS: Order of tied observations is random
sort state
gen observation_number = _n  // Different every time you run!

* SAFE: Break ties with a unique identifier
sort state id
gen observation_number = _n  // Now reproducible

* ALTERNATIVE: Use stable sort (preserves existing order for ties)
sort state, stable

* Best practice: Always sort by enough variables to eliminate ties
* or use stable when you don't care about tie-breaking
# R's order() is stable by default (preserves original order for ties)
data <- data[order(data$state), ]

# To be explicit about tie-breaking:
data <- data[order(data$state, data$id), ]

# dplyr::arrange() is also stable
pacman::p_load(dplyr)
data <- data %>% arrange(state, id)
# pandas sort_values() has kind='stable' option
data = data.sort_values('state', kind='stable')

# Better: explicitly break ties
data = data.sort_values(['state', 'id'])

# Note: default sort is 'quicksort' which is unstable
Why this matters: If you create variables based on sort order (like _n for observation numbers or by: gen first = _n == 1), your results will differ across runs, making your analysis non-reproducible.

Variable Abbreviation

By default, Stata allows you to abbreviate variable names. This is dangerous—your code can silently operate on the wrong variable if you add new variables later.

* Suppose you have: income, age, education
summarize inc  // Works! Stata abbreviates to "income"

* Later, you add: income_spouse
summarize inc  // ERROR: "inc" is ambiguous (income vs income_spouse)
* Or worse: no error, wrong variable used silently

* SOLUTION: Disable variable abbreviation at the start of every do-file
set varabbrev off

* Now Stata requires exact variable names
summarize income  // Must use full name
# R does NOT allow variable abbreviation by default
# You must use exact column names
summary(data$income)  # Must be exact

# Partial matching only happens with $ on lists, not data frames
# This is generally not a problem in R
# Python does NOT allow column abbreviation
# You must use exact column names
data['income'].describe()  # Must be exact

# Autocomplete in IDEs helps with long names
# No risk of silent abbreviation errors
Why this matters: Code that works today can silently break when new variables are added to your dataset. Put set varabbrev off at the top of every do-file.

Defensive Coding Habits

  • Put set varabbrev off at the top of every do-file
  • Always use if !missing(x) with inequality comparisons
  • Always use egen long or egen double for group IDs
  • Sort by enough variables to eliminate ties, or use sort ..., stable
  • Run compress after loading data to optimize storage types
  • Use assert liberally to catch problems early

Practice Exercise

Test your understanding with this data wrangling challenge.

Scenario

You have two datasets:

  1. students.csv: One row per student with columns student_id, school_id, test_score
  2. schools.csv: One row per school with columns school_id, school_type (public/private), total_enrollment

Task: Create a dataset where each row is a student, with their school's characteristics added.

Questions to answer before coding:

  1. What type of merge is this? (1:1, m:1, 1:m, m:m)
  2. After the merge, how many rows should you have?
  3. What should you check before merging?
Click to see answers + code

Answers:

  1. m:1 — Many students per school, one school record per school
  2. Same number of rows as students.csv (school info gets duplicated to each student)
  3. Check that school_id is unique in schools.csv: duplicates report school_id
* Stata solution
import delimited "students.csv", clear
count  // Note: N students

merge m:1 school_id using "schools.dta"
tab _merge  // All should match (or investigate non-matches)

count  // Should equal original N students
keep if _merge == 3
drop _merge

Found something unclear or have a suggestion? Email [email protected].