Research and Communication in Economics

Automation

Loops, functions, and working with large datasets

As your projects grow, you'll find yourself repeating similar operations: cleaning 50 data files, running regressions on 20 outcomes, generating the same graph for different subsamples. Automation turns hours of tedious work into seconds of computer time—and eliminates the copy-paste errors that inevitably creep into manual work.

Loops and Iteration

Why Use Loops?

Imagine you have 20 outcome variables (income, hours, wages, savings, ...) and need to run the same regression for each one. You could copy-paste the regression command 20 times, changing the outcome variable each time. But if you later decide to add a control variable, you'd have to edit all 20 copies. With a loop, you write the regression once, wrap it in a loop, and let the computer run all 20 versions. Change the control? Edit one line.

A loop executes the same code multiple times, each time with different values. Instead of writing summarize income, summarize age, summarize education separately, you write one loop that summarizes each variable in a list. This is more concise, easier to modify, and less error-prone.

Basic Loops

* Loop over numbers
forvalues i = 1/10 {
    di "Iteration `i'"
}

* Loop over values in a list
foreach var in income age education {
    summarize `var'
}

* Loop over variables matching a pattern
foreach var of varlist inc* {
    replace `var' = . if `var' < 0
}

* Loop over files
local files: dir "data/" files "*.csv"
foreach f of local files {
    di "Processing `f'"
    import delimited "data/`f'", clear
    * ... do something ...
}
# Loop over numbers
for (i in 1:10) {
  print(paste("Iteration", i))
}

# Loop over values in a vector
for (var in c("income", "age", "education")) {
  print(summary(data[[var]]))
}

# Loop over files
files <- list.files("data/", pattern = "*.csv", full.names = TRUE)
for (f in files) {
  print(paste("Processing", f))
  df <- read_csv(f)
  # ... do something ...
}

# Better: collect into a list and combine
data_list <- list()
for (f in files) {
  data_list[[f]] <- read_csv(f)
}
all_data <- bind_rows(data_list)
# Loop over numbers
for i in range(1, 11):
    print(f"Iteration {i}")

# Loop over values in a list
for var in ['income', 'age', 'education']:
    print(df[var].describe())

# Loop over columns matching a pattern
for col in df.columns:
    if col.startswith('inc'):
        df.loc[df[col] < 0, col] = None

# Loop over files
import glob
files = glob.glob("data/*.csv")
for f in files:
    print(f"Processing {f}")
    df_temp = pd.read_csv(f)
    # ... do something ...

# Better: use list comprehension and concat
all_data = pd.concat([pd.read_csv(f) for f in files])

Running Multiple Regressions

A common task: run the same specification with different outcome variables, or run the same regression for different subsamples. Loops let you do this systematically, ensuring you apply exactly the same controls and standard errors to each model. This is especially useful for robustness checks—showing your result holds across different outcomes or subgroups.

* Run same regression for different outcomes
eststo clear
foreach outcome in income hours wages {
    eststo `outcome': reg `outcome' treatment age female, robust
}
esttab, se star(* 0.1 ** 0.05 *** 0.01)

* Run same regression for different subgroups
eststo clear
levelsof state, local(states)
foreach s of local states {
    eststo `s': reg income treatment if state == "`s'", robust
}
esttab
pacman::p_load(fixest, purrr)

# Run same regression for different outcomes
outcomes <- c("income", "hours", "wages")
models <- map(outcomes, ~ feols(
  as.formula(paste(.x, "~ treatment + age + female")),
  data = data,
  vcov = "hetero"
))
names(models) <- outcomes
etable(models)

# Run same regression for different subgroups
states <- unique(data$state)
models <- map(states, ~ feols(
  income ~ treatment,
  data = filter(data, state == .x),
  vcov = "hetero"
))
names(models) <- states
etable(models)
import pyfixest as pf

# Run same regression for different outcomes
outcomes = ['income', 'hours', 'wages']
models = {}
for outcome in outcomes:
    models[outcome] = pf.feols(f'{outcome} ~ treatment + age + female',
                               data=df, vcov='HC1')
pf.etable(list(models.values()))

# Run same regression for different subgroups
states = df['state'].unique()
models = {}
for s in states:
    models[s] = pf.feols('income ~ treatment',
                         data=df[df['state'] == s], vcov='HC1')
pf.etable(list(models.values()))

Storing Results

* Create empty dataset to store results
clear
set obs 50
gen state = ""
gen beta = .
gen se = .
gen n = .
save "results.dta", replace

* Loop and store
levelsof state, local(states)
local row = 1
foreach s of local states {
    reg income treatment if state == "`s'", robust

    use "results.dta", clear
    replace state = "`s'" in `row'
    replace beta = _b[treatment] in `row'
    replace se = _se[treatment] in `row'
    replace n = e(N) in `row'
    save "results.dta", replace

    local row = `row' + 1
}
# Store results in a data frame
results_list <- list()
for (s in unique(data$state)) {
  model <- lm(income ~ treatment, data = filter(data, state == s))

  results_list[[s]] <- tibble(
    state = s,
    beta = coef(model)["treatment"],
    se = sqrt(vcov(model)["treatment", "treatment"]),
    n = nobs(model)
  )
}
results <- bind_rows(results_list)

# View results
print(results)

# Or use broom for tidy model output
pacman::p_load(broom)
models <- data %>%
  group_by(state) %>%
  do(model = lm(income ~ treatment, data = .))

results <- models %>%
  mutate(tidied = map(model, tidy)) %>%
  unnest(tidied)
import statsmodels.formula.api as smf

# Store results in a list, then convert to DataFrame
results = []
for s in df['state'].unique():
    model = smf.ols('income ~ treatment',
                    data=df[df['state'] == s]).fit(cov_type='HC1')
    results.append({
        'state': s,
        'beta': model.params['treatment'],
        'se': model.bse['treatment'],
        'n': int(model.nobs)
    })

results_df = pd.DataFrame(results)
print(results_df)

# Or use a list comprehension for cleaner code
results_df = pd.DataFrame([
    {'state': s,
     'beta': smf.ols('income ~ treatment', data=df[df['state'] == s])
             .fit(cov_type='HC1').params['treatment']}
    for s in df['state'].unique()
])

Functions and Programs

Don't Repeat Yourself (DRY)

If you're copy-pasting code and changing small things, write a function instead. This reduces errors and makes your code easier to maintain.

Writing Functions

* Define a program (Stata's version of functions)
capture program drop my_regression
program define my_regression
    * The syntax line defines what inputs your program expects:
    * - varlist means "a list of variable names"
    * - outcome(string) means "a required option called 'outcome' that takes text"
    * - [controls(string)] means "an optional 'controls' option" (square brackets = optional)
    * When you call the program, Stata parses your inputs according to this template
    syntax varlist, outcome(string) [controls(string)]

    * Parse variable list
    local treatment: word 1 of `varlist'

    * Run regression
    if "`controls'" != "" {
        reg `outcome' `treatment' `controls', robust
    }
    else {
        reg `outcome' `treatment', robust
    }

    * Store and display results
    di "Treatment effect: " _b[`treatment']
end

* Use the program
my_regression treatment, outcome(income) controls(age female)
# Define a function
run_regression <- function(data, outcome, treatment, controls = NULL) {

  # Build formula
  if (is.null(controls)) {
    formula <- as.formula(paste(outcome, "~", treatment))
  } else {
    formula <- as.formula(paste(outcome, "~", treatment, "+",
                                paste(controls, collapse = " + ")))
  }

  # Run regression
  model <- lm(formula, data = data)

  # Return tidy results
  broom::tidy(model, conf.int = TRUE)
}

# Use the function
run_regression(data, "income", "treatment", c("age", "female"))

# Apply to multiple outcomes
outcomes <- c("income", "hours", "wages")
results_list <- list()
for (outcome in outcomes) {
  results_list[[outcome]] <- run_regression(data, outcome, "treatment")
}
results <- bind_rows(results_list, .id = "outcome")
import statsmodels.formula.api as smf

# Define a function
def run_regression(data, outcome, treatment, controls=None):
    """Run OLS regression with robust standard errors."""

    # Build formula
    if controls is None:
        formula = f'{outcome} ~ {treatment}'
    else:
        formula = f'{outcome} ~ {treatment} + {" + ".join(controls)}'

    # Run regression
    model = smf.ols(formula, data=data).fit(cov_type='HC1')

    # Return results as dict
    return {
        'outcome': outcome,
        'beta': model.params[treatment],
        'se': model.bse[treatment],
        'pval': model.pvalues[treatment],
        'n': int(model.nobs)
    }

# Use the function
result = run_regression(df, 'income', 'treatment', ['age', 'female'])

# Apply to multiple outcomes
outcomes = ['income', 'hours', 'wages']
results_df = pd.DataFrame([
    run_regression(df, outcome, 'treatment') for outcome in outcomes
])

Modular Code Structure

* master.do - runs everything
clear all
set more off

* Set paths
global root "/Users/me/Dropbox/project"
global build "$root/build"
global analysis "$root/analysis"

* Run scripts in order
do "$build/code/01_import.do"
do "$build/code/02_clean.do"
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_regressions.do"

di "Master script complete!"
# master.R - runs everything
rm(list = ls())

# Set paths
root <- "/Users/me/Dropbox/project"
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")

# Load packages
pacman::p_load(tidyverse, haven, fixest)

# Run scripts in order
source(file.path(build, "code", "01_import.R"))
source(file.path(build, "code", "02_clean.R"))
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_regressions.R"))

cat("Master script complete!\n")
# main.py - runs everything
import os
from pathlib import Path

# Set paths
ROOT = Path("/Users/me/Dropbox/project")
BUILD = ROOT / "build"
ANALYSIS = ROOT / "analysis"

# Run scripts in order
exec(open(BUILD / "code" / "01_import.py").read())
exec(open(BUILD / "code" / "02_clean.py").read())
exec(open(ANALYSIS / "code" / "01_summary_stats.py").read())
exec(open(ANALYSIS / "code" / "02_regressions.py").read())

# Or better: import modules
# from build.code import import_data, clean_data
# from analysis.code import summary_stats, regressions

print("Master script complete!")

Working with Large Datasets

Memory Management

* Check current memory usage
memory

* Increase memory limit (Stata/MP)
set max_memory 16g

* Compress data to use less memory
compress

* Keep only needed variables
keep id year treatment outcome

* Drop unnecessary observations early
drop if year < 2000
# Check object sizes
object.size(data)
pryr::mem_used()

# Remove objects you don't need
rm(temp_data, old_results)
gc()  # Force garbage collection

# Select only needed columns early
data <- data %>% select(id, year, treatment, outcome)

# Filter rows early
data <- data %>% filter(year >= 2000)
import sys

# Check object sizes
print(f"DataFrame size: {sys.getsizeof(df) / 1e6:.1f} MB")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# Remove objects you don't need
del temp_data, old_results
import gc
gc.collect()

# Select only needed columns early
df = df[['id', 'year', 'treatment', 'outcome']]

# Filter rows early
df = df[df['year'] >= 2000]

# Optimize dtypes to reduce memory
df['year'] = df['year'].astype('int16')
df['treatment'] = df['treatment'].astype('int8')

Efficient Data Formats

* Stata's native .dta format is usually fine
* For very large data, consider:

* 1. Use compress to shrink variable storage
compress

* 2. Convert strings to numeric where possible
encode state_name, gen(state_num)
drop state_name

* 3. Save in chunks if needed
preserve
keep if year <= 2010
save "data_pre2010.dta", replace
restore
keep if year > 2010
save "data_post2010.dta", replace
# Use efficient file formats

# Parquet: columnar format, very fast, good compression
pacman::p_load(arrow)
write_parquet(data, "data.parquet")
data <- read_parquet("data.parquet")

# fst: extremely fast for R-to-R
pacman::p_load(fst)
write_fst(data, "data.fst")
data <- read_fst("data.fst")

# data.table: faster than dplyr for large data
pacman::p_load(data.table)
dt <- fread("data.csv")  # Much faster than read_csv
dt[year >= 2000, .(mean_income = mean(income)), by = state]
# Use efficient file formats

# Parquet: columnar format, very fast, good compression
df.to_parquet("data.parquet")
df = pd.read_parquet("data.parquet")

# Feather: fast binary format
df.to_feather("data.feather")
df = pd.read_feather("data.feather")

# Pickle: Python's native serialization (fast but Python-only)
df.to_pickle("data.pkl")
df = pd.read_pickle("data.pkl")

# For very large data: use pyarrow or dask
import pyarrow.parquet as pq
table = pq.read_table("data.parquet")

# Dask for out-of-memory data
import dask.dataframe as dd
ddf = dd.read_csv("big_data.csv")
result = ddf[ddf['year'] >= 2000].groupby('state')['income'].mean().compute()

Processing in Chunks

* Process data year by year
forvalues y = 2000/2020 {
    use "big_data.dta" if year == `y', clear

    * Process this year's data
    collapse (mean) income, by(state)
    gen year = `y'

    * Append to results
    if `y' == 2000 {
        save "results.dta", replace
    }
    else {
        append using "results.dta"
        save "results.dta", replace
    }
}
# Read and process in chunks
pacman::p_load(readr)

# Define processing function
process_chunk <- function(chunk, pos) {
  chunk %>%
    group_by(state, year) %>%
    summarize(mean_income = mean(income), .groups = "drop")
}

# Read in chunks of 100,000 rows
results <- read_csv_chunked(
  "big_data.csv",
  DataFrameCallback$new(process_chunk),
  chunk_size = 100000
)

# Or use arrow for out-of-memory data
pacman::p_load(arrow)
ds <- open_dataset("data_folder/")  # Can be larger than RAM
ds %>%
  filter(year >= 2000) %>%
  group_by(state) %>%
  summarize(mean_income = mean(income)) %>%
  collect()  # Only materializes the result
# Read and process in chunks
results = []
for chunk in pd.read_csv("big_data.csv", chunksize=100000):
    # Process this chunk
    chunk_result = chunk.groupby(['state', 'year'])['income'].mean()
    results.append(chunk_result)

# Combine results
final = pd.concat(results).groupby(level=[0, 1]).mean()

# Or use dask for out-of-memory data
import dask.dataframe as dd

ddf = dd.read_csv("big_data.csv")  # Lazy - doesn't load into memory
result = (ddf[ddf['year'] >= 2000]
          .groupby('state')['income']
          .mean()
          .compute())  # Only materializes the final result

# Or use pyarrow datasets
import pyarrow.dataset as ds
dataset = ds.dataset("data_folder/", format="parquet")
table = dataset.to_table(filter=ds.field('year') >= 2000)
df = table.to_pandas()

Speed Tips

  • Filter early: Remove unneeded rows/columns before processing
  • Use indexes: In Stata, xtset or sort before merging
  • Avoid loops when possible: Vectorized operations are much faster
  • Use specialized packages: reghdfe in Stata, fixest in R are optimized for large data

Acknowledgements

This tutorial draws on materials and inspiration from many sources:

← Regression Back to Tutorial Home

Found something unclear or have a suggestion? Email [email protected].