Research and Communication in Economics

Stata Session 1

Your first steps with data: setup, organization, exploring, and importing

What You'll Learn

  • How to set up a research project with proper folder structure
  • How to explore a dataset you've never seen before
  • How to select the observations and variables you need
  • How to create new variables from existing ones
  • How to import data from CSV files

Download Complete Scripts

Run the full session code in your preferred language:

Stata .do R .R Python .py
A note before we start: Learning to code is like learning a new languageβ€”it takes practice, and everyone makes mistakes. If something doesn't work, that's normal! Read error messages carefully, check for typos, and remember that debugging is a core skill, not a sign you're doing something wrong.

Module 1: Introduction to Stata

Imagine you just got a new dataset. Before doing any analysis, you need to understand what you have. How many observations? What variables? What do they look like? This module teaches you how to answer these questions.

Step 1: Setting Up Your Script

Every Stata do-file should start the same way. Think of this as "clearing the stage" before your performance:

πŸ€” Before running this code, predict: What does clear all do? Why might we want to close any open log file before starting?
* ========================================
* SETUP - Run this at the start of every script
* ========================================

* Step 1: Clear everything from memory
* (This ensures you're starting fresh, not using old data)
clear all

* Step 2: Close any open log file
* (capture means "try this, but don't error if it fails")
capture log close

* Step 3: Don't pause after each screenful of output
set more off

* Step 4: Set your working directory
* IMPORTANT: Change this path to YOUR project folder!
* To find your path: On Mac, right-click your project folder in Finder
* β†’ 'Get Info' β†’ copy the path next to 'Where'.
* On Windows, open the folder in Explorer β†’ click the address bar β†’ copy.
* Use forward slashes (/) in Stata even on Windows.
cd "/Users/yourname/Dropbox/my_project"

* Step 5: Start a log file (records everything you do)
* "replace" overwrites the old log if it exists
log using my_analysis.log, replace
# ========================================
# SETUP - Run this at the start of every script
# ========================================

# Step 1: Clear environment (remove all objects)
rm(list = ls())

# Step 2: Set working directory
# IMPORTANT: Change this path to YOUR project folder!
setwd("/Users/yourname/Dropbox/my_project")

# Step 3: Load required packages
pacman::p_load(tidyverse, haven)
# ========================================
# SETUP - Run this at the start of every script
# ========================================

# Step 1: Import required packages
import pandas as pd
import numpy as np
import os

# Step 2: Set working directory
# IMPORTANT: Change this path to YOUR project folder!
os.chdir("/Users/yourname/Dropbox/my_project")

Why This Matters

Starting fresh (clear all) prevents a common bug: accidentally using variables from a previous session. The log file creates a record of everything you didβ€”invaluable when you need to remember your steps months later or when something goes wrong.

Step 2: First Look at Your Data

You've downloaded a dataset. Now what? Let's say it's Census data with information about people. Before diving into analysis, answer these questions:

  • How many observations? (Are there 100 people? 1 million?)
  • What variables exist? (Income? Age? Education?)
  • What do the values look like? (Is age in years? Months? Categories?)
πŸ€” Predict: If you have a variable called sex with values 1 and 2, what might those numbers mean? How would you find out?
* Load the data
use mydata.dta, clear

* ----- EXPLORATION COMMANDS -----

* browse: Opens a spreadsheet view
* USE THIS FIRST to see what your data looks like
browse

* describe: Shows variable names, types, and labels
* This tells you WHAT variables you have
describe

* summarize: Shows mean, sd, min, max for numeric variables
* This tells you the RANGE of your data
summarize

* summarize specific variables
summarize age income

* summarize with detail: adds percentiles
* Useful for seeing the full distribution
summarize income, detail

* tabulate: Counts observations by category
* ESSENTIAL for categorical variables
tabulate sex

* See the numeric codes behind the labels
tabulate sex, nolabel
# Load the data
df <- read_dta("mydata.dta")

# ----- EXPLORATION COMMANDS -----

# View(): Opens a spreadsheet view
# USE THIS FIRST to see what your data looks like
View(df)

# glimpse(): Shows variable names and types
# This tells you WHAT variables you have
glimpse(df)

# summary(): Shows summary stats for all variables
summary(df)

# Summary for specific variables
summary(df$age)
summary(df$income)

# Detailed summary with percentiles
pacman::p_load(psych)
describe(df$income)

# table(): Counts observations by category
# ESSENTIAL for categorical variables
table(df$sex)

# With percentages
prop.table(table(df$sex))
# Load the data
df = pd.read_stata("mydata.dta")

# ----- EXPLORATION COMMANDS -----

# head(): Shows first few rows
# USE THIS FIRST to see what your data looks like
df.head(10)

# info(): Shows variable names and types
# This tells you WHAT variables you have
df.info()

# describe(): Shows summary stats for numeric variables
df.describe()

# Summary for specific variables
df[['age', 'income']].describe()

# Detailed summary with percentiles
df['income'].describe(percentiles=[.1, .25, .5, .75, .9])

# value_counts(): Counts observations by category
# ESSENTIAL for categorical variables
df['sex'].value_counts()

# With percentages
df['sex'].value_counts(normalize=True)
πŸ’‘ What the output tells you:
  • describe output shows variable names, types (string vs numeric), and labels (human-readable descriptions)
  • summarize output shows Obs (count of non-missing), Mean, Std. Dev., Min, and Max
  • tabulate output shows how many observations fall into each category

Never Manually Edit Data

Stata's Data Editor lets you click on cells and change values. Never do this. Why? Because there's no record of what you changed. If you need to make changes, do it in code so your work is reproducible.

Step 3: Selecting Your Sample

Your dataset might have millions of observations, but your research question might focus on a specific group. For example, if you're studying women's labor force participation, you might want to:

  • Keep only women (drop men)
  • Keep only working-age adults (ages 25-54)
  • Drop observations with missing income
πŸ€” Predict: What's the difference between keep if age >= 30 and drop if age < 30? Do they give the same result?
* ----- SELECTING OBSERVATIONS -----

* Keep only women (where sex == 2)
* "==" means "is equal to" (double equals for comparison)
keep if sex == 2

* Check how many observations remain
count

* Keep only ages 25-54
* "&" means AND - both conditions must be true
keep if age >= 25 & age <= 54

* Alternative: drop observations you don't want
* "|" means OR - either condition triggers the drop
drop if age < 25 | age > 54

* ----- SELECTING VARIABLES -----

* Drop variables you don't need
* This makes your dataset smaller and easier to work with
drop year serial pernum

* Keep only specific variables
keep id age income sex education
# ----- SELECTING OBSERVATIONS -----

# Keep only women (where sex == 2)
df <- df %>% filter(sex == 2)

# Check how many observations remain
nrow(df)

# Keep only ages 25-54
df <- df %>% filter(age >= 25 & age <= 54)

# Alternative: drop observations you don't want
df <- df %>% filter(!(age < 25 | age > 54))

# ----- SELECTING VARIABLES -----

# Drop variables you don't need
df <- df %>% select(-year, -serial, -pernum)

# Keep only specific variables
df <- df %>% select(id, age, income, sex, education)
# ----- SELECTING OBSERVATIONS -----

# Keep only women (where sex == 2)
df = df[df['sex'] == 2]

# Check how many observations remain
len(df)

# Keep only ages 25-54
df = df[(df['age'] >= 25) & (df['age'] <= 54)]

# Alternative: drop observations you don't want
df = df[~((df['age'] < 25) | (df['age'] > 54))]

# ----- SELECTING VARIABLES -----

# Drop variables you don't need
df = df.drop(columns=['year', 'serial', 'pernum'])

# Keep only specific variables
df = df[['id', 'age', 'income', 'sex', 'education']]

Check Your Work!

After every keep or drop, run count (Stata) or nrow(df) (R) or len(df) (Python) to verify you have the expected number of observations. A common mistake is accidentally dropping everything!

Step 4: Creating New Variables

Raw data rarely has exactly the variables you need. You'll often need to create:

  • Indicator variables (0/1): Is this person married? Employed? Over 65?
  • Categorical variables: Group ages into bins (18-24, 25-34, 35-44, etc.)
  • Transformed variables: Log of income, income in thousands, etc.
πŸ€” Predict: In Stata, what's the difference between gen married = 1 if marst == 1 and gen married = (marst == 1)? What happens to observations where marst != 1 in each case?
⚠️ Critical Warning About Indicator Variables: If you write gen married = 1 if marst == 1, Stata creates married = 1 for married people, but married = missing (not 0!) for everyone else. This silently breaks your regressions because observations with missing values get dropped. Always use the two-step method with replace, or the safer one-step method shown below.
* ----- INDICATOR VARIABLES (0/1) -----

* Method 1: Two steps (WRONG - creates missing values!)
gen married = 1 if marst == 1
* Problem: married is MISSING (not 0) when marst != 1
tab married, missing  // You'll see missing values

* Method 2: Two steps (CORRECT)
gen married = 1 if marst == 1
replace married = 0 if marst != 1
* Now married is 0 or 1 for everyone

* Method 3: One step (BEST - cleaner code)
gen married = (marst == 1)
* The expression (marst == 1) evaluates to 1 if true, 0 if false

* ----- CATEGORICAL VARIABLES -----

* Group number of children: 0, 1, 2, 3, 4+
gen kids_cat = nchild if nchild <= 4
replace kids_cat = 4 if nchild > 4

* Create dummy variables from categorical
* This creates Kids_1, Kids_2, Kids_3, Kids_4, Kids_5
tab kids_cat, gen(Kids_)

* ----- TRANSFORMED VARIABLES -----

* Log of income (useful for skewed distributions)
* Add 1 to handle zeros (ln(0) is undefined)
gen ln_income = ln(income + 1)

* Income in thousands
gen income_k = income / 1000
# ----- INDICATOR VARIABLES (0/1) -----

# Method 1: Using ifelse (explicit)
df <- df %>% mutate(married = ifelse(marst == 1, 1, 0))

# Method 2: Using as.integer (cleaner)
df <- df %>% mutate(married = as.integer(marst == 1))

# ----- CATEGORICAL VARIABLES -----

# Group number of children: 0, 1, 2, 3, 4+
df <- df %>% mutate(
  kids_cat = case_when(
    nchild <= 4 ~ nchild,
    nchild > 4 ~ 4
  )
)

# Create dummy variables from categorical
df <- df %>% mutate(
  Kids_0 = as.integer(kids_cat == 0),
  Kids_1 = as.integer(kids_cat == 1),
  Kids_2 = as.integer(kids_cat == 2),
  Kids_3 = as.integer(kids_cat == 3),
  Kids_4 = as.integer(kids_cat == 4)
)

# ----- TRANSFORMED VARIABLES -----

# Log of income
df <- df %>% mutate(ln_income = log(income + 1))

# Income in thousands
df <- df %>% mutate(income_k = income / 1000)
# ----- INDICATOR VARIABLES (0/1) -----

# Simple boolean comparison (returns True/False, converted to 1/0)
df['married'] = (df['marst'] == 1).astype(int)

# ----- CATEGORICAL VARIABLES -----

# Group number of children: 0, 1, 2, 3, 4+
df['kids_cat'] = df['nchild'].clip(upper=4)

# Create dummy variables from categorical
dummies = pd.get_dummies(df['kids_cat'], prefix='Kids')
df = pd.concat([df, dummies], axis=1)

# ----- TRANSFORMED VARIABLES -----

# Log of income
df['ln_income'] = np.log(df['income'] + 1)

# Income in thousands
df['income_k'] = df['income'] / 1000

Language Comparison

Task Stata R (tidyverse) Python (pandas)
Create variable gen x = ... mutate(x = ...) df['x'] = ...
Modify variable replace x = ... mutate(x = ...) df['x'] = ...

Module 2: Project Organization

Before we dive into loading data, let's set up a proper project structure. A well-organized project saves hours of confusion later and makes your work reproducible.

Why organization matters NOW: It's much easier to start with good habits than to reorganize a messy project later. The 15 minutes you spend setting this up will save you hours of debugging and searching for files.

The Standard Structure

my_project/
β”œβ”€β”€ master.do              # Runs everything (or master.R / main.py)
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ build/                 # Data preparation
β”‚   β”œβ”€β”€ input/             # Raw data (NEVER edit these!)
β”‚   β”œβ”€β”€ code/              # Scripts to clean data
β”‚   └── output/            # Cleaned data
β”‚
β”œβ”€β”€ analysis/              # Your analysis
β”‚   β”œβ”€β”€ code/              # Regression scripts, etc.
β”‚   └── output/
β”‚       β”œβ”€β”€ tables/
β”‚       └── figures/
β”‚
└── paper/                 # Your writeup
    └── draft.tex

The Golden Rule: Raw Data is READ-ONLY

Never modify files in your input/ folder. If you need to make changes, write code that reads the raw data and saves a cleaned version to output/. This way you can always reproduce your work from scratch.

The Master Script

Your master.do file should run your entire project from start to finish. Anyone should be able to run it and reproduce all your results.

/*==============================================================================
    Master Do-File: My Research Project
    Author: Your Name
    Date: February 2026
==============================================================================*/

clear all
set more off

* Set the project root (CHANGE THIS!)
global root "/Users/yourname/Dropbox/my_project"

* Define paths
global build    "$root/build"
global analysis "$root/analysis"

* Run the build scripts
do "$build/code/01_clean_census.do"
do "$build/code/02_clean_policy.do"
do "$build/code/03_merge.do"

* Run the analysis
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_main_regression.do"
do "$analysis/code/03_robustness.do"
#==============================================================================
#    Master Script: My Research Project
#    Author: Your Name
#    Date: February 2026
#==============================================================================

rm(list = ls())

# Set the project root (CHANGE THIS!)
root <- "/Users/yourname/Dropbox/my_project"

# Define paths
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")

# Run the build scripts
source(file.path(build, "code", "01_clean_census.R"))
source(file.path(build, "code", "02_clean_policy.R"))
source(file.path(build, "code", "03_merge.R"))

# Run the analysis
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_main_regression.R"))
source(file.path(analysis, "code", "03_robustness.R"))
"""
Master Script: My Research Project
Author: Your Name
Date: February 2026
"""

import os
import subprocess

# Set the project root (CHANGE THIS!)
ROOT = "/Users/yourname/Dropbox/my_project"

# Define paths
BUILD = os.path.join(ROOT, "build")
ANALYSIS = os.path.join(ROOT, "analysis")

# Run the build scripts
exec(open(os.path.join(BUILD, "code", "01_clean_census.py")).read())
exec(open(os.path.join(BUILD, "code", "02_clean_policy.py")).read())
exec(open(os.path.join(BUILD, "code", "03_merge.py")).read())

# Run the analysis
exec(open(os.path.join(ANALYSIS, "code", "01_summary_stats.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "02_main_regression.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "03_robustness.py")).read())

Number Your Scripts

Prefix scripts with numbers (01_, 02_, etc.) so it's clear what order they run in. This also keeps them sorted correctly in your file browser.

Quick Check: Project Organization

Question: You download census data from the web. Where should you save it?

Module 3: Reading in Data

Real-world data comes in many formats: CSV files from websites, Excel spreadsheets from collaborators, Stata files from data archives. This module teaches you how to get data into your software.

Importing CSV Files

CSV (Comma-Separated Values) is the most common format for sharing data. It's just a text file where values are separated by commas.

πŸ’‘ What's in a CSV file? Open one in a text editor (not Excel) and you'll see something like:
name,age,income
Alice,32,50000
Bob,45,75000
Carol,28,42000
The first row is usually variable names. Each subsequent row is one observation.
* Import CSV with first row as variable names
import delimited "mydata.csv", varnames(1) clear

* Immediately check what you got
browse       // Look at the data
describe     // Check variable types
summarize    // Check value ranges

* Common issue: numbers imported as strings
* WHY? If your CSV has a comma in any number (like '1,234'), Stata reads
* the entire column as text. Same if there's any non-numeric character
* ('$50', 'N/A'). You'll know this happened if `summarize income` says
* 'no observations' or shows nothing.
* Fix by destringing (converting to numeric)
destring income, replace
# Import CSV (read_csv from tidyverse is best)
df <- read_csv("mydata.csv")

# Immediately check what you got
glimpse(df)    # Check variable types
summary(df)    # Check value ranges

# Note: read_csv automatically detects types
# It's smarter than base R's read.csv()
# Import CSV
df = pd.read_csv("mydata.csv")

# Immediately check what you got
df.head()     # Look at first rows
df.info()     # Check variable types
df.describe() # Check value ranges

# Note: pandas usually detects types correctly
# You can specify types if needed:
# df = pd.read_csv("mydata.csv", dtype={'id': str})

Importing Stata Files

Many economics datasets are distributed as Stata .dta files. These are convenient because they preserve variable labels and value labels.

* Open Stata file
use "mydata.dta", clear

* Open specific variables only (saves memory for large files)
use age income education using "mydata.dta", clear

* Open only observations meeting a condition
use if age >= 25 using "mydata.dta", clear
# Need the haven package
pacman::p_load(haven)

# Read Stata file
df <- read_dta("mydata.dta")

# Note: haven preserves variable labels
# Access them with:
attr(df$income, "label")
# pandas can read Stata files directly
df = pd.read_stata("mydata.dta")

# Note: pandas preserves variable labels
# Access them with:
df.columns.to_list()  # Variable names

Saving Your Data

After cleaning and creating variables, save your work so you don't have to redo it.

* Save as Stata file
save "mydata_clean.dta", replace

* Export as CSV
export delimited "mydata_clean.csv", replace
# Save as RDS (R's native format)
saveRDS(df, "mydata_clean.rds")

# Save as Stata file
write_dta(df, "mydata_clean.dta")

# Save as CSV
write_csv(df, "mydata_clean.csv")
# Save as pickle (Python's native format)
df.to_pickle("mydata_clean.pkl")

# Save as Stata file
df.to_stata("mydata_clean.dta")

# Save as CSV
df.to_csv("mydata_clean.csv", index=False)

Practice Quiz

Test your understanding with these questions:

Q1: In Stata, what's wrong with this code?

gen employed = 1 if empstat == 1
Show Answer

This creates employed = 1 for employed people, but employed = missing for everyone else (not 0!). You need either:
gen employed = 1 if empstat == 1
replace employed = 0 if empstat != 1
Or: gen employed = (empstat == 1)

Q2: You download census data from the web. Where should you save it in your project folder?

Show Answer

build/input/. Raw data goes in the input folder. Never modify files in this folderβ€”if you need to make changes, write code that reads the raw data and saves a cleaned version to build/output/.

Q3: You import a CSV and the income variable shows up as a string instead of a number. How do you fix this in Stata?

Show Answer

Use destring income, replace to convert the string to a numeric variable. This is a common issue when CSVs have numbers formatted with commas or other text characters.

Q4: You run keep if age >= 18 but then count shows 0 observations. What probably happened?

Show Answer

Most likely, age is stored as a string (text), not a number. The comparison age >= 18 doesn't work as expected on strings. Use destring age, replace first to convert it to numeric.

Found something unclear or have a suggestion? Email [email protected].