Stata Session 1
Your first steps with data: setup, organization, exploring, and importing
What You'll Learn
- How to set up a research project with proper folder structure
- How to explore a dataset you've never seen before
- How to select the observations and variables you need
- How to create new variables from existing ones
- How to import data from CSV files
Download Complete Scripts
Run the full session code in your preferred language:
Module 1: Introduction to Stata
Imagine you just got a new dataset. Before doing any analysis, you need to understand what you have. How many observations? What variables? What do they look like? This module teaches you how to answer these questions.
Step 1: Setting Up Your Script
Every Stata do-file should start the same way. Think of this as "clearing the stage" before your performance:
clear all do? Why might we want to close any open log file before starting?
* ========================================
* SETUP - Run this at the start of every script
* ========================================
* Step 1: Clear everything from memory
* (This ensures you're starting fresh, not using old data)
clear all
* Step 2: Close any open log file
* (capture means "try this, but don't error if it fails")
capture log close
* Step 3: Don't pause after each screenful of output
set more off
* Step 4: Set your working directory
* IMPORTANT: Change this path to YOUR project folder!
* To find your path: On Mac, right-click your project folder in Finder
* β 'Get Info' β copy the path next to 'Where'.
* On Windows, open the folder in Explorer β click the address bar β copy.
* Use forward slashes (/) in Stata even on Windows.
cd "/Users/yourname/Dropbox/my_project"
* Step 5: Start a log file (records everything you do)
* "replace" overwrites the old log if it exists
log using my_analysis.log, replace
# ========================================
# SETUP - Run this at the start of every script
# ========================================
# Step 1: Clear environment (remove all objects)
rm(list = ls())
# Step 2: Set working directory
# IMPORTANT: Change this path to YOUR project folder!
setwd("/Users/yourname/Dropbox/my_project")
# Step 3: Load required packages
pacman::p_load(tidyverse, haven)
# ========================================
# SETUP - Run this at the start of every script
# ========================================
# Step 1: Import required packages
import pandas as pd
import numpy as np
import os
# Step 2: Set working directory
# IMPORTANT: Change this path to YOUR project folder!
os.chdir("/Users/yourname/Dropbox/my_project")
Why This Matters
Starting fresh (clear all) prevents a common bug: accidentally using variables from a previous session. The log file creates a record of everything you didβinvaluable when you need to remember your steps months later or when something goes wrong.
Step 2: First Look at Your Data
You've downloaded a dataset. Now what? Let's say it's Census data with information about people. Before diving into analysis, answer these questions:
- How many observations? (Are there 100 people? 1 million?)
- What variables exist? (Income? Age? Education?)
- What do the values look like? (Is age in years? Months? Categories?)
sex with values 1 and 2, what might those numbers mean? How would you find out?
* Load the data
use mydata.dta, clear
* ----- EXPLORATION COMMANDS -----
* browse: Opens a spreadsheet view
* USE THIS FIRST to see what your data looks like
browse
* describe: Shows variable names, types, and labels
* This tells you WHAT variables you have
describe
* summarize: Shows mean, sd, min, max for numeric variables
* This tells you the RANGE of your data
summarize
* summarize specific variables
summarize age income
* summarize with detail: adds percentiles
* Useful for seeing the full distribution
summarize income, detail
* tabulate: Counts observations by category
* ESSENTIAL for categorical variables
tabulate sex
* See the numeric codes behind the labels
tabulate sex, nolabel
# Load the data
df <- read_dta("mydata.dta")
# ----- EXPLORATION COMMANDS -----
# View(): Opens a spreadsheet view
# USE THIS FIRST to see what your data looks like
View(df)
# glimpse(): Shows variable names and types
# This tells you WHAT variables you have
glimpse(df)
# summary(): Shows summary stats for all variables
summary(df)
# Summary for specific variables
summary(df$age)
summary(df$income)
# Detailed summary with percentiles
pacman::p_load(psych)
describe(df$income)
# table(): Counts observations by category
# ESSENTIAL for categorical variables
table(df$sex)
# With percentages
prop.table(table(df$sex))
# Load the data
df = pd.read_stata("mydata.dta")
# ----- EXPLORATION COMMANDS -----
# head(): Shows first few rows
# USE THIS FIRST to see what your data looks like
df.head(10)
# info(): Shows variable names and types
# This tells you WHAT variables you have
df.info()
# describe(): Shows summary stats for numeric variables
df.describe()
# Summary for specific variables
df[['age', 'income']].describe()
# Detailed summary with percentiles
df['income'].describe(percentiles=[.1, .25, .5, .75, .9])
# value_counts(): Counts observations by category
# ESSENTIAL for categorical variables
df['sex'].value_counts()
# With percentages
df['sex'].value_counts(normalize=True)
describeoutput shows variable names, types (string vs numeric), and labels (human-readable descriptions)summarizeoutput shows Obs (count of non-missing), Mean, Std. Dev., Min, and Maxtabulateoutput shows how many observations fall into each category
Never Manually Edit Data
Stata's Data Editor lets you click on cells and change values. Never do this. Why? Because there's no record of what you changed. If you need to make changes, do it in code so your work is reproducible.
Step 3: Selecting Your Sample
Your dataset might have millions of observations, but your research question might focus on a specific group. For example, if you're studying women's labor force participation, you might want to:
- Keep only women (drop men)
- Keep only working-age adults (ages 25-54)
- Drop observations with missing income
keep if age >= 30 and drop if age < 30? Do they give the same result?
* ----- SELECTING OBSERVATIONS -----
* Keep only women (where sex == 2)
* "==" means "is equal to" (double equals for comparison)
keep if sex == 2
* Check how many observations remain
count
* Keep only ages 25-54
* "&" means AND - both conditions must be true
keep if age >= 25 & age <= 54
* Alternative: drop observations you don't want
* "|" means OR - either condition triggers the drop
drop if age < 25 | age > 54
* ----- SELECTING VARIABLES -----
* Drop variables you don't need
* This makes your dataset smaller and easier to work with
drop year serial pernum
* Keep only specific variables
keep id age income sex education
# ----- SELECTING OBSERVATIONS -----
# Keep only women (where sex == 2)
df <- df %>% filter(sex == 2)
# Check how many observations remain
nrow(df)
# Keep only ages 25-54
df <- df %>% filter(age >= 25 & age <= 54)
# Alternative: drop observations you don't want
df <- df %>% filter(!(age < 25 | age > 54))
# ----- SELECTING VARIABLES -----
# Drop variables you don't need
df <- df %>% select(-year, -serial, -pernum)
# Keep only specific variables
df <- df %>% select(id, age, income, sex, education)
# ----- SELECTING OBSERVATIONS -----
# Keep only women (where sex == 2)
df = df[df['sex'] == 2]
# Check how many observations remain
len(df)
# Keep only ages 25-54
df = df[(df['age'] >= 25) & (df['age'] <= 54)]
# Alternative: drop observations you don't want
df = df[~((df['age'] < 25) | (df['age'] > 54))]
# ----- SELECTING VARIABLES -----
# Drop variables you don't need
df = df.drop(columns=['year', 'serial', 'pernum'])
# Keep only specific variables
df = df[['id', 'age', 'income', 'sex', 'education']]
Check Your Work!
After every keep or drop, run count (Stata) or nrow(df) (R) or len(df) (Python) to verify you have the expected number of observations. A common mistake is accidentally dropping everything!
Step 4: Creating New Variables
Raw data rarely has exactly the variables you need. You'll often need to create:
- Indicator variables (0/1): Is this person married? Employed? Over 65?
- Categorical variables: Group ages into bins (18-24, 25-34, 35-44, etc.)
- Transformed variables: Log of income, income in thousands, etc.
gen married = 1 if marst == 1 and gen married = (marst == 1)? What happens to observations where marst != 1 in each case?
gen married = 1 if marst == 1, Stata creates married = 1 for married people, but married = missing (not 0!) for everyone else. This silently breaks your regressions because observations with missing values get dropped. Always use the two-step method with replace, or the safer one-step method shown below.
* ----- INDICATOR VARIABLES (0/1) -----
* Method 1: Two steps (WRONG - creates missing values!)
gen married = 1 if marst == 1
* Problem: married is MISSING (not 0) when marst != 1
tab married, missing // You'll see missing values
* Method 2: Two steps (CORRECT)
gen married = 1 if marst == 1
replace married = 0 if marst != 1
* Now married is 0 or 1 for everyone
* Method 3: One step (BEST - cleaner code)
gen married = (marst == 1)
* The expression (marst == 1) evaluates to 1 if true, 0 if false
* ----- CATEGORICAL VARIABLES -----
* Group number of children: 0, 1, 2, 3, 4+
gen kids_cat = nchild if nchild <= 4
replace kids_cat = 4 if nchild > 4
* Create dummy variables from categorical
* This creates Kids_1, Kids_2, Kids_3, Kids_4, Kids_5
tab kids_cat, gen(Kids_)
* ----- TRANSFORMED VARIABLES -----
* Log of income (useful for skewed distributions)
* Add 1 to handle zeros (ln(0) is undefined)
gen ln_income = ln(income + 1)
* Income in thousands
gen income_k = income / 1000
# ----- INDICATOR VARIABLES (0/1) -----
# Method 1: Using ifelse (explicit)
df <- df %>% mutate(married = ifelse(marst == 1, 1, 0))
# Method 2: Using as.integer (cleaner)
df <- df %>% mutate(married = as.integer(marst == 1))
# ----- CATEGORICAL VARIABLES -----
# Group number of children: 0, 1, 2, 3, 4+
df <- df %>% mutate(
kids_cat = case_when(
nchild <= 4 ~ nchild,
nchild > 4 ~ 4
)
)
# Create dummy variables from categorical
df <- df %>% mutate(
Kids_0 = as.integer(kids_cat == 0),
Kids_1 = as.integer(kids_cat == 1),
Kids_2 = as.integer(kids_cat == 2),
Kids_3 = as.integer(kids_cat == 3),
Kids_4 = as.integer(kids_cat == 4)
)
# ----- TRANSFORMED VARIABLES -----
# Log of income
df <- df %>% mutate(ln_income = log(income + 1))
# Income in thousands
df <- df %>% mutate(income_k = income / 1000)
# ----- INDICATOR VARIABLES (0/1) -----
# Simple boolean comparison (returns True/False, converted to 1/0)
df['married'] = (df['marst'] == 1).astype(int)
# ----- CATEGORICAL VARIABLES -----
# Group number of children: 0, 1, 2, 3, 4+
df['kids_cat'] = df['nchild'].clip(upper=4)
# Create dummy variables from categorical
dummies = pd.get_dummies(df['kids_cat'], prefix='Kids')
df = pd.concat([df, dummies], axis=1)
# ----- TRANSFORMED VARIABLES -----
# Log of income
df['ln_income'] = np.log(df['income'] + 1)
# Income in thousands
df['income_k'] = df['income'] / 1000
Language Comparison
| Task | Stata | R (tidyverse) | Python (pandas) |
|---|---|---|---|
| Create variable | gen x = ... |
mutate(x = ...) |
df['x'] = ... |
| Modify variable | replace x = ... |
mutate(x = ...) |
df['x'] = ... |
Module 2: Project Organization
Before we dive into loading data, let's set up a proper project structure. A well-organized project saves hours of confusion later and makes your work reproducible.
The Standard Structure
my_project/
βββ master.do # Runs everything (or master.R / main.py)
βββ README.md
β
βββ build/ # Data preparation
β βββ input/ # Raw data (NEVER edit these!)
β βββ code/ # Scripts to clean data
β βββ output/ # Cleaned data
β
βββ analysis/ # Your analysis
β βββ code/ # Regression scripts, etc.
β βββ output/
β βββ tables/
β βββ figures/
β
βββ paper/ # Your writeup
βββ draft.tex
The Golden Rule: Raw Data is READ-ONLY
Never modify files in your input/ folder. If you need to make changes, write code that reads the raw data and saves a cleaned version to output/. This way you can always reproduce your work from scratch.
The Master Script
Your master.do file should run your entire project from start to finish. Anyone should be able to run it and reproduce all your results.
/*==============================================================================
Master Do-File: My Research Project
Author: Your Name
Date: February 2026
==============================================================================*/
clear all
set more off
* Set the project root (CHANGE THIS!)
global root "/Users/yourname/Dropbox/my_project"
* Define paths
global build "$root/build"
global analysis "$root/analysis"
* Run the build scripts
do "$build/code/01_clean_census.do"
do "$build/code/02_clean_policy.do"
do "$build/code/03_merge.do"
* Run the analysis
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_main_regression.do"
do "$analysis/code/03_robustness.do"
#==============================================================================
# Master Script: My Research Project
# Author: Your Name
# Date: February 2026
#==============================================================================
rm(list = ls())
# Set the project root (CHANGE THIS!)
root <- "/Users/yourname/Dropbox/my_project"
# Define paths
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")
# Run the build scripts
source(file.path(build, "code", "01_clean_census.R"))
source(file.path(build, "code", "02_clean_policy.R"))
source(file.path(build, "code", "03_merge.R"))
# Run the analysis
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_main_regression.R"))
source(file.path(analysis, "code", "03_robustness.R"))
"""
Master Script: My Research Project
Author: Your Name
Date: February 2026
"""
import os
import subprocess
# Set the project root (CHANGE THIS!)
ROOT = "/Users/yourname/Dropbox/my_project"
# Define paths
BUILD = os.path.join(ROOT, "build")
ANALYSIS = os.path.join(ROOT, "analysis")
# Run the build scripts
exec(open(os.path.join(BUILD, "code", "01_clean_census.py")).read())
exec(open(os.path.join(BUILD, "code", "02_clean_policy.py")).read())
exec(open(os.path.join(BUILD, "code", "03_merge.py")).read())
# Run the analysis
exec(open(os.path.join(ANALYSIS, "code", "01_summary_stats.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "02_main_regression.py")).read())
exec(open(os.path.join(ANALYSIS, "code", "03_robustness.py")).read())
Number Your Scripts
Prefix scripts with numbers (01_, 02_, etc.) so it's clear what order they run in. This also keeps them sorted correctly in your file browser.
Quick Check: Project Organization
Question: You download census data from the web. Where should you save it?
Module 3: Reading in Data
Real-world data comes in many formats: CSV files from websites, Excel spreadsheets from collaborators, Stata files from data archives. This module teaches you how to get data into your software.
Importing CSV Files
CSV (Comma-Separated Values) is the most common format for sharing data. It's just a text file where values are separated by commas.
name,age,income Alice,32,50000 Bob,45,75000 Carol,28,42000The first row is usually variable names. Each subsequent row is one observation.
* Import CSV with first row as variable names
import delimited "mydata.csv", varnames(1) clear
* Immediately check what you got
browse // Look at the data
describe // Check variable types
summarize // Check value ranges
* Common issue: numbers imported as strings
* WHY? If your CSV has a comma in any number (like '1,234'), Stata reads
* the entire column as text. Same if there's any non-numeric character
* ('$50', 'N/A'). You'll know this happened if `summarize income` says
* 'no observations' or shows nothing.
* Fix by destringing (converting to numeric)
destring income, replace
# Import CSV (read_csv from tidyverse is best)
df <- read_csv("mydata.csv")
# Immediately check what you got
glimpse(df) # Check variable types
summary(df) # Check value ranges
# Note: read_csv automatically detects types
# It's smarter than base R's read.csv()
# Import CSV
df = pd.read_csv("mydata.csv")
# Immediately check what you got
df.head() # Look at first rows
df.info() # Check variable types
df.describe() # Check value ranges
# Note: pandas usually detects types correctly
# You can specify types if needed:
# df = pd.read_csv("mydata.csv", dtype={'id': str})
Importing Stata Files
Many economics datasets are distributed as Stata .dta files. These are convenient because they preserve variable labels and value labels.
* Open Stata file
use "mydata.dta", clear
* Open specific variables only (saves memory for large files)
use age income education using "mydata.dta", clear
* Open only observations meeting a condition
use if age >= 25 using "mydata.dta", clear
# Need the haven package
pacman::p_load(haven)
# Read Stata file
df <- read_dta("mydata.dta")
# Note: haven preserves variable labels
# Access them with:
attr(df$income, "label")
# pandas can read Stata files directly
df = pd.read_stata("mydata.dta")
# Note: pandas preserves variable labels
# Access them with:
df.columns.to_list() # Variable names
Saving Your Data
After cleaning and creating variables, save your work so you don't have to redo it.
* Save as Stata file
save "mydata_clean.dta", replace
* Export as CSV
export delimited "mydata_clean.csv", replace
# Save as RDS (R's native format)
saveRDS(df, "mydata_clean.rds")
# Save as Stata file
write_dta(df, "mydata_clean.dta")
# Save as CSV
write_csv(df, "mydata_clean.csv")
# Save as pickle (Python's native format)
df.to_pickle("mydata_clean.pkl")
# Save as Stata file
df.to_stata("mydata_clean.dta")
# Save as CSV
df.to_csv("mydata_clean.csv", index=False)
Practice Quiz
Test your understanding with these questions:
Q1: In Stata, what's wrong with this code?
gen employed = 1 if empstat == 1
Show Answer
This creates employed = 1 for employed people, but employed = missing for everyone else (not 0!). You need either:
gen employed = 1 if empstat == 1
replace employed = 0 if empstat != 1
Or: gen employed = (empstat == 1)
Q2: You download census data from the web. Where should you save it in your project folder?
Show Answer
build/input/. Raw data goes in the input folder. Never modify files in this folderβif you need to make changes, write code that reads the raw data and saves a cleaned version to build/output/.
Q3: You import a CSV and the income variable shows up as a string instead of a number. How do you fix this in Stata?
Show Answer
Use destring income, replace to convert the string to a numeric variable. This is a common issue when CSVs have numbers formatted with commas or other text characters.
Q4: You run keep if age >= 18 but then count shows 0 observations. What probably happened?
Show Answer
Most likely, age is stored as a string (text), not a number. The comparison age >= 18 doesn't work as expected on strings. Use destring age, replace first to convert it to numeric.
Found something unclear or have a suggestion? Email [email protected].