Stata Session 1
Basics, importing data, reshaping, and merging
What You'll Learn
- Stata basics: browsing, exploring, and summarizing data
- Making sample selections and creating new variables
- Reading in data from CSV and other formats
- Reshaping data between wide and long formats
- Merging datasets together
Introduction
We start with the basics: opening, browsing, and exploring data; making sample selections; and creating and labeling new variables.
Setup
Every script should start with setup commands:
* Clear anything in memory
clear all
* Close any open log file
capture log close
* Run the do file without stopping
set more off
* Change the working directory (CHANGE THIS!)
cd "/Users/yourname/Dropbox/my_project"
* Open a log file
log using my_analysis.log, replace
# Clear environment
rm(list = ls())
# Set working directory (CHANGE THIS!)
setwd("/Users/yourname/Dropbox/my_project")
# Load required packages
library(tidyverse)
library(haven) # for reading Stata files
# Import required packages
import pandas as pd
import numpy as np
import os
# Set working directory (CHANGE THIS!)
os.chdir("/Users/yourname/Dropbox/my_project")
Browsing and Exploring Data
* Open the data
use mydata.dta, clear
* Open the data browser (like a spreadsheet view)
browse
* Describe variables: names, labels, and data size
describe
* Summary stats: mean, sd, min, max
summarize
* Summary stats for specific variables
sum age income
* Detailed summary with percentiles
sum income, detail
* Tabulate: count observations by category
tabulate gender
* See the actual numeric values (not labels)
tab gender, nolabel
# Load the data
df <- read_dta("mydata.dta")
# View the data (opens viewer)
View(df)
# See structure and variable types
str(df)
glimpse(df)
# Summary statistics
summary(df)
# Summary for specific variables
summary(df$age)
summary(df$income)
# Detailed summary
library(psych)
describe(df$income)
# Tabulate: count by category
table(df$gender)
# With percentages
prop.table(table(df$gender))
# Load the data
df = pd.read_stata("mydata.dta")
# View first rows
df.head()
# See structure and variable types
df.info()
df.dtypes
# Summary statistics
df.describe()
# Summary for specific variables
df[['age', 'income']].describe()
# Detailed summary with percentiles
df['income'].describe(percentiles=[.1, .25, .5, .75, .9])
# Tabulate: count by category
df['gender'].value_counts()
# With percentages
df['gender'].value_counts(normalize=True)
Never Manually Edit Data
In Stata, the Data Editor lets you manually change values. Never do this. In R/Python, don't modify data outside your script. Any changes must be in code for reproducibility.
Making Sample Selections
* Keep only women (where sex == 2)
keep if sex == 2
* Keep ages 30-40
keep if age >= 30 & age <= 40
* Alternatively, drop observations
drop if age < 30 | age > 40
* Drop unnecessary variables
drop year serial pernum
# Keep only women (where sex == 2)
df <- df %>% filter(sex == 2)
# Keep ages 30-40
df <- df %>% filter(age >= 30 & age <= 40)
# Alternatively, drop observations
df <- df %>% filter(!(age < 30 | age > 40))
# Drop unnecessary variables
df <- df %>% select(-year, -serial, -pernum)
# Keep only women (where sex == 2)
df = df[df['sex'] == 2]
# Keep ages 30-40
df = df[(df['age'] >= 30) & (df['age'] <= 40)]
# Alternatively, drop observations
df = df[~((df['age'] < 30) | (df['age'] > 40))]
# Drop unnecessary variables
df = df.drop(columns=['year', 'serial', 'pernum'])
Creating New Variables
* Create an indicator variable (0/1)
* Method 1: Two steps
gen married = 1 if marst == 1
replace married = 0 if marst != 1
* Method 2: One step (cleaner)
gen married = (marst == 1)
* Create a categorical variable
gen kids_cat = nchild if nchild <= 4
replace kids_cat = 4 if nchild > 4
* Create multiple indicators from a categorical variable
tab kids_cat, gen(Kids_)
# Create an indicator variable (0/1)
df <- df %>% mutate(married = ifelse(marst == 1, 1, 0))
# Or more simply
df <- df %>% mutate(married = as.integer(marst == 1))
# Create a categorical variable
df <- df %>% mutate(
kids_cat = case_when(
nchild <= 4 ~ nchild,
nchild > 4 ~ 4
)
)
# Create dummy variables from categorical
df <- df %>% mutate(
Kids_0 = as.integer(kids_cat == 0),
Kids_1 = as.integer(kids_cat == 1),
Kids_2 = as.integer(kids_cat == 2),
Kids_3 = as.integer(kids_cat == 3),
Kids_4 = as.integer(kids_cat == 4)
)
# Create an indicator variable (0/1)
df['married'] = (df['marst'] == 1).astype(int)
# Create a categorical variable
df['kids_cat'] = df['nchild'].clip(upper=4)
# Create dummy variables from categorical
dummies = pd.get_dummies(df['kids_cat'], prefix='Kids')
df = pd.concat([df, dummies], axis=1)
Language Comparison: Creating Variables
- Stata: Uses
genandreplace(modifies data in place) - R: Uses
mutate()from tidyverse (creates/modifies columns) - Python: Direct assignment
df['col'] = ...
Reading Data
Real-world data often comes in messy formats. This section covers how to import data and clean it up.
Importing CSV Files
* Import CSV with first row as variable names
import delimited "mydata.csv", varnames(1) clear
* Look at what we got
browse
describe
summarize
# Import CSV
df <- read_csv("mydata.csv")
# Look at what we got
View(df)
glimpse(df)
summary(df)
# Import CSV
df = pd.read_csv("mydata.csv")
# Look at what we got
df.head()
df.info()
df.describe()
Fixing Data Types
Problem: A numeric variable is read as a string (text).
* Check why taxrate is a string
browse taxrate
* Ah, some values have "0a" (footnote markers)
* Fix it: replace the non-numeric characters
replace taxrate = "0" if taxrate == "0a"
* Convert string to numeric
destring taxrate, replace
* Now we can calculate statistics
summarize taxrate
# Check why taxrate is character
class(df$taxrate)
unique(df$taxrate)
# Ah, some values have "0a" (footnote markers)
# Fix it: replace non-numeric characters
df <- df %>% mutate(
taxrate = str_replace(taxrate, "a$", ""),
taxrate = as.numeric(taxrate)
)
# Now we can calculate statistics
summary(df$taxrate)
# Check why taxrate is object type
df['taxrate'].dtype
df['taxrate'].unique()
# Ah, some values have "0a" (footnote markers)
# Fix it: replace non-numeric characters
df['taxrate'] = df['taxrate'].str.replace('a', '', regex=False)
df['taxrate'] = pd.to_numeric(df['taxrate'])
# Now we can calculate statistics
df['taxrate'].describe()
Reshaping
Data comes in two shapes: wide (one row per unit, multiple columns for time periods) and long (multiple rows per unit, one row per time period). You need to know how to convert between them.
Wide Format
state emp_2009 emp_2010 emp_2011 CA 0.65 0.63 0.64 TX 0.68 0.67 0.69
One row per state, multiple year columns
Long Format
state year emp CA 2009 0.65 CA 2010 0.63 CA 2011 0.64 TX 2009 0.68 ...
Multiple rows per state, one year column
Long Data is (Almost) Always Better
For regression analysis and most statistical work, you want long format. Wide format is mainly useful for display purposes or specific matrix operations.
Reshaping Wide to Long
* Current: one row per state, columns emp_2009, emp_2010, emp_2011
* Goal: multiple rows per state, one "year" column
reshape long emp_, i(state) j(year)
* Clean up the variable name
rename emp_ emp
* Check the result
browse
# Current: one row per state, columns emp_2009, emp_2010, emp_2011
# Goal: multiple rows per state, one "year" column
df_long <- df %>%
pivot_longer(
cols = starts_with("emp_"),
names_to = "year",
names_prefix = "emp_",
values_to = "emp"
) %>%
mutate(year = as.integer(year))
# Check the result
head(df_long)
# Current: one row per state, columns emp_2009, emp_2010, emp_2011
# Goal: multiple rows per state, one "year" column
df_long = pd.melt(
df,
id_vars=['state'],
value_vars=['emp_2009', 'emp_2010', 'emp_2011'],
var_name='year',
value_name='emp'
)
# Extract year from column name
df_long['year'] = df_long['year'].str.extract('(\d+)').astype(int)
# Check the result
df_long.head()
Reshaping Long to Wide
* Convert back to wide format
reshape wide emp, i(state) j(year)
# Convert back to wide format
df_wide <- df_long %>%
pivot_wider(
names_from = year,
values_from = emp,
names_prefix = "emp_"
)
# Convert back to wide format
df_wide = df_long.pivot(
index='state',
columns='year',
values='emp'
).add_prefix('emp_').reset_index()
Aggregating Data
* Calculate mean BY STATE (keeps all rows)
bysort state: egen emp_avg_state = mean(emp)
* Collapse to state means (reduces to one row per state)
collapse (mean) emp (max) taxrate, by(state)
# Calculate mean BY STATE (keeps all rows)
df <- df %>%
group_by(state) %>%
mutate(emp_avg_state = mean(emp, na.rm = TRUE)) %>%
ungroup()
# Collapse to state means (reduces to one row per state)
df_collapsed <- df %>%
group_by(state) %>%
summarize(
emp = mean(emp, na.rm = TRUE),
taxrate = max(taxrate, na.rm = TRUE)
)
# Calculate mean BY STATE (keeps all rows)
df['emp_avg_state'] = df.groupby('state')['emp'].transform('mean')
# Collapse to state means (reduces to one row per state)
df_collapsed = df.groupby('state').agg({
'emp': 'mean',
'taxrate': 'max'
}).reset_index()
Quick Check: Reshaping
You have state-level unemployment data in wide format:
Wide format (50 rows):
| state | unemp_2015 | unemp_2016 | ... | unemp_2024 |
|---|---|---|---|---|
| AL | 6.1 | 5.8 | ... | 3.2 |
| AK | 6.5 | 6.9 | ... | 4.1 |
| AZ | 6.0 | 5.4 | ... | 3.5 |
| ... (50 states total) | ||||
Long format (? rows):
| state | year | unemp |
|---|---|---|
| AL | 2015 | 6.1 |
| AL | 2016 | 5.8 |
| AL | ... | ... |
| AL | 2024 | 3.2 |
| AK | 2015 | 6.5 |
| ... | ||
How many rows will the long format have? (50 states × 10 years)
Merging
Real research requires combining data from multiple sources. Merging is one of the most important (and error-prone) operations in data work.
The Golden Rule of Merging
The merging variable must be identical in both datasets: same name, same values, same format (numeric vs. string). If one dataset has "California" and the other has "CA", the merge will fail.
Types of Merges
* One-to-one merge: each row matches exactly one row
merge 1:1 id using other_data.dta
* Many-to-one merge: many rows match one row
* (e.g., many individuals in each state match one state-level observation)
merge m:1 state using state_data.dta
* One-to-many merge: one row matches many rows
merge 1:m state using individual_data.dta
# Load the other dataset
other_data <- read_dta("other_data.dta")
state_data <- read_dta("state_data.dta")
# One-to-one merge (inner join by default)
df_merged <- df %>% inner_join(other_data, by = "id")
# Many-to-one merge
# (e.g., many individuals in each state match one state-level observation)
df_merged <- df %>% left_join(state_data, by = "state")
# One-to-many merge
df_merged <- state_data %>% left_join(df, by = "state")
# Load the other dataset
other_data = pd.read_stata("other_data.dta")
state_data = pd.read_stata("state_data.dta")
# One-to-one merge
df_merged = df.merge(other_data, on='id', how='inner')
# Many-to-one merge
# (e.g., many individuals in each state match one state-level observation)
df_merged = df.merge(state_data, on='state', how='left')
# One-to-many merge
df_merged = state_data.merge(df, on='state', how='left')
Checking Your Merge
* After merging, Stata creates _merge variable:
* _merge == 1: only in master data
* _merge == 2: only in using data
* _merge == 3: matched (in both)
* Check how the merge went
tab _merge
* See which observations didn't match
tab state if _merge == 1
tab state if _merge == 2
* Usually you want to keep only matched observations
keep if _merge == 3
drop _merge
# Use indicator to check merge quality
df_merged <- df %>%
full_join(state_data, by = "state") %>%
mutate(
in_master = !is.na(some_var_from_df),
in_using = !is.na(some_var_from_state_data)
)
# Check how many matched
table(df_merged$in_master, df_merged$in_using)
# Keep only matched
df_merged <- df %>% inner_join(state_data, by = "state")
# Use indicator to check merge quality
df_merged = df.merge(
state_data,
on='state',
how='outer',
indicator=True
)
# Check how many matched
print(df_merged['_merge'].value_counts())
# left_only: only in master
# right_only: only in using
# both: matched
# Keep only matched
df_merged = df_merged[df_merged['_merge'] == 'both']
df_merged = df_merged.drop(columns=['_merge'])
Fixing Merge Variables
* Remove periods from state names
replace state = subinstr(state, ".", "", .)
* Remove extra spaces
replace state = trim(state)
* Extract last 2 characters (e.g., "DC" from "Washington DC")
replace state = substr(state, -2, 2)
# Remove periods from state names
df <- df %>% mutate(state = str_replace_all(state, "\\.", ""))
# Remove extra spaces
df <- df %>% mutate(state = str_trim(state))
# Extract last 2 characters
df <- df %>% mutate(state = str_sub(state, -2, -1))
# Remove periods from state names
df['state'] = df['state'].str.replace('.', '', regex=False)
# Remove extra spaces
df['state'] = df['state'].str.strip()
# Extract last 2 characters
df['state'] = df['state'].str[-2:]
Never Do m:m Merges
A many-to-many merge (merge m:m) is almost always a mistake. It produces a Cartesian product that's rarely what you want. If you think you need m:m, you probably need to reshape or collapse first.
Quick Check: Merging
Question 1: You want to merge these two datasets on state. What type of merge should you use?
Individual survey (1000 rows):
| person_id | state | income |
|---|---|---|
| 1 | CA | 45000 |
| 2 | CA | 62000 |
| 3 | TX | 51000 |
| ... (1000 people total) | ||
State policies (50 rows):
| state | min_wage |
|---|---|
| AL | 7.25 |
| AK | 11.73 |
| CA | 16.00 |
| ... (50 states) | |
Question 2: Which merge type should you never use?
Question 3: You merge these datasets using merge m:1 state using policies.dta. How many rows will the result have?
Master: state-year panel (100 rows):
| state | year | unemp |
|---|---|---|
| AL | 2023 | 2.8 |
| AL | 2024 | 3.1 |
| AK | 2023 | 4.2 |
| ... (50 states × 2 years = 100) | ||
Using: state policies (50 rows):
| state | min_wage |
|---|---|
| AL | 7.25 |
| AK | 11.73 |
| AZ | 14.35 |
| ... (50 states) | |
Next Steps
You've learned the fundamentals of data manipulation. In Session 2, we'll cover loops, project organization, and regression commands.