Project Organization
Structure your project for reproducibility and collaboration
Why Organization Matters
A well-organized project:
- Saves time — you can find files months later without searching
- Prevents errors — clear separation between raw and processed data
- Enables collaboration — others can understand and run your code
- Supports reproducibility — you (or a referee) can recreate all results
The Golden Rule
Anyone should be able to run one script and reproduce all your results, starting from raw data. This includes future you, who will have forgotten everything.
Project Setup
Put Everything in Dropbox
Before doing anything else, create your project folder in Dropbox. This gives you automatic backups and version history. When you accidentally delete code, you can right-click the file and restore it from any point in the last 30-180 days.
One-Command Setup Script
We provide a script that creates a well-organized project structure. Open your terminal and run:
Mac/Linux:
cd ~/Dropbox
curl -fsSL https://theodorecaputi.com/teaching/14.33/files/setup.sh | bash -s my_1433_project
Windows PowerShell:
cd ~\Dropbox
irm https://theodorecaputi.com/teaching/14.33/files/setup.ps1 | iex; Setup-Project "my_1433_project"
Folder Structure
The Standard Economics Project
my_project/
├── master.do # Runs everything (master.R or main.py)
├── README.md # Describe your project
│
├── build/ # Data preparation
│ ├── input/ # Raw data (NEVER modify these!)
│ ├── code/ # Scripts to clean data
│ └── output/ # Cleaned data
│
├── analysis/ # Your analysis
│ ├── code/ # Regression scripts, etc.
│ └── output/
│ ├── tables/
│ └── figures/
│
└── paper/ # Your writeup
├── draft.tex
└── figures/ # Symlinks to analysis/output/figures
What Goes Where
| Folder | Contains | Rule |
|---|---|---|
build/input/ |
Raw data exactly as downloaded | Never modify — treat as read-only |
build/output/ |
Cleaned datasets ready for analysis | Created by code, deletable |
analysis/output/ |
Tables, figures, estimates | Created by code, deletable |
If you need to fix something in raw data, write code to do it. This creates a record of every change. Manually editing a CSV loses that history forever.
The Master Script
Your master script is the single entry point that runs your entire project. It should:
- Set paths relative to a project root
- Run all build scripts in order
- Run all analysis scripts in order
/*==============================================================================
Master Do-File: My Research Project
Author: Your Name
Date: February 2026
Instructions: Change the path below, then run this entire file.
==============================================================================*/
clear all
set more off
* ============ CHANGE THIS PATH ============
global root "/Users/yourname/Dropbox/my_project"
* ==========================================
* Define paths (don't change these)
global build "$root/build"
global analysis "$root/analysis"
* Run the build scripts (data preparation)
do "$build/code/01_import_census.do"
do "$build/code/02_import_policy.do"
do "$build/code/03_clean.do"
do "$build/code/04_merge.do"
* Run the analysis
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_main_regression.do"
do "$analysis/code/03_robustness.do"
do "$analysis/code/04_figures.do"
di "Done! All results in $analysis/output/"
#==============================================================================
# Master Script: My Research Project
# Author: Your Name
# Date: February 2026
#
# Instructions: Change the path below, then run this entire file.
#==============================================================================
rm(list = ls())
# ============ CHANGE THIS PATH ============
root <- "/Users/yourname/Dropbox/my_project"
# ==========================================
# Define paths (don't change these)
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")
# Run the build scripts (data preparation)
source(file.path(build, "code", "01_import_census.R"))
source(file.path(build, "code", "02_import_policy.R"))
source(file.path(build, "code", "03_clean.R"))
source(file.path(build, "code", "04_merge.R"))
# Run the analysis
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_main_regression.R"))
source(file.path(analysis, "code", "03_robustness.R"))
source(file.path(analysis, "code", "04_figures.R"))
cat("Done! All results in", file.path(analysis, "output"), "\n")
"""
Master Script: My Research Project
Author: Your Name
Date: February 2026
Instructions: Change the path below, then run this entire file.
"""
import os
from pathlib import Path
# ============ CHANGE THIS PATH ============
ROOT = Path("/Users/yourname/Dropbox/my_project")
# ==========================================
# Define paths (don't change these)
BUILD = ROOT / "build"
ANALYSIS = ROOT / "analysis"
# Run the build scripts (data preparation)
exec(open(BUILD / "code" / "01_import_census.py").read())
exec(open(BUILD / "code" / "02_import_policy.py").read())
exec(open(BUILD / "code" / "03_clean.py").read())
exec(open(BUILD / "code" / "04_merge.py").read())
# Run the analysis
exec(open(ANALYSIS / "code" / "01_summary_stats.py").read())
exec(open(ANALYSIS / "code" / "02_main_regression.py").read())
exec(open(ANALYSIS / "code" / "03_robustness.py").read())
exec(open(ANALYSIS / "code" / "04_figures.py").read())
print(f"Done! All results in {ANALYSIS / 'output'}")
Number Your Scripts
Prefix scripts with numbers (01_, 02_, etc.) so the run order is obvious. This also keeps them sorted correctly in your file browser.
Working with Paths
Hardcoded paths like "/Users/john/Desktop/data.csv" break when anyone else runs your code. Use relative paths from your project root instead.
* BAD: Hardcoded path (breaks on other computers)
use "/Users/john/Desktop/project/data/census.dta", clear
* GOOD: Use globals set in master.do
use "$build/output/census_clean.dta", clear
* Save output using globals
save "$analysis/output/regression_results.dta", replace
graph export "$analysis/output/figures/event_study.pdf", replace
# BAD: Hardcoded path (breaks on other computers)
df <- read_csv("/Users/john/Desktop/project/data/census.csv")
# GOOD: Use variables set in master.R
df <- read_csv(file.path(build, "output", "census_clean.csv"))
# Save output using path variables
write_csv(results, file.path(analysis, "output", "regression_results.csv"))
ggsave(file.path(analysis, "output", "figures", "event_study.pdf"), p, width = 10, height = 6)
# BAD: Hardcoded path (breaks on other computers)
df = pd.read_csv("/Users/john/Desktop/project/data/census.csv")
# GOOD: Use variables set in main.py
df = pd.read_csv(BUILD / "output" / "census_clean.csv")
# Save output using path variables
results.to_csv(ANALYSIS / "output" / "regression_results.csv")
fig.savefig(ANALYSIS / "output" / "figures" / "event_study.pdf")
Best Practices
Code Style
Well-formatted code is easier to read and debug. Use section headers to organize your scripts, and include comments that explain why you're doing something (not just what):
* GOOD: Clear sections with headers
*==============================================================================
* SECTION 1: LOAD AND CLEAN DATA
*==============================================================================
use "$build/input/census_raw.dta", clear
* Keep only working-age adults
keep if age >= 25 & age <= 64
* Create outcome variable
gen employed = (empstat == 1)
label var employed "Currently employed"
*==============================================================================
* SECTION 2: CREATE ANALYSIS SAMPLE
*==============================================================================
* Drop if missing key variables
drop if missing(income, education, age)
* Document sample size
count
di "Final sample: " r(N) " observations"
# GOOD: Clear sections with headers
#==============================================================================
# SECTION 1: LOAD AND CLEAN DATA
#==============================================================================
df <- read_csv(file.path(build, "input", "census_raw.csv"))
# Keep only working-age adults
df <- df %>%
filter(age >= 25, age <= 64)
# Create outcome variable
df <- df %>%
mutate(employed = as.integer(empstat == 1))
#==============================================================================
# SECTION 2: CREATE ANALYSIS SAMPLE
#==============================================================================
# Drop if missing key variables
df <- df %>%
drop_na(income, education, age)
# Document sample size
cat("Final sample:", nrow(df), "observations\n")
# GOOD: Clear sections with headers
#==============================================================================
# SECTION 1: LOAD AND CLEAN DATA
#==============================================================================
df = pd.read_csv(BUILD / "input" / "census_raw.csv")
# Keep only working-age adults
df = df[(df['age'] >= 25) & (df['age'] <= 64)]
# Create outcome variable
df['employed'] = (df['empstat'] == 1).astype(int)
#==============================================================================
# SECTION 2: CREATE ANALYSIS SAMPLE
#==============================================================================
# Drop if missing key variables
df = df.dropna(subset=['income', 'education', 'age'])
# Document sample size
print(f"Final sample: {len(df)} observations")
Debugging Tips
- Check your data often — After each transformation, verify the result is what you expected
- Print counts — After filters or drops, print how many observations remain
- Test on a subset — Run code on 1000 rows before processing millions
- Read error messages — They usually tell you exactly what's wrong
Here's what defensive coding looks like in practice:
* Before and after every major step:
di "Before cleaning: " _N " observations"
drop if age < 0 | age > 120
di "After cleaning: " _N " observations"
* Check for unexpected values
tab state, missing
summarize income, detail
assert income >= 0 | missing(income)
# Before and after every major step:
cat("Before cleaning:", nrow(df), "observations\n")
df <- df %>% filter(age >= 0, age <= 120)
cat("After cleaning:", nrow(df), "observations\n")
# Check for unexpected values
table(df$state, useNA = "ifany")
summary(df$income)
stopifnot(all(df$income >= 0 | is.na(df$income)))
# Before and after every major step:
print(f"Before cleaning: {len(df)} observations")
df = df[(df['age'] >= 0) & (df['age'] <= 120)]
print(f"After cleaning: {len(df)} observations")
# Check for unexpected values
print(df['state'].value_counts(dropna=False))
print(df['income'].describe())
assert (df['income'] >= 0).all() or df['income'].isna().any()
Save Intermediate Files
Save your data after each major cleaning step with numbered prefixes:
* After each major cleaning step, save the data
save "$build/output/01_imported.dta", replace
* ... cleaning code ...
save "$build/output/02_cleaned.dta", replace
* ... more cleaning ...
save "$build/output/03_merged.dta", replace
# After each major cleaning step, save the data
saveRDS(df, file.path(build, "output", "01_imported.rds"))
# ... cleaning code ...
saveRDS(df, file.path(build, "output", "02_cleaned.rds"))
# ... more cleaning ...
saveRDS(df, file.path(build, "output", "03_merged.rds"))
# After each major cleaning step, save the data
df.to_pickle(f"{BUILD}/output/01_imported.pkl")
# ... cleaning code ...
df.to_pickle(f"{BUILD}/output/02_cleaned.pkl")
# ... more cleaning ...
df.to_pickle(f"{BUILD}/output/03_merged.pkl")
Why This Matters
When something breaks at step 5, you can start debugging from step 4 instead of re-running everything from scratch. This saves hours of debugging time, especially with large datasets that take time to process.
Project Checklist
Before submitting any project, verify:
- ☐ Running
master.doreproduces all results from raw data - ☐ All paths use globals/variables, not hardcoded paths
- ☐ Raw data in
build/input/is untouched - ☐ README explains how to run the project
- ☐ No temporary or test files left in folders
- ☐ Scripts are numbered in run order
Found something unclear or have a suggestion? Email [email protected].