Project Organization

Structure your project for reproducibility and collaboration

Why Organization Matters

A well-organized project:

Saves time — you can find files months later without searching
Prevents errors — clear separation between raw and processed data
Enables collaboration — others can understand and run your code
Supports reproducibility — you (or a referee) can recreate all results

The Golden Rule

Anyone should be able to run one script and reproduce all your results, starting from raw data. This includes future you, who will have forgotten everything.

Project Setup

Put Everything in Dropbox

Before doing anything else, create your project folder in Dropbox. This gives you automatic backups and version history. When you accidentally delete code, you can right-click the file and restore it from any point in the last 30-180 days.

One-Command Setup Script

We provide a script that creates a well-organized project structure. Open your terminal and run:

Mac/Linux:

cd ~/Dropbox
curl -fsSL https://theodorecaputi.com/teaching/14.33/files/setup.sh | bash -s my_1433_project

Windows PowerShell:

cd ~\Dropbox
irm https://theodorecaputi.com/teaching/14.33/files/setup.ps1 | iex; Setup-Project "my_1433_project"

Folder Structure

The Standard Economics Project

my_project/
├── master.do              # Runs everything (master.R or main.py)
├── README.md              # Describe your project
│
├── build/                 # Data preparation
│   ├── input/             # Raw data (NEVER modify these!)
│   ├── code/              # Scripts to clean data
│   └── output/            # Cleaned data
│
├── analysis/              # Your analysis
│   ├── code/              # Regression scripts, etc.
│   └── output/
│       ├── tables/
│       └── figures/
│
└── paper/                 # Your writeup
    ├── draft.tex
    └── figures/           # Symlinks to analysis/output/figures

What Goes Where

Folder	Contains	Rule
`build/input/`	Raw data exactly as downloaded	Never modify — treat as read-only
`build/output/`	Cleaned datasets ready for analysis	Created by code, deletable
`analysis/output/`	Tables, figures, estimates	Created by code, deletable

Never Edit Raw Data:

If you need to fix something in raw data, write code to do it. This creates a record of every change. Manually editing a CSV loses that history forever.

The Master Script

Your master script is the single entry point that runs your entire project. It should:

Set paths relative to a project root
Run all build scripts in order
Run all analysis scripts in order

/*==============================================================================
    Master Do-File: My Research Project
    Author: Your Name
    Date: February 2026

    Instructions: Change the path below, then run this entire file.
==============================================================================*/

clear all
set more off

* ============ CHANGE THIS PATH ============
global root "/Users/yourname/Dropbox/my_project"
* ==========================================

* Define paths (don't change these)
global build    "$root/build"
global analysis "$root/analysis"

* Run the build scripts (data preparation)
do "$build/code/01_import_census.do"
do "$build/code/02_import_policy.do"
do "$build/code/03_clean.do"
do "$build/code/04_merge.do"

* Run the analysis
do "$analysis/code/01_summary_stats.do"
do "$analysis/code/02_main_regression.do"
do "$analysis/code/03_robustness.do"
do "$analysis/code/04_figures.do"

di "Done! All results in $analysis/output/"

#==============================================================================
#    Master Script: My Research Project
#    Author: Your Name
#    Date: February 2026
#
#    Instructions: Change the path below, then run this entire file.
#==============================================================================

rm(list = ls())

# ============ CHANGE THIS PATH ============
root <- "/Users/yourname/Dropbox/my_project"
# ==========================================

# Define paths (don't change these)
build <- file.path(root, "build")
analysis <- file.path(root, "analysis")

# Run the build scripts (data preparation)
source(file.path(build, "code", "01_import_census.R"))
source(file.path(build, "code", "02_import_policy.R"))
source(file.path(build, "code", "03_clean.R"))
source(file.path(build, "code", "04_merge.R"))

# Run the analysis
source(file.path(analysis, "code", "01_summary_stats.R"))
source(file.path(analysis, "code", "02_main_regression.R"))
source(file.path(analysis, "code", "03_robustness.R"))
source(file.path(analysis, "code", "04_figures.R"))

cat("Done! All results in", file.path(analysis, "output"), "\n")

"""
Master Script: My Research Project
Author: Your Name
Date: February 2026

Instructions: Change the path below, then run this entire file.
"""

import os
from pathlib import Path

# ============ CHANGE THIS PATH ============
ROOT = Path("/Users/yourname/Dropbox/my_project")
# ==========================================

# Define paths (don't change these)
BUILD = ROOT / "build"
ANALYSIS = ROOT / "analysis"

# Run the build scripts (data preparation)
exec(open(BUILD / "code" / "01_import_census.py").read())
exec(open(BUILD / "code" / "02_import_policy.py").read())
exec(open(BUILD / "code" / "03_clean.py").read())
exec(open(BUILD / "code" / "04_merge.py").read())

# Run the analysis
exec(open(ANALYSIS / "code" / "01_summary_stats.py").read())
exec(open(ANALYSIS / "code" / "02_main_regression.py").read())
exec(open(ANALYSIS / "code" / "03_robustness.py").read())
exec(open(ANALYSIS / "code" / "04_figures.py").read())

print(f"Done! All results in {ANALYSIS / 'output'}")

Number Your Scripts

Prefix scripts with numbers (01_, 02_, etc.) so the run order is obvious. This also keeps them sorted correctly in your file browser.

Working with Paths

Hardcoded paths like "/Users/john/Desktop/data.csv" break when anyone else runs your code. Use relative paths from your project root instead.

* BAD: Hardcoded path (breaks on other computers)
use "/Users/john/Desktop/project/data/census.dta", clear

* GOOD: Use globals set in master.do
use "$build/output/census_clean.dta", clear

* Save output using globals
save "$analysis/output/regression_results.dta", replace
graph export "$analysis/output/figures/event_study.pdf", replace

# BAD: Hardcoded path (breaks on other computers)
df <- read_csv("/Users/john/Desktop/project/data/census.csv")

# GOOD: Use variables set in master.R
df <- read_csv(file.path(build, "output", "census_clean.csv"))

# Save output using path variables
write_csv(results, file.path(analysis, "output", "regression_results.csv"))
ggsave(file.path(analysis, "output", "figures", "event_study.pdf"), p, width = 10, height = 6)

# BAD: Hardcoded path (breaks on other computers)
df = pd.read_csv("/Users/john/Desktop/project/data/census.csv")

# GOOD: Use variables set in main.py
df = pd.read_csv(BUILD / "output" / "census_clean.csv")

# Save output using path variables
results.to_csv(ANALYSIS / "output" / "regression_results.csv")
fig.savefig(ANALYSIS / "output" / "figures" / "event_study.pdf")

Best Practices

Code Style

Well-formatted code is easier to read and debug. Use section headers to organize your scripts, and include comments that explain why you're doing something (not just what):

* GOOD: Clear sections with headers
*==============================================================================
* SECTION 1: LOAD AND CLEAN DATA
*==============================================================================

use "$build/input/census_raw.dta", clear

* Keep only working-age adults
keep if age >= 25 & age <= 64

* Create outcome variable
gen employed = (empstat == 1)
label var employed "Currently employed"

*==============================================================================
* SECTION 2: CREATE ANALYSIS SAMPLE
*==============================================================================

* Drop if missing key variables
drop if missing(income, education, age)

* Document sample size
count
di "Final sample: " r(N) " observations"

# GOOD: Clear sections with headers
#==============================================================================
# SECTION 1: LOAD AND CLEAN DATA
#==============================================================================

df <- read_csv(file.path(build, "input", "census_raw.csv"))

# Keep only working-age adults
df <- df %>%
  filter(age >= 25, age <= 64)

# Create outcome variable
df <- df %>%
  mutate(employed = as.integer(empstat == 1))

#==============================================================================
# SECTION 2: CREATE ANALYSIS SAMPLE
#==============================================================================

# Drop if missing key variables
df <- df %>%
  drop_na(income, education, age)

# Document sample size
cat("Final sample:", nrow(df), "observations\n")

# GOOD: Clear sections with headers
#==============================================================================
# SECTION 1: LOAD AND CLEAN DATA
#==============================================================================

df = pd.read_csv(BUILD / "input" / "census_raw.csv")

# Keep only working-age adults
df = df[(df['age'] >= 25) & (df['age'] <= 64)]

# Create outcome variable
df['employed'] = (df['empstat'] == 1).astype(int)

#==============================================================================
# SECTION 2: CREATE ANALYSIS SAMPLE
#==============================================================================

# Drop if missing key variables
df = df.dropna(subset=['income', 'education', 'age'])

# Document sample size
print(f"Final sample: {len(df)} observations")

Debugging Tips

Check your data often — After each transformation, verify the result is what you expected
Print counts — After filters or drops, print how many observations remain
Test on a subset — Run code on 1000 rows before processing millions
Read error messages — They usually tell you exactly what's wrong

Here's what defensive coding looks like in practice:

* Before and after every major step:
di "Before cleaning: " _N " observations"

drop if age < 0 | age > 120

di "After cleaning: " _N " observations"

* Check for unexpected values
tab state, missing
summarize income, detail
assert income >= 0 | missing(income)

# Before and after every major step:
cat("Before cleaning:", nrow(df), "observations\n")

df <- df %>% filter(age >= 0, age <= 120)

cat("After cleaning:", nrow(df), "observations\n")

# Check for unexpected values
table(df$state, useNA = "ifany")
summary(df$income)
stopifnot(all(df$income >= 0 | is.na(df$income)))

# Before and after every major step:
print(f"Before cleaning: {len(df)} observations")

df = df[(df['age'] >= 0) & (df['age'] <= 120)]

print(f"After cleaning: {len(df)} observations")

# Check for unexpected values
print(df['state'].value_counts(dropna=False))
print(df['income'].describe())
assert (df['income'] >= 0).all() or df['income'].isna().any()

Save Intermediate Files

Save your data after each major cleaning step with numbered prefixes:

* After each major cleaning step, save the data
save "$build/output/01_imported.dta", replace
* ... cleaning code ...
save "$build/output/02_cleaned.dta", replace
* ... more cleaning ...
save "$build/output/03_merged.dta", replace

# After each major cleaning step, save the data
saveRDS(df, file.path(build, "output", "01_imported.rds"))
# ... cleaning code ...
saveRDS(df, file.path(build, "output", "02_cleaned.rds"))
# ... more cleaning ...
saveRDS(df, file.path(build, "output", "03_merged.rds"))

# After each major cleaning step, save the data
df.to_pickle(f"{BUILD}/output/01_imported.pkl")
# ... cleaning code ...
df.to_pickle(f"{BUILD}/output/02_cleaned.pkl")
# ... more cleaning ...
df.to_pickle(f"{BUILD}/output/03_merged.pkl")

Why This Matters

When something breaks at step 5, you can start debugging from step 4 instead of re-running everything from scratch. This saves hours of debugging time, especially with large datasets that take time to process.

Project Checklist

Before submitting any project, verify:

☐ Running master.do reproduces all results from raw data
☐ All paths use globals/variables, not hardcoded paths
☐ Raw data in build/input/ is untouched
☐ README explains how to run the project
☐ No temporary or test files left in folders
☐ Scripts are numbered in run order

← 1. Getting Started Next: 3. Causal Inference in Theory →

Found something unclear or have a suggestion? Email [email protected].