Finding a Project
How to turn an idea into a feasible research project
The Project Triangle
A viable empirical project requires three things that all fit together. Missing any one of them means you don't have a project.
Missing one = no project
Question
Does X cause Y? The relationship should be interesting for policy, business, or scientific understanding.
Data
Can you actually measure X and Y? For coursework, you'll need publicly accessible data.
Identification
What's the source of quasi-exogenous variation? Why did some units get treated and others didn't?
1. The Question
Your research question should take the form "Does X cause Y?"
What Makes a Good Question?
Someone should care about the answer. Would policymakers change their behavior based on your findings? Would firms make different decisions?
You'll spend weeks or months on this. If you're not curious about the answer, you'll lose motivation. Pick a topic you find fascinating.
Can you realistically answer this with available data and methods in the time you have?
2. Identification
This is the hardest part. You need variation in X (your treatment) that is quasi-exogenous—not driven by the same factors that affect Y.
Diff-in-Diff
Most students will use a difference-in-differences design. The key is to think about policies that changed in some places but not others. (See the Causal Inference page for the theory behind DiD.)
The Formula
Think: "Some [states/cities/countries] adopted [policy X] at different times. Did this affect [outcome Y]?"
What You Need for a DiD / Event Study Project
For a diff-in-diff or event study to work, you need two things:
- A treatment that affects some units but not others. A policy that rolled out to some states but not others, or at different times, gives you a treatment group and a control group.
- Data on both groups, before and after the treatment. You need panel data (repeated observations over time) covering treated and untreated units in both the pre- and post-treatment periods.
This means your treatment should have happened some time ago—you need enough post-treatment data to measure effects. Most successful projects involve policies introduced in the 2000s or 2010s, not the 2020s. A policy from 2023 gives you very little post-treatment data to work with.
Turning a Treatment into a Project
Once you've identified a treatment (policy) that interests you, ask yourself three questions:
- Can I find data on the treatment? You need to know when each state adopted the policy. Check NCSL, academic papers, or policy databases. If no one has compiled the dates, you might have to do it yourself—that's a lot of work.
- What outcome might this policy affect? Think about what the policy was designed to change, or what it might unintentionally affect. Texting bans → traffic fatalities. Minimum wage → employment. Marijuana legalization → drug use, crime, traffic accidents, etc.
- Can I find data on the outcome? The outcome must exist in a dataset that covers multiple states and years. FARS for traffic deaths, BRFSS for health behaviors, CPS for employment, etc.
If you can answer "yes" to all three, you have a viable project. Pick something you genuinely find interesting—you'll spend weeks with this data.
Regression Discontinuity
RD exploits situations where treatment is determined by whether a continuous variable crosses a threshold. People just above and just below the cutoff are essentially identical—the only difference is treatment status.
What you need to find:
- A cutoff rule that determines treatment (age, test score, income threshold)
- Data with the running variable (the variable that determines treatment)
- Enough observations near and on both sides of the cutoff to compare
Examples: Age 21 and alcohol access, age 65 and Medicare eligibility, test score thresholds for scholarships or program eligibility
Instrumental Variables
IV addresses endogeneity by finding a variable (the "instrument") that affects treatment but has no direct effect on the outcome. The instrument creates quasi-random variation in treatment that we can use to estimate causal effects.
What you need to find:
- An instrument that affects treatment but not the outcome directly
- Relevance: The instrument must actually affect treatment (testable: first-stage F > 10)
- Exclusion: The instrument must only affect the outcome through treatment (requires a convincing argument)
Examples: Distance to college as instrument for education, draft lottery numbers for military service, weather as instrument for attendance
3. Data
For 14.33, you'll probably need publicly available data. Here are the key sources by field:
Health Data
State-level health behaviors (smoking, drinking, exercise)
YRBS ↗Health behaviors among high school students
NHIS ↗Health status, access to care, insurance coverage
FARS ↗Every fatal traffic accident in the US
CDC WONDER ↗Mortality, natality, vital statistics
MEPS ↗Medical expenditure, utilization, insurance
Labor & Demographics
Harmonized Census, ACS, CPS microdata
ACS ↗American Community Survey—demographics, employment
CPS ↗Current Population Survey—monthly employment
NLSY ↗Longitudinal—follows individuals over time
PSID ↗Panel Study of Income Dynamics
QCEW ↗Quarterly employment & wages by industry
Crime, Education, Economics
UCR/NIBRS crime statistics by jurisdiction
NCES ↗Education statistics—schools, enrollment, IPEDS
FRED ↗Federal Reserve economic data (macro, finance)
BLS ↗Prices, employment, productivity, wages
ICPSR ↗Social science data archive (vast collection)
World Bank ↗International development indicators
Policy Databases
You'll often need to identify when policies were enacted. These databases track state-level policy variation:
Comprehensive gun law database by state-year
NCSL ↗State legislation database (search by topic)
Tax Policy Center ↗State and federal tax data
DOL Minimum Wage ↗Historical state minimum wages
.dta, .sas7bdat, or .rds can be read in any language. The haven package in R and pandas in Python read most statistical formats directly. When in doubt, convert to CSV.
Putting It Together: Examples
Let's walk through some project ideas and see why they work or don't work.
"Does poverty affect educational attainment?"
Great—huge policy implications
Census/ACS has income & education
No quasi-random variation in poverty
Problem: Poor families differ from rich families in countless ways (neighborhood, parenting, genetics, schools). What's the quasi-random variation in income? Without it, any correlation is hopelessly confounded.
Not feasible without finding exogenous variation (lottery winners? plant closures? inheritance timing?).
"Do doctors make worse decisions when stressed?"
Fascinating—medical errors kill thousands
Need physician + patient + ER data
ER crowding varies quasi-randomly
Problem: You'd need matched data on physician schedules, patient outcomes, and ER conditions. This data exists in hospitals but is not publicly available.
Not feasible for coursework. Great idea for a faculty member with hospital data access.
"Do texting bans reduce traffic fatalities?"
Policy-relevant—states are debating this
FARS + state law databases
DiD with staggered adoption
Why it works: Different states adopted texting bans at different times. Compare changes in fatalities before/after adoption, relative to states that hadn't yet adopted.
Feasible project. This is exactly how published papers on this topic are designed.
"Does Medicare eligibility improve health outcomes?"
Huge—worth trillions of dollars
NHIS, BRFSS, MEPS by age
RD at age 65
Why it works: People just under and just over age 65 are nearly identical, but one group has Medicare and the other doesn't. Classic RD design.
Feasible project. Card, Dobkin, and Maestas (2008) used exactly this approach.
How to Start
My Recommendation: Start with Identification
I personally recommend starting with an identification strategy—find a policy or treatment that was implemented in some places but not others. You can find these either in the news or from existing academic papers.
Then, work outward from there:
- Find treatment data: You need to know when and where the policy was implemented. Implementation dates are sometimes published alongside existing papers, or compiled by organizations like NCSL.
- Find outcome data: Look for data on an outcome at the same or more granular level as the treatment, covering both before and after the treatment. For example, if the policy varies at the county level, you need county-level (or more granular) measures of employment, health, education, etc.
- Form a question: Let the available data shape your research question. You don't need to start with a grand question—start with what's feasible and let the question emerge from what you can actually measure.
That said, you can start from any corner of the triangle:
Start with a Question
What topic interests you? Then ask: where might I find quasi-random variation in my treatment? What data captures this?
Start with Data
Browse IPUMS, BRFSS, or FARS. What variables are there? What policies changed over the time period covered? What comparisons seem credible?
Start with Identification
Read the news for policy changes. Marijuana laws? Minimum wage? Paid leave? Then ask: what outcomes might be affected, and is there data?
Pro Tip: Read Published Papers
The best way to learn what makes a viable project is to read papers in your area of interest. Pay attention to: What's the identification strategy? Where did they get the data? Could you do something similar with a different policy or outcome?
Past 14.33 Papers
These papers from past 14.33 students are published in the UEA Journal:
- Income Gaps and Political Divides: Local Inequality's Effect on Local Election Polarization (pp. 9-36)
- The Impact of Autism Therapy Coverage Mandates on Outcomes for Autistic Children (pp. 37-62)
- Assessing the Impact of Project Green Light: A Google Sustainability Initiative to Reduce City Air Pollution (pp. 63-88)
- Assessing the Impact of Police Officer Line-of-Duty Deaths and Force Size on Civilian Arrests (pp. 87-100)
- An Empirical Analysis of the 3-Points-for-a-Win Rule in the English Premier League (pp. 129-147)
- A Rising Tide for All or Wave for One?: The Effect of Charter School Competition on District Achievement (pp. 11-32)
- Post-Earnings Announcement Drift (pp. 46-72)
Found something unclear or have a suggestion? Email [email protected].