Finding a Project

How to turn an idea into a feasible research project

The Project Triangle

A viable empirical project requires three things that all fit together. Missing any one of them means you don't have a project.

Missing one = no project

Question

Does X cause Y? The relationship should be interesting for policy, business, or scientific understanding.

Data

Can you actually measure X and Y? For coursework, you'll need publicly accessible data.

Identification

What's the source of quasi-exogenous variation? Why did some units get treated and others didn't?

1. The Question

Your research question should take the form "Does X cause Y?"

Does X cause Y?

treatment/policy/exposure → outcome you care about

What Makes a Good Question?

1

It matters for policy, business, or science

Someone should care about the answer. Would policymakers change their behavior based on your findings? Would firms make different decisions?

2

You're genuinely interested in it

You'll spend weeks or months on this. If you're not curious about the answer, you'll lose motivation. Pick a topic you find fascinating.

3

It's feasible

Can you realistically answer this with available data and methods in the time you have?

2. Identification

This is the hardest part. You need variation in X (your treatment) that is quasi-exogenous—not driven by the same factors that affect Y.

The key question: Why did some units get treated and others didn't? If the answer involves factors that also affect the outcome, you have selection bias and no identification.

Diff-in-Diff

Most students will use a difference-in-differences design. The key is to think about policies that changed in some places but not others. (See the Causal Inference page for the theory behind DiD.)

The Formula

Think: "Some [states/cities/countries] adopted [policy X] at different times. Did this affect [outcome Y]?"

What You Need for a DiD / Event Study Project

For a diff-in-diff or event study to work, you need two things:

A treatment that affects some units but not others. A policy that rolled out to some states but not others, or at different times, gives you a treatment group and a control group.
Data on both groups, before and after the treatment. You need panel data (repeated observations over time) covering treated and untreated units in both the pre- and post-treatment periods.

This means your treatment should have happened some time ago—you need enough post-treatment data to measure effects. Most successful projects involve policies introduced in the 2000s or 2010s, not the 2020s. A policy from 2023 gives you very little post-treatment data to work with.

Turning a Treatment into a Project

Once you've identified a treatment (policy) that interests you, ask yourself three questions:

Can I find data on the treatment? You need to know when each state adopted the policy. Check NCSL, academic papers, or policy databases. If no one has compiled the dates, you might have to do it yourself—that's a lot of work.
What outcome might this policy affect? Think about what the policy was designed to change, or what it might unintentionally affect. Texting bans → traffic fatalities. Minimum wage → employment. Marijuana legalization → drug use, crime, traffic accidents, etc.
Can I find data on the outcome? The outcome must exist in a dataset that covers multiple states and years. FARS for traffic deaths, BRFSS for health behaviors, CPS for employment, etc.

If you can answer "yes" to all three, you have a viable project. Pick something you genuinely find interesting—you'll spend weeks with this data.

Regression Discontinuity

RD exploits situations where treatment is determined by whether a continuous variable crosses a threshold. People just above and just below the cutoff are essentially identical—the only difference is treatment status.

What you need to find:

A cutoff rule that determines treatment (age, test score, income threshold)
Data with the running variable (the variable that determines treatment)
Enough observations near and on both sides of the cutoff to compare

Examples: Age 21 and alcohol access, age 65 and Medicare eligibility, test score thresholds for scholarships or program eligibility

Instrumental Variables

IV addresses endogeneity by finding a variable (the "instrument") that affects treatment but has no direct effect on the outcome. The instrument creates quasi-random variation in treatment that we can use to estimate causal effects.

What you need to find:

An instrument that affects treatment but not the outcome directly
Relevance: The instrument must actually affect treatment (testable: first-stage F > 10)
Exclusion: The instrument must only affect the outcome through treatment (requires a convincing argument)

Examples: Distance to college as instrument for education, draft lottery numbers for military service, weather as instrument for attendance

3. Data

For 14.33, you'll probably need publicly available data. Here are the key sources by field:

Health Data

BRFSS ↗

State-level health behaviors (smoking, drinking, exercise)

YRBS ↗

Health behaviors among high school students

NHIS ↗

Health status, access to care, insurance coverage

FARS ↗

Every fatal traffic accident in the US

CDC WONDER ↗

Mortality, natality, vital statistics

MEPS ↗

Medical expenditure, utilization, insurance

Labor & Demographics

IPUMS ↗

Harmonized Census, ACS, CPS microdata

ACS ↗

American Community Survey—demographics, employment

CPS ↗

Current Population Survey—monthly employment

NLSY ↗

Longitudinal—follows individuals over time

PSID ↗

Panel Study of Income Dynamics

QCEW ↗

Quarterly employment & wages by industry

Crime, Education, Economics

FBI Crime Data ↗

UCR/NIBRS crime statistics by jurisdiction

NCES ↗

Education statistics—schools, enrollment, IPEDS

FRED ↗

Federal Reserve economic data (macro, finance)

BLS ↗

Prices, employment, productivity, wages

ICPSR ↗

Social science data archive (vast collection)

World Bank ↗

International development indicators

Policy Databases

You'll often need to identify when policies were enacted. These databases track state-level policy variation:

RAND Firearm Laws ↗

Comprehensive gun law database by state-year

NCSL ↗

State legislation database (search by topic)

Tax Policy Center ↗

State and federal tax data

DOL Minimum Wage ↗

Historical state minimum wages

Don't let file formats limit you. Data in .dta, .sas7bdat, or .rds can be read in any language. The haven package in R and pandas in Python read most statistical formats directly. When in doubt, convert to CSV.

Putting It Together: Examples

Let's walk through some project ideas and see why they work or don't work.

✘

"Does poverty affect educational attainment?"

✔

Question

Great—huge policy implications

✔

Data

Census/ACS has income & education

MISSING

Identification

No quasi-random variation in poverty

Problem: Poor families differ from rich families in countless ways (neighborhood, parenting, genetics, schools). What's the quasi-random variation in income? Without it, any correlation is hopelessly confounded.

Not feasible without finding exogenous variation (lottery winners? plant closures? inheritance timing?).

✘

"Do doctors make worse decisions when stressed?"

✔

Question

Fascinating—medical errors kill thousands

MISSING

Data

Need physician + patient + ER data

✔

Identification

ER crowding varies quasi-randomly

Problem: You'd need matched data on physician schedules, patient outcomes, and ER conditions. This data exists in hospitals but is not publicly available.

Not feasible for coursework. Great idea for a faculty member with hospital data access.

✔

"Do texting bans reduce traffic fatalities?"

✔

Question

Policy-relevant—states are debating this

✔

Data

FARS + state law databases

✔

Identification

DiD with staggered adoption

Why it works: Different states adopted texting bans at different times. Compare changes in fatalities before/after adoption, relative to states that hadn't yet adopted.

Feasible project. This is exactly how published papers on this topic are designed.

✔

"Does Medicare eligibility improve health outcomes?"

✔

Question

Huge—worth trillions of dollars

✔

Data

NHIS, BRFSS, MEPS by age

✔

Identification

RD at age 65

Why it works: People just under and just over age 65 are nearly identical, but one group has Medicare and the other doesn't. Classic RD design.

Feasible project. Card, Dobkin, and Maestas (2008) used exactly this approach.

How to Start

My Recommendation: Start with Identification

I personally recommend starting with an identification strategy—find a policy or treatment that was implemented in some places but not others. You can find these either in the news or from existing academic papers.

Then, work outward from there:

Find treatment data: You need to know when and where the policy was implemented. Implementation dates are sometimes published alongside existing papers, or compiled by organizations like NCSL.
Find outcome data: Look for data on an outcome at the same or more granular level as the treatment, covering both before and after the treatment. For example, if the policy varies at the county level, you need county-level (or more granular) measures of employment, health, education, etc.
Form a question: Let the available data shape your research question. You don't need to start with a grand question—start with what's feasible and let the question emerge from what you can actually measure.

That said, you can start from any corner of the triangle:

Start with a Question

What topic interests you? Then ask: where might I find quasi-random variation in my treatment? What data captures this?

Start with Data

Browse IPUMS, BRFSS, or FARS. What variables are there? What policies changed over the time period covered? What comparisons seem credible?

Start with Identification

Read the news for policy changes. Marijuana laws? Minimum wage? Paid leave? Then ask: what outcomes might be affected, and is there data?

Pro Tip: Read Published Papers

The best way to learn what makes a viable project is to read papers in your area of interest. Pay attention to: What's the identification strategy? Where did they get the data? Could you do something similar with a different policy or outcome?

Past 14.33 Papers

These papers from past 14.33 students are published in the UEA Journal:

← 3. Causal Inference in Theory Next: 5. Data Fundamentals →

Found something unclear or have a suggestion? Email [email protected].