Research and Communication in Economics

Getting Started

Install your tools, set up a project, and write your first script

This guide will get you from zero to running your first regression. The goal isn't just to get code working—it's to set up habits that will make your work reproducible. Reproducibility means anyone (including future you) can run your code from scratch and get the same results. This is non-negotiable in empirical economics.

Stata vs. R vs. Python: Which Should I Use?

This tutorial provides code in Stata, R, and Python. You can choose any of them, but the two course tutorials will be taught in Stata. Here's how they compare:

Stata

  • Not free — requires a license (MIT provides one)
  • Often faster for large datasets
  • Simpler syntax for common econometrics tasks
  • One dataset in memory at a time

R

  • Free and open-source
  • Excellent for visualization (ggplot2)
  • tidyverse makes data wrangling intuitive
  • Multiple datasets in memory

Python

  • Free and open-source
  • General-purpose — useful beyond statistics
  • Excellent for data manipulation (pandas)
  • Weaker for statistics and publication tables

My Recommendation

Use R or Stata throughout your project unless you're working with very large datasets (100+ million observations). For big data, use Python or SQL for data manipulation, then switch to R or Stata for analysis and creating publication-ready output. R and Stata have much better support for econometric methods and producing the tables you'll need for papers.

The Stata Mental Model

Stata is unusual: it assumes you're working with one dataset at a time. Commands like summarize or regress automatically operate on "the data" without you specifying which data. If you need to work with a second dataset, you save the current one, then use the new one. This feels strange if you're used to Python or R, but it makes simple analyses very concise.

For this class, any works fine. The toggle at the top of the sidebar lets you switch between Stata, R, and Python code examples.

Installation

Stata

MIT provides Stata licenses for students:

  1. Go to econ-help.mit.edu
  2. Look under Desktop and find Department Wide StataNow/SE License
  3. Log in with your MIT credentials
  4. Download Stata (StataNow) for your OS (Windows, Mac, or Linux)
  5. Download the PDF with the license information
  6. Install Stata, then copy the license information from the PDF into the registration prompt to activate it

Stata is also available via the MIT IST Stata page, but that appears to be for Athena systems only.

R and RStudio

R is free and open-source. You may also want RStudio, which provides a nicer interface.

Option 1: Graphical installers

  1. Download R from CRAN
  2. Download RStudio from Posit

Option 2: Command line (Mac)

If you have Homebrew installed, this is faster:

# Install Homebrew first if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install R and RStudio
brew install --cask r
brew install --cask rstudio

Option 3: Command line (Windows)

If you have Chocolatey or Scoop:

# Using Chocolatey
choco install r.project rstudio

# Or using Scoop
scoop bucket add extras
scoop install r rstudio

Python

Python is free and open-source. We recommend using Anaconda or Miniconda for scientific computing.

Option 1: Anaconda (recommended for beginners)

Download from anaconda.com. Anaconda includes Python, Jupyter notebooks, and most data science packages pre-installed.

Option 2: Command line (Mac)

# Using Homebrew
brew install python

# Or install Miniconda (lighter than Anaconda)
brew install --cask miniconda

Option 3: Command line (Windows)

# Using Chocolatey
choco install python anaconda3

# Or using Scoop
scoop install python
scoop install anaconda3

Installing Packages

Out of the box, Stata, R, and Python can do basic statistics. But for modern econometrics—high-dimensional fixed effects, robust standard errors, publication-quality tables—you need additional packages written by the community. Think of packages as pre-written code that adds new commands to your toolkit. You install them once, then load them at the start of each script.

Stata

In Stata, you install packages using ssc install for packages from the Statistical Software Components archive:

* Install commonly used packages
ssc install estout      // For regression tables
ssc install reghdfe     // Fast fixed effects regression
ssc install ftools      // Required by reghdfe
ssc install coefplot    // For coefficient plots
ssc install binscatter  // For binned scatter plots

R

In R, we recommend using the pacman package, which automatically installs packages if they're not already installed:

# Install pacman if you don't have it, then load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, haven, fixest, modelsummary)

Python

In Python, use pip to install packages from the command line:

# Install commonly used packages
pip install pandas numpy matplotlib seaborn
pip install statsmodels linearmodels
pip install openpyxl  # For Excel files

Then at the top of your Python scripts:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

Put Imports at the Top of Every Script

In R, use pacman::p_load(). In Python, put your import statements at the top. This ensures anyone running your code knows exactly what packages are needed.

Interface Basics

The Stata Interface

When you open Stata, you'll see four main windows:

  • Command window — Type commands here and press Enter to run them
  • Results window — Output from your commands appears here
  • Variables window — Shows all variables in the current dataset
  • Review window — History of commands you've run (click to re-run)

You can type commands directly, but it's better to write them in a do-file (a script). Go to File → New → Do-file to create one. This makes your work reproducible.

Working Directory

The working directory is the folder your program looks in by default. Set it at the start of every script:

* Set working directory
cd "/Users/yourname/Dropbox/my_1433_project"

* Check it worked
pwd

* Now use relative paths
use "build/output/cleaned_data.dta", clear
# Set working directory
setwd("/Users/yourname/Dropbox/my_1433_project")

# Check it worked
getwd()

# Now use relative paths
pacman::p_load(haven)
data <- read_dta("build/output/cleaned_data.dta")
import os
import pandas as pd

# Set working directory
os.chdir("/Users/yourname/Dropbox/my_1433_project")

# Check it worked
print(os.getcwd())

# Now use relative paths
data = pd.read_stata("build/output/cleaned_data.dta")

Getting Help

When you encounter an unfamiliar command or forget the syntax, use the built-in help system:

* Get help on a command
help regress

* Search for commands
search panel data
# Get help on a function
?lm
help(lm)

# Search for functions
??regression
# Get help on a function
help(pd.read_csv)

# Or in Jupyter/IPython
pd.read_csv?

# Search documentation online
# https://pandas.pydata.org/docs/

Your First Script

Always Use Scripts

Never type commands directly into the console for real work. Write everything in a .do file (Stata) or .R file. This is the #1 habit that separates beginners from competent researchers.

The test: Close everything, reopen, run your script. Does it work? If not, your analysis isn't reproducible.

Create a new file called test.do, test.R, or test.py:

/*==============================================================================
    test.do
    My first Stata script
==============================================================================*/

clear all
set more off

* Load built-in dataset
sysuse auto, clear

* Look at the data
describe
summarize price mpg

* Simple regression
regress price mpg weight

* Save a graph
scatter price mpg
graph export "my_first_graph.png", replace

di "Script completed successfully!"
#===============================================================================
#   test.R
#   My first R script
#===============================================================================

# Load packages
pacman::p_load(tidyverse)

# Load built-in dataset
data(mtcars)

# Look at the data
str(mtcars)
summary(mtcars[, c("mpg", "hp", "wt")])

# Simple regression
model <- lm(mpg ~ hp + wt, data = mtcars)
summary(model)

# Save a graph
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "MPG vs Weight", x = "Weight (1000 lbs)", y = "MPG")
ggsave("my_first_graph.png", p, width = 8, height = 6)

cat("Script completed successfully!\n")
#===============================================================================
#   test.py
#   My first Python script
#===============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# Load built-in dataset (using seaborn's mpg dataset)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
mtcars = pd.read_csv(url)

# Look at the data
print(mtcars.info())
print(mtcars[["mpg", "horsepower", "weight"]].describe())

# Simple regression
model = smf.ols("mpg ~ horsepower + weight", data=mtcars).fit()
print(model.summary())

# Save a graph
plt.figure(figsize=(8, 6))
plt.scatter(mtcars["weight"], mtcars["mpg"])
plt.xlabel("Weight")
plt.ylabel("MPG")
plt.title("MPG vs Weight")
plt.savefig("my_first_graph.png")

print("Script completed successfully!")

Run the entire script (not line by line!) and verify it works.

Found something unclear or have a suggestion? Email [email protected].