Getting Started
Install your tools, set up a project, and write your first script
Stata vs. R vs. Python: Which Should I Use?
This tutorial provides code in Stata, R, and Python. You can choose any of them, but the two course tutorials will be taught in Stata. Here's how they compare:
Stata
- Not free — requires a license (MIT provides one)
- Often faster for large datasets
- Simpler syntax for common econometrics tasks
- One dataset in memory at a time
R
- Free and open-source
- Excellent for visualization (ggplot2)
- tidyverse makes data wrangling intuitive
- Multiple datasets in memory
Python
- Free and open-source
- General-purpose — useful beyond statistics
- Excellent for data manipulation (pandas)
- Weaker for statistics and publication tables
My Recommendation
Python excels at data manipulation but is relatively weak for statistical analysis and especially for creating publication-ready tables. If you're comfortable with Python, consider using it for data cleaning and preparation, then switching to R or Stata for the actual analysis and reporting.
The Stata Mental Model
Stata is unusual: it assumes you're working with one dataset at a time. Commands like summarize or regress automatically operate on "the data" without you specifying which data. If you need to work with a second dataset, you save the current one, then use the new one. This feels strange if you're used to Python or R, but it makes simple analyses very concise.
For this class, any works fine. The toggle at the top of the sidebar lets you switch between Stata, R, and Python code examples.
Installation
Stata
MIT provides Stata licenses for students. Visit the MIT IST Stata page to download and install.
R and RStudio
R is free and open-source. You'll also want RStudio, which provides a much nicer interface.
Option 1: Graphical installers
Option 2: Command line (Mac)
If you have Homebrew installed, this is faster:
# Install Homebrew first if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install R and RStudio
brew install --cask r
brew install --cask rstudio
Option 3: Command line (Windows)
If you have Chocolatey or Scoop:
# Using Chocolatey
choco install r.project rstudio
# Or using Scoop
scoop bucket add extras
scoop install r rstudio
Installing Packages
Both Stata and R extend their functionality through packages (called "ado files" in Stata). Here's how to install them.
Stata
In Stata, you install packages using ssc install for packages from the Statistical Software Components archive:
* Install commonly used packages
ssc install estout // For regression tables
ssc install reghdfe // Fast fixed effects regression
ssc install ftools // Required by reghdfe
ssc install coefplot // For coefficient plots
ssc install binscatter // For binned scatter plots
R
In R, we recommend using the pacman package, which automatically installs packages if they're not already installed:
# Install pacman if you don't have it, then load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, haven, fixest, modelsummary)
Python
In Python, use pip to install packages from the command line:
# Install commonly used packages
pip install pandas numpy matplotlib seaborn
pip install statsmodels linearmodels
pip install openpyxl # For Excel files
Then at the top of your Python scripts:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
Put Imports at the Top of Every Script
In R, use pacman::p_load(). In Python, put your import statements at the top. This ensures anyone running your code knows exactly what packages are needed.
Project Setup
Put Everything in Dropbox
Before doing anything else, create your project folder in Dropbox. This gives you automatic backups and version history. When you accidentally delete code, you can right-click the file and restore it from any point in the last 30-180 days.
One-Command Setup Script
We provide a script that creates a well-organized project structure. Open your terminal and run:
Mac/Linux:
cd ~/Dropbox
curl -fsSL https://theodorecaputi.com/teaching/14.33/files/setup.sh | bash -s my_1433_project
Windows PowerShell:
cd ~\Dropbox
irm https://theodorecaputi.com/teaching/14.33/files/setup.ps1 | iex; Setup-Project "my_1433_project"
What Gets Created
my_1433_project/
├── master.do # Runs everything (Stata)
├── master.R # Runs everything (R)
├── README.md
│
├── build/ # Data preparation
│ ├── input/ # Raw data goes here (NEVER edit these)
│ ├── code/ # Scripts to clean data
│ └── output/ # Cleaned data
│
├── analysis/ # Your analysis
│ ├── code/ # Regression scripts, etc.
│ └── output/
│ ├── tables/
│ └── figures/
│
└── paper/ # Your writeup
└── draft.tex
The Golden Rule
Never modify raw data. Files in build/input/ are sacred. All changes happen through code, with results saved to build/output/. This means you can always reproduce your results.
Interface Basics
Working Directory
The working directory is the folder your program looks in by default. Set it at the start of every script:
* Set working directory
cd "/Users/yourname/Dropbox/my_1433_project"
* Check it worked
pwd
* Now use relative paths
use "build/output/cleaned_data.dta", clear
# Set working directory
setwd("/Users/yourname/Dropbox/my_1433_project")
# Check it worked
getwd()
# Now use relative paths
library(haven)
data <- read_dta("build/output/cleaned_data.dta")
import os
import pandas as pd
# Set working directory
os.chdir("/Users/yourname/Dropbox/my_1433_project")
# Check it worked
print(os.getcwd())
# Now use relative paths
data = pd.read_stata("build/output/cleaned_data.dta")
Getting Help
* Get help on a command
help regress
* Search for commands
search panel data
# Get help on a function
?lm
help(lm)
# Search for functions
??regression
# Get help on a function
help(pd.read_csv)
# Or in Jupyter/IPython
pd.read_csv?
# Search documentation online
# https://pandas.pydata.org/docs/
Your First Script
Always Use Scripts
Never type commands directly into the console for real work. Write everything in a .do file (Stata) or .R file. This is the #1 habit that separates beginners from competent researchers.
The test: Close everything, reopen, run your script. Does it work? If not, your analysis isn't reproducible.
Create a new file called test.do, test.R, or test.py:
/*==============================================================================
test.do
My first Stata script
==============================================================================*/
clear all
set more off
* Load built-in dataset
sysuse auto, clear
* Look at the data
describe
summarize price mpg
* Simple regression
regress price mpg weight
* Save a graph
scatter price mpg
graph export "my_first_graph.png", replace
di "Script completed successfully!"
#===============================================================================
# test.R
# My first R script
#===============================================================================
# Load packages
library(tidyverse)
# Load built-in dataset
data(mtcars)
# Look at the data
str(mtcars)
summary(mtcars[, c("mpg", "hp", "wt")])
# Simple regression
model <- lm(mpg ~ hp + wt, data = mtcars)
summary(model)
# Save a graph
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "MPG vs Weight", x = "Weight (1000 lbs)", y = "MPG")
ggsave("my_first_graph.png")
cat("Script completed successfully!\n")
#===============================================================================
# test.py
# My first Python script
#===============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# Load built-in dataset (using seaborn's mpg dataset)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
mtcars = pd.read_csv(url)
# Look at the data
print(mtcars.info())
print(mtcars[["mpg", "horsepower", "weight"]].describe())
# Simple regression
model = smf.ols("mpg ~ horsepower + weight", data=mtcars).fit()
print(model.summary())
# Save a graph
plt.figure(figsize=(8, 6))
plt.scatter(mtcars["weight"], mtcars["mpg"])
plt.xlabel("Weight")
plt.ylabel("MPG")
plt.title("MPG vs Weight")
plt.savefig("my_first_graph.png")
print("Script completed successfully!")
Run the entire script (not line by line!) and verify it works.