Getting Started
Install your tools, set up a project, and write your first script
This guide will get you from zero to running your first regression. The goal isn't just to get code working—it's to set up habits that will make your work reproducible. Reproducibility means anyone (including future you) can run your code from scratch and get the same results. This is non-negotiable in empirical economics.
Stata vs. R vs. Python: Which Should I Use?
This tutorial provides code in Stata, R, and Python. You can choose any of them, but the two course tutorials will be taught in Stata. Here's how they compare:
Stata
- Not free — requires a license (MIT provides one)
- Often faster for large datasets
- Simpler syntax for common econometrics tasks
- One dataset in memory at a time
R
- Free and open-source
- Excellent for visualization (ggplot2)
- tidyverse makes data wrangling intuitive
- Multiple datasets in memory
Python
- Free and open-source
- General-purpose — useful beyond statistics
- Excellent for data manipulation (pandas)
- Weaker for statistics and publication tables
My Recommendation
Use R or Stata throughout your project unless you're working with very large datasets (100+ million observations). For big data, use Python or SQL for data manipulation, then switch to R or Stata for analysis and creating publication-ready output. R and Stata have much better support for econometric methods and producing the tables you'll need for papers.
The Stata Mental Model
Stata is unusual: it assumes you're working with one dataset at a time. Commands like summarize or regress automatically operate on "the data" without you specifying which data. If you need to work with a second dataset, you save the current one, then use the new one. This feels strange if you're used to Python or R, but it makes simple analyses very concise.
For this class, any works fine. The toggle at the top of the sidebar lets you switch between Stata, R, and Python code examples.
Installation
Stata
MIT provides Stata licenses for students:
- Go to econ-help.mit.edu
- Look under Desktop and find Department Wide StataNow/SE License
- Log in with your MIT credentials
- Download Stata (StataNow) for your OS (Windows, Mac, or Linux)
- Download the PDF with the license information
- Install Stata, then copy the license information from the PDF into the registration prompt to activate it
Stata is also available via the MIT IST Stata page, but that appears to be for Athena systems only.
R and RStudio
R is free and open-source. You may also want RStudio, which provides a nicer interface.
Option 1: Graphical installers
Option 2: Command line (Mac)
If you have Homebrew installed, this is faster:
# Install Homebrew first if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install R and RStudio
brew install --cask r
brew install --cask rstudio
Option 3: Command line (Windows)
If you have Chocolatey or Scoop:
# Using Chocolatey
choco install r.project rstudio
# Or using Scoop
scoop bucket add extras
scoop install r rstudio
Python
Python is free and open-source. We recommend using Anaconda or Miniconda for scientific computing.
Option 1: Anaconda (recommended for beginners)
Download from anaconda.com. Anaconda includes Python, Jupyter notebooks, and most data science packages pre-installed.
Option 2: Command line (Mac)
# Using Homebrew
brew install python
# Or install Miniconda (lighter than Anaconda)
brew install --cask miniconda
Option 3: Command line (Windows)
# Using Chocolatey
choco install python anaconda3
# Or using Scoop
scoop install python
scoop install anaconda3
Installing Packages
Out of the box, Stata, R, and Python can do basic statistics. But for modern econometrics—high-dimensional fixed effects, robust standard errors, publication-quality tables—you need additional packages written by the community. Think of packages as pre-written code that adds new commands to your toolkit. You install them once, then load them at the start of each script.
Stata
In Stata, you install packages using ssc install for packages from the Statistical Software Components archive:
* Install commonly used packages
ssc install estout // For regression tables
ssc install reghdfe // Fast fixed effects regression
ssc install ftools // Required by reghdfe
ssc install coefplot // For coefficient plots
ssc install binscatter // For binned scatter plots
R
In R, we recommend using the pacman package, which automatically installs packages if they're not already installed:
# Install pacman if you don't have it, then load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, haven, fixest, modelsummary)
Python
In Python, use pip to install packages from the command line:
# Install commonly used packages
pip install pandas numpy matplotlib seaborn
pip install statsmodels linearmodels
pip install openpyxl # For Excel files
Then at the top of your Python scripts:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
Put Imports at the Top of Every Script
In R, use pacman::p_load(). In Python, put your import statements at the top. This ensures anyone running your code knows exactly what packages are needed.
Interface Basics
The Stata Interface
When you open Stata, you'll see four main windows:
- Command window — Type commands here and press Enter to run them
- Results window — Output from your commands appears here
- Variables window — Shows all variables in the current dataset
- Review window — History of commands you've run (click to re-run)
You can type commands directly, but it's better to write them in a do-file (a script). Go to File → New → Do-file to create one. This makes your work reproducible.
Working Directory
The working directory is the folder your program looks in by default. Set it at the start of every script:
* Set working directory
cd "/Users/yourname/Dropbox/my_1433_project"
* Check it worked
pwd
* Now use relative paths
use "build/output/cleaned_data.dta", clear
# Set working directory
setwd("/Users/yourname/Dropbox/my_1433_project")
# Check it worked
getwd()
# Now use relative paths
pacman::p_load(haven)
data <- read_dta("build/output/cleaned_data.dta")
import os
import pandas as pd
# Set working directory
os.chdir("/Users/yourname/Dropbox/my_1433_project")
# Check it worked
print(os.getcwd())
# Now use relative paths
data = pd.read_stata("build/output/cleaned_data.dta")
Getting Help
When you encounter an unfamiliar command or forget the syntax, use the built-in help system:
* Get help on a command
help regress
* Search for commands
search panel data
# Get help on a function
?lm
help(lm)
# Search for functions
??regression
# Get help on a function
help(pd.read_csv)
# Or in Jupyter/IPython
pd.read_csv?
# Search documentation online
# https://pandas.pydata.org/docs/
Your First Script
Always Use Scripts
Never type commands directly into the console for real work. Write everything in a .do file (Stata) or .R file. This is the #1 habit that separates beginners from competent researchers.
The test: Close everything, reopen, run your script. Does it work? If not, your analysis isn't reproducible.
Create a new file called test.do, test.R, or test.py:
/*==============================================================================
test.do
My first Stata script
==============================================================================*/
clear all
set more off
* Load built-in dataset
sysuse auto, clear
* Look at the data
describe
summarize price mpg
* Simple regression
regress price mpg weight
* Save a graph
scatter price mpg
graph export "my_first_graph.png", replace
di "Script completed successfully!"
#===============================================================================
# test.R
# My first R script
#===============================================================================
# Load packages
pacman::p_load(tidyverse)
# Load built-in dataset
data(mtcars)
# Look at the data
str(mtcars)
summary(mtcars[, c("mpg", "hp", "wt")])
# Simple regression
model <- lm(mpg ~ hp + wt, data = mtcars)
summary(model)
# Save a graph
p <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "MPG vs Weight", x = "Weight (1000 lbs)", y = "MPG")
ggsave("my_first_graph.png", p, width = 8, height = 6)
cat("Script completed successfully!\n")
#===============================================================================
# test.py
# My first Python script
#===============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# Load built-in dataset (using seaborn's mpg dataset)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
mtcars = pd.read_csv(url)
# Look at the data
print(mtcars.info())
print(mtcars[["mpg", "horsepower", "weight"]].describe())
# Simple regression
model = smf.ols("mpg ~ horsepower + weight", data=mtcars).fit()
print(model.summary())
# Save a graph
plt.figure(figsize=(8, 6))
plt.scatter(mtcars["weight"], mtcars["mpg"])
plt.xlabel("Weight")
plt.ylabel("MPG")
plt.title("MPG vs Weight")
plt.savefig("my_first_graph.png")
print("Script completed successfully!")
Run the entire script (not line by line!) and verify it works.
Found something unclear or have a suggestion? Email [email protected].