Getting Started | Intro to Data Analysis for Economics

Stata vs. R vs. Python: Which Should I Use?

This tutorial provides code in Stata, R, and Python. You can choose any of them, but the two course tutorials will be taught in Stata. Here's how they compare:

Stata

Not free — requires a license (MIT provides one)
Often faster for large datasets
Simpler syntax for common econometrics tasks
One dataset in memory at a time

R

Free and open-source
Excellent for visualization (ggplot2)
tidyverse makes data wrangling intuitive
Multiple datasets in memory

Python

Free and open-source
General-purpose — useful beyond statistics
Excellent for data manipulation (pandas)
Weaker for statistics and publication tables

My Recommendation

Python excels at data manipulation but is relatively weak for statistical analysis and especially for creating publication-ready tables. If you're comfortable with Python, consider using it for data cleaning and preparation, then switching to R or Stata for the actual analysis and reporting.

The Stata Mental Model

Stata is unusual: it assumes you're working with one dataset at a time. Commands like summarize or regress automatically operate on "the data" without you specifying which data. If you need to work with a second dataset, you save the current one, then use the new one. This feels strange if you're used to Python or R, but it makes simple analyses very concise.

For this class, any works fine. The toggle at the top of the sidebar lets you switch between Stata, R, and Python code examples.

Installation

Stata

MIT provides Stata licenses for students. Visit the MIT IST Stata page to download and install.

R and RStudio

R is free and open-source. You'll also want RStudio, which provides a much nicer interface.

Option 1: Graphical installers

Download R from CRAN
Download RStudio from Posit

Option 2: Command line (Mac)

If you have Homebrew installed, this is faster:

# Install Homebrew first if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install R and RStudio
brew install --cask r
brew install --cask rstudio

Option 3: Command line (Windows)

If you have Chocolatey or Scoop:

# Using Chocolatey
choco install r.project rstudio

# Or using Scoop
scoop bucket add extras
scoop install r rstudio

Installing Packages

Both Stata and R extend their functionality through packages (called "ado files" in Stata). Here's how to install them.

Stata

In Stata, you install packages using ssc install for packages from the Statistical Software Components archive:

* Install commonly used packages
ssc install estout      // For regression tables
ssc install reghdfe     // Fast fixed effects regression
ssc install ftools      // Required by reghdfe
ssc install coefplot    // For coefficient plots
ssc install binscatter  // For binned scatter plots

R

In R, we recommend using the pacman package, which automatically installs packages if they're not already installed:

# Install pacman if you don't have it, then load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, haven, fixest, modelsummary)

Python

In Python, use pip to install packages from the command line:

# Install commonly used packages
pip install pandas numpy matplotlib seaborn
pip install statsmodels linearmodels
pip install openpyxl  # For Excel files

Then at the top of your Python scripts:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

Put Imports at the Top of Every Script

In R, use pacman::p_load(). In Python, put your import statements at the top. This ensures anyone running your code knows exactly what packages are needed.

Project Setup

Put Everything in Dropbox

Before doing anything else, create your project folder in Dropbox. This gives you automatic backups and version history. When you accidentally delete code, you can right-click the file and restore it from any point in the last 30-180 days.

One-Command Setup Script

We provide a script that creates a well-organized project structure. Open your terminal and run:

Mac/Linux:

cd ~/Dropbox
curl -fsSL https://theodorecaputi.com/teaching/14.33/files/setup.sh | bash -s my_1433_project

Windows PowerShell:

cd ~\Dropbox
irm https://theodorecaputi.com/teaching/14.33/files/setup.ps1 | iex; Setup-Project "my_1433_project"

What Gets Created

my_1433_project/
├── master.do              # Runs everything (Stata)
├── master.R               # Runs everything (R)
├── README.md
│
├── build/                 # Data preparation
│   ├── input/             # Raw data goes here (NEVER edit these)
│   ├── code/              # Scripts to clean data
│   └── output/            # Cleaned data
│
├── analysis/              # Your analysis
│   ├── code/              # Regression scripts, etc.
│   └── output/
│       ├── tables/
│       └── figures/
│
└── paper/                 # Your writeup
    └── draft.tex

The Golden Rule

Never modify raw data. Files in build/input/ are sacred. All changes happen through code, with results saved to build/output/. This means you can always reproduce your results.

Interface Basics

Working Directory

The working directory is the folder your program looks in by default. Set it at the start of every script:

* Set working directory
cd "/Users/yourname/Dropbox/my_1433_project"

* Check it worked
pwd

* Now use relative paths
use "build/output/cleaned_data.dta", clear

# Set working directory
setwd("/Users/yourname/Dropbox/my_1433_project")

# Check it worked
getwd()

# Now use relative paths
library(haven)
data <- read_dta("build/output/cleaned_data.dta")

import os
import pandas as pd

# Set working directory
os.chdir("/Users/yourname/Dropbox/my_1433_project")

# Check it worked
print(os.getcwd())

# Now use relative paths
data = pd.read_stata("build/output/cleaned_data.dta")

Getting Help

* Get help on a command
help regress

* Search for commands
search panel data

# Get help on a function
?lm
help(lm)

# Search for functions
??regression

# Get help on a function
help(pd.read_csv)

# Or in Jupyter/IPython
pd.read_csv?

# Search documentation online
# https://pandas.pydata.org/docs/

Your First Script

Always Use Scripts

Never type commands directly into the console for real work. Write everything in a .do file (Stata) or .R file. This is the #1 habit that separates beginners from competent researchers.

The test: Close everything, reopen, run your script. Does it work? If not, your analysis isn't reproducible.

Create a new file called test.do, test.R, or test.py:

/*==============================================================================
    test.do
    My first Stata script
==============================================================================*/

clear all
set more off

* Load built-in dataset
sysuse auto, clear

* Look at the data
describe
summarize price mpg

* Simple regression
regress price mpg weight

* Save a graph
scatter price mpg
graph export "my_first_graph.png", replace

di "Script completed successfully!"

#===============================================================================
#   test.R
#   My first R script
#===============================================================================

# Load packages
library(tidyverse)

# Load built-in dataset
data(mtcars)

# Look at the data
str(mtcars)
summary(mtcars[, c("mpg", "hp", "wt")])

# Simple regression
model <- lm(mpg ~ hp + wt, data = mtcars)
summary(model)

# Save a graph
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "MPG vs Weight", x = "Weight (1000 lbs)", y = "MPG")
ggsave("my_first_graph.png")

cat("Script completed successfully!\n")

#===============================================================================
#   test.py
#   My first Python script
#===============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

# Load built-in dataset (using seaborn's mpg dataset)
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
mtcars = pd.read_csv(url)

# Look at the data
print(mtcars.info())
print(mtcars[["mpg", "horsepower", "weight"]].describe())

# Simple regression
model = smf.ols("mpg ~ horsepower + weight", data=mtcars).fit()
print(model.summary())

# Save a graph
plt.figure(figsize=(8, 6))
plt.scatter(mtcars["weight"], mtcars["mpg"])
plt.xlabel("Weight")
plt.ylabel("MPG")
plt.title("MPG vs Weight")
plt.savefig("my_first_graph.png")

print("Script completed successfully!")

Run the entire script (not line by line!) and verify it works.

Next: Data Fundamentals →