Age-Sex Standardized Poisson-Based Interrupted Time Series Analysis
Introduction
This tutorial describes how to perform an age-sex standardized interrupted time series analysis based upon Poisson regression using the R programming language.
When would you use Poisson-Based Interrupted Time Series Analysis (PBITSA)?
You Have Time Series Data of an Intervention/Policy. An interrupted time series analysis is a quasi-experimental method used to evaluate the impact of a policy or intervention by comparing trends in an outcome before and after the intervention is put in place.
You Want to Model Compare Rates Before/After a Specific Point. The most common form of interrupted time series analysis is based upon linear regression. This works if your outcome is a continuous variable (or closely approximated by one). For example, you might look at the concentration of a particular air pollutant before and after some policy change. There are plenty of tutorials on this type of interrupted time series analysis available elsewhere, but in short, you create a dummy variable for whether or not the observation occurred after the interruption, and you perform a regression with the outcome on time, the interruption dummy, and an interaction between time and the interruption dummy.
Your Outcome is a Count Variable. But what if your outcome isn’t a simple continuous variable? What if it’s a count variable? Say that you know the annual count of how many new lung cancer cases were found (lung cancer incidence) and how many people died from lung cancer (lung cancer mortality), and you want to know if these outcomes were affected by a certain policy. This variable is not continuous (you can’t have 5.5 lung cancer mortalities in a year), and the meaning of the count variable depends upon the base population size (6 lung cancer mortalities in a year might not sound like a lot, but it is if there are only 100 people in the population you’re studying). This is when you use PBITSA.
When would you want to age-sex standardize?
If age and sex affect your outcome and the age and sex distribution of the population is changing over time, then you need to take that into account by age-sex standardization. Let’s say that we want to measure the impact of a policy on hard-candy consumption and that, over the period of your study, the population is becoming older. Old people eat more hard candy than young people, so even if the within-age group rates stay the same, the overall prevalence of hard candy consumption would increase – giving you a spurious result. Age-sex standardization adjusts for changes to the age and sex distribution of the population over time.
Depending upon your research question, there might be other things you need to standardize throughout your study period. Say that you’re looking at prostate cancer. Since prostate cancer affects males, you shouldn’t adjust for sex – but you still might want to standardize for age-groups.
The Current Study
This tutorial is based upon my MPH Dissertation, which I completed at University College Cork in Cork, Republic of Ireland, in 2017-2018. My dissertation explored the effect of Ireland’s 2004 Comprehensive Workplace Smoking Ban on lung cancer incidence and mortality. Count data was available for lung cancer incidence and mortality, and so, to perform an interrupted time series analysis, I had to use PBITSA.
The Data
1994-2015 count data was available for lung cancer incidence and mortality such that every observation was identified by a combination of year X age-group X sex and lung cancer incidence and mortality were given as raw counts. That is, one observation could be that there were 126 lung cancer incidences and 68 lung cancer mortalities among females aged 50-54 in 2004, and another could be that there were 269 lung cancer incidences and 45 lung cancer mortalities among males aged 85+ in 2009.
For this analysis to work, you AT LEAST need population estimates in the same categories as those for your count variable (for this study, population data by year X age-group X sex).
Depending on the analysis you want to run, you might want to adjust for some confounders. For this analysis, active smoking and air quality are potential confounders. I had access to micro-data on smoking prevalence (proportion of active smokers) and smoking frequency (average number of cigarettes per day), and so I could aggregate that to either overall smoking prevalence/frequency per year or year X age-group X sex level smoking prevalence/frequency. Air quality data was only sparsely available (readings at different sites on different dates), and so I averaged it all together by year to create overall annual Particulate Matter 2.5 (PM2.5) concentration. In the main analysis (shown here), we include overall smoking prevalence and frequency as confounders in the model.
First, we have to do a bit of data cleaning.
## Housekeeping
rm(list=ls())
## Install packages
if(!require("pacman")) install.packages("pacman")
pacman::p_load(WordR, MASS, plyr, foreign, data.table,lubridate, rJava, WordR, ReporteRs, fakeR, car,
flextable, WordR, rJava, ReporteRs, ggplot2, dplyr, devtools, formula.tools, stringr, rlang, lmtest)
install_github("tlcaputi/tlcPack"); library(tlcPack)
## Set working directory where the data is
setwd("C:/Users/tcapu/Google Drive/UCC/manuscript/SmokingBanLungCancer/analyze")
y <- readRDS("./input/data_new.rds")
y <- y %>%
rename(
lungMortality = lungMortalityCases,
lungIncidence = lungIncidenceCases,
brainMortality = brainMortalityCases,
brainIncidence = brainIncidenceCases,
smoke_rate = total_smoking_rate,
smoke_frequency = total_smoking_frequency,
age_sex_smoke_rate = age_sex_smoking_rate,
age_sex_smoke_frequency = age_sex_smoking_frequency,
smoke_price = real_price
)
y <- y %>%
mutate(lungMortality = ifelse(lungMortality=='..', 0, lungMortality)) %>%
mutate(
year = as.numeric(year),
population = as.numeric(population),
lungMortality = as.numeric(lungMortality),
lungIncidence = as.numeric(lungIncidence),
brainMortality = as.numeric(brainMortality),
brainIncidence = as.numeric(brainIncidence),
int2004 = ifelse(year<2004, "preban", "postban"),
year_from_baseline = year - min(year)
) %>% filter(sex!="Both sexes", year <=2015)
y <- y %>% mutate(
lungIncidence = ifelse(is.na(lungIncidence), 0, lungIncidence),
lungMortality = ifelse(is.na(lungMortality), 0, lungMortality),
brainIncidence = ifelse(is.na(brainIncidence), 0, brainIncidence),
brainMortality = ifelse(is.na(brainMortality), 0, brainMortality)
)
y <- y %>% group_by(year, sex) %>% mutate(totalPopulation = sum(population)) %>%
ungroup()
y <- y %>% mutate(
lungIncidenceRate = lungIncidence/population,
lungMortalityRate = lungMortality/population,
brainIncidenceRate = brainIncidence/population,
brainMortalityRate = brainMortality/population,
weighting = population/totalPopulation
) %>% mutate(
lungIncidenceSumUp = lungIncidenceRate * weighting,
lungMortalitySumUp = lungMortalityRate * weighting,
brainIncidenceSumUp = brainIncidenceRate * weighting,
brainMortalitySumUp = brainMortalityRate * weighting
)
y <- y %>% group_by(year) %>% mutate(
lungIncidence_ASR = sum(lungIncidenceSumUp),
lungMortality_ASR = sum(lungMortalitySumUp),
brainIncidence_ASR = sum(brainIncidenceSumUp),
brainMortality_ASR = sum(brainMortalitySumUp)
) %>% ungroup()
## Now we select only the variables we need for analysis
y <- y %>%
select(
sex, # Male or Female
year, # Year
year_from_baseline, # Year since 1994
ageGroup, # 5-year age groups from 40 to 85 and then 85+
population, # number of Irish people in the sex X age group
pm2p5, # Yearly particulate matter 2.5 concentration
pm2p5_lag3, # 3-year lagged particulate matter 2.5 concentration
smoke_rate, # Annual Irish smoking prevalence
smoke_frequency, # Annual avg. number of cigs per day
lungMortality, # Cases of lung mortality within age X sex X year
) %>% arrange(ageGroup, sex, year) %>% ungroup()
Skipping install of 'tlcPack' from a github remote, the SHA1 (4d8015e0) has not changed since last install.
Use `force = TRUE` to force installation
And now we can take a look.
y %>% head()
sex | year | year_from_baseline | ageGroup | population | pm2p5 | pm2p5_lag3 | smoke_rate | smoke_frequency | lungMortality |
---|---|---|---|---|---|---|---|---|---|
Female | 1994 | 0 | 40-44 years | 116.5 | 30.20692 | NA | 0.2465582 | 3.782245 | 5 |
Female | 1995 | 1 | 40-44 years | 118.6 | 29.08168 | NA | 0.2428202 | 3.760822 | 9 |
Female | 1996 | 2 | 40-44 years | 120.4 | 27.95643 | NA | 0.2390823 | 3.739400 | 9 |
Female | 1997 | 3 | 40-44 years | 124.0 | 26.83119 | 30.20692 | 0.2353444 | 3.717978 | 5 |
Female | 1998 | 4 | 40-44 years | 126.3 | 25.70594 | 29.08168 | 0.2316065 | 3.696555 | 10 |
Female | 1999 | 5 | 40-44 years | 128.2 | 24.58070 | 27.95643 | 0.2278686 | 3.675133 | 6 |
The Model
For simplicity, in this tutorial, we’ll focus on lung cancer mortality. You could apply the same technique to lung cancer mortality just by changing the left-hand side variable name.
Basic (Linear) Model
The basic model is pretty straightforward. You are predicting lung cancer incidence or mortality using a linear term for time, a dummy variable for whether the observation is before/after the interruption, an interaction term between time and the dummy variable, and other variables for confounders.
The Poisson Model
Recall that a Poisson model is not linear – it’s log-linear. That means that we are essentially predicting the natural logarithm of the outcome, rather than the outcome itself.
The outcome variable we are investigating is the natural logarithm of the count of either lung cancer incidence or mortality. But, of course, the count itself doesn’t tell you much without knowing the whole population – it’s like having the numerator without the denominator. So you have to adjust for the population using an offset term.
An offset term sets the model coefficient to exactly 1. Essentially, what we are saying here is that, before we model everything else, we need to account for a certain specific number on the right-hand side of the model. Without the offset term, we are predicting the natural logarithm of the count variable. With the offset on the right-hand side (equivalent to a negative offset on the left-hand side), we are predicting:
Recall from high school math:
Therefore, the left-hand side of the model is now equivalent to the natural logarithm of the count over the population, which is the rate.
For the mathematically/statistically inclined, the final PBITSA model is formualted as follows:
basefit_formula <- as.formula("lungMortality ~ ageGroup + sex + offset(log(population)) + smoke_rate + smoke_frequency")
basefit <- glm(basefit_formula, data = y, family = poisson)
summary(basefit)
Call:
glm(formula = basefit_formula, family = poisson, data = y)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.3074 -1.1027 -0.0426 0.9688 4.8095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.96358 0.13177 -30.080 < 2e-16 ***
ageGroup45-49 years 0.91204 0.06946 13.131 < 2e-16 ***
ageGroup50-54 years 1.83876 0.06327 29.063 < 2e-16 ***
ageGroup55-59 years 2.51632 0.06112 41.169 < 2e-16 ***
ageGroup60-64 years 3.11410 0.06000 51.900 < 2e-16 ***
ageGroup65-69 years 3.60885 0.05945 60.702 < 2e-16 ***
ageGroup70-74 years 3.97289 0.05924 67.059 < 2e-16 ***
ageGroup75-79 years 4.25016 0.05925 71.730 < 2e-16 ***
ageGroup80-84 years 4.37474 0.05971 73.269 < 2e-16 ***
ageGroup85+ years 4.29703 0.06070 70.792 < 2e-16 ***
sexMale 0.65510 0.01093 59.921 < 2e-16 ***
smoke_rate 2.27386 0.40570 5.605 2.09e-08 ***
smoke_frequency 0.01430 0.05090 0.281 0.779
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 46216.5 on 439 degrees of freedom
Residual deviance: 1047.4 on 427 degrees of freedom
AIC: 3642.6
Number of Fisher Scoring iterations: 4
Autocorrelation
We should consider if there is significant autocorrelation in the residuals of the model. A common rule of thumb is to look for a Durbin-Watson test statistic between 1.5 and 2.5. If it’s in that range, you can proceed without lag terms – if not, you should consider testing for some significant lag terms.
Here, we just test for 1, 2, and 3-year lag terms directly through a stepwise regression. We add every significant lag term to the model as a confounder.
Because we have lag terms, the error term is now modelled as:
durbinWatsonTest(basefit, max.lag=3)
lag_names <- c()
y$age_sex_group <- paste0(y$ageGroup, y$sex)
for(k in unique(y$age_sex_group)){
for (j in 1:3){
nam <- paste0("lungMortality_lag", j)
y[y$age_sex_group==k,nam] <- lag(as.numeric(unlist(y[y$age_sex_group==k, "lungMortality"])), j)
lag_names <- u(lag_names, nam)
}
}
mod_w_lag_formula <- as.formula(paste(c(basefit_formula, lag_names), collapse="+"))
mod_w_lag <- lm(mod_w_lag_formula, data = na.omit(y))
step1 <- step(mod_w_lag, na.rm=T)
lags <- grep("lag", names(step1$coefficients), value=T)
if(length(lags)>0){
formula_w_lags <- as.formula(paste(c(as.character(basefit_formula), lags), collapse="+"))
} else {
formula_w_lags <- basefit_formula
}
model_w_lags <- glm(formula_w_lags, data = y, family = poisson)
summary(model_w_lags)
lag Autocorrelation D-W Statistic p-value
1 0.4639009 1.059313 0
2 0.4329246 1.113317 0
3 0.4101562 1.142450 0
Alternative hypothesis: rho[lag] != 0
Start: AIC=1873.52
lungMortality ~ ageGroup + sex + offset(log(population)) + smoke_rate +
smoke_frequency + lungMortality_lag1 + lungMortality_lag2 +
lungMortality_lag3
Df Sum of Sq RSS AIC
<none> 48352 1873.5
- sex 1 367.9 48720 1874.4
- smoke_frequency 1 779.1 49131 1877.6
- smoke_rate 1 1425.4 49777 1882.6
- ageGroup 9 4984.7 53337 1892.8
- lungMortality_lag3 1 3520.0 51872 1898.2
- lungMortality_lag1 1 4415.0 52767 1904.7
- lungMortality_lag2 1 5089.0 53441 1909.5
Call:
glm(formula = formula_w_lags, family = poisson, data = y)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.1888 -0.9825 0.0378 0.9864 3.5148
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.9708867 0.1360732 -29.182 < 2e-16 ***
ageGroup45-49 years 0.9432878 0.0753907 12.512 < 2e-16 ***
ageGroup50-54 years 1.8024521 0.0697899 25.827 < 2e-16 ***
ageGroup55-59 years 2.4349953 0.0698428 34.864 < 2e-16 ***
ageGroup60-64 years 2.9768871 0.0740942 40.177 < 2e-16 ***
ageGroup65-69 years 3.4006834 0.0800485 42.483 < 2e-16 ***
ageGroup70-74 years 3.7351003 0.0845788 44.161 < 2e-16 ***
ageGroup75-79 years 4.0287955 0.0843779 47.747 < 2e-16 ***
ageGroup80-84 years 4.2312856 0.0763421 55.425 < 2e-16 ***
ageGroup85+ years 4.2368000 0.0712594 59.456 < 2e-16 ***
sexMale 0.5290185 0.0216730 24.409 < 2e-16 ***
smoke_rate 2.2479488 0.4717015 4.766 1.88e-06 ***
smoke_frequency 0.0243775 0.0519945 0.469 0.6392
lungMortality_lag1 0.0002108 0.0004110 0.513 0.6080
lungMortality_lag2 0.0007068 0.0004067 1.738 0.0823 .
lungMortality_lag3 0.0009069 0.0004060 2.233 0.0255 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 40400.92 on 379 degrees of freedom
Residual deviance: 792.05 on 364 degrees of freedom
(60 observations deleted due to missingness)
AIC: 3047.7
Number of Fisher Scoring iterations: 4
Reconsidering the Interruption
In our study, we need to decide which “interruption” we want to use. There is a latency period between smoke exposure and lung cancer, but we don’t know exactly what it is. Here, we tested all possible breakpoints between 2004 and 2010, and we use the most significant model (i.e., the model with the highest z-value associated with the time-interruption interaction term).
In your analysis, you may have to make similar or different adjustments (depending upon your research question).
## Perform a GLM and return only the interaction term
glm_int <- function(formula, data, family){
mod <- glm(formula, data, family=family)
variables <- all.vars(mod$terms)
mod_coeff <- data.frame(summary(mod)$coefficients)
int_variable <- grep("year_from_baseline:int", rownames(mod_coeff), value=T)
mod_coeff$rownames <- rownames(mod_coeff)
z_value <- mod_coeff[grep("year_from_baseline:int", rownames(mod_coeff)),"z.value"]
mat <- as.data.frame(matrix(NA, nrow=1, ncol=2))
names(mat) <- c("int_year", "z_val")
mat$int_year <- as.numeric(gsub("year_from_baseline:int|postban", "", int_variable))
mat$z_val <- as.numeric(z_value)
return(mat)
}
## Take a list of glm_int objects and return the most significant negative value.
most_significant_value <- function(list_of_models){
data <- Reduce(function(x,y) {rbind.data.frame(x, y)}, list_of_models)
which_is_min <- which(data$z_val==min(data$z_val, na.rm=T))
min_year <- data[which_is_min, "int_year"]
if(min(data$z_val, na.rm=T) > - 1.96) warning("No significant results!")
print(paste0(min_year, ": z-value is", min(data$z_val, na.rm=T)))
return(as.numeric(min_year))
}
for(i in 2003:2010){
interruption_name <- paste0("int", i)
values <- ifelse(y$year < i, "preban", "postban")
y[,interruption_name] <- factor(values, levels=c("preban", "postban"))
}
# Test models with interruption points between 0 and 6 years after the ban
int2004_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2004")), y, family=poisson)
int2005_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2005")), y, family=poisson)
int2006_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2006")), y, family=poisson)
int2007_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2007")), y, family=poisson)
int2008_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2008")), y, family=poisson)
int2009_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2009")), y, family=poisson)
int2010_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2010")), y, family=poisson)
models_to_test <- list(
int2004_model,
int2005_model,
int2006_model,
int2007_model,
int2008_model,
int2009_model,
int2010_model
)
int <- most_significant_value(models_to_test)
interruption <- paste0("int", int)
print(interruption)
[1] "2006: z-value is-2.39163967446975"
[1] "int2006"
Now lets run the model.
Results
First, remember that all these coefficients are on a natural log scale. You need to exponentiate them in order to get the relevant RR.
The row labelled int2006postban
corresponds to the level or “immediate” shift occurring right after the ban. The row labelled year_from_baseline:int2006postban
corresponds to the slope or “gradual” shift occurring right after the ban.
final_model_formula <- as.formula(paste0(formula_w_lags, "+ year_from_baseline *", interruption))
final_model <- glm(final_model_formula, data = y, family = poisson)
print(as.character(final_model_formula))
print(summary(final_model))
[1] "lungMortality ~ ageGroup + sex + offset(log(population)) + smoke_rate + smoke_frequency +
lungMortality_lag1 + lungMortality_lag2 + lungMortality_lag3 + year_from_baseline * int2006"
Call:
glm(formula = final_model_formula, family = poisson, data = y)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.2709 -0.9116 0.0089 0.9092 3.5758
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.7339134 0.2378258 -15.700 <2e-16 ***
ageGroup45-49 years 0.9414212 0.0753927 12.487 <2e-16 ***
ageGroup50-54 years 1.7958452 0.0698201 25.721 <2e-16 ***
ageGroup55-59 years 2.4222705 0.0699616 34.623 <2e-16 ***
ageGroup60-64 years 2.9571266 0.0743581 39.769 <2e-16 ***
ageGroup65-69 years 3.3736395 0.0805245 41.896 <2e-16 ***
ageGroup70-74 years 3.7029407 0.0852492 43.437 <2e-16 ***
ageGroup75-79 years 3.9969385 0.0850150 47.015 <2e-16 ***
ageGroup80-84 years 4.2079389 0.0766959 54.865 <2e-16 ***
ageGroup85+ years 4.2219335 0.0714159 59.118 <2e-16 ***
sexMale 0.5183965 0.0219768 23.588 <2e-16 ***
smoke_rate 0.1317331 1.5021274 0.088 0.9301
smoke_frequency 0.0916101 0.0678737 1.350 0.1771
lungMortality_lag1 0.0002723 0.0004151 0.656 0.5119
lungMortality_lag2 0.0007950 0.0004076 1.951 0.0511 .
lungMortality_lag3 0.0009748 0.0004115 2.369 0.0178 *
year_from_baseline -0.0007664 0.0068449 -0.112 0.9109
int2006postban 0.1128675 0.0623443 1.810 0.0702 .
year_from_baseline:int2006postban -0.0104927 0.0043872 -2.392 0.0168 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 40400.92 on 379 degrees of freedom
Residual deviance: 782.04 on 361 degrees of freedom
(60 observations deleted due to missingness)
AIC: 3043.7
Number of Fisher Scoring iterations: 4
Subgroup Analyses
We can also compute this same model for a certain subgroup – we just need to make sure we take out the confounder in the model that corresponds with the subgroup. For example, it doesn’t make much sense to “adjust for” sex if we’re only looking at females or “adjust for” age if we’re only looking at one age group. Also, the R functions won’t work.
## Sex Group
sex_final_formula <- as.formula(
paste0("lungMortality ~ ageGroup + offset(log(population)) + smoke_rate + smoke_frequency + ",
paste(lags, collapse="+"),
" + year_from_baseline * ", interruption)
)
for (i in unique(y$sex)){
print(i)
temp_mod <- glm(sex_final_formula, data = y, family = poisson)
print(summary(temp_mod))
}
## Age Groups
age_final_formula <- as.formula(
paste0("lungMortality ~ sex + offset(log(population)) + smoke_rate + smoke_frequency + ",
paste(lags, collapse="+"),
" + year_from_baseline * ", interruption)
)
# for (i in unique(y$ageGroup)){
# print(i)
# temp_mod <- glm(age_final_formula, data = y, family = poisson)
# print(summary(temp_mod))
# }
[1] "Female"
Call:
glm(formula = sex_final_formula, family = poisson, data = y)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.7861 -1.0362 -0.1191 1.0275 6.3739
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.7479632 0.2375117 -15.780 < 2e-16 ***
ageGroup45-49 years 0.8766539 0.0753350 11.637 < 2e-16 ***
ageGroup50-54 years 1.5875764 0.0691470 22.959 < 2e-16 ***
ageGroup55-59 years 2.0240121 0.0675392 29.968 < 2e-16 ***
ageGroup60-64 years 2.2919754 0.0681191 33.647 < 2e-16 ***
ageGroup65-69 years 2.4613264 0.0698749 35.225 < 2e-16 ***
ageGroup70-74 years 2.6313596 0.0712050 36.955 < 2e-16 ***
ageGroup75-79 years 2.9272426 0.0707465 41.377 < 2e-16 ***
ageGroup80-84 years 3.4139383 0.0682269 50.038 < 2e-16 ***
ageGroup85+ years 3.6589453 0.0672914 54.375 < 2e-16 ***
smoke_rate 1.4226511 1.4993838 0.949 0.343
smoke_frequency 0.0804042 0.0678472 1.185 0.236
lungMortality_lag1 0.0024542 0.0004047 6.064 1.33e-09 ***
lungMortality_lag2 0.0033937 0.0003950 8.593 < 2e-16 ***
lungMortality_lag3 0.0036391 0.0003960 9.189 < 2e-16 ***
year_from_baseline 0.0072249 0.0068221 1.059 0.290
int2006postban 0.2447550 0.0621743 3.937 8.26e-05 ***
year_from_baseline:int2006postban -0.0255733 0.0043457 -5.885 3.99e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 40400.9 on 379 degrees of freedom
Residual deviance: 1341.7 on 362 degrees of freedom
(60 observations deleted due to missingness)
AIC: 3601.4
Number of Fisher Scoring iterations: 4
[1] "Male"
Call:
glm(formula = sex_final_formula, family = poisson, data = y)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.7861 -1.0362 -0.1191 1.0275 6.3739
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.7479632 0.2375117 -15.780 < 2e-16 ***
ageGroup45-49 years 0.8766539 0.0753350 11.637 < 2e-16 ***
ageGroup50-54 years 1.5875764 0.0691470 22.959 < 2e-16 ***
ageGroup55-59 years 2.0240121 0.0675392 29.968 < 2e-16 ***
ageGroup60-64 years 2.2919754 0.0681191 33.647 < 2e-16 ***
ageGroup65-69 years 2.4613264 0.0698749 35.225 < 2e-16 ***
ageGroup70-74 years 2.6313596 0.0712050 36.955 < 2e-16 ***
ageGroup75-79 years 2.9272426 0.0707465 41.377 < 2e-16 ***
ageGroup80-84 years 3.4139383 0.0682269 50.038 < 2e-16 ***
ageGroup85+ years 3.6589453 0.0672914 54.375 < 2e-16 ***
smoke_rate 1.4226511 1.4993838 0.949 0.343
smoke_frequency 0.0804042 0.0678472 1.185 0.236
lungMortality_lag1 0.0024542 0.0004047 6.064 1.33e-09 ***
lungMortality_lag2 0.0033937 0.0003950 8.593 < 2e-16 ***
lungMortality_lag3 0.0036391 0.0003960 9.189 < 2e-16 ***
year_from_baseline 0.0072249 0.0068221 1.059 0.290
int2006postban 0.2447550 0.0621743 3.937 8.26e-05 ***
year_from_baseline:int2006postban -0.0255733 0.0043457 -5.885 3.99e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 40400.9 on 379 degrees of freedom
Residual deviance: 1341.7 on 362 degrees of freedom
(60 observations deleted due to missingness)
AIC: 3601.4
Number of Fisher Scoring iterations: 4
Producing the Figure
Now it’s time to produce the figure.
So far, we’ve calculated what the effect of the intervention and tested for its statistical significance. Now, we want to visualize how the lung cancer mortality rate has changed and the difference between what we would expect with or without the smoking ban (adjusting for our confounders).
To do that, we need two versions of the data:
- The actual data that we used in our analysis
- A “counterfactual” – what we would expect had there never been a smoking ban – by setting the interruption variable to “preban” for all years.
Standardized Prediction Rates
Fortunately, R has a very helpful function predict
that uses the model and all the confounders in our datasets to predict the rates over a given dataset. We comptue this for our original data and for our counterfactual data.
Still, these predicted values are counts in year X sex X age groups, and we need rates (count/population) by year. We can do this by taking a weighted.mean
of the counts (this accounts for differences in the population size of different sex X age groups) and dividing it by the total population for any given year. These rates are age-and-sex standardized.
Observed Rates
We also want to capture the actual observed rates of lung cancer (not adjusting for our confounders). This is pretty easy. For each year, just take the total number of counts and divide it by the total population.
counterfactual <- y
counterfactual[,interruption] <- "preban"
y <- y %>% mutate(
crude_rate_orig = (lungMortality)/population,
with_smoking_ban = predict(final_model, type="response", y),
without_smoking_ban = predict(final_model, type="response", counterfactual),
with_smoking_ban_rate = predict(final_model, type="response", y)/population,
without_smoking_ban_rate = predict(final_model, type="response", counterfactual)/population
)
## aggregate dataset from age X sex X year observations to year observations to create a figure
fig_dataset <- y %>% group_by(year) %>%
transmute(
crude_rate = sum(lungMortality)/sum(population), #weighted.mean(original_rate, population, na.rm=T),
crude_rate_wtd = weighted.mean(crude_rate_orig, population, na.rm=T),
with_smoking_ban_rate = weighted.mean(with_smoking_ban_rate, population, na.rm=T),
without_smoking_ban_rate = weighted.mean(without_smoking_ban_rate, population, na.rm=T),
without_smoking_ban_rate_direct = sum(without_smoking_ban)/sum(population),
with_smoking_ban_rate_direct = sum(with_smoking_ban)/sum(population)
) %>% ungroup()
fig_dataset <- fig_dataset %>%
select(year, crude_rate, crude_rate_wtd, with_smoking_ban_rate, without_smoking_ban_rate,
with_smoking_ban_rate_direct, without_smoking_ban_rate_direct) %>%
unique()
post_interruption <- fig_dataset %>% filter(year >= int - 1)
## Set parameters for Jupyter Notebook
par(mar=c(5.1,5.1,4.1,2.1))
options(repr.plot.width=8.5, repr.plot.height=5.5)
plot(fig_dataset$year, fig_dataset$crude_rate_wtd, xlab="Year", ylab="Lung Cancer Mortality\n(per 1,000 people aged 40+)",
ylim=c(min(c(fig_dataset$without_smoking_ban_rate,fig_dataset$with_smoking_ban_rate,fig_dataset$with_smoking_ban_rate_direct),na.rm=T),
max(c(fig_dataset$without_smoking_ban_rate,fig_dataset$with_smoking_ban_rate,fig_dataset$with_smoking_ban_rate_direct),na.rm=T))
)
lines(fig_dataset$year, fig_dataset$with_smoking_ban_rate, type="l",col="blue")
lines(post_interruption$year, post_interruption$without_smoking_ban_rate, type="l",lty="dashed",col="red")
abline(v=int,col="black")
abline(v=2004,lty="dashed",col="darkgreen")
legend("bottomleft",legend=c("Predicted Rate Without Smoking Ban","Standardized-Adjusted Rate",
paste0(int," Modeled Interruption"),"2004 Smoking Ban", "Observed Crude Rate"),
col=c("red","blue","black","darkgreen","black"),lty=c(2,1,1,2, NA),pch=c(NA,NA,NA,NA, 1))
Interpreting the Figure
Based upon this figure, we may want to calculate a few numbers, such as the average number of mortalities avoided per year. I find it useful to print a summary.
Finding the 95% confidence interval is not immediately obvious. Here, we bootstrap the confidence interval.
## Calculate fitted and observed values and estimate differences using bootstrapping
fitted_values <- as.vector(filter(y, !is.na(without_smoking_ban),
!is.na(with_smoking_ban)) %>% ungroup() %>% select(without_smoking_ban) ) %>%
unlist() %>% as.numeric()
observed_values <- as.vector(filter(y, !is.na(without_smoking_ban),
!is.na(with_smoking_ban)) %>% ungroup() %>% select(with_smoking_ban) ) %>%
unlist() %>% as.numeric()
## Set seed to replicate
set.seed(1234)
B <- replicate(1000, expr={
ix <- sample(1:length(observed_values), length(observed_values), replace=T)
(sum(fitted_values[ix])-sum(observed_values[ix])) # /sum(comb_preds$mean[ix])
})
### Create Summary
## Rounding function
r2 <- function(x) round(x,2)
## Creates a string that looks like a confidence interval
ci95 <- function(a, b, c) paste0(r2(a), " (95%CI ", r2(b), " to ", r2(c), ")")
post_interruption_cases <- r2(sum(y[y$year >=int, "lungMortality"],na.rm=T))
total_difference <- r2(sum(y$without_smoking_ban - y$with_smoking_ban, na.rm=T))
total_difference_lo95 <- r2(quantile(B,0.025, na.rm=T)) #sum(d$wos_raw - d$wsm_raw, na.rm=T))
total_difference_hi95 <- r2(quantile(B,0.975, na.rm=T)) # sum(d$wos_raw - d$wsm_raw, na.rm=T))
percent_cases_avoided <- ci95(
total_difference/sum(y[y$year >=int, ]$without_smoking_ban, na.rm=T) * 100,
total_difference_lo95/sum(y[y$year >=int, ]$without_smoking_ban, na.rm=T) * 100,
total_difference_hi95/sum(y[y$year >=int, ]$without_smoking_ban, na.rm=T) * 100
)
total_cases_avoided <- ci95(
total_difference,
total_difference_lo95,
total_difference_hi95
)
yearly_cases_avoided <- ci95(
total_difference/(max(y$year - int + 1, na.rm=T)),
total_difference_lo95/(max(y$year - int + 1, na.rm=T)),
total_difference_hi95/(max(y$year - int + 1, na.rm=T))
)
numSaved <- paste0("The total number of cases avoided is ",total_cases_avoided,".")
numSavedYr <- paste0("The number of cases avoided per year is ",yearly_cases_avoided,".")
intYear <- paste0("The interruption year is ",int)
postIntDeaths <- paste0("The total number of post-interruption cases is ",post_interruption_cases,".")
pctPostIntDeaths <- paste0("The ratio of cases avoided to post-interruption cases (*100) is ",percent_cases_avoided,".")
## Print summary
print(intYear)
print(numSaved)
print(numSavedYr)
print(postIntDeaths)
print(pctPostIntDeaths)
[1] "The interruption year is 2006"
[1] "The total number of cases avoided is 1125.09 (95%CI 959.92 to 1329.53)."
[1] "The number of cases avoided per year is 112.51 (95%CI 95.99 to 132.95)."
[1] "The total number of post-interruption cases is 17548."
[1] "The ratio of cases avoided to post-interruption cases (*100) is 6.03 (95%CI 5.14 to 7.12)."
Conclusion
…and that’s it! You’ve completed the analysis.
What if you wanted to compare two populations – one that experienced the intervention and one that didn’t?
Let’s say that we wanted to compare Ireland, which has a comprehensive workplace smoking ban, to Marlboro-land, a country that does not. It’s very easy! Just create a new category for “country” so that now your observations are identified by year, age-group, sex, and country. When you create your interruption term, just make sure you code it as follows:
- Ireland Pre-Interruption = “Pre-Interruption”
- Ireland Post-Interruption = “Post-Interruption”
- Marlboro-Land Pre-Interruption = “Pre-Interruption”
- Marlboro-Land Post-Interruption = “Pre-Interruption”
The intuition here is that Marlboro-land never has the policy, so it’s always “pre” the interruption.
Additional Information
My code for this project will soon be available on my GitHub Page.
Acknowledgements
Special thanks to the faculty at University College Cork who helped me with this project and to the following data providers who made my analysis possible:
- National Cancer Registry Ireland (Cancer Incidence Data)
- Central Statistics Office Ireland (Population and Cancer Mortality Data)
- Ireland Environmental Protection Agency (Air Quality Data)
- Ireland Health Services Executive (Active Smoking Data)
Recommended Citation
Caputi TL. Age-Sex Standardized Poisson-Based Interrupted Time Series Analysis. 2019. Accessed on DAY MONTH YEAR. https://www.theodorecaputi.com/standardized_poisson_itsa.
Have questions?
I am available at tcaputi [at] gmail.com.
About Me
My name is Theodore Caputi. I received a Bachelor of Science degree from the Wharton School at the University of Pennsylvania in 2017 and a Master of Public Health from University College Cork in 2018.