# Age-Sex Standardized Poisson-Based Interrupted Time Series Analysis

## Introduction

This tutorial describes how to perform an age-sex standardized interrupted time series analysis based upon Poisson regression using the R programming language.

When would you use Poisson-Based Interrupted Time Series Analysis (PBITSA)?

1. You Have Time Series Data of an Intervention/Policy. An interrupted time series analysis is a quasi-experimental method used to evaluate the impact of a policy or intervention by comparing trends in an outcome before and after the intervention is put in place.

2. You Want to Model Compare Rates Before/After a Specific Point. The most common form of interrupted time series analysis is based upon linear regression. This works if your outcome is a continuous variable (or closely approximated by one). For example, you might look at the concentration of a particular air pollutant before and after some policy change. There are plenty of tutorials on this type of interrupted time series analysis available elsewhere, but in short, you create a dummy variable for whether or not the observation occurred after the interruption, and you perform a regression with the outcome on time, the interruption dummy, and an interaction between time and the interruption dummy.

3. Your Outcome is a Count Variable. But what if your outcome isn’t a simple continuous variable? What if it’s a count variable? Say that you know the annual count of how many new lung cancer cases were found (lung cancer incidence) and how many people died from lung cancer (lung cancer mortality), and you want to know if these outcomes were affected by a certain policy. This variable is not continuous (you can’t have 5.5 lung cancer mortalities in a year), and the meaning of the count variable depends upon the base population size (6 lung cancer mortalities in a year might not sound like a lot, but it is if there are only 100 people in the population you’re studying). This is when you use PBITSA.

When would you want to age-sex standardize?

If age and sex affect your outcome and the age and sex distribution of the population is changing over time, then you need to take that into account by age-sex standardization. Let’s say that we want to measure the impact of a policy on hard-candy consumption and that, over the period of your study, the population is becoming older. Old people eat more hard candy than young people, so even if the within-age group rates stay the same, the overall prevalence of hard candy consumption would increase – giving you a spurious result. Age-sex standardization adjusts for changes to the age and sex distribution of the population over time.

Depending upon your research question, there might be other things you need to standardize throughout your study period. Say that you’re looking at prostate cancer. Since prostate cancer affects males, you shouldn’t adjust for sex – but you still might want to standardize for age-groups.

## The Current Study

This tutorial is based upon my MPH Dissertation, which I completed at University College Cork in Cork, Republic of Ireland, in 2017-2018. My dissertation explored the effect of Ireland’s 2004 Comprehensive Workplace Smoking Ban on lung cancer incidence and mortality. Count data was available for lung cancer incidence and mortality, and so, to perform an interrupted time series analysis, I had to use PBITSA.

## The Data

1994-2015 count data was available for lung cancer incidence and mortality such that every observation was identified by a combination of year X age-group X sex and lung cancer incidence and mortality were given as raw counts. That is, one observation could be that there were 126 lung cancer incidences and 68 lung cancer mortalities among females aged 50-54 in 2004, and another could be that there were 269 lung cancer incidences and 45 lung cancer mortalities among males aged 85+ in 2009.

For this analysis to work, you AT LEAST need population estimates in the same categories as those for your count variable (for this study, population data by year X age-group X sex).

Depending on the analysis you want to run, you might want to adjust for some confounders. For this analysis, active smoking and air quality are potential confounders. I had access to micro-data on smoking prevalence (proportion of active smokers) and smoking frequency (average number of cigarettes per day), and so I could aggregate that to either overall smoking prevalence/frequency per year or year X age-group X sex level smoking prevalence/frequency. Air quality data was only sparsely available (readings at different sites on different dates), and so I averaged it all together by year to create overall annual Particulate Matter 2.5 (PM2.5) concentration. In the main analysis (shown here), we include overall smoking prevalence and frequency as confounders in the model.

First, we have to do a bit of data cleaning.

## Housekeeping
rm(list=ls())

## Install packages
if(!require("pacman")) install.packages("pacman")
pacman::p_load(WordR, MASS, plyr, foreign, data.table,lubridate, rJava, WordR, ReporteRs, fakeR, car,
flextable, WordR, rJava, ReporteRs, ggplot2, dplyr, devtools, formula.tools, stringr, rlang, lmtest)
install_github("tlcaputi/tlcPack"); library(tlcPack)

## Set working directory where the data is

y <- y %>%
rename(
lungMortality = lungMortalityCases,
lungIncidence = lungIncidenceCases,
brainMortality = brainMortalityCases,
brainIncidence = brainIncidenceCases,
smoke_rate = total_smoking_rate,
smoke_frequency = total_smoking_frequency,
age_sex_smoke_rate = age_sex_smoking_rate,
age_sex_smoke_frequency = age_sex_smoking_frequency,
smoke_price = real_price
)

y <- y %>%
mutate(lungMortality = ifelse(lungMortality=='..', 0, lungMortality)) %>%
mutate(
year = as.numeric(year),
population = as.numeric(population),
lungMortality = as.numeric(lungMortality),
lungIncidence = as.numeric(lungIncidence),
brainMortality = as.numeric(brainMortality),
brainIncidence = as.numeric(brainIncidence),
int2004 = ifelse(year<2004, "preban", "postban"),
year_from_baseline = year - min(year)
) %>% filter(sex!="Both sexes", year <=2015)

y <- y  %>% mutate(
lungIncidence = ifelse(is.na(lungIncidence), 0, lungIncidence),
lungMortality = ifelse(is.na(lungMortality), 0, lungMortality),
brainIncidence = ifelse(is.na(brainIncidence), 0, brainIncidence),
brainMortality = ifelse(is.na(brainMortality), 0, brainMortality)
)

y <- y %>% group_by(year, sex) %>% mutate(totalPopulation = sum(population)) %>%
ungroup()

y <- y %>% mutate(
lungIncidenceRate = lungIncidence/population,
lungMortalityRate = lungMortality/population,
brainIncidenceRate = brainIncidence/population,
brainMortalityRate = brainMortality/population,
weighting = population/totalPopulation
) %>% mutate(
lungIncidenceSumUp = lungIncidenceRate * weighting,
lungMortalitySumUp = lungMortalityRate * weighting,
brainIncidenceSumUp = brainIncidenceRate * weighting,
brainMortalitySumUp = brainMortalityRate * weighting
)

y <- y %>% group_by(year) %>% mutate(
lungIncidence_ASR = sum(lungIncidenceSumUp),
lungMortality_ASR = sum(lungMortalitySumUp),
brainIncidence_ASR = sum(brainIncidenceSumUp),
brainMortality_ASR = sum(brainMortalitySumUp)
) %>% ungroup()

## Now we select only the variables we need for analysis
y <- y %>%
select(
sex, # Male or Female
year, # Year
year_from_baseline, # Year since 1994
ageGroup, # 5-year age groups from 40 to 85 and then 85+
population, # number of Irish people in the sex X age group
pm2p5, # Yearly particulate matter 2.5 concentration
pm2p5_lag3, # 3-year lagged particulate matter 2.5 concentration
smoke_rate, # Annual Irish smoking prevalence
smoke_frequency, # Annual avg. number of cigs per day
lungMortality, # Cases of lung mortality within age X sex X year
) %>% arrange(ageGroup, sex, year) %>% ungroup()


Skipping install of 'tlcPack' from a github remote, the SHA1 (4d8015e0) has not changed since last install.
Use force = TRUE to force installation


And now we can take a look.

y %>% head()

sexyearyear_from_baselineageGrouppopulationpm2p5pm2p5_lag3smoke_ratesmoke_frequencylungMortality
Female1994040-44 years116.530.20692 NA0.24655823.782245 5
Female1995140-44 years118.629.08168 NA0.24282023.760822 9
Female1996240-44 years120.427.95643 NA0.23908233.739400 9
Female1997340-44 years124.026.8311930.206920.23534443.717978 5
Female1998440-44 years126.325.7059429.081680.23160653.69655510
Female1999540-44 years128.224.5807027.956430.22786863.675133 6

## The Model

For simplicity, in this tutorial, we’ll focus on lung cancer mortality. You could apply the same technique to lung cancer mortality just by changing the left-hand side variable name.

Basic (Linear) Model

The basic model is pretty straightforward. You are predicting lung cancer incidence or mortality using a linear term for time, a dummy variable for whether the observation is before/after the interruption, an interaction term between time and the dummy variable, and other variables for confounders.

The Poisson Model

Recall that a Poisson model is not linear – it’s log-linear. That means that we are essentially predicting the natural logarithm of the outcome, rather than the outcome itself.

The outcome variable we are investigating is the natural logarithm of the count of either lung cancer incidence or mortality. But, of course, the count itself doesn’t tell you much without knowing the whole population – it’s like having the numerator without the denominator. So you have to adjust for the population using an offset term.

An offset term sets the model coefficient to exactly 1. Essentially, what we are saying here is that, before we model everything else, we need to account for a certain specific number on the right-hand side of the model. Without the offset term, we are predicting the natural logarithm of the count variable. With the offset on the right-hand side (equivalent to a negative offset on the left-hand side), we are predicting:

Recall from high school math:

Therefore, the left-hand side of the model is now equivalent to the natural logarithm of the count over the population, which is the rate.

For the mathematically/statistically inclined, the final PBITSA model is formualted as follows:

basefit_formula <- as.formula("lungMortality ~ ageGroup + sex + offset(log(population)) + smoke_rate + smoke_frequency")

basefit <- glm(basefit_formula, data = y, family = poisson)

summary(basefit)

Call:
glm(formula = basefit_formula, family = poisson, data = y)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-6.3074  -1.1027  -0.0426   0.9688   4.8095

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)         -3.96358    0.13177 -30.080  < 2e-16 ***
ageGroup45-49 years  0.91204    0.06946  13.131  < 2e-16 ***
ageGroup50-54 years  1.83876    0.06327  29.063  < 2e-16 ***
ageGroup55-59 years  2.51632    0.06112  41.169  < 2e-16 ***
ageGroup60-64 years  3.11410    0.06000  51.900  < 2e-16 ***
ageGroup65-69 years  3.60885    0.05945  60.702  < 2e-16 ***
ageGroup70-74 years  3.97289    0.05924  67.059  < 2e-16 ***
ageGroup75-79 years  4.25016    0.05925  71.730  < 2e-16 ***
ageGroup80-84 years  4.37474    0.05971  73.269  < 2e-16 ***
ageGroup85+ years    4.29703    0.06070  70.792  < 2e-16 ***
sexMale              0.65510    0.01093  59.921  < 2e-16 ***
smoke_rate           2.27386    0.40570   5.605 2.09e-08 ***
smoke_frequency      0.01430    0.05090   0.281    0.779
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 46216.5  on 439  degrees of freedom
Residual deviance:  1047.4  on 427  degrees of freedom
AIC: 3642.6

Number of Fisher Scoring iterations: 4


Autocorrelation

We should consider if there is significant autocorrelation in the residuals of the model. A common rule of thumb is to look for a Durbin-Watson test statistic between 1.5 and 2.5. If it’s in that range, you can proceed without lag terms – if not, you should consider testing for some significant lag terms.

Here, we just test for 1, 2, and 3-year lag terms directly through a stepwise regression. We add every significant lag term to the model as a confounder.

Because we have lag terms, the error term is now modelled as:

durbinWatsonTest(basefit, max.lag=3)

lag_names <- c()
y$age_sex_group <- paste0(y$ageGroup, y$sex) for(k in unique(y$age_sex_group)){
for (j in 1:3){
nam <- paste0("lungMortality_lag", j)
y[y$age_sex_group==k,nam] <- lag(as.numeric(unlist(y[y$age_sex_group==k, "lungMortality"])), j)
lag_names <- u(lag_names, nam)
}
}

mod_w_lag_formula <- as.formula(paste(c(basefit_formula, lag_names), collapse="+"))
mod_w_lag <- lm(mod_w_lag_formula, data = na.omit(y))

step1 <- step(mod_w_lag, na.rm=T)
lags <- grep("lag", names(step1$coefficients), value=T) if(length(lags)>0){ formula_w_lags <- as.formula(paste(c(as.character(basefit_formula), lags), collapse="+")) } else { formula_w_lags <- basefit_formula } model_w_lags <- glm(formula_w_lags, data = y, family = poisson) summary(model_w_lags)   lag Autocorrelation D-W Statistic p-value 1 0.4639009 1.059313 0 2 0.4329246 1.113317 0 3 0.4101562 1.142450 0 Alternative hypothesis: rho[lag] != 0 Start: AIC=1873.52 lungMortality ~ ageGroup + sex + offset(log(population)) + smoke_rate + smoke_frequency + lungMortality_lag1 + lungMortality_lag2 + lungMortality_lag3 Df Sum of Sq RSS AIC <none> 48352 1873.5 - sex 1 367.9 48720 1874.4 - smoke_frequency 1 779.1 49131 1877.6 - smoke_rate 1 1425.4 49777 1882.6 - ageGroup 9 4984.7 53337 1892.8 - lungMortality_lag3 1 3520.0 51872 1898.2 - lungMortality_lag1 1 4415.0 52767 1904.7 - lungMortality_lag2 1 5089.0 53441 1909.5 Call: glm(formula = formula_w_lags, family = poisson, data = y) Deviance Residuals: Min 1Q Median 3Q Max -6.1888 -0.9825 0.0378 0.9864 3.5148 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.9708867 0.1360732 -29.182 < 2e-16 *** ageGroup45-49 years 0.9432878 0.0753907 12.512 < 2e-16 *** ageGroup50-54 years 1.8024521 0.0697899 25.827 < 2e-16 *** ageGroup55-59 years 2.4349953 0.0698428 34.864 < 2e-16 *** ageGroup60-64 years 2.9768871 0.0740942 40.177 < 2e-16 *** ageGroup65-69 years 3.4006834 0.0800485 42.483 < 2e-16 *** ageGroup70-74 years 3.7351003 0.0845788 44.161 < 2e-16 *** ageGroup75-79 years 4.0287955 0.0843779 47.747 < 2e-16 *** ageGroup80-84 years 4.2312856 0.0763421 55.425 < 2e-16 *** ageGroup85+ years 4.2368000 0.0712594 59.456 < 2e-16 *** sexMale 0.5290185 0.0216730 24.409 < 2e-16 *** smoke_rate 2.2479488 0.4717015 4.766 1.88e-06 *** smoke_frequency 0.0243775 0.0519945 0.469 0.6392 lungMortality_lag1 0.0002108 0.0004110 0.513 0.6080 lungMortality_lag2 0.0007068 0.0004067 1.738 0.0823 . lungMortality_lag3 0.0009069 0.0004060 2.233 0.0255 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 40400.92 on 379 degrees of freedom Residual deviance: 792.05 on 364 degrees of freedom (60 observations deleted due to missingness) AIC: 3047.7 Number of Fisher Scoring iterations: 4  Reconsidering the Interruption In our study, we need to decide which “interruption” we want to use. There is a latency period between smoke exposure and lung cancer, but we don’t know exactly what it is. Here, we tested all possible breakpoints between 2004 and 2010, and we use the most significant model (i.e., the model with the highest z-value associated with the time-interruption interaction term). In your analysis, you may have to make similar or different adjustments (depending upon your research question). ## Perform a GLM and return only the interaction term glm_int <- function(formula, data, family){ mod <- glm(formula, data, family=family) variables <- all.vars(mod$terms)
mod_coeff <- data.frame(summary(mod)$coefficients) int_variable <- grep("year_from_baseline:int", rownames(mod_coeff), value=T) mod_coeff$rownames <- rownames(mod_coeff)
z_value <- mod_coeff[grep("year_from_baseline:int", rownames(mod_coeff)),"z.value"]
mat <- as.data.frame(matrix(NA, nrow=1, ncol=2))
names(mat) <- c("int_year", "z_val")
mat$int_year <- as.numeric(gsub("year_from_baseline:int|postban", "", int_variable)) mat$z_val <- as.numeric(z_value)
return(mat)
}

## Take a list of glm_int objects and return the most significant negative value.
most_significant_value <- function(list_of_models){
data <- Reduce(function(x,y) {rbind.data.frame(x, y)}, list_of_models)
which_is_min <- which(data$z_val==min(data$z_val, na.rm=T))
min_year <- data[which_is_min, "int_year"]
if(min(data$z_val, na.rm=T) > - 1.96) warning("No significant results!") print(paste0(min_year, ": z-value is", min(data$z_val, na.rm=T)))
return(as.numeric(min_year))
}

for(i in 2003:2010){
interruption_name <- paste0("int", i)
values <- ifelse(y$year < i, "preban", "postban") y[,interruption_name] <- factor(values, levels=c("preban", "postban")) } # Test models with interruption points between 0 and 6 years after the ban int2004_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2004")), y, family=poisson) int2005_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2005")), y, family=poisson) int2006_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2006")), y, family=poisson) int2007_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2007")), y, family=poisson) int2008_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2008")), y, family=poisson) int2009_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2009")), y, family=poisson) int2010_model <- glm_int(as.formula(paste0(formula_w_lags, "+year_from_baseline*int2010")), y, family=poisson) models_to_test <- list( int2004_model, int2005_model, int2006_model, int2007_model, int2008_model, int2009_model, int2010_model ) int <- most_significant_value(models_to_test) interruption <- paste0("int", int) print(interruption)  [1] "2006: z-value is-2.39163967446975" [1] "int2006"  Now lets run the model. ## Results First, remember that all these coefficients are on a natural log scale. You need to exponentiate them in order to get the relevant RR. The row labelled int2006postban corresponds to the level or “immediate” shift occurring right after the ban. The row labelled year_from_baseline:int2006postban corresponds to the slope or “gradual” shift occurring right after the ban. final_model_formula <- as.formula(paste0(formula_w_lags, "+ year_from_baseline *", interruption)) final_model <- glm(final_model_formula, data = y, family = poisson) print(as.character(final_model_formula)) print(summary(final_model))  [1] "lungMortality ~ ageGroup + sex + offset(log(population)) + smoke_rate + smoke_frequency + lungMortality_lag1 + lungMortality_lag2 + lungMortality_lag3 + year_from_baseline * int2006" Call: glm(formula = final_model_formula, family = poisson, data = y) Deviance Residuals: Min 1Q Median 3Q Max -6.2709 -0.9116 0.0089 0.9092 3.5758 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.7339134 0.2378258 -15.700 <2e-16 *** ageGroup45-49 years 0.9414212 0.0753927 12.487 <2e-16 *** ageGroup50-54 years 1.7958452 0.0698201 25.721 <2e-16 *** ageGroup55-59 years 2.4222705 0.0699616 34.623 <2e-16 *** ageGroup60-64 years 2.9571266 0.0743581 39.769 <2e-16 *** ageGroup65-69 years 3.3736395 0.0805245 41.896 <2e-16 *** ageGroup70-74 years 3.7029407 0.0852492 43.437 <2e-16 *** ageGroup75-79 years 3.9969385 0.0850150 47.015 <2e-16 *** ageGroup80-84 years 4.2079389 0.0766959 54.865 <2e-16 *** ageGroup85+ years 4.2219335 0.0714159 59.118 <2e-16 *** sexMale 0.5183965 0.0219768 23.588 <2e-16 *** smoke_rate 0.1317331 1.5021274 0.088 0.9301 smoke_frequency 0.0916101 0.0678737 1.350 0.1771 lungMortality_lag1 0.0002723 0.0004151 0.656 0.5119 lungMortality_lag2 0.0007950 0.0004076 1.951 0.0511 . lungMortality_lag3 0.0009748 0.0004115 2.369 0.0178 * year_from_baseline -0.0007664 0.0068449 -0.112 0.9109 int2006postban 0.1128675 0.0623443 1.810 0.0702 . year_from_baseline:int2006postban -0.0104927 0.0043872 -2.392 0.0168 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 40400.92 on 379 degrees of freedom Residual deviance: 782.04 on 361 degrees of freedom (60 observations deleted due to missingness) AIC: 3043.7 Number of Fisher Scoring iterations: 4  Subgroup Analyses We can also compute this same model for a certain subgroup – we just need to make sure we take out the confounder in the model that corresponds with the subgroup. For example, it doesn’t make much sense to “adjust for” sex if we’re only looking at females or “adjust for” age if we’re only looking at one age group. Also, the R functions won’t work. ## Sex Group sex_final_formula <- as.formula( paste0("lungMortality ~ ageGroup + offset(log(population)) + smoke_rate + smoke_frequency + ", paste(lags, collapse="+"), " + year_from_baseline * ", interruption) ) for (i in unique(y$sex)){
print(i)
temp_mod <- glm(sex_final_formula, data = y, family = poisson)
print(summary(temp_mod))
}

## Age Groups

age_final_formula <- as.formula(
paste0("lungMortality ~ sex + offset(log(population)) + smoke_rate + smoke_frequency + ",
paste(lags, collapse="+"),
" + year_from_baseline * ", interruption)
)

max(c(fig_dataset$without_smoking_ban_rate,fig_dataset$with_smoking_ban_rate,fig_dataset$with_smoking_ban_rate_direct),na.rm=T)) ) lines(fig_dataset$year, fig_dataset$with_smoking_ban_rate, type="l",col="blue") lines(post_interruption$year, post_interruption$without_smoking_ban_rate, type="l",lty="dashed",col="red") abline(v=int,col="black") abline(v=2004,lty="dashed",col="darkgreen") legend("bottomleft",legend=c("Predicted Rate Without Smoking Ban","Standardized-Adjusted Rate", paste0(int," Modeled Interruption"),"2004 Smoking Ban", "Observed Crude Rate"), col=c("red","blue","black","darkgreen","black"),lty=c(2,1,1,2, NA),pch=c(NA,NA,NA,NA, 1))  ## Interpreting the Figure Based upon this figure, we may want to calculate a few numbers, such as the average number of mortalities avoided per year. I find it useful to print a summary. Finding the 95% confidence interval is not immediately obvious. Here, we bootstrap the confidence interval.  ## Calculate fitted and observed values and estimate differences using bootstrapping fitted_values <- as.vector(filter(y, !is.na(without_smoking_ban), !is.na(with_smoking_ban)) %>% ungroup() %>% select(without_smoking_ban) ) %>% unlist() %>% as.numeric() observed_values <- as.vector(filter(y, !is.na(without_smoking_ban), !is.na(with_smoking_ban)) %>% ungroup() %>% select(with_smoking_ban) ) %>% unlist() %>% as.numeric() ## Set seed to replicate set.seed(1234) B <- replicate(1000, expr={ ix <- sample(1:length(observed_values), length(observed_values), replace=T) (sum(fitted_values[ix])-sum(observed_values[ix])) # /sum(comb_preds$mean[ix])
})

### Create Summary

## Rounding function
r2 <- function(x) round(x,2)

## Creates a string that looks like a confidence interval
ci95 <- function(a, b, c) paste0(r2(a), " (95%CI ", r2(b), " to ", r2(c), ")")

post_interruption_cases <- r2(sum(y[y$year >=int, "lungMortality"],na.rm=T)) total_difference <- r2(sum(y$without_smoking_ban - y$with_smoking_ban, na.rm=T)) total_difference_lo95 <- r2(quantile(B,0.025, na.rm=T)) #sum(d$wos_raw - d$wsm_raw, na.rm=T)) total_difference_hi95 <- r2(quantile(B,0.975, na.rm=T)) # sum(d$wos_raw - d$wsm_raw, na.rm=T)) percent_cases_avoided <- ci95( total_difference/sum(y[y$year >=int, ]$without_smoking_ban, na.rm=T) * 100, total_difference_lo95/sum(y[y$year >=int, ]$without_smoking_ban, na.rm=T) * 100, total_difference_hi95/sum(y[y$year >=int, ]$without_smoking_ban, na.rm=T) * 100 ) total_cases_avoided <- ci95( total_difference, total_difference_lo95, total_difference_hi95 ) yearly_cases_avoided <- ci95( total_difference/(max(y$year - int + 1, na.rm=T)),
total_difference_lo95/(max(y$year - int + 1, na.rm=T)), total_difference_hi95/(max(y$year - int + 1, na.rm=T))
)

numSaved <- paste0("The total number of cases avoided is ",total_cases_avoided,".")
numSavedYr <- paste0("The number of cases avoided per year is ",yearly_cases_avoided,".")
intYear <- paste0("The interruption year is ",int)
postIntDeaths <- paste0("The total number of post-interruption cases is ",post_interruption_cases,".")
pctPostIntDeaths <- paste0("The ratio of cases avoided to post-interruption cases (*100) is ",percent_cases_avoided,".")

## Print summary
print(intYear)
print(numSaved)
print(numSavedYr)
print(postIntDeaths)
print(pctPostIntDeaths)


[1] "The interruption year is 2006"
[1] "The total number of cases avoided is 1125.09 (95%CI 959.92 to 1329.53)."
[1] "The number of cases avoided per year is 112.51 (95%CI 95.99 to 132.95)."
[1] "The total number of post-interruption cases is 17548."
[1] "The ratio of cases avoided to post-interruption cases (*100) is 6.03 (95%CI 5.14 to 7.12)."


## Conclusion

…and that’s it! You’ve completed the analysis.

What if you wanted to compare two populations – one that experienced the intervention and one that didn’t?

Let’s say that we wanted to compare Ireland, which has a comprehensive workplace smoking ban, to Marlboro-land, a country that does not. It’s very easy! Just create a new category for “country” so that now your observations are identified by year, age-group, sex, and country. When you create your interruption term, just make sure you code it as follows:

• Ireland Pre-Interruption = “Pre-Interruption”
• Ireland Post-Interruption = “Post-Interruption”
• Marlboro-Land Pre-Interruption = “Pre-Interruption”
• Marlboro-Land Post-Interruption = “Pre-Interruption”

The intuition here is that Marlboro-land never has the policy, so it’s always “pre” the interruption.

My code for this project will soon be available on my GitHub Page.

## Acknowledgements

Special thanks to the faculty at University College Cork who helped me with this project and to the following data providers who made my analysis possible:

• National Cancer Registry Ireland (Cancer Incidence Data)
• Central Statistics Office Ireland (Population and Cancer Mortality Data)
• Ireland Environmental Protection Agency (Air Quality Data)
• Ireland Health Services Executive (Active Smoking Data)

Caputi TL. Age-Sex Standardized Poisson-Based Interrupted Time Series Analysis. 2019. Accessed on DAY MONTH YEAR. https://www.theodorecaputi.com/standardized_poisson_itsa.

## Have questions?

I am available at tcaputi [at] gmail.com.