Cox regression is not very suitable in the analysis of huge data sets with a lot of events (e.g., deaths). For instance, consider analyzing the mortality of the Swedish population aged 60–110 during the years 1968-2019, where we can count to more than four million deaths.

The obvious way to handle that situation is by tabulation and applying a piecewise constant hazard function, because it is a well-known fact that any continuous function can arbitrary well be approximated by a step function, simply by taking small enough steps.

1 Tabular data

The data sets swepop and swedeaths in eha contain age and sex specific yearly information on population size and number of deaths, respectively. They both cover the full Swedish population the years 1968–2019.

The first few rows of each:

head(swepop)
##        age   sex year   pop id
## 1.1968   0   men 1968 59280  1
## 2.1968   0 women 1968 56134  2
## 3.1968   1   men 1968 62298  3
## 4.1968   1 women 1968 58722  4
## 5.1968   2   men 1968 62602  5
## 6.1968   2 women 1968 59126  6
head(swedeaths)
##        age   sex year deaths id
## 1.1968   0   men 1968    783  1
## 2.1968   0 women 1968    566  2
## 3.1968   1   men 1968    103  3
## 4.1968   1 women 1968     66  4
## 5.1968   2   men 1968     48  5
## 6.1968   2 women 1968     31  6

The funny rownames and the column id are created by the function reshape, which was used to transform the original tables, given in wide format, to long format. In the original data, downloaded from Statistics Sweden, the population size refers to the last day, December 31, of the given year, but here it refers to an average of that value and the corresponding one the previous year. In that way we get an estimate of the number of person years, which allows us to consider number of occurrences and exposure time in each cell of the data. This information will allow us to fit proportional hazards survival models. So we start by joining the two data sets and remove irrelevant stuff:

dat <- swepop[, c("age", "sex", "year", "pop")]
dat$deaths <- swedeaths$deaths
rownames(dat) <- 1:NROW(dat) # Simplify rownames.
head(dat)
##   age   sex year   pop deaths
## 1   0   men 1968 59280    783
## 2   0 women 1968 56134    566
## 3   1   men 1968 62298    103
## 4   1 women 1968 58722     66
## 5   2   men 1968 62602     48
## 6   2 women 1968 59126     31
tail(dat)
##       age   sex year  pop deaths
## 10499  98   men 2019  596    314
## 10500  98 women 2019 2121    846
## 10501  99   men 2019  320    230
## 10502  99 women 2019 1308    638
## 10503 100   men 2019  368    248
## 10504 100 women 2019 1768   1005

We note that the age column ends with age == 100, which in fact means age >= 100. There are in total 4729403 observed deaths during the years 1968–2019, or 90950 deaths per year on average. There are 101 age groups, two sexes, and 52 years, in all 10504 cells (rows in the data frame).

1.1 Poission regression

Assuming a piecewise constant hazards model on the 101 age groups, we can fit a proportional hazards model by Poisson regression, utilizing the fact that two likelihood functions in fact are identical. In R, we use glm.

fit.glm <- glm(deaths ~ offset(log(pop)) + I(year - 2000) + sex + 
                 factor(age), data = dat, family = poisson)
summary(fit.glm)$coefficients[2:3, ]
##                Estimate Std. Error z value Pr(>|z|)
## I(year - 2000) -0.01565  0.0000315  -496.9        0
## sexmen          0.45523  0.0009324   488.2        0

The 101 coefficients corresponding to the intercept and the age factor can be used to estimate the hazard function: The intercept, -5.7268, is the log of the hazard in the age interval 0-1, and the rest are differences to that value, so we can reconstruct the baseline hazard by

lhaz <- coefficients(fit.glm)[-(2:3)]
n <- length(lhaz)
lhaz[-1] <- lhaz[-1] + lhaz[1]
haz <- exp(lhaz)

and plot the result, see Figure 1.1.

oldpar <- par(las = 1, lwd = 1.5, mfrow = c(1, 2))
plot(0:(n-1), haz, type = "s", main = "log(hazards)", 
     xlab = "Age", ylab = "", log = "y")
plot(0:(n-1), haz, type = "s", main = "hazards", 
     xlab = "Age", ylab = "Deaths / Year")
Age-specific mortality, Sweden 1968-2019. Poisson regression.

Figure 1.1: Age-specific mortality, Sweden 1968-2019. Poisson regression.

1.2 The tpchreg function

While it straightforward to use glm and Poisson regression to fit the model, it takes some efforts to get it right. That is the reason for the creation of the function tpchreg (“Tabular Piecewise Constant Hazards REGression”), and with it, the “Poisson analysis” is performed by

fit <- tpchreg(oe(deaths, pop) ~ I(year - 2000) + sex, 
               time = age, last = 101, data = dat)

Note:

  • The function oe (“occurrence/exposure”) takes two arguments, the first is the number of events (deaths in our example), and the second is exposure time, or person years.

  • The argument time is the defining time intervals variable. It can be either character, like c(“0-1”, “1-2”, …, “100-101”) or numeric (as here). If numeric, the value refers to the start of the corresponding interval, and the next start is the end of the previous interval. This leaves the last interval’s endpoint undefined, and if not given by the last argument (see below), it is chosen so that the length of the last interval is one.

  • The argument last closes the last interval, if is not already closed, see above. The exact value of last is only important for plotting and for the calculation of the restricted mean survival time, (RMST) see the summary result below.

summary(fit)
## Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
## I(year - 2000)       -5.511    -0.016     0.984     0.000    0.000 
## sex                                                          0.000 
##            women      0.503     0         1 (reference)
##              men      0.497     0.455     1.577     0.001
## 
## Events                    4729403 
## Total time at risk        457210264 
## Max. log. likelihood      -18984719 
## LR test statistic         477840.10 
## Degrees of freedom        2 
## Overall p-value           0
## 
## Restricted mean survival:  81.84 in (0, 101]

The restricted mean survival time is defined as the integral of the survivor function over the given time interval. Note that if the lower limit of the interval is larger than zero, it gives the conditional restricted mean survival time, given survival to the lower endpoint.

Graphs of the hazards and the log(hazards) functions are shown in Figure 1.2.

oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit, fn = "haz", log = "y", main = "log(hazards)", 
     xlab = "Age")
plot(fit, fn = "haz", log = "", main = "hazards", 
     xlab = "Age", ylab = "Deaths / Year")
Age-specific mortality, Sweden 1968-2019. 'tpch' regression. Baseline refers to women and the year 2000.

Figure 1.2: Age-specific mortality, Sweden 1968-2019. ‘tpch’ regression. Baseline refers to women and the year 2000.

Same results as with glm and Poisson regression, but a lot simpler.

2 Tabulating standard survival data

Sometimes you have a large data file in classical, individual form, suitable for Cox regression with coxreg, but the mere size makes it impractical, or even impossible. Then help is close by tabulating and assuming a piecewise constant hazard function, returning to the method in the previous section, that is, using tpchreg.

The helper function is toTpch, and we illustrate its use on the oldmort data frame:

head(oldmort[, c("enter", "exit", "event", "sex", "civ", "birthdate")])
##   enter  exit event    sex       civ birthdate
## 1 94.51 95.81  TRUE female     widow      1765
## 2 94.27 95.76  TRUE female unmarried      1766
## 3 91.09 91.95  TRUE female     widow      1769
## 4 89.01 89.59  TRUE female     widow      1771
## 5 90.00 90.21  TRUE female     widow      1770
## 6 88.43 89.76  TRUE female     widow      1772
oldmort$birthyear <- floor(oldmort$birthdate) - 1800
om <- toTpch(Surv(enter, exit, event) ~ sex + civ + birthyear, 
             cuts = seq(60, 100, by = 2), data = oldmort)
head(om)
##      sex       civ birthyear   age event exposure
## 1   male unmarried        -2 60-62     0    0.578
## 2 female unmarried        -2 60-62     0    4.109
## 3   male   married        -2 60-62     0   17.366
## 4 female   married        -2 60-62     0   14.129
## 5   male     widow        -2 60-62     0    0.148
## 6 female     widow        -2 60-62     0    5.805

Note two things:

Now we can run tpchreg as before

fit3 <- tpchreg(oe(event, exposure) ~ sex + civ + 
                  birthyear, time = age, data = om)
summary(fit3)
## Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
## sex                                                          0.000 
##             male      0.406     0         1 (reference)
##           female      0.594    -0.245     0.783     0.047
## civ                                                          0.000 
##        unmarried      0.080     0         1 (reference)
##          married      0.530    -0.397     0.672     0.081
##            widow      0.390    -0.258     0.773     0.079
## birthyear             2.114    -0.006     0.994     0.004    0.150 
## 
## Events                    1971 
## Total time at risk         37824 
## Max. log. likelihood      -7265.5 
## LR test statistic         43.45 
## Degrees of freedom        4 
## Overall p-value           8.34423e-09
## 
## Restricted mean survival:  12.65 in (60, 100]

And the hazards graphs are shown in Figure 2.1.

oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit3, fn = "haz", log = "y", main = "log(hazards)", 
     xlab = "Age", ylab = "log(Deaths / Year)", col = "blue")
plot(fit3, fn = "haz", log = "", main = "hazards", 
     xlab = "Age", ylab = "Deaths / Year", col = "blue")
Old age mortality, Skellefteå 1860-1880.

Figure 2.1: Old age mortality, Skellefteå 1860-1880.

The plots of the survivor and cumulative hazards functions are “smoother”, see Figure 2.2.

oldpar <- par(mfrow = c(1, 2), las = 1, lwd = 1.5)
plot(fit3, fn = "cum", log = "y", main = "Cum. hazards", 
     xlab = "Age", col = "blue")
plot(fit3, fn = "sur", log = "", main = "Survivor function", 
     xlab = "Age", col = "blue")
Old age mortality, Skellefteå 1860-1880. Cumulative hazards and survivor functions.

Figure 2.2: Old age mortality, Skellefteå 1860-1880. Cumulative hazards and survivor functions.

par(oldpar)

A comparison with Cox regression on the original data.

fit4 <- coxreg(Surv(enter, exit, event) ~ sex + civ + I(birthdate - 1800), 
               data = oldmort)
summary(fit4)
## Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
## sex                                                          0.000 
##             male      0.406     0         1 (reference)
##           female      0.594    -0.244     0.783     0.047
## civ                                                          0.000 
##        unmarried      0.080     0         1 (reference)
##          married      0.530    -0.397     0.673     0.081
##            widow      0.390    -0.259     0.772     0.079
## I(birthdate - 18      2.602    -0.005     0.995     0.004    0.212 
## 
## Events                    1971 
## Total time at risk         37824 
## Max. log. likelihood      -13557 
## LR test statistic         42.79 
## Degrees of freedom        4 
## Overall p-value           1.14378e-08

And the graphs, see Figure 2.3.

oldpar <- par(mfrow = c(1, 2), lwd = 1.5, las = 1)
plot(fit4, main = "Cumulative hazards", xlab = "Age", 
     col = "blue")
plot(fit4, main = "Survivor function", xlab = "Age", 
     fn = "surv", col = "blue")
Old age mortality, Skellefteå 1860-1880. Cox regression with original data.

Figure 2.3: Old age mortality, Skellefteå 1860-1880. Cox regression with original data.

3 References