# Survival Analysis

The goal of **Survival Analysis** is to identify the time until the event of interest occurs.
This event can be anything but most of the time, it represents death.

Survival Analysis may be used to explore large datasets, for example the time from surgery to death or from start of treatment to progression.

## Getting Started

In the following example we will use the `survival`

package which you can install and load like this:

`install.packages("survival")`

library(survival)

For better visualization, we will also be using the `survminer`

package:

`install.packages("survminer")`

library(survminer)

Next, you can load the dataset that you wish to analyse.
In this course, we will be using Acute Myelogenous Leukaemia survival data.
It is included in the `survival`

package and may be loaded using the `data()`

function:

`data(cancer, package = "survival")`

"*Survival in patients with Acute Myelogenous Leukaemia.*
*The question at the time was whether the standard course of chemotherapy should be extended ('maintenance') for additional cycles.*"

- time: survival or censoring time
- status: censoring status
- x: maintenance chemotherapy given? (factor)

Use R's `help()`

function to discover more about a dataset:

`help(cancer, package = "survival")`

## Censoring

Data that is incomplete is referred to as *censoring*.
Although there are other forms of censoring, *right-censoring* is the most typical in survival statistics.
Reasons for censoring can be: The patient was either forgotten for follow-up, dropped out of the study, or the "event" did not occur until the research was over.

To summarize, the `cens`

variable holds data indicating whether a person in the program died.
You can count both censored and uncensored people with the `table()`

function.

In our example dataset, the variable *status* indicates whether a person was censored.

`# Count censored and uncensored data`

num_cens <- table(aml$status)

num_cens

output:

0 1

5 18

The result shows that 5 people were censored (`=0`

), whereas 18 were uncensored, i.e. died (`=1`

).

`# Create barplot of censored and uncensored data`

barplot(num_cens)

The outcome of our calculation will then be visually shown:

## Kaplan-Meier Estimate

The **Kaplan-Meier estimate** is a non-parametric statistic used to calculate the survival function from lifetime data.
This statistic indicates the probability that a certain patient will live through a given time `t`

.
The Kaplan-Meier estimator is `1`

at `t = 0`

and decreases to `0`

as `t`

approaches infinity.

### Creating a Survival Object

`Surv`

objects are created using the `Surv()`

function.
The main arguments of this function are `time`

(follow-up time) and `event`

(status indicator, normally 0 = alive/censored, 1 = dead).
Each subject will have one entry that is the survival time, followed by a `+`

if the subject was censored.

`# Create Surv object`

sobj <- Surv(aml$time, aml$status)

# Look at 10 first elements

sobj[1:10]

output:

[1] 9 13 13+ 18 23 28+ 31 34 45+ 48

# Look at the summary

summary(sobj)

output:

time status

Min. : 5.00 Min. :0.0000

1st Qu.: 12.50 1st Qu.:1.0000

Median : 23.00 Median :1.0000

Mean : 29.48 Mean :0.7826

3rd Qu.: 33.50 3rd Qu.:1.0000

Max. :161.00 Max. :1.0000

# Look at the structure

str(sobj)

output:

'Surv' num [1:23, 1:2] 9 13 13+ 18 23 28+ 31 34 45+ 48 ...

- attr(*, "dimnames")=List of 2

..$ : NULL

..$ : chr [1:2] "time" "status"

- attr(*, "type")= chr "right"

### Creating a Kaplan-Meier Plot

The `survfit()`

function can generate survival curves based on the *Kaplan-Meier estimate*.
It requires the previously created *survival object*.

`# Calculate the Kaplan-Meier estimate`

km <- survfit(Surv(time, status) ~ 1, data = aml)

The `1`

after the tilde (`~`

) indicates that we want to construct a single survival function for all observations.

`# Look at the structure`

str(km)

output:

List of 16

$ n : int 23

$ time : num [1:18] 5 8 9 12 13 16 18 23 27 28 ...

$ n.risk : num [1:18] 23 21 19 18 17 15 14 13 11 10 ...

$ n.event : num [1:18] 2 2 1 1 1 0 1 2 1 0 ...

$ n.censor : num [1:18] 0 0 0 0 1 1 0 0 0 1 ...

$ surv : num [1:18] 0.913 0.826 0.783 0.739 0.696 ...

$ std.err : num [1:18] 0.0643 0.0957 0.1099 0.1239 0.1379 ...

$ cumhaz : num [1:18] 0.087 0.182 0.235 0.29 0.349 ...

$ std.chaz : num [1:18] 0.0615 0.0912 0.1053 0.119 0.1328 ...

$ type : chr "right"

$ logse : logi TRUE

$ conf.int : num 0.95

$ conf.type: chr "log"

$ lower : num [1:18] 0.805 0.685 0.631 0.58 0.531 ...

$ upper : num [1:18] 1 0.996 0.971 0.942 0.912 ...

$ call : language survfit(formula = Surv(aml$time, aml$status) ~ 1)

- attr(*, "class")= chr "survfit"

### Visualizing a Kaplan-Meier Plot

The `ggsurvplot()`

function in the `survminer`

package that we loaded before allows us to plot survival curves.

To create a simple curve, just insert the `survfit`

object that we created earlier into `ggsurvplot()`

.

`# Create a Kaplan-Meier curve`

ggsurvplot(fit = km)

This will give you the following output:

`ggsurvplot()`

provides several options for customizing your curve.
Further information may be found here or by using the `help()`

function in your R console.

### Number at Risk

A risk table displaying the total number of patients under surveillance can also be included to the plot.
This can be achieved with the `risk.table`

argument.

`# Add the risk table and visualize the median survival time`

ggsurvplot(km, risk.table = TRUE, surv.median.line = "hv")

## The Weibull Model

The **Weibull model** is similar to the *Kaplan-Meier estimate*.
While the Kaplan-Meier estimate is useful for looking at data, the Weibull model is helpful for more advanced analysis such as adjusting covariates and forming inferences.
Another difference is that within the Kaplan-Meier curve you see the little "steps," whereas the Weibull model can smoothen the curve.

The Weibull distribution can match several distribution shapes. It defines the probability connected with continuous data, much like the normal distribution does. Nonetheless, it can also model skewed data, which does not follow a normal distribution.

### Creating a Weibull Model

To create a Weibull model in R, you need the `survreg()`

function instead of `survfit()`

.
However, it requires the same arguments.

`# Compute a Weibull model`

wb <- survreg(Surv(time, status) ~ 1, data = aml)

### Calculating Measures

The `predict()`

function allows us to calculate the point in time at which n % of patients would still live.

`# Calculate the time point at which 90 % survive `

predict(wb, type = "quantile", p = 1 - 0.9, newdate = data.frame(1))

output:

[1] 4.905592

This means that 90 % of patients survive more than 4 days.

### Creating a Weibull Curve

Unfortunately, using `ggsurvplot()`

for creating a Weibull curve will not work, because it is not a step function.
Instead, we have to follow these steps:

- Create a sequence of numbers for the survival probability between 0 and 1
- Calculate the time point for each probability using
`predict()`

with a vector of quantiles instead of a single value - Generate a data frame

`# Create a vector from .99 to .01 in steps of .01`

surv <- seq(.99, .01, by = -.01)

# Calculate time points for each probability

t <- predict(wb, type = "quantile", p = 1 - surv, newdata = data.frame(1))

# Create a data frame

surv_wb <- data.frame(time = t, surv = surv)

# Call head on surv_wb

head(surv_wb)

output:

time surv

1 0.5755697 0.99

2 1.0879602 0.98

3 1.5820251 0.97

4 2.0662605 0.96

5 2.5445740 0.95

6 3.0192307 0.94

- We may now visualize data using the data frame we have constructed.
For this, we need the
`ggsurvplot_df()`

function.

`# Plot the data frame`

ggsurvplot_df(fit = surv_wb, surv.geom = geom_line)

`ggsurvplot_df()`

only takes data frames of the format that we specified.
The argument `surv.geom`

specifies that our model generates a smooth function with `geom_line`

instead of `geom_step`

.

Now that we have completed graphing our data, we have obtained a smooth curve:

### Weibull Model with Covariates

For further analysis, you may want to construct different survival curves with covariates.
This makes it easier to compare several hypothetical circumstances ("*imaginary patients*").
As an example, you could discuss the question of who will survive longer — patients with a larger or smaller tumour size.

In this chapter, we will continue to use the *Acute Myelogenous Leukaemia survival data* and discuss the potential impact of a maintenance chemotherapy (factor `x`

) on the survival curve.
We will go through how to create a Weibull model with covariates step by step using our example.

**Compute the Weibull model including the covariates.**

`wbmod <- survreg(Surv(time, status) ~ x, data = aml)`

**Decide on covariate combinations for which to compute the survival curves ("***imaginary patients*").

`# Create newdat`

newdat <- expand.grid(

x = levels(aml$x)

)

# Look at newdat

newdat

output:

x

1 Maintained

2 Nonmaintained

**Compute the survival curves for the***imaginary patients*.

This is similar to what we have done before.
The only difference is that we set `newdata`

to our new data (`newdat`

) that we created in step 2.

`# Create a sequence between .99 and 0.01`

surv <- seq(.99, .01, by = -.01)

# Calculate a time point for each probability

t <- predict(wbmod, type = "quantile", p = 1 - surv, newdata = newdat)

# Look at the dimension of t

dim(t)

output:

[1] 2 99

This means, we have 99 columns and 2 rows. We cannot plot the survival curves in this format, therefore we have to reshape them.

**Create a**`data.frame`

with survival curve information.

`# Bind the columns to the survival functions`

surv_wbmod_wide <- cbind(newdat, t)

# Install & load the reshape2 package to use the melt function

install.packages("reshape2")

library(reshape2)

# Transform data from wide to long format with melt()

surv_wbmod <- melt(surv_wbmod_wide, id.vars = c("x"), variable.name = "surv_id", value.name = "time")

# Create the column "surv"

surv_wbmod$surv <- surv[as.numeric(surv_wbmod$surv_id)]

# Add variables to the data frame (required by the plotting function, but not for us)

surv_wbmod[c("upper", "lower", "std.err", "strata")] <- NA

# Look at the structure of the data frame

str(surv_wbmod)

output:

'data.frame': 198 obs. of 8 variables:

$ x : Factor w/ 2 levels "Maintained","Nonmaintained": 1 2 1 2 1 2 1 2 1 2 ...

$ surv_id: Factor w/ 99 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...

$ time : num 1.601 0.632 2.781 1.098 3.848 ...

$ surv : num 0.99 0.99 0.98 0.98 0.97 0.97 0.96 0.96 0.95 0.95 ...

$ upper : logi NA NA NA NA NA NA ...

$ lower : logi NA NA NA NA NA NA ...

$ std.err: logi NA NA NA NA NA NA ...

$ strata : logi NA NA NA NA NA NA ...

**Plot the Weibull model.**

`# Create a smooth survival curve and specify the color according to the covariate`

ggsurvplot_df(surv_wbmod, surv.geom = geom_line, color = "x", legend.title = NULL)

If you have more covariates than in our case, you can set the `linetype`

argument in `ggsurvplot_df()`

to the other covariate.

## Other Distributions

In addition to the Weibull distribution, there are several other distributions. In most cases, we would select the Weibull distribution, since it is more flexible. However, the decision for a particular distribution is more a question of philosophy.

You can select the distribution to use with the `survreg()`

function by using the `dist`

argument.
For example, you could use exponential or log-normal distributions:

`# Compute an exponential distribution`

survreg(Surv(time, status) ~ horTh, data = aml, dist = "exponential")

# Compute a lognormal distribution

survreg(Surv(time, status) ~ horTh, data = aml, dist = "lognormal")

By default, `dist`

is set to `weibull`

.

### Comparing Weibull and Log-Normal Model

In this chapter, we will be comparing the Weibull to the log-normal model using the AML data.
Make sure that the `survival`

, `survminer`

and `reshape2`

packages as well as the `aml`

data are loaded.

**Compute the Weibull and the log-normal model.**

`# Compute Weibull model`

wbmod <- survreg(Surv(time, status) ~ x, data = aml)

# Compute log-normal model

lnmod <- survreg(Surv(time, status) ~ x, data = aml, dist = "lognormal")

**Compute the survival functions for both models.**

`# Create a new dataset with the two levels of maintenance chemotherapy`

newdat <- data.frame(x = levels(aml$x))

# Create a sequence between .99 and 0.01

surv <- seq(.99, .01, by = -.01)

# Compute both survival curves with predict()

wbt <- predict(wbmod, type = "quantile", p = 1 - surv, newdata = newdat)

lnt <- predict(lnmod, type = "quantile", p = 1 - surv, newdata = newdat)

**Add the correct survival probabilities to the data frame.**

`# Bind the columns to the survival functions`

surv_wbmod_wide <- cbind(newdat, wbt)

surv_lnmod_wide <- cbind(newdat, lnt)

# Merge the survival functions and add the column "dist"

surv_wide <- rbind(surv_wbmod_wide, surv_lnmod_wide)

surv_wide$dist <- c("weibull", "weibull", "lognormal", "lognormal")

# Melt data.frame from wide to long format

surv_long <- melt(surv_wide, id.vars = c("x", "dist"), variable.name = "surv_id", value.name = "time")

# Add the column "surv"

surv_long$surv <- surv[as.numeric(surv_long$surv_id)]

# Add columns upper, lower, std.err, and strata contianing NA values

surv_long[, c("upper", "lower", "std.err", "strata")] <- NA

**Plot the survival curves.**

`# Create a smooth survival curve`

# The linetype should correspond to the maintenance chemotherapy

# The color should correspond to the distribution

ggsurvplot_df(surv_long, surv.geom = geom_line, linetype = "x", color = "dist", legend.title = NULL)

## The Cox Model

The Cox model allows the analysis of the influence of variables on the time-to-event outcome.
In contrast to the *Weibull model*, it is a semi-parametric model.
This indicates that we are less strict about the time-to-event distribution.

The Cox model is also known as the *proportional hazards model* since, like the *Weibull model*, it assumes proportional hazards.
That is why survival curves cannot cross.

In the following example, we will again use our dataset `aml`

.
We want to know if the maintenance chemotherapy is associated with the patients' survival time.
Make sure you have loaded the `survival`

and `survminer`

packages as well as our dataset.

### Computing a Cox Model

Constructing a Cox model is quite similar to the Weibull model.
Simply replace the `survreg()`

function with `coxph()`

.

`# Compute a Cox model`

cxmod <- coxph(Surv(time, status) ~ x, data = aml)

Because the Cox model is semi-parametric, there is no intercept in the coefficients as there is in the Weibull model. In contrast to the Weibull model, negative coefficient values indicate a positive influence on the duration time.

`# Look at the model coefficient`

coef(cxmod)

output:

xNonmaintained

0.9155326

The coefficient implies that nonmaintained patients have a reduced survival probability.

### Visualizing a Cox Model

The method for visualizing a Cox model is almost identical to that for Weibull models. We will go through them again step by step.

**Compute the Cox model.**

`cxmod <- coxph(Surv(time, status) ~ x, data = aml)`

**Decide on covariate combinations.**

`# Create newdat`

newdat <- expand.grid(

x = levels(aml$x)

)

# Rename the rownames (remember that we have 2 rows)

rownames(newdat) <- letters[1:2]

# Look at newdat

newdat

output:

x

a Maintained

b Nonmaintained

**Compute the survival curves.**

We can now compute the survival curves using the `survfit()`

function.
Since we don't need confidence intervals, we set `conf.type`

to `none`

.

`# Compute the survival curves`

cxsf <- survfit(cxmod, data = aml, newdata = newdat, conf.type = "none")

# Look at the structure

str(cxsf)

output:

List of 11

$ n : int 23

$ time : num [1:18] 5 8 9 12 13 16 18 23 27 28 ...

$ n.risk : num [1:18] 23 21 19 18 17 15 14 13 11 10 ...

$ n.event : num [1:18] 2 2 1 1 1 0 1 2 1 0 ...

$ n.censor: num [1:18] 0 0 0 0 1 1 0 0 0 1 ...

$ surv : num [1:18, 1:2] 0.951 0.898 0.869 0.841 0.811 ...

..- attr(*, "dimnames")=List of 2

.. ..$ : NULL

.. ..$ : chr [1:2] "a" "b"

$ cumhaz : num [1:18, 1:2] 0.0504 0.1081 0.1403 0.1737 0.2101 ...

$ std.err : num [1:18, 1:2] 0.0402 0.0666 0.0801 0.0937 0.1081 ...

$ logse : logi TRUE

$ std.chaz: num [1:18, 1:2] 0.0402 0.0666 0.0801 0.0937 0.1081 ...

$ call : language survfit(formula = cxmod, newdata = newdat, conf.type = "none", data = aml)

- attr(*, "class")= chr [1:2] "survfitcox" "survfit"

**Create a**`data.frame`

with survival curve information.

To generate the data frame with the desired information about the survival curve, we use the `surv_summary()`

function of the `survminer`

package.

`surv_summary()`

provides a data frame containing a neat summary from `survfit()`

results, containing columns like `time`

and `surv`

.

`# Create the data frame`

surv_cxmod0 <- surv_summary(cxsf)

# Call head on surv_cxmod0

head(surv_cxmod0)

output:

time n.risk n.event n.censor surv std.err upper lower strata

1 5 23 2 0 0.9508567 0.04022921 NA NA a

2 8 21 2 0 0.8975825 0.06663029 NA NA a

3 9 19 1 0 0.8690765 0.08013356 NA NA a

4 12 18 1 0 0.8405707 0.09374603 NA NA a

5 13 17 1 1 0.8105393 0.10813703 NA NA a

6 16 15 0 1 0.8105393 0.10813703 NA NA a

We add another column to the data frame that contains information about the covariate `x`

(maintenance chemotherapy).
To do this, we use the `cbind()`

function to choose the row of `newdat`

that corresponds to the correct variable for each row in the data frame and combine it with the survival curve information.

`# Add the colum with info about x`

surv_cxmod <- cbind(surv_cxmod0,

newdat[as.character(surv_cxmod0$strata), ])

**Plot the Cox model.**

To plot the Cox model, we can use the `ggsurvplot_df()`

function that we used before to visualize the Weibull model.

`ggsurvplot_df(surv_cxmod, color = "x", legend.title = NULL, censor = FALSE)`

As a result, we get the Cox model with two survival curves: `a`

representing maintained patients, and `b`

representing nonmaintained patients:

As our coefficient predicted, the survival curve of the maintained patients show that they have a better outcome.