TLDW logo

Survival Analysis | Statistics for Applied Epidemiology | Tutorial 11

By MarinStatsLectures-R Programming & Statistics

Summary

## Key takeaways - **Log-Rank Test Limitation**: The log-rank test does not allow us to adjust for confounders; it tests the difference in survival between two age groups ignoring other factors. [01:10], [01:32] - **Kaplan-Meier Can't Handle Numerics**: Kaplan-Meier can't incorporate numeric explanatory variables; it can only stratify on categorical variables, requiring a new curve for each combination like 12 curves for three variables. [01:54], [02:27] - **Cox Model Baseline Hazard Unspecified**: In Cox regression, the baseline hazard is unspecified and varies over time, so we can't calculate absolute hazards but can estimate hazard ratios by exponentiating coefficients. [05:07], [06:12] - **Hazard Ratio Interpretation**: A hazard ratio of 1.69 for over 40 means their instantaneous risk of death at a given time is 1.69 times that of under 40; if less than 1, lower risk and longer median survival. [09:24], [09:46] - **Confounder Check with Likelihood Ratio**: Use likelihood ratio test to see if adding mismatch improves model fit; p=0.5 means no improvement, so exclude it as it's neither confounder nor predictor. [13:56], [14:47] - **Proportional Hazards Assumption Check**: Check proportional hazards with log-log plots looking for parallel lines, Schoenfeld residuals for flat line, or x-time interaction; crossing lines violate as hazards switch. [20:15], [21:49]

Topics Covered

  • Log-rank ignores confounders
  • Kaplan-Meier can't handle continuous variables
  • Cox models hazard ratios, not absolute hazards
  • Proportional hazards must remain constant
  • Check assumptions with log-log plots

Full Transcript

hi and thanks for joining me for tutorial 11 in this session we're going to finish up our discussion of survival analysis including the kaplan-meier

method and Cox proportional hazards regression so in the last session we left off talking about kaplan-meier curves we talked about how the log-rank

test can be used to compare survival curves between two groups with the null hypothesis that there's no difference in survival between the two groups compared and the alternative hypothesis that the

survival curves are different the log-rank test is sort of like a chi-squared test it compares the observed number of deaths for each group versus the expected if there was no

relationship between survival and the explanatory variable so let's revisit the example from the last tutorial where we were comparing the survival curves

for those under and over age 40 just by looking at the survival curves between the two groups it appears that there's a difference with those over 40 having

shorter survival than those under 40 but we can statistically test if there is a significant difference in survival for the two age groups with the log-rank

test however the log-rank test does not allow us to adjust for confounders this tests the difference in survival between the two age groups ignoring other

factors and there may be other factors at play that we aren't controlling for here so let's look at the output from the log-rank test since the p-value of

the log-rank test is less than point zero five we reject the null hypothesis and have evidence to believe that the survival functions are different for the two age groups ignoring other factors

those over 40 appear to have shorter survival time compared to those under 40 ignoring other factors but there are several limitations to using the

kaplan-meier method the first is that it can't incorporate numeric explanatory variables it can only incorporate categorical X's so we can stratify on

those categories of that variable that we want to adjust for and fit survival curves for each category but we can include many explanatory variables

because we would need a new curve for each group for example if we wanted to include three variables like over versus under 40 male versus female and high

medium versus low dose we would have 12 survival functions also since kaplan-meier is non parametric there is no simpler form to neatly summarize the

relationship between x and y with regression we have a slope beta 1 so we can boil down all observations to this one number which is meant to represent

the entire line but with kaplan-meier there's no way to simplify it the relationship we need the full table or graph so what if we want to control for

other factors that may confound the association between age and survival including continuous variables and what if we want to summarize the relationship

between age and survival in a simpler form kaplan-meier is not a regression model so we need to use another method that allows us to summarize the

relationship between age and survival controlling for potential confounders this brings us to Cox proportional hazards regression so a kaplan-meier

analysis is a good starting point it's useful for initial exploration between variables and survival and it's also good for simpler datasets where you don't need to adjust for many variables

but if we want to control for other factors that may confound an association we need to use Cox proportional hazards regression so a Cox proportional hazards

model is just another type of regression model like linear logistic or Poisson regression and you can incorporate many explanatory variables including numeric

explanatory variables it is a semi parametric method it does not assume a constant hazard the hazard is allowed to vary over time for example the hazard of

relapse after treatment is allowed to change over time so this time we're modeling the log hazard at time T to get the log hazard ratio and ultimately the

hazard ratio recall that the hazard means the probability that you die now given that you're alive it doesn't have much of a

useful interpretation on its own but relative hazards or hazard ratios do so that's ultimately what we're trying to get up you'll see in the Cox regression

equation that in place of where the intercept is in linear regression we have the log base line hazard the baseline hazard is a hazard at time T

for observations when all predictors equals zero an important property of the baseline hazard is that it's unspecified this is because the baseline hazard is

allowed to vary over time so it doesn't have a fixed value we actually don't know what the value of the baseline hazard is it's unspecified because it

varies over time so it won't be in your output so because we don't know what that baseline hazard is we can't take all of our values from our regression

output and plug it into our regression formula to calculate someone's hazard of an event at a particular time because we don't know what that baseline hazard is

this is in contrast to previous models that we've discussed in the course so we could estimate the mean Y with linear regression for those of a given age gender and ethnicity for example with

logistic regression we could estimate the odds or probability and with Poisson regression we could estimate the rate for participants with a given set of X

values but with Cox regression we can't estimate the hazard of an event at a given time for people with a given set of X values because we don't know the

value of the baseline hazard what we can do with Cox regression is estimate the hazard ratios from our coefficients in order to compare the hazards between

groups so a hazard ratio is like an odds ratio we're always comparing one group to another with Cox regression we're comparing the hazard in a group to the

baseline hazard but we don't know what the value of that baseline hazard is so the Cox proportional our model gives us the advantage of that we can model the ratio of the hazard of

experiencing an event at a given time between groups but we can't estimate the hazard of an event at a particular time so in some ways it's not technically

correct but you can think of the baseline hazard as sort of like an intercept it's the hazard at time T for observations when all predictors equal

zero but we don't label it as an intercept because it varies over time and therefore we don't know what its value is we can compare two groups or

the difference for a one unit increase in X but we can't actually calculate someone's hazard at a particular time because we don't have that baseline

hazard so with Cox proportional hazards regression we exponentiate the coefficients to get hazard ratios a hazard ratio is the relative hazard

between groups the hazard ratio is interpreted in a similar way as the rate ratio or odds ratio if it's less than 1 then the group you're interested in has

a lower hazard or a lower instantaneous risk of death compared to the reference group we can also say that the group of interest has a longer median survival

compared to the reference group if the hazard ratio is less than 1 if the hazard ratio is greater than 1 the group you're interested in has a higher instantaneous risk of dying or

experiencing the event at a given time compared to the reference group the coefficients in our output with Cox proportional hazards regression tell us the difference in the log hazard

function between two groups if X is categorical or the change in the log hazard function for a one unit change in

X if X is continuous so now we're going to return to our example where we're comparing the relationship between age so over age 40 versus under age 40 and

survival we fit a simple Cox proportional hazards model with age as the sole explanatory variable and hazard of death as the outcome the coefficient

for over 40 0.52 with a p-value of 0.02 which suggests that there is a significant relationship between age and survival

this tells us that at a given point in time the log instantaneous risk of an event which is death in this case is 0.5 to higher for those over 40 compared to

those under 40 so those over 40 have a higher instantaneous risk of dying at a given time compared to those under 40

so we exponentiate this coefficient to get the hazard ratio which is shown right beside the coefficient and you can also see this below along with it's 95%

confidence interval so you would exponentiate the beta 1 to get 1.69 this tells us that the hazard or instantaneous risk of experiencing the

event which is death in this case at a given time for the over-40 group is 1.6 nine times the instantaneous risk of death at a given time for the under 40

group ignoring all other variables if we interpret the confidence interval for the hazard ratio this tells us that at a given point in time the hazard of death for those over 40 is between one point

one and two point five nine times the hazard of death for those under 40 so based on this simple model do we have evidence to suggest that people who are

over 40 have a greater risk of death at a given time compared to those under 40 yes the p value associated with the slope is less than point zero five and the confidence interval for the hazard

ratio does not include one we can estimate that at a given instant in time someone who is over 40 is 69 percent more likely to die than someone who is

under 40 ignoring all other variables so these are all of the tests of significance for the Cox model we have one beta or coefficient for the single

variable in our model these tests are testing whether any of the betas are not equal to zero so basically if our model is better than nothing like the model f

statistic with linear regression so they're all calculating this in a different way than the p-value associate with beta 1 which is why we have slightly different p-values but they're

all less than point zero 5 and are similar to the p value associated with beta 1 because we just have that one variable in our model so what about

possible confounders we know that there are likely other factors that could impact whether or not you experience death that are also related to age let's

say that we want to adjust for mismatch level but before we do that what do we have to do first we have to think through whether or not it makes sense

conceptually you should then also compare those variables statistically so used by variable plots summaries or tests to examine the association between

mismatched level and age as well as mismatch level and death I'm not going to do these steps in the interest of time but you should always think through if a variable makes sense as a

confounder and should also see if this conceptual understanding is reflected in by variable relationships between the potential confounder and the primary explanatory variable as well as the

potential confounder and the outcome variable this is particularly important when we're working with logs as we have been with logistic Poisson and Cox

regression this is because even small changes in the beta one can have a large confounding influence when beta 1 is logged so we don't want to rely

exclusively on the 10% change rule and then the next step would be to add mismatch to our model and what are we

looking for now a change in beta 1 so if age is our main explanatory variable of interest and we're wanting to know if there is truly an independent relationship between age and death we

look at the change in the age coefficient so let's compare our beta 1 of 0.5 to 2 our previous model in our

previous model beta 1 was also approximately 0.5 2 so there's essentially no change in beta 1 so now let's look at the standard error

associated with beta 1 what happened to that with our model including mismatch the standard is about 0.22 in our model without

mismatched the standard error was also about 0.2 - so there was little to no change in the standard error so based on

this what do we think mismatched doesn't seem to be doing much it doesn't look like it's acting as a confounder and since the standard error didn't decrease

it doesn't look like a predictor now we could test to see whether adding mismatch makes the model fit better and how could we do that the likelihood

ratio test so we compare two models using the likelihood ratio test just like we did with linear logistic and Prasanna regression so this answers the

question is the full model significantly better than the reduced model the null hypothesis is that there's no difference between models the alternative

hypothesis is that the full model explains more it's better so if we find that there's no difference between the two models which one should we go with

the reduced model because we want to go with simpler models whenever possible so let's look at the results from the likelihood ratio test we can see that

our p-value is quite high at 0.5 which is greater than point zero five so based on this we fail to reject the null hypothesis and conclude that the model

that includes mismatch does not increase the model predictive power so it doesn't look like mismatch is a confounder or make our model better so we should

probably exclude it from our model what if we had instead found here that the likelihood ratio test was significant that our p-value is less than point zero

five even though our previous analyses suggested that it wasn't a confounder this would mean that mismatch would be another significant predictor so it's

improving our ability to predict the outcome but it's not confounding the relationship between x and y so if we're trying to build an effect size model like we are here and we're only

concerned about the relationship between our primary x and y we might not want to keep it in the model if we're only interested in true confounders but if we want to increase

the precision of our estimate of the effect we might decide to keep it in or if it was something like age or gender where you think theoretically it could be important you could argue to keep it in for face

validity so it's possible that you could have many explanatory variables that are significant predictors but not confounders but where you draw the line of what to include or exclude in your

model will depend on your research question and other considerations but if you were fitting a predictive model how would your approach be different then well you wouldn't be looking for a

change in beta one because you don't have a primary explanatory variable you're just looking for X's that predict your outcome so if a variable was found

to be a predictor with a likelihood ratio test with a predictive model you would want to keep this in your model but if not you would exclude it so now

let's turn to the Cox regression assumptions similar to previous regression models the first assumption is that individuals are independent of

one another also the events are independent of one another so if one person experiences an event that's not going to increase or decrease the

likelihood of that person or other people experiencing the event the second assumption which is unique to Cox proportional hazards Russian compared to other regression models is that

censoring is not informative this just means that people who stayed in your study are no different from those who were lost to follow-up if you only have a small number of people who are lost to

follow-up then it makes less of a difference but that's one of the assumptions that we're making here the next one is more of a property of the model rather than an assumption but

that's that the baseline hazard is unspecified so we don't know what that is our model doesn't tell us that we're also assuming that the X values don't

change over time so if you were a smoker at the beginning of the study when we measured it we assumed that you're going to maintain your status as a smoker throughout the study there are extensions of the Cox

that allow us to account for variables that change over time such as time updated cox models but we won't talk about these here another property is

that the log hazard rate is a linear function of the x's so just like logistic regression where the log odds was linear function of the exes or

Poisson regression where the log rate was a linear function of the exes so we can check for this with residual plots like we did with linear regression which

we'll discuss shortly and then the final assumption is the proportional hazards assumption which is probably the assumption that we are most concerned

about so this assumption means that the hazard ratio is the same regardless of whether you're looking at time 1 2 3 4

etc it doesn't mean that the hazard stays the same the hazard can vary but the relative difference between groups compared is constant over time

so therefore the curves are not going to cross because the hazard ratio is going to stay the same let's look at an example of what I mean by this this is

based on one of Mike's figures that he uses to explain the proportional hazards assumption recall that Cox regression allows the hazard to change over time so

the instantaneous risk of an event like relapse after treatment can change over time but the main assumption or

limitation on this is that hazards must be proportional the hazard ratio is constant so the relative difference between two groups being compared must

stay constant over time so for example in this figure we are comparing the hazard of an event for Group A and B we can see that the hazard for Group A

changes over time and the hazard for Group B changes over time but the relative hazard or hazard ratio is constant the hazards are proportional

the relative difference between groups a and B is constant over time the risk of death between the two groups is the same at all

whether it's time one two or three a person in Group B is about twice as likely to die than a person in Group A so how can we check the proportional

hazard assumption the first way is to assess the log log plot so the log of survival versus the log of time and what

are we looking for we are looking for convergence divergence or crossing of the hazard functions which would violate

the assumption so we're looking for parallel curves in our log log plot if we see these mega phoning out or coming

together especially if they cross then that's a clear violation because if you see that those two curves are crossing what is that telling you the hazards are

not proportional because they cross at one point the hazard was higher in one group than the other and then at some point these switched if you see these mega phoning out or coming together it's

sort of a judgment call as to whether or not the assumption is violated but if they cross that's a clear violation you would be building an inappropriate model

because the relationship between those groups is different over time the relative hazard changes so giving one hazard ratio is incorrect

you're not representing that relationship properly if they converge a little bit or diverge a little bit you have to make a judgment call so let's

take a look at our example on the left are our original kaplan-meier curves that we plotted and the log log plot is on the right when we look at the log log

plot we see that the lines cross so the proportional hazards assumption is violated and it's really not appropriate to use a Cox proportional hazards model in this situation because the hazard was

higher in one group than the other and then at some point these switched another way that we can check the proportional hazards assumption is shown

Feld's test this is kind of like a goodness of fit test in some ways we haven't talked to goodness-of-fit tests very much in this course but the null hypothesis is that

the proportional hazards assumption is met and the alternative hypothesis is that the proportional hazards assumption is not met if we reject the null this

suggests that the proportional hazards assumption is not met so you're actually wanting to fail to reject the null hypothesis so you're looking for a high

p-value in this case you can also calculate the correlation between schoenfelt's residuals and time and if this is positive this suggests that the log hazard ratio increases over time and

if this is negative this suggests that the log hazard ratio decreases over time you can also plot the schoenfelt's residuals to check this assumption you don't need to know much about the

details about this other than that you want to see a flat line when you plot this especially in the middle but some curving of the tails is okay the third

way that you can check the proportional hazards assumption is to add an x times time interaction so whatever your primary explanatory variable is times

time interaction term and if there is a significant interaction with time this indicates that the hazard ratio for a given X is dependent on the time which

violates the proportional hazards assumption if this interaction is significant this suggests that the effect of this variable changes with

time so that's saying the hazard ratio varies over time and therefore the proportional hazards assumption is violated so there are lots of different

ways that you can check this assumption if the proportional hazards assumption is violated there are a few potential solutions to address this the first is

to stratify and fit separate cox models for different levels of x with this example you could fit separate models for each of the mismatched groups and

then estimate the hazard ratio for each of these respective groups another option is to fit an extended cox model which allows the covariance or

coefficients to vary over time and the third option is to include an x times time interaction term in your model and then estimate the hazard ratio for your

primary X which is age in this case for different time periods to check linearity just like linear regression you plot the residuals and are looking

for a smooth red line we reviewed how to do this in tutorial 1 so I'm not going to discuss this here but this is also a good way to check if there are any outliers that could be impacting your

data another option is to fit a model with X and x squared and then compare this to a model with only X to see if the model improves so just like we did

with linear regression we use the same ways to check for linearity the next tutorial is our final tutorial for the semester and will be a review of the

course material to help you prepare for the final exam thanks for watching our video stick around guys these are not

small you

Loading...

Loading video analysis...