PART I - BIOSTATISTICS FOR THE INTERVENTIONIST
Updated on March 31, 2022
PART I

Biostatistics for the interventionist

Tim Collier1, Stuart Pocock1
1. Medical Statistics Department, London School of Hygiene & Tropical Medicine, London, UK

Chapter update in progress

Summary

This chapter will address some of the fundamental statistical issues that arise in the analysis of cardiovascular clinical trials. Examples from recently published cardiovascular clinical trials are presented throughout to illustrate the statistical issues covered.

Most clinical trials are interested in assessing the efficacy of a new treatment versus some standard treatment or placebo. Three key questions arise: (i) is the new treatment more effective than the comparator? (ii) if yes, how much more effective? and (iii) how confident can we be in our estimate of effectiveness? These three questions correspond to the subjects of significance testing, effect estimation and confidence intervals, respectively. In this chapter we will cover the interpretation of p-values and their link to confidence intervals, and consider both relative (e.g., risk ratio) and absolute (e.g., difference in risk) measures of treatment effect.

The appropriate statistical test or method of analysis to be used is principally determined by the design, in particular the nature of the outcome of interest. Three types of trial outcome are covered in this chapter: binary (e.g., dead or alive at one-year); time-to-event (e.g., time to hospitalisation) and quantitative (e.g., in-stent stenosis [mm]).

We also cover statistical issues relating to the analysis of non-inferiority trials, sample size and power, baseline covariate adjustment, and the analyses of secondary endpoints, subgroups and composite endpoints. View chapter.

Significance tests and p-values

In observational and non-randomised studies any differences observed between groups could be due to bias, confounding or chance. When a randomised controlled trial is well designed and carefully conducted, bias and confounding are notably reduced if not eliminated. Any difference observed between the groups is then due either to a genuine treatment effect or to pure chance. The purpose of significance testing is to assess the strength of evidence for a real treatment difference.

Although the appropriate statistical test to use varies with the outcome of interest (e.g., binary, quantitative, etc.), the underlying principle is the same. At the heart of the significance test are the null and alternative hypotheses. For a superiority trial involving two treatment groups (e.g., a new versus standard treatment) the null hypothesis will be that the two treatments are equal in their effect on the outcome of interest. The alternative hypothesis will be that the two treatments are not equally effective. Generally the alternative hypothesis will be two-sided, which means that we consider the possibility that the new treatment could be better or worse than the comparator. From the statistical test we calculate a p-value which is a measure of the strength of evidence against the null hypothesis. The p-value is the probability of obtaining a treatment difference at least as great (in either direction if a two-sided test) as that actually observed if the null hypothesis was true. The smaller the p-value, the stronger is the evidence against the null hypothesis.
For example, in the SPIRIT IV Trial [1] 3,687 patients were randomly assigned (in a ratio of 2:1) to receive either everolimus-eluting stents or paclitaxel-eluting stents without routine follow-up angiography. The primary endpoint for the trial was target-lesion failure at one year. The results are shown in Table 1.

The null hypothesis being tested is that everolimus-eluting stents and paclitaxel-eluting stents are equally effective with respect to the primary endpoint. Even if this were the case (i.e., the null hypothesis was true) we would be most unlikely to observe exactly the same percentage of patients experiencing the primary endpoint in each treatment group. The significance test addresses the question if the null hypothesis were true what would be the probability of getting a difference as great (or greater) than the difference we have observed i.e., 4.2% versus 6.8%? Given that the primary endpoint (target-lesion failure at one year) is a binary (yes or no) outcome variable, and that there are reasonably large numbers of events in each group, the appropriate statistical test to use is a chi-square test.

The chi-square test compares the observed number of events in each group with the number of events you would expect to see if the two treatments were equally effective. Since in total (i.e., combining both groups) 182 out of 3,611 patients had a primary endpoint, the overall risk of an event is 5.0%. Under the null hypothesis the expected number of events is calculated by multiplying the number of patients by the overall risk. The question is, “are the data compatible with the true risk of the primary endpoint occurring being 5% in each group”? For example, in the everolimus group we would expect to see 2,416 x 5% = 121 events. In fact 101 events were observed, 20 less than expected if the two treatments were equally effective. If the expected number of events was small e.g., less than 5, then we should use Fisher’s exact test rather than the chi-square test.

For SPIRIT IV the p-value from the chi-square test is p=0.001. The smaller the p-value obtained from the test, the greater the evidence to contradict the null hypothesis and therefore the stronger the evidence that there is a real treatment difference. A p-value of 0.001 means that the probability of getting a difference at least as great as 4.2% vs. 6.8% if the treatments were in truth equally effective is one in a thousand. The test therefore gives strong evidence that everolimus-eluting stents reduce the risk of the primary endpoint compared to paclitaxel-eluting stents.

For time-to-event outcomes e.g., time to death, p-values are usually calculated using either a log-rank test or a likelihood ratio test from a Cox proportional hazards model. For quantitative outcomes e.g., in stent late loss, p-values are usually calculated using a two-sample t-test. These different tests are covered in more detail below under the relevant sections. However, the underlying principle of significance testing and the interpretation of the p-value are the same for each of these statistical tests.

INTERPRETING P-VALUES

Some common mistakes are made regarding the interpretation of p-values. A p-value provides a measure of the strength of the evidence against the null hypothesis and is not absolute proof that a treatment works.
Whereas a small p-value provides evidence against the null hypothesis, a large p-value should not be interpreted as proving that the null hypothesis is true. In SPIRIT IV a major secondary endpoint was the composite of cardiac death or target-vessel myocardial infarction at 12 months. This endpoint was experienced by 2.2% of patients in the everolimus group and 3.3% of patients in the paclitaxel group with p=0.09 and was therefore not statistically significant at the traditional 5% significance level. The correct interpretation of this test is that there is insufficient evidence of a treatment difference with regard to this outcome. It could be that the trial was not powered for this endpoint or, related to this, that the treatment effect was smaller than anticipated.

Traditionally the null hypothesis is rejected if the p-value is less than or equal to some value.

Often too much emphasis is placed on whether p is greater than or less than 0.05. Even when two treatments are truly identical in their effect on an outcome there is, by definition, a 1 in 20 chance of obtaining a p-value less than or equal to 0.05. This is known as the Type-1 error or false positive rate. It is possible that a statistically significant (p<0.05) result can hang on a single event. For example, consider a trial with 500 patients in each of two treatment groups which observes 15 events in one group versus 6 events in the other. This would produce a p-value of 0.047. Transferring a single event from one group to the other (14 versus 7) results in a p-value of 0.12. So be very cautious of p=0.05.

It is good practice to report the precise value of the p-value rather than to use conventions such as p<0.05 or “NS” (to indicate p>0.05) which tends to focus on arbitrary boundaries. Reporting the precise p-value helps weigh up the strength of evidence, though once the evidence against the null hypothesis is overwhelming e.g., p-values less than 0.001 or 0.0001, these could be reported as p<0.001 or p<0.0001 without much loss of information.

ESTIMATING THE TREATMENT EFFECT

As well as assessing the evidence for whether a real treatment difference exists, clinical trials are also interested in estimating the direction and magnitude of the difference. By “direction” of the difference we mean “does the treatment under investigation do better or worse than the control”, and by “magnitude” of the difference we mean how much better or worse.
A number of different measures of treatment effect are commonly used in cardiovascular trials. The measure of effect to use is determined predominantly by the type of outcome used in the trial e.g., binary outcomes – risk ratios; time-to-event outcomes – hazard or rate ratios; quantitative outcomes – difference in means. Here we will focus on treatment effect estimates for binary outcomes before turning to time-to-event and quantitative outcomes later.

Relative treatment effects

For binary outcomes, such as cardiac death at 30 days, estimates of the risk ratio (sometimes called the relative risk) are typically presented. The risk in each group is calculated by dividing the number of patients experiencing the outcome by the total number of patients. This might be expressed as either a proportion or a percentage. To calculate the risk ratio the risk in the active treatment group is divided by the risk in the control group. Hence a risk ratio less than or greater than one indicates that the treatment looks better or worse, respectively, compared to the control group, with a risk ratio of one meaning that there is no difference in risk. The further the risk ratio is from one the greater the treatment difference – though this is not symmetrical about one. A risk ratio of 0.5 means a halving of risk, whilst a risk ratio of 2 means a doubling of risk.
In SPIRIT IV the percentage of patients experiencing the primary endpoint was 4.2% (101/2,416) and 6.8% (81/1,195) in the everolimus and paclitaxel groups respectively. The risk ratio for everolimus compared to paclitaxel is therefore 4.2% ÷ 6.8% = 0.62. As this ratio is less than one it indicates that the everolimus-eluting stents have performed better than the paclitaxel-eluting stents as regards the primary endpoint. Sometimes the treatment difference might be described in terms of the relative risk reduction, which is calculated as 100 x (1 – risk ratio). For SPIRIT IV we have a relative risk reduction of 100 x (1 – 0.62) = 38%. The risk in the everolimus group is 38% lower than that in the paclitaxel group.

Some trials will report odds ratios rather than risk ratios. To calculate the risk of an event we take the number of patients with the event and divide by the total number of patients; to calculate the odds of an event we divide by the number of patients without the event. So in SPIRIT IV the odds of the event are 101 ÷ 2,315 = 0.04 and 81 ÷ 1,114 = 0.07 in the everolimus and paclitaxel groups respectively. The odds ratio is therefore 0.04 ÷ 0.07 = 0.60. Odds ratios are always further from one than risk ratios ( Figure 1 ) though the difference will be less marked when the risk is low. Here, since the outcome is quite rare (5% overall), the odds ratio (0.60) and risk ratio (0.62) are quite similar.

Absolute treatment effect and number needed to treat

As well as reporting the relative risk it is also important to report the absolute risk reduction. The absolute risk reduction is calculated by subtracting the risk in one group from that in the other. So for SPIRIT IV the absolute risk reduction at 12 months is 6.8% - 4.2% = 2.6%. This makes the treatment benefit seem much more modest than the 38% relative risk reduction seen above. Note that whilst for a risk ratio a value of one is equivalent to no difference, for an absolute risk difference the value 0 means no difference.

From the absolute risk reduction we can work out the number needed to treat (NNT). This is the average number of patients who would need to be treated to prevent one additional event – the bigger the treatment difference the smaller the NNT. The NNT is calculated as 100 ÷ absolute percentage reduction. For SPIRIT IV the NNT is 100 ÷ 2.6 = 38.5. That is we would need to treat 39 patients with everolimus-eluting stents (compared to treating them with paclitaxel-eluting stents) to prevent one primary endpoint occurring within 12 months of follow-up.

ESTIMATING UNCERTAINTY WITH CONFIDENCE INTERVALS

Treatment estimates from clinical trials are always subject to random error and are only estimates of the true treatment difference. If we were to repeat a trial a number of times using precisely the same methodology our results would be slightly different each time. To allow us to express the uncertainty in our treatment estimate we calculate a confidence interval. A confidence interval allows us to see a range of plausible values for the true treatment effect which are consistent with what we have observed in our trial.

Typically we calculate a 95% confidence interval (95% CI). In the long run five per cent of 95% CIs will not contain the true difference. Whenever we calculate a 95% CI there is a 2.5% chance that the truth lies below, and a 2.5% chance that the true effect lies above the range. The larger the study (in terms of number of patients or number of events) the more precisely we can estimate the truth and therefore the narrower will be the 95% CI.

In SPIRIT IV the risk ratio for the primary endpoint was 0.62 with a 95% confidence interval ranging from 0.46 to 0.82. That is, whilst the best estimate is 0.62, the data from the trial is consistent with a true treatment effect as optimistic as 0.46 (a 54% relative risk reduction) or as pessimistic as 0.82 (an 18% relative risk reduction).

Link between p-values and confidence intervals

It is helpful to be aware of the link between p-values and confidence intervals. When a 95% CI for a relative risk excludes one then the p-value from the significance test will be less than 0.05. The further the entire interval is away from one the smaller the p-value will be. Conversely when a 95% CI includes one, we know, without carrying out a significance test, that the p-value will be greater than 0.05. A 99% confidence interval which does not include one, indicates a statistically significant result at the 1% significance level i.e., a p-value <0.01.

Figure 1 shows the results from SPIRIT IV for (A) the primary endpoint and (B) one of the main secondary endpoints. As the results are presented on the relative risk scale (RR and OR) the vertical line of no treatment effect is at 1. Looking at (A) we see that the 95% CI for the risk ratio does not cross the vertical line (i.e., does not include 1) and therefore will be statistically significant at the 5% level. As the upper limit (0.82) is some distance from 1.0 we could correctly guess that the p-value will be considerably smaller than 0.05. Looking at (B), we see that the 95% CI for the risk ratio crosses the line of no-effect and therefore will not be statistically significant at the 5% level. We can see here that the estimates and 95% CIs for the risk ratio (RR) and odds ratio (OR) are slightly different. As pointed out above the odds ratio will always be further from the null value than the risk ratio, though this is not so marked when the outcome is rare.

Recall that for measures of absolute risk reduction and for difference in means it is the value 0, rather than 1, which indicates no treatment effect. Therefore, when looking at the 95% CI for estimates of an absolute difference, we need to note whether or not the interval contains 0.

FOCUS BOX 1Significance testing, estimation and confidence intervals
  • Significance testing, estimating treatment effects and confidence intervals are at the heart of statistical inference
  • The p-values from significance tests do not prove that a treatment works (or doesn’t work), but rather provide a measure of the strength of evidence against the null hypothesis being true
  • Estimates of relative and absolute treatment benefit as well as the underlying risk should be considered
  • Confidence intervals should always be calculated to provide a range of plausible values for the true treatment effect

Time-to-event outcomes

Many cardiovascular trials study time to a primary outcome using a statistical approach called “survival analysis”. Such trials will normally present Kaplan Meier survival plots showing cumulative survival or incidence over the follow-up period, p-values from a log-rank or likelihood ratio test, and report incidence rates and rate ratios or hazard ratios. They may also report and compare cumulative incidence rates at specific time points e.g., the 1-year survival rate. The most commonly used statistical model for the analysis of time-to-event outcomes is the Cox proportional hazards model.

Survival analysis takes into account the fact that patients will have unequal lengths of follow-up. Some patients will be lost to follow-up or withdraw from the trial, whilst others will have shorter follow-up simply because they were randomised later. Patients are said to be “at risk” of the event of interest while they are being followed. Subjects who do not experience the outcome of interest during the trial are censored at the end of the follow-up period.

The JUPITER Trial [2] investigated the efficacy of rosuvastatin, 20 mg per day, versus placebo in 17,802 apparently healthy subjects with low density lipoprotein cholesterol levels less than 130 mg per decilitre and high-sensitivity C-reactive protein levels equal to or greater than 2 mg per litre. Participants were followed for the first occurrence of venous thromboembolism (a composite of pulmonary embolism or deep-vein thrombosis). The median follow-up time was 1.9 years with the longest follow-up time being 5 years. A median follow-up time of 1.9 years means that half the patients were observed for at least 1.9 years. The results are shown in Table 2 .

The rate of venous thromboembolism was 0.18 per 100 patient years in the rosuvastatin group compared to 0.32 per 100 patient years in the placebo group. The rate in each group is obtained by dividing the number of patients with an event by the total patient follow-up time. As this is a very rare outcome (experienced by <1% of participants in the study) the rates have been presented per 100 patient years. A rate of 1.0 per 100 patient years means that if we followed 100 patients for a year we would expect to observe one event.

The significance test asks how likely is this difference (0.18 versus 0.32 per 100 patient years) to have been due to chance or to a real treatment effect? The p-value calculated from a likelihood ratio test using a Cox Proportional Hazards model is p=0.007. Although this statistical test is different to that used for the binary outcome above, the underlying principle is the same. The null hypothesis is that rosuvastatin is no better or worse than a placebo and the p-value is the probability of getting a result as or more extreme than that we have observed if the null hypothesis were true. The p-value is very small (p=0.007) providing strong evidence against the null hypothesis, or conversely, that there is strong evidence that rosuvastatin really is more effective than placebo in preventing venous thromboembolism.

The most commonly presented estimate of the treatment effect in trials with a time-to-event outcome is a hazard ratio (HR), estimated using a Cox proportional hazards model. The hazard or hazard rate is the instantaneous probability of the outcome of interest; that is the probability that a patient experiences the outcome at a specific point in time given that they have survived to that point in time. The HR can be thought of as the hazard rate in the treatment group divided by the hazard rate in the control group averaged over the follow-up period. As with the risk ratio a HR of one means no treatment difference and a value less than or greater than one indicates that the treatment is better than or worse than the comparison group, respectively.

For JUPITER the HR for the venous thromboembolism was 0.57 with 95% CI 0.37 to 0.86. That is the hazard of having a venous thromboembolism in the rosuvastatin group is on average 43% lower than that in the placebo group at any point over the follow-up period. The 95% CI is entirely below one indicating a statistically significant (p<0.05) treatment benefit from being on rosuvastatin. The data are consistent with the true reduction in hazard being as great as 63% or as small as 14%. Since the upper bound of the 95% CI is considerably below one we could deduce that the p-value was considerably less than 0.05.

KAPLAN-MEIER PLOTS

The most commonly used method of displaying results for time-to-event outcomes is the Kaplan Meier (KM) survival plot. They require careful reading and interpretation [3].

KM plots display either cumulative survival (lines descending from 1 towards 0) or cumulative incidence (lines ascending from 0 towards 1) over the follow-up period. For cumulative survival the plots display the proportion (or percentage) of patients who have not yet experienced the event of interest. At randomisation all patients are free of the outcome and so the plot starts at 1 (if displaying as a proportion) or 100% (if displaying as a percentage). Over the follow-up period the number of patients who experience the event accumulates and therefore the proportion surviving descends towards 0. Jumps or steps in the plots occur when there is an event. For cumulative incidence (failure) the plots display the proportion or percentage of patients who have experienced the outcome of interest. At randomisation no patients have yet had the outcome and so the plot starts at 0. Events accumulate over time and the plot ascends towards 1 or 100%.
It is good practice to present the number of patients at risk (i.e., still being followed in the study) at various time-points in a table below the KM plot. Most studies will also report a summary measure of the length of follow-up time, e.g., the median follow-up time. Patients who are lost to follow-up or reach the end of the follow-up period without experiencing the event of interest are censored at the time they stop being followed. Such censorings do not lead to a step or jump in the survival estimate but do mean that these patients are no longer at risk of the event as far as the trial is concerned. Patients who experience one or more events are censored at the time of their first event. Therefore, as we move from left to right across the KM curve fewer patients are at risk. This means that events occurring later in the follow-up (towards the right hand end of the survival curve) result in larger steps. For example, 1 event among 1,000 patients (0.1%) causes a smaller jump than 1 event among 100 patients (1%).

When reading and interpreting a KM plot it is important to look carefully at the scale and range of the vertical axis and to check whether there is a break in the axis. Interrupting or cutting the y-axis can lead to a visual exaggeration of the difference between the two groups. In particular watch out for survival plots going down from 1 to 0 but which have a break in the vertical axis.

Figure 2A shows the KM cumulative incidence plot for JUPITER up to 4.5 years of follow-up. The number of patients at risk at baseline and at 6-month intervals thereafter is shown below the horizontal (“Years of Follow-up”) axis. 8,901 patients were randomised into each group and were therefore at risk at baseline. By two years of follow-up there are around 4,000 patients at risk in each group, less than 50% of the number randomised. We can infer from this that the median follow-up time is less than 2 years (in fact it is reported as 1.9 years in the paper). At four-years of follow-up just 6% of those randomised are still at-risk. So although the eye is naturally drawn to the right hand side of the graph where the difference between the two lines looks greatest we should be aware that as we move from left to right there will be increasing uncertainty about the estimates. Some KM plots display the uncertainty with the use of confidence intervals or standard error bars.

In Figure 2A we see a clear separation between the two lines, with a higher cumulative incidence in the placebo group, although this separation does not begin to take place until around 1 year of follow-up. However, we should note that the vertical (“Cumulative Incidence”) axis runs from 0 to a maximum of just 0.06 (i.e., 6%) which tends to visually exaggerate the difference between the two groups (this is dealt with nicely in Figure 2B ). If this were presented as cumulative survival on a scale from 1 down to 0 the divergence would be much less obvious. This is not a criticism of plots such as this - we just need to be observant to get a correct interpretation.

Taking the scale into account we note that overall the outcome is rare (<1% at 2 years) and that the absolute difference in risk is quite small. Reading from the graph it appears that after 3 years of follow-up the cumulative incidence is around 0.005 (i.e., 0.5%) in the rosuvastatin group and around 0.012 (i.e., 1.2%) in the placebo group. This gives us an absolute difference of around 0.7% in the 3 year cumulative incidence of venous thromboembolism. Dividing 0.7% by 1.2% gives us an approximate estimate for the 3 year risk ratio of 0.58 – which closely matches the HR of 0.57.

Figure 2B shows the KM survival plot for the primary efficacy endpoint for the PLATO trial [4]. This figure displays the cumulative incidence (%) of venous thromboembolism on two different vertical axis scales. For the main plot the full vertical axis scale is used, extending from 0% to 100%. The picture we get from this plot is that the event rate is quite low, reaching around just 10% at 12 months. There is a difference between the two groups but it appears to be quite small and it would be quite difficult to estimate the difference from this plot. The second plot, inset within the main plot area, displays the same data, but with the vertical axis extending from 0% to just 13%. The effect this has is to zoom in on the two lines enabling us to see the difference between the groups more clearly. With the aid of a ruler we can estimate that the KM cumulative incidence at 12 months is about 9.8% in the ticagrelor group compared to about 11.5% in the clopidogrel group.

The slight disadvantage of this plot is that it can take some time to work out that the plots are displaying the same data on different scales. Visual perception is also affected by the ‘squeezing’ of the horizontal axis on the inset plot.

The KM plots also give us an indication of how the risk or hazard of the event changes over time. In Figure 2A the slope (gradient) of each line is reasonably constant, indicating that the risk of the outcome did not change very much over time. In Figure 2B the slopes are much steeper in the first month of follow-up indicating an early high risk period. After the first month the lines become less steep indicating that if a patient survives the first month the risk of having the event is much reduced. For example, the risk of having an event by 1 month in the clopidogrel group is about 5.5%. This increases by another 5% to around 11.5% over the next 11 months.

A SIMPLE STATISTICAL TEST

Before moving on to quantitative outcomes it is worthwhile describing a very simple statistical test [5] that can be used by the inquisitive reader for trials such as JUPITER or PLATO i.e., an RCT with two treatment groups, roughly equal randomisation and where the outcome of interest is a clinical event. We can calculate a simple statistic by dividing the difference in the number of events by the square root of the sum of the number of events. Under the null hypothesis this test statistic (z) approximately follows a standardised normal deviate (a normal distribution with mean zero and standard deviation one). Some critical values for z are shown in Table 3 below. The statistic can be derived using just a basic calculator or even just paper and a pencil.
So for JUPITER (see Table 2 ) the difference in the number of events is 60-34 = 26. The sum of the number of events is 60+34 = 94 and the square root of 94 is approximately 9.7. Dividing 26 by 9.7 we get a test statistic of 2.68. Looking at Table 3 our test statistic 2.68 falls between the values 2.58 and 2.81. The p-value therefore lies somewhere between 0.01 and 0.005. In fact the p-value reported from the Cox PH model was 0.007. This simple test works best when the event rate is relatively low, as indeed it was in JUPITER. For the secondary endpoint of unprovoked venous thromboembolism there were 19 and 31 events in the rosuvastatin and placebo groups respectively. Try out this simple test and see what you get! You can find the answer at the end of this chapter.

FOCUS BOX 2Time-to-event outcomes
  • Many trials study time to the first occurrence of an event of interest
  • In such trials, patients are not all followed for the same period of time
  • The most common model used for time-to-event data is the Cox proportional hazards model, with the hazard ratio as the measure of treatment effect
  • Kaplan-Meier plots display the cumulative proportion of patients surviving or failing over the follow-up period

Quantitative outcomes

Clinical trials sometimes study quantitative outcomes such as blood pressure or in-stent late loss. For such outcomes it is most usual to compare the mean levels in the treatment and control groups. P-values, treatment effect estimates and confidence intervals can be obtained using a two-sample t-test. The two-sample t-test assumes that the data come from a normal distribution and that the variances in the two groups are similar – though a variation of the two-sample t-test allows this second assumption to be relaxed. It is also assumed that the observations are independent of one another.

The key information required to carry out a two-sample t-test is the number of patients, the mean value and the standard deviation (SD) in each group. The mean value is a summary measure of the centre of a distribution and is calculated by summing all values and dividing by the number of observations. The SD is a summary measure of the variability of the sample – it is the average deviation about the mean. For a given mean value a smaller SD indicates less variability. When data are normally distributed 95% of values lie within plus/minus two standard deviations of the mean.

In the SPIRIT III Trial [6] 1,002 patients were randomised 2:1 to receive everolimus- or paclitaxel- eluting stents, respectively. The primary endpoint was angiographic in-segment late loss at eight months, measured in a pre-specified subset of patients; the results are shown in Table 4 . The mean and SD of in-segment late loss was 0.14 (SD, 0.41) mm in the everolimus group compared to 0.28 (SD, 0.48) mm in the paclitaxel group. These measurements come from 301 and 134 patients in the everolimus and paclitaxel groups, respectively.

SIGNIFICANCE TEST

The null hypothesis for a two-sample t-test is that the means are the same in the two groups and therefore that the difference between the two means is zero. In SPIRIT III the p-value from the two-sample t-test is p=0.004 providing fairly strong evidence that mean in-segment late loss is not the same in the everolimus-eluting and paclitaxel-eluting stent groups. The p-value is derived from the t-statistic.

The t-statistic is calculated by dividing the difference in means by the standard error (SE) of the difference in means. The formula (assuming unequal variances) for the SE of the difference in means is given below.

N1 and N2 are the number of subjects and SD1 and SD2 are the standard deviations in the two groups. A slightly different formula is used when assuming the variances are equal. We can see from this formula that keeping SD1 and SD2 constant the greater the number of patients in the study the smaller will be the SE. A smaller SE provides more power to detect a treatment difference.

All the information we need to calculate the t-statistic is provided in Table 4 . Putting these numbers into this formula we find that the SE = 0.048. The t-statistic for the test of no difference in means is therefore (0.28-0.14)/0.048 = 2.91. Under the null hypothesis this test statistic follows a t-distribution. As the sample size here is large we can use the standard normal deviate (z-distribution); referring back to Table 3 our value of 2.91 yields a p-value slightly less than 0.005.

TREATMENT EFFECT ESTIMATE AND CONFIDENCE INTERVAL

The treatment effect estimate from a two-sample t-test is simply the difference in the mean values. In SPIRIT III the difference in mean in-segment late loss is 0.28 – 0.14 = 0.14 mm. That is, the mean late loss in the everolimus group is 0.14 mm less than that in the paclitaxel group.

The 95% CI extends from 0.05 to 0.23 mm meaning that the data are consistent with the true difference being as small as 0.05 mm or as great as 0.23 mm. This confidence interval does not include 0, indicating that we have a statistically significant result at the 5% level.

The limits of the 95% CI are calculated as the mean difference ± 1.96 x SE. So for SPIRIT III the 95% CI is 0.14 ± 1.96 x 0.048 = 0.05 to 0.23. We can use the value 1.96 here as the sample size is reasonably large. For smaller samples we would have to refer to the t-distribution using the appropriate degrees of freedom. The degrees of freedom for a two-sample t-test is the total number of patients minus two.

The t-test is reasonably robust to departures from normality – by which we mean it copes well if the assumption of normality is slightly violated. However, if the sample distribution appears to very skewed, then either the data should be transformed (e.g., using a natural logarithm transformation) or a non-parametric statistical method should be used.

One of the features of the normal (bell-shaped) distribution is its symmetry about the median value. The median is the middle of the distribution – with 50% of values above and below. A skewed distribution is not symmetric about the median; it has an excess of either large (positively skewed) or small (negatively skewed) values. Because of the symmetry of the normal distribution the mean and median values are the same. This is not so for skewed distributions; in a positively skewed distribution the mean is greater than the median i.e., the mean is not at the middle of the distribution. For this reason non-parametric tests are usually based on a comparison of medians and use rankings rather than actual values.

Where a trial has a quantitative outcome which has also been measured at baseline, multivariable linear regression or analysis of covariance should be used. This allows the baseline measurement to be included as a covariate in the model. This is statistically much more efficient and can lead to considerable gains in power and therefore precision.

FOCUS BOX 3Quantitative outcomes
  • Clinical trials analysing quantitative outcomes typically use the two-sample t-test to test for a difference in means
  • The key information required for the two-sample t-test is the mean value, standard deviation and number of observations in each group
  • Where baseline values of the outcome have been measured analysis of covariance or linear regression should be used
  • When the outcome is very skewed, non-parametric methods should be used

Non-inferiority trials

Non-inferiority (NI) trials are designed to demonstrate that a new treatment is “comparable” in efficacy to some active control - usually the established standard treatment for the condition being studied. Such trials have become increasingly common in recent years, in part due to the decreasing incremental benefits of new treatments and the necessity of having an active control when the use of a placebo would be ethically unacceptable. NI trials pose particular challenges in terms of design, conduct, analysis and interpretation [7, 8, 9].

When setting out to demonstrate the non-inferiority of a new treatment compared to an existing treatment in terms of efficacy, it is usually assumed that the new treatment carries some other advantage, such as ease of administration, e.g., once-weekly versus daily injections, or reduced cost, or fewer or less serious harmful side effects.

As it is not possible to prove the exact equivalence of two treatments, the goal of NI trials is to demonstrate that the new treatment is no worse than the existing treatment by a pre-specified difference. This pre-specified difference is called the margin of non-inferiority and is usually denoted as delta (Δ). The focus in NI trials tends to be on confidence intervals rather than hypothesis testing. Usually a one-sided 95% CI is calculated. If the upper limit of that interval is less than Δ i.e., the interval does not include and is entirely below Δ, then non-inferiority has been demonstrated. This approach ignores the lower limit of the CI – in a NI trial we are not interested in how much better the new treatment might be (though having demonstrated non-inferiority some trials may then test for superiority).

Figure 3 shows five (of many possible) scenarios for results from a non-inferiority trial. In scenarios A, B and C the CI does not include Δ, the margin of non-inferiority, and therefore non-inferiority has been established. In scenarios D and E the CI includes Δ and therefore non-inferiority has not been established. Note however, that failing to demonstrate non-inferiority does not mean that the treatment has been shown to be inferior. For scenario A the new treatment is superior to the active control as the CI does not include the treatment difference 0 and in scenario D the control is superior for the same reason. In scenarios B, C and E standard superiority tests would conclude that there was no evidence of a difference between the two groups i.e., the 95% CIs cross the null-value of zero. However, not having enough evidence to demonstrate a difference is not the same as demonstrating non-inferiority.

The choice of the margin of non-inferiority is one of the key issues in designing NI trials, and will usually involve a combination of statistical analysis of previous trials and clinical judgement [10, 11]. It is crucial that this margin is determined carefully to ensure that the new treatment will be superior to placebo and therefore a conservative margin is usually appropriate. For example, in a NI trial where the standard treatment has an absolute risk 5% lower than placebo, the margin of non-inferiority should be considerably smaller than 5%. This means that sample size requirements for NI trials will generally be larger than for standard superiority trials.

The choice as to which treatment to use as the active comparator is also a very important consideration. This should be the best available established treatment for the condition being studied. Use of lesser or inferior treatments as controls for NI trials could, over time, result in treatments that are no better than placebo – a phenomenon referred to as biocreep. NI trials should also closely follow the design of the earlier trials which demonstrated the superiority of the active control [7]. Special rigour is required in the conduct of these trials as issues such as non-adherence, treatment cross-over, misclassification of outcome and dropouts can all conspire to make the two groups more similar, making it easier to demonstrate non-inferiority [9].

The Resolute All Comers trial [12] was a non-inferiority trial comparing zotarolimus to everolimus eluting stents in 2,292 patients with chronic, stable coronary artery disease or acute coronary syndromes. The goal was to demonstrate that zotarolimus eluting stents were not inferior to everolimus eluting stents. The primary outcome was target-lesion failure at 12 months. The trial used a pre-specified non-inferiority margin of 3.5%. This meant that the upper limit of the one-sided 95% confidence interval for the risk difference had to be less than 3.5%. Table 5 shows the results for the primary clinical outcome at 12 months.

The observed 12 month risk was 8.2% in the zotarolimus group and 8.3% in the everolimus group – compared to an anticipated rate of 8% for the sample size calculations. The absolute risk difference (8.2%-8.3%) was -0.1%, very slightly in favour of the zotarolimus group. This is not enough to demonstrate non-inferiority. The upper limit of the one-sided 95% CI was +1.8%. At very worst the absolute difference between the new and existing treatments is estimated to be just 1.8%. This is considerably less that the pre-specified non-inferiority margin of 3.5% and therefore non-inferiority can be claimed. The p-value for non-inferiority was pni<0.001. The null hypothesis for this test is that zotarolimus is inferior to everolimus by the margin 3.5%.

Note that the p-value and 95% CI shown in Table 5 (taken from the main results table) do not relate to the test of non-inferiority, but are the 95% CI and p-value for a standard superiority test. The p-value of p=0.94, which in a superiority test would not provide evidence of any difference between the treatments, cannot be used to claim non-inferiority.

FOCUS BOX 4Non-inferiority trials
  • The goal of NI trials is to demonstrate that a new treatment is no worse than the existing treatment by a pre-specified difference
  • The focus in NI trials tends to be on confidence intervals rather than hypothesis testing
  • The choice of the margin of non-inferiority should involve a combination of statistical reasoning and clinical judgement
  • The control group should be the best available established treatment for the condition being studied

Sample size and power

Sample size and power are issues related more to the design rather than the analysis of clinical trials and so are dealt with only briefly here.

Calculating the sample size is one of the key responsibilities of a trial statistician. By sample size we mean the number of patients that a trial needs to randomise in order to demonstrate superiority or non-inferiority. How the sample size was determined should be reported in the methods section of a medical research paper [13]. In practice the precise details will often be reported in a study methods or trial design manuscript.

The power of the study is the probability of obtaining a significant treatment effect if the null hypothesis is false i.e, if there is a real treatment effect. A treatment difference may truly exist, but if the sample size is too small, or there are too few events, the trial may fail to detect that difference. All else being equal, the larger the study the greater the power. Power is often expressed as a percentage; a power of 90% means that the probability of getting a significant result is 0.9 (given the treatment really works). The probability of not rejecting the null hypothesis when it is in truth false is called the Type-2 error rate. This is simply 100% minus the power. For example, when the power is 90% the Type-2 error rate is 10%.

Larger studies not only provide greater power but also provide more reliable (i.e., closer to the truth) and more precise (i.e., narrower confidence intervals) estimates of the treatment effect. As we saw above, for a given treatment effect a narrower confidence interval equates to a smaller p-value.

In arriving at a sample size the statistician will have had to estimate a number of different unknown factors e.g., the risk or event rate in the control group, the treatment effect and, for quantitative outcomes, a measure of variability. These estimates may be based on data from other trials or earlier phases of the drug or product development, but inevitably involve some (educated) guesswork. Issues such as loss to follow-up and non-adherence also need to be considered and factored into the sample size calculation.

Figure 4 shows the sample size requirements for a superiority trial with a binary outcome, with power set at 90% and a Type-1 error rate of 0.05. The horizontal axis shows various levels of relative risk reduction, the vertical axis the total number of patients required. There are three separate lines for different levels of risk in the control group.

From the figure we can see that holding everything else constant, a smaller treatment effect will require a larger sample size. Similarly, holding all else equal, the lower the risk in the control group the greater the number of patients required.

WORKED EXAMPLE: SAMPLE SIZE CALCULATION FOR A BINARY OUTCOME

To calculate the sample size required to demonstrate a difference in two proportions we need to estimate (i) the proportion of patients with an event in the control group (Pc) and (ii) the anticipated treatment effect i.e., the risk ratio (RR). Multiplying Pc by RR gives us the expected proportion in the treatment group (Pt). We also need to decide on the Type-1 and Type-2 error rate – usually called α and β. The Type-1 error rate is typically set at 5%. A Type-2 error rate of 10% equates to a power of 90%. The number of patients required in each group Ng is then calculated as follows:

Fixing α=0.05 the function f(a,b) is equal to 10.51 for β=0.1 (90% power) and 7.85 for β=0.2 (80% power).

Suppose we anticipate that the risk in the control group Pc will be 0.10 and that the treatment effect will be a RR of 0.7 i.e. a 30% relative risk reduction. Multiplying 0.10 by 0.7 gives an expected risk in the treatment group of 0.07. Going with α=0.05 and β=0.1 we get the following:

Finding 0.2 of a patient is awkward so we round up 1,811.2 to 1,812. Remember that Ng is the number required in each group so we need to double this to get the total number of patients required, i.e., 2 x 1,812 = 3,624.

Most studies experience some loss to follow-up. If we randomise exactly 3,624 patients and 5% drop out, the power of the trial will fall below 90%. To allow for a 5% loss to follow-up we divide the number needed by 0.95. So for our study the number required to be randomised would be 3,624 ÷ 0.95 = 3,815. As this is an odd number we could round up to 3,816 to allow an equal number of patients in each group.

Once a trial has been conducted post hoc power calculations should not be carried out. The key information is found in the confidence interval – wide CIs indicate a lack of power. However it can be interesting to consider how educated the “guesswork” was – how well did the investigators do in estimating the risk in the control group, treatment benefit etc.?

The sample size calculations for SPIRIT IV for superiority testing of the primary endpoint of ischaemia-driven target lesion failure at one year assumed a rate of 8.2% in the everolimus-eluting stent (EES) group. The anticipated treatment effect was a 35% relative risk reduction i.e., a risk ratio of 0.65. From this the expected one year rate in the paclitaxel-eluting stent (PES) group was calculated as 8.2% x 0.65 = 5.3%. Using these two percentages it was calculated that 3,690 patients randomised 2:1 (EES: PES) would give 90% power with a 2-sided alpha (Type-1 error rate) of 0.05. Remember from above that a Type-1 error is the probability of rejecting the null hypothesis given it is true. The sample size calculations also allowed for a 5% loss of patients to follow-up. Note that as this study used 2:1 randomisation the simple sample size formula above is not appropriate.

So what actually happened in SPIRIT IV? 3,687 patients were randomised, just three less than proposed. 3,611 patients completed the one-year follow-up period – an actual loss to follow-up of 2.1% compared to 5% assumed for the sample size calculation. The observed rate was 6.8% compared to 8.2% assumed in the PES group and 4.2% observed versus 5.3% assumed in the EES group; so slightly lower than expected in both groups. The observed risk ratio was 4.2/6.8 = 0.62 which is very close to the anticipate treatment effect of 0.65. So all things considered the sample size calculations seem to have been carried out rather well. That the trial was adequately powered can be seen in the relatively narrow 95% confidence interval 0.46 to 0.82 for the risk ratio.

FOCUS BOX 5Sample size and power
  • Determining the number of patients required to demonstrate a treatment difference is called the sample size calculation
  • The power of a study is the probability of rejecting the null hypothesis if the treatment really works
  • A Type-2 error is the probability of not detecting a real treatment difference
  • All else being equal, larger trials provide more power and precision
  • A smaller treatment effect will require a larger sample size, all else being equal

Adjusting for baseline covariates

When should randomised controlled trials adjust for baseline covariates, how should the covariates be selected, what statistical methods should be used and which result should be emphasised? These are just some of the questions that surround this complex subject. We will touch briefly on some of the key issues here.

By baseline covariates we mean either quantitative (e.g., age, SBP, BMI) or categorical (e.g., sex, previous myocardial infarction, diabetes) variables collected before randomisation and which affect the primary outcome. They are also known as baseline prognostic variables. The emphasis on before randomisation is important – trials should not adjust for variables that are measured post randomisation, and which may therefore be affected by treatment. Nor is there anything to be gained (and perhaps something to be lost) in adjusting for variables which are unrelated to the outcome of interest.
When carrying out baseline covariate adjustment the interest is not in estimating the effect of those covariates but in improving the estimation of the treatment effect. The improvement can be both in terms of obtaining unbiased and more precise estimates – though this is not guaranteed for any given trial.

The goal of randomisation in a clinical trial is to produce two (or more) comparable groups in order to enable an unbiased estimation of the true treatment effect. In theory, as each patient has an equal chance of being in either group, the two groups should be balanced in terms of all baseline covariates. However, in practice imbalances or baseline differences between the groups do occur, which can lead to biased estimates of the treatment effect. This is of particular concern when the imbalance favours the treatment. Adjustment for such imbalances may be desirable in order to obtain a reliable unbiased estimate of the treatment effect.

We should say at this point that the practice of using significance tests to detect baseline differences and presenting p-values in baseline tables in the context of randomised controlled trials is inappropriate [14, 15]. Providing that the randomisation process has been carried out appropriately any imbalance will simply be due to random chance – the null hypothesis is true – and therefore testing whether there is evidence against the null hypothesis makes no sense. Furthermore, focussing on baseline differences in terms of p-values is unhelpful in that even small, statistically non-significant, imbalances in covariates which are strongly related to the outcome of interest may influence the result, whereas larger, statistically significant imbalances in variables that are unrelated to the outcome, may not matter at all [15, 16]. Baseline tables showing the distribution of prognostic variables in the two groups are important, but any imbalances should be discussed from a clinical perspective.

As well as obtaining unbiased estimates of the treatment effect, a further motivation for adjusting for baseline covariates is to increase the precision of the estimated treatment effect i.e., narrower confidence intervals and smaller p-values. Gains in precision mean increased statistical power for a given treatment effect and can therefore potentially lead to reduced sample size requirements [17, 18, 19]. However, the gain in precision or power will only be found when the covariates being adjusted for are related to the outcome of interest.

The effect of adjusting for baseline covariates also varies with the statistical method being used. When the statistical method being used is non-linear, e.g., logistic regression or Cox proportional hazards model, the unadjusted (sometimes called crude or unconditional) and adjusted (sometimes called conditional or stratified) estimates of treatment effect have different interpretations [20]. If there is no imbalance, adjusted estimates will always be further from the null value than unadjusted estimates. This is not the case when a linear method e.g., multivariable linear regression, is being used to adjust for baseline covariates.

Since most trials collect a large array of baseline information, deciding which baseline variables to adjust for is not straightforward. Allowing the choice to be driven by statistical significance after the blind has been broken is clearly wrong, and would result in biased treatment estimates. Details of the statistical methods to be used for analysis of the primary endpoint should be clearly pre-specified in the statistical analysis plan (SAP) or study protocol – this should include details of any planned covariate adjustment [21]. A decision on which covariates to include therefore needs to be made at the design stage and should be based on current knowledge of prognostic variables – perhaps gained from early phases of the treatment development. Where a trial has used stratified randomisation to ensure balance in prognostic factors then the stratifying factors should usually be included as covariates in the model. Including many covariates in a model may lead to unstable treatment estimates. The principle should be to keep things as simple as possible - a few judiciously chosen prognostic variables will generally serve best.

When both adjusted and unadjusted results are reported the emphasis should clearly be placed on whichever analysis was declared as the principal analysis in the SAP or study protocol. It is therefore essential that the SAP clearly and unambiguously spells out the plan for covariate adjustment. It is rare for adjusted and unadjusted results to lead to different conclusions. Where they do, it is likely that the treatment effect is of borderline significance (p-value near 0.05) and then it is important to bear in mind comments made earlier in this chapter about emphasising such an arbitrary cut-point.

Many trials simply carry out and report unadjusted analyses. None of trials referred to so far in this chapter (JUPITER, SPIRIT III, SPIRIT IV or PLATO) report making any baseline covariate adjustments. The EMPHASIS-Heart Failure study [22] investigated the effects of eplerenone in patients with chronic heart failure and mild symptoms. 2,737 patients were randomised to eplerenone or placebo and followed-up for a median of 21 months. The primary endpoint was death from cardiovascular causes or hospitalisation for heart failure. The study protocol and statistical analysis plan pre-specified the use of Cox proportional hazards models adjusting for 13 baseline covariates: age, eGFR, ejection fraction, BMI, haemoglobin value, heart rate, SBP, diabetes, history of hypertension, previous myocardial infarction, atrial fibrillation and left bundle-branch block or QRS duration greater than 130 msec. Both adjusted and unadjusted hazard ratios were presented in the main results table with the focus throughout the paper being on the adjusted results. Results for the primary endpoint and its two component events are shown in Table 6 .

What is immediately obvious is that the adjusted and unadjusted hazard ratios are in close agreement – which is what we would expect given that this is a RCT. For the primary endpoint the adjusted HR and its 95% CI is 0.63 (0.54-0.74) compared to 0.66 (0.56-0.78) when unadjusted. We can also see that for the primary endpoint and both its components the adjusted hazard ratios are further from 1 than the adjusted and there has been a small gain in precision.

FOCUS BOX 6Baseline covariate adjustment
  • Baseline covariates are variables collected before randomisation and which are associated with the primary outcome
  • Adjustment for prognostic baseline variables can lead to gains in terms of less biased and more precise estimates of the treatment effect
  • The choice of which covariates to adjust for should be clearly pre-specified in the SAP and should not be based on statistical tests for baseline imbalance

Secondary endpoints and subgroups – The problem of multiple significance testing

Most clinical trials report results not only for a primary endpoint (or sometimes co-primary endpoints) but also for a series of secondary or tertiary outcomes. For example, in addition to their primary endpoints, SPIRIT IV pre-specified two major secondary endpoints and three additional secondary outcomes whilst PLATO pre-specified one principal secondary efficacy endpoint and six additional secondary endpoints.

In addition, trials often investigate and report treatment effects across a series of subgroups of patients. Subgroup analyses investigate whether the overall treatment effect is consistent across groups of patients defined by baseline characteristics e.g, does the treatment work equally well in male and female patients, or in patients with and without diabetes? JUPITER and SPIRIT IV both present results for the primary endpoint across 12 subgroups of patients, whilst PLATO investigated 33 subgroups, 25 of which were pre-specified and 8 were post hoc.

Furthermore, trials may also report results for the primary endpoint at different time-points in the follow-up period e.g., at 30 days and 12 months.

Consequently the results section of a trial publication can be filled with p-values. The statistical issue this raises is the problem of multiple significance testing. This is often referred to as the problem of multiplicity. The probability of getting a false positive result (Type-1 error) i.e., attaining a statistically significant difference by chance alone, increases with the number of tests carried out. The more tests that are carried out the greater the likelihood of getting a spurious positive result. For example, if we carry out significance tests for five independent outcomes when the null hypothesis is in fact true, the probability of getting a statistically significant result for one the five tests is almost 1 in 4. Therefore multiple testing of secondary endpoints and subgroup analyses can lead to misleading conclusions and overstated claims. Such analyses require careful consideration in reporting and interpretation [13, 15, 16, 23, 24, 25, 26].

How are we to interpret the situation where there is no evidence of benefit in terms of the key primary endpoint, but there is a positive result for one of the secondary outcomes? Or what if there is no evidence of benefit for the primary endpoint except in a pre-specified subgroup of patients? Or what if there is a positive result for the primary endpoint, but a subgroup analysis suggests there is no treatment benefit for men or patients with diabetes? How do we distinguish between real and spurious results? Unfortunately there is no simple answer.

A number of different approaches are used to try to deal with this issue of multiple testing, including pre-specifying the primary endpoint and a small number of key secondary endpoints, or using a statistical approach, such as a Bonferroni correction, that adjusts p-values. Whilst post hoc analyses are of particular concern, the fact that a secondary endpoint or subgroup analysis has been pre-specified in a statistical analysis plan does not provide absolution from the problem of multiple testing. One could simply pre-specify many endpoints and subgroups – the risk of getting a false positive result is still inflated. The Bonferroni method, adjusts (inflates) p-values in order to maintain the overall Type-1 error rate at no more than 5% (or some similar level). However, when outcomes are correlated, as may be expected in a clinical trial, this method is inappropriate as it will result in an inflation of the Type-2 error rate i.e., the probability of missing a real treatment difference. This method also treats all outcomes equally whereas investigators are generally mostly interested in the primary outcome and some key secondary outcomes.

INTERACTION TESTS FOR SUBGROUPS

For subgroup analyses one should not simply carry out separate significance tests in each subgroup. It is important not only to present estimates and confidence intervals for the treatment effect in each group, but also to carry out an interaction test [15, 23, 24]. An interaction test is a statistical test of homogeneity of treatment effect across the subgroups. The null hypothesis is that the treatment effect is the same (homogeneous) in each group. The alternative hypothesis is that the treatment effect is not the same (heterogeneous). As with other significance tests a small p-value provides evidence against the null hypothesis. However, as trials are usually powered to detect a treatment effect across all patients, the interaction test which examines the treatment effect in subgroups will usually lack power.

A common method of graphically presenting results of subgroup analyses is by use of a forest plot. Figure 5 shows this for a subset of the subgroup analyses performed in SPIRIT IV. The overall estimate of relative risk and 95% CI is shown, along with the estimates and 95% CIs in each of various subgroups. It is not always possible to deduce whether the effect is different in the subgroups by looking at the estimates and CIs and so it is good practice to present the interaction p-values (as described above).

In weighing up the credibility of a statistically significant subgroup effect, it is also important to consider evidence from other similar trials (has such an effect been noted elsewhere?) and the biological plausibility of the interaction. More weight can be given to a finding where the investigators not only pre-specified the subgroup analysis but also anticipated the direction of the effect. The number of subgroups investigated should also be considered – it is therefore good practice for investigators to report the number of subgroup analyses carried out (not just reported) and to specify whether they were pre-specified or post hoc.

SPIRIT IV carried out interaction tests for 12 subgroups – in one subgroup there was a statistically significant result. The HR (95% CI) for the primary endpoint is 0.47 (0.32, 0.68) among patients without diabetes and 0.94 (0.59, 1.49) among patients with diabetes. Note that it not sufficient to look at the 95% CIs for each group and so conclude that there is a treatment benefit in one group but not the other. The p-value for the interaction test is 0.02, i.e., evidence against the null hypothesis that the treatment effect is the same in the two groups. The investigators concluded that there was not a significant treatment benefit among patients with diabetes suggesting that the mechanism by which the treatment worked might differ in patients with insulin resistance. However, given that this is one of 12 subgroup analyses, and that the interaction p-value (p=0.02) does not provide overwhelming evidence, this finding could well be due to chance.

In PLATO, 33 subgroup analyses are carried out, 25 of which were pre-specified. The investigators reported that for just three of these subgroups there appeared to be significant heterogeneity for the primary endpoint; the benefit of ticagrelor over clopidogrel was attenuated in patients weighing less than the median weight for their sex, patients not taking lipid lowering drugs at baseline, and patients enrolled in North America (interaction p-values 0.04, 0.04 and 0.045, respectively). The investigators divided the 43 participating countries into four regions: Asia/Australia, Central/South America, Europe/Middle East/Africa and North America. In North American patients the HR (95% CI) was 1.25 (0.93-1.67) a non-significant increase in hazard in the ticagrelor group. The investigators reported that no apparent explanations had been found, though subsequent analyses have suggested a difference in aspirin prescribing in the United States may be a factor [27]. However, given that 33 subgroup analyses have been carried out, 8 (24%) of which were exploratory i.e., not pre-specified, and that the interaction tests were of borderline significance (all p≥0.04) it is arguable that these findings could be statistical flukes.

FOCUS BOX 7Secondary endpoints and subgroup analyses
  • The probability of finding a statistically significant result by chance increases with the number of significance tests carried out
  • Most emphasis should be placed on the results of the primary endpoint
  • Results from secondary or tertiary endpoints and subgroup analyses should be interpreted with caution
  • Subgroup analyses should be pre-specified and always present results of an interaction test – though bear in mind that such tests lack power
  • Significant subgroup interactions should be considered in the light of external evidence, biological plausibility and number of subgroups investigated

Composite endpoints

Many cardiovascular trials define their primary endpoint using a composite of two or more related events e.g., myocardial infarction, stroke or cardiac death. The individual elements are generally referred to as the component events and together they make up the composite endpoint (CEP). The main motivation for using CEPs is to increase the event rate so reducing the sample size required for a given power. They can also be used to make a better estimate of the net benefit of a treatment and perhaps to address the issue of multiple testing. However, the analysis and interpretation of CEPs is not straightforward [28, 29, 30, 31].

Conventional methods of analysis of CEPs typically treat each component as being of equal importance and study time to the first event. The patient is censored at the time of their first event and any subsequent events are completely ignored. This means that more frequent and perhaps less clinically important non-fatal events which occur early in the follow-up period can dominate the primary endpoint. It also follows that more clinically important events e.g., cardiac death, are completely ignored in the primary endpoint if they are preceded by some other non-fatal event. For example, in EMPHASIS-HF the primary CEP was defined as cardiovascular death or hospitalisation for worsening heart failure. During the follow-up period 332 patients died from cardiovascular causes. However, only 189 (57%) of these deaths were analysed in the CEP as they were preceded by a non-fatal hospitalisation for heart failure.

Care needs to be taken therefore when interpreting the results for a trial with a CEP. A statistically significant positive result for a trial with a CEP endpoint including death does not necessarily mean that the treatment saves lives. It is also important to see the results of the trial broken down into its components. Is the positive overall result being driven by a frequent component event that is of less clinical importance? Also is the overall treatment effect seen for the composite consistent across the component endpoints?

The primary endpoint of ischaemia-driven target-lesion failure in SPIRIT IV was defined as the composite of cardiac death, target-vessel myocardial infarction, or target-lesion revascularisation. The one year results are shown in Table 7 .

Note that the number of patients with an event in each component sums to more than the number of patients with the composite endpoint. For example, in the everolimus group 10 patients experience cardiac death, 44 a target vessel MI and 61 a target lesion revascularisation. These three components add to 121 whereas the number of patients with the primary CEP is 101. This means that some patients have experienced more than one of the components e.g., perhaps a target vessel revascularisation followed by a cardiac death. However, from the results provided we cannot deduce how many events from each component are analysed in the CEP. For example, it is possible that none of the cardiac deaths are included in the CEP.

In Table 7 we see that the RR for the CEP was 0.62 with p-value 0.001. Does this mean that everolimus-eluting stents save lives compared to paclitaxel-eluting stents? Looking at the results for the components the first thing to note is that there are considerably more non-fatal (195) than fatal (15) events and therefore that they are likely to dominate the CEP result. We also note that risk of a cardiac death is very similar (0.4%) in both groups with a RR of 0.99 and p-value of 1.00 to two decimal places. So there is no evidence of a benefit in terms of cardiac death. There is evidence of a treatment benefit for the two non-fatal components; RR=0.62, p=0.04 for target vessel myocardial infarction and RR=0.55, p=0.001 for target vessel revascularisation. So we can see that the statistically significant result for the primary CEP is driven by the clinically less serious non-fatal components.

In PLATO, patients admitted to hospital with an acute coronary syndrome were randomised to receive ticagrelor versus clopidogrel. The primary endpoint was a composite of death from vascular causes, myocardial infarction or stroke. The results for the primary composite endpoint and each component are shown in Table 8 .

We can again deduce that some patients have experienced more than one of the components since the numbers for each component add up to considerably more than the number with the primary CEP (e.g., for Ticagrelor 353 + 504 + 125 = 982).

The HR for the primary CEP is 0.84, 95% CI (0.77-0.92) with p<0.001 giving strong evidence of a benefit from being on ticagrelor compared to clopidogrel. Is this treatment benefit seen consistently across each of the components? The results broken down by components suggest that ticagrelor is superior with regards death from vascular causes (HR 0.79 95% CI 0.69 to 0.91, p=0.001) and myocardial infarction (HR 0.84 95% CI 0.75 to 0.95, p=0.005). However, for the outcome of stroke there is no evidence of any benefit, with the HR actually being (non-significantly) greater than one. As stroke was the least common endpoint the positive results in terms of death from vascular causes and myocardial infarction outweighed the slightly negative result for stroke and produced an overall significant result for the CEP.

An alternative approach to the analysis of CEPs is the win ratio [32]. This new method, introduced in 2012, takes into account the varying severity and relative clinical priorities of the components of the CEP such that the most serious events get top priority. Unlike conventional methods, the win ratio allows different types of component events e.g. time-to-event, quantitative, binary, categorical or recurrent events, to be combined in a single CEP. For example, it is possible to combine time to death with number of hospitalisations and change in blood pressure. Whereas time to event methods consider only the first event in a CEP the win ratio considers all events meaning that the win ratio typically has greater power for a given sample size or requires fewer patients for a given power.

Motivated by the Finkelstein-Schoenfeld (FS) test [33], the win ratio provides an estimate and confidence interval for the treatment effect. In brief, the approach compares outcomes for every possible treatment-control pair of patients, i.e., every patient from the treatment group is compared with every patient from the control group. The outcomes are evaluated in descending order of priority over the shared follow-up time of each pair of patients to determine whether one patient can be declared the “winner”. If there is no winner, the pair are said to be “tied”.

To illustrate the concept, Figure 6 shows hypothetical outcomes for 5 patient pairs from a trial with a CEP of death and number of cardiovascular hospitalisations. Considering pair 1 and starting with the most severe endpoint, death, we see that patient B died first and therefore A is declared the winner. We do not need to evaluate the number of hospitalisations. Considering pair 2 and starting with death, we see that although patient B died, the time of death was beyond the shared follow-up time for this pair. This pair therefore remains “tied” on death and we move to the next component, number of hospitalisations. Within the shared follow-up we observe that patient B experienced 2 hospitalisations compared to 1 for patient A, and therefore A is declared the winner. Considering pair 3, neither patient dies and therefore we move to hospitalisations. Although patient A experiences 2 hospitalisations the second is outside of the shared follow-up. Within the shared follow-up both experience 1 hospitalisation and therefore this pair remained tied i.e., no winner can be declared. We leave you to evaluate pairs 4 and 5. You can find the answers at the end of this chapter. The win ratio is calculated by dividing the total number of “wins” in the intervention group by the total number of “wins” in the control group. A win ratio greater than 1 therefore means that the intervention is performing better than the control.

The win ratio method has become increasingly popular since it was introduced in 2012 and has been used in a number of cardiovascular clinical trials. The EMPULSE trial [34] evaluated the effect of empagliflozin versus placebo in patients hospitalised for acute heart failure. The primary outcome was a four-level hierarchical CEP of (i) all cause death, (ii) number of heart failure events, (iii) time to first heart failure event, or (iv) a five-point or greater difference in change from baseline in the Kansas City Cardiomyopathy Questionnaire Total Symptom Score (KCCQ-TSS) at 90 days and was analysed using the win ratio. Figure 7 shows the percentage of wins in each group or ties at each level of the hierarchy, plus the overall percentage of wins. The overall percentage of wins was 53.9% in the empagliflozin group and 39.7 % in the placebo group yielding a win ratio of 1.36 with 95% CI 1.09 to 1.68, p=0.0054.

FOCUS BOX 8Composite endpoints
  • CEPs are commonly used in cardiovascular clinical trials to increase power
  • Results of trials involving CEPs need to be interpreted with caution
  • In addition to the primary results for the CEP investigators should report results for the individual components of the composite
  • Is the overall positive result being driven by the least clinically important component of the CEP?
  • The win ratio takes into account clinical importance of the component events and enables different types of endpoints to be combined

Answer to simplest test question

The difference in the number of events is 31-19 = 12. The sum of the number of events is 31+19=50 and the square root of 50 is approximately 7 (49).

So z statistic is 12 ÷ 7 = 1.7. Looking at the critical values for z from Table 3 , we see our test statistic is slightly greater 1.64 (p=0.1) and somewhat less than 1.96 (p=0.05). So we could estimate that the p-value lies between 0.05 and 0.1, probably closer to 0.1 than 0.05. The reported p-value was 0.09.

Answer to win ratio test question

Pair 4. Neither patient dies during follow-up so the pair are tied on death. Both patients experience 1 CV hospitalisation, but within their shared follow-up it is 0 v 1 hospitalisation for A v B respectively. Therefore, A wins on hospitalisations.

Pair 5. Patient B dies but the time of death is beyond their shared follow-up – so pair are tied on death. Within the shared follow-up time we have 2 v 1 hospitalisations for A v B respectively. Therefore, B wins on hospitalisations.

Personal perspective – Stuart Pocock

Many cardiologists need to acquire a better understanding of statistical methods, especially as they relate to the design and interpretation of clinical trials. The subject is often badly taught, and people end up intimidated by too much technical detail and too little insight into what really matters. This chapter attempts to undo those fears, with a direct account of the fundamental issues nicely illustrated by topical examples of recent trials in interventional cardiology.

We start with what significance tests are really for, aiming to instil the idea that they convey the strength of evidence as to whether a treatment benefit (or harm) is genuine. We wish to bury the misguided notion that P<0.05 is a “magic cut-off” that “proves” that a treatment is effective.
We then move on to the various methods of estimating the magnitude of treatment difference (e.g. hazard ratio for time-to-event outcomes) and how to express the inevitable uncertainty in an estimate of treatment effect using confidence intervals.

No amount of sophisticated analysis can rescue a trial that is poorly designed. The fundamental design needs are: random allocation of treatments, other issues to avoid bias (e.g. appropriate blinding, compliance with intended treatment and complete follow-up) and making trials big enough and long enough to achieve their goals, by using statistical power calculations.

Lastly, we emphasize the problems of “slicing and dicing” the data in lots of different ways and “cherry picking” the result you like most! Thus, subgroup analyses require correct analysis (e.g. interaction tests) and cautious interpretation, in a spirit of hypothesis generating.

A pre-specified statistical analysis plan, giving priorities among outcomes and defining a few subgroups of key interest, can alleviate some of the problems of over interpreting post hoc findings.

Overall, we hope this chapter will help cardiologists to grasp the essentials of statistics, and lead to wiser interpretations of trial evidence.

SHARE YOUR COMMENT