Introduction

The use of well-defined outcomes to assess clinical trial results is of major importance. Ideally, endpoints in clinical trials for novel antibiotic agents should be objective, reproducible, have a high internal and external validity, and be clinically meaningful: a direct measure of how patients feel, function, and survive [1, 2]. Unfortunately, at the moment, there is a lack of universal, well-accepted endpoints, particularly for severe hospital-acquired infections. This has resulted in inconsistencies in how trials are designed and reported, raising questions about internal validity and making interpretations across trials difficult. A recent Delphi process (see definition in Table 1) to define standardized endpoints for trials on antibiotic therapy for bloodstream infections (BSIs) identified that no well-validated primary endpoints existed; mortality or clinical cure, at hospital discharge or up to 12 weeks after treatment, were the most common primary endpoints [3]. As another example, the US Food and Drug Administration (FDA) recommends all-cause mortality as the primary efficacy endpoint for trials on hospital-acquired pneumonia (HAP) and ventilator-associated pneumonia (VAP) [4], whereas the European Medicines Agency recommends clinical outcome at the test of cure visit for these type of infectious syndromes.

Table 1 Glossary of important concepts

At the same time, both cure and mortality endpoints have challenges associated with them, especially in critically ill patients. First, in this group of patients, cure is very difficult to define since clinical signs and symptoms may vary due to the infectious process studied, but also because of many concurrent adverse events during their stay in the intensive care unit (ICU) [5]. Second, all-cause mortality in critically ill patients is often related to the underlying illnesses and severity of disease [6]. And since trials often provide inadequate data on the “standard of care”, especially for the treatment of organ dysfunctions related to severe infections [7], it is difficult to associate death with treatment failure of the infection of interest. As a result of these endpoint issues, clinical trials often become less pragmatic [8] and exclude patients with a high risk of dying, or patients in whom underlying conditions may explain, at least partly, the risk of death, even though these new antibiotics, once available for clinical use, will also be administered to these patients. Finally, non-infection-related deaths bias the potential effectiveness of the treatment under study towards non-inferiority, which is especially relevant given that randomized clinical trials (RCTs) for antimicrobial therapies are generally non-inferiority trials (Table 1).

A further challenge represents the required sample size to detect a clinically meaningful difference in the main outcome of interest. In trials examining new therapeutic options for severe infectious diseases, it is not possible to compare new treatment to placebo for ethical reasons. The added value of a new treatment needs to be compared to the standard of care, which often results in marginal differences. Furthermore, it may not be feasible or ethical to recruit patients for whom most benefit can be expected, hence reducing the marginal differences even further. Consequently, a non-inferiority trial needs to be considered. However, a standard non-inferiority margin (Table 1) of 10% is often considered too large from a clinical point of view, especially if survival rates or cure rates are high due to the exclusion of patients at high risk of dying, which often occurs as noted above. This makes the choice for and definition of the key endpoint in this patient group critical.

In this article, we review and discuss the current practices and challenges regarding endpoints in RCTs evaluating new therapies for severe infections and propose novel approaches which could improve internal and external validity of RCT outcomes. This is based on the expertise that was fostered within the STAT-Net group through systematic reviews [9], re-analyses of clinical trial data [10], and testing of new analytical methods and study designs, which will be reported in more detail soon.

Currently applied endpoints

Current primary endpoints in RCTs of severe infections include all-cause mortality, attributable mortality (Table 1), improvement of clinical parameters or specific biomarkers, microbiological eradication, antibiotic- or organ-failure-free days, and quality of life evaluations [11]. Advantages and disadvantages of these different endpoints will be discussed below and are summarized in Table 2.

Table 2 Advantages and disadvantages of clinical endpoints in randomized clinical trials evaluating antibiotic effectiveness in critically ill patients

Mortality

Mortality endpoints constitute the most robust outcome criteria; it is the most severe outcome and can be measured objectively. RCTs in critical care have traditionally reported ICU-, hospital-, or 28-day-mortality, partly as a regulatory requirement, partly in an attempt to balance the time needed for a drug to show its effects and the time in which other disease processes could obscure the effect. The magnitude of the effect depends on the timeliness of initiation and appropriateness of empirical antibiotic therapy [12]. The attributable risk of death has been studied for VAP [6, 13, 14], HAP [15], catheter-associated urinary tract infections [16], and nosocomial BSIs [17] using various methodologies [18]. In HAP/VAP, the most important indication for antibiotic treatment in ICU, the attributable mortality (Table 1), reported by well-designed epidemiological studies, is around 3–10% when compared to treated controls [6, 13].

However, nowadays, trials become less pragmatic, losing the match between the trial setting and the setting to which its results will be applied [8]. The application of, for example, increasingly restrictive inclusion criteria have reduced reported all-cause mortality for hospital-acquired infections from 30 to 10–15% [18]. Restrictive criteria are required to ensure that the treatment can be adequately assessed; i.e. to prevent a sizeable proportion of patients dropping out within 48 h due to death. This does, however, mean that RCTs only include part of the real-life population for whom the drug could be of benefit, and could result in new drugs being prescribed off-label after approval [19]. Moreover, recently published epidemiological trends have shown that all-cause mortality rates could become even lower in future trials due to a combination of better recognition and improved standard of care [19,20,21]. This would make it even more difficult for a new treatment to demonstrate higher efficacy when focusing on mortality endpoints [22]. This emphasizes the importance of selecting an endpoint that is sensitive enough to capture relevant added benefit for the trial patients, as well as patients expected to be treated with the new drug. In Fig. 1a, we show how a small difference in mortality between the intervention and control arm influences the required sample size for a clinical trial in a non-inferiority setting; with 22% mortality in the control arm versus 20% in the intervention arm, and a non-inferiority margin of 10%, more than 370 patients should be included to have a power of 80% to be able to conclude non-inferiority. Halving of the non-inferiority margin, to a more clinically acceptable 5%, would almost triple this requirement.

Fig. 1
figure 1

These graphs show the impact of the chosen endpoint, effect size, and non-inferiority margins on the required sample size. Scenario 1: required sample size for a non-inferiority trial with a 28-day mortality endpoint with an estimated mortality difference of 0% (green circle) or 2% (blue square) and a non-inferiority margin of 10% (dashed line) or 5% (solid line) (a). Scenario 2: required sample size for a superiority trial for a difference in antibiotic-free days of 7 days (green circle) (b). Scenario 3: required sample size for a superiority trial for a probability of 66% (green circle) to have a better outcome in the intervention arm (DOOR/RADAR composite endpoint) (green circle) (c). All simulations are based on a power of 80%

Clinical cure

Clinical cure, i.e. investigator’s assessment of clinical response, is the primary endpoint used in the vast majority of studies conducted before 2010 for severe infections such as HAP/VAP [23]. This endpoint can be more sensitive than mortality to assess treatment efficacy, especially in the context of low mortality rates. However, as a consensual definition of clinical cure is still lacking, the appreciation of cure by clinicians remains subjective, raising reliability and reproducibility issues. Indeed, clinical improvement is sometimes very hard to establish in severely-ill patients [2], and this lack of objectiveness can result in variability between centers in ascertaining cure, resulting in bias on the endpoint, consequently diluting or masking a potential treatment effect. Using an adjudication committee with pre-planned charter for adjudication may be a solution to circumvent such potential bias in assessment of clinical cure, but variability in diagnostic criteria can also impact clinical cure rates. A recent study showed for example that only 27.6% of the infection-related complications in mechanically ventilated patients are related to VAP [24]. Therefore, absence of clinical cure is related to VAP in only one-quarter of the cases. Finally, while mortality is usually assessed at day 28 or at 1 month, the variability of timepoints used to assess clinical cure may impact study results [9].

Microbiological cure

Microbiological cure is a more objective endpoint, but it requires multiple, serial samples, which need to be adequately and reproducibly cultured. In particular, time to negativity of blood cultures and decrease in bacterial count of quantitative cultures from respiratory samples in VAP are frequently used in clinical practice. The main challenge is that the proportion of patients with a confirmed pathogen at baseline varies per infection type, but can be less than 50%, and repeated microbiological sampling is often not feasible or systematically performed. In most of the HAP/VAP studies conducted for approval by the FDA before 2010, clinical cure without respiratory samples was also classified as microbiological cure. Moreover, in case of doubt of clinical success, additional sampling is done, i.e. the most critically ill patients are sampled more often, creating measurement bias. Finally, the time to eradicate the organism is not always correlated with clinical response or increased mortality. For instance, Dennesen et al. and Shorr et al. failed to demonstrate a relationship between microbiological clearance and mortality for severe infections such as HAP/VAP [25, 26]. This underlines how microbiological cure as an endpoint could readily lead to overestimation of clinical improvements and in that way disregards patients’ wellbeing.

Antibiotic- or organ-failure-free days

Endpoints measuring antibiotic- or organ-failure-free days are among the most frequently used endpoints in non-registration RCTs, comparing antibiotic strategies in the ICU setting [27, 28]. This endpoint considers both mortality and other endpoints such as antimicrobial exposure or mechanical ventilation within one measure; free days are days without exposure between randomization and the pre-determined end of follow-up or death, whereby patients who die early have a higher chance of receiving a lower score. This strategy increases the chance of observing differences in outcome between treatment arms. If, for example, a new treatment increases the number of antibiotic-free days by 7 days, only around 30 patients would be required to have a power of 80% to detect this difference in a clinical trial (Fig. 1b). A disadvantage of this endpoint is, however, that patients with completely different infectious processes could receive the same score. For example, a patient under continuous antibiotic therapy and still alive at day 28 could receive the same score as another patient who died at day 7, because neither of them had any antibiotic-free days.

Safety

Safety endpoints are always measured in addition to other clinical endpoints. A new anti-infective therapy should be at least as safe as the comparator. However, in critically ill patients, the adverse events are frequent and could be due to many other drugs and invasive procedures. As such, it is almost impossible to causally attribute adverse events to the drug under study [5]. Given that treatment-limiting adverse events are part of the definition of failure on the clinical response endpoint, it is important to be able to causally assess the adverse event. Incorrect attribution of serious adverse events could jeopardize drug development and registration. Since the accuracy of the detection and attribution of serious adverse events is so low, i.e. high variability, and most RCTs are not specifically powered to detect these events, the chance of detecting any differences in safety endpoints between treatments is often very low.

Emerging antimicrobial resistance

The prevalence of multi-drug-resistant organisms (MDROs)-associated infections has increased over the years [29], and as such has increased complexities even further. If patients are randomized before microbiological results confirm the susceptibility profile of the causative pathogen, it is impossible to distinguish resistant from susceptible infections prior to randomization. Consequently, the impact of empirical treatment of infections caused by MDROs will be diluted, and a non-inferiority trial (Table 1) would be the most feasible design, with the risk of including very few target patients. This trial therefore has a risk of providing little relevant value, especially if the overall patient group has a very high response rate, as this makes the often chosen non-inferiority margins (Table 1) around 10% clinically unacceptable. If the trial is focused on MDRO infections such that patients are randomized only after microbiological results are known, three major caveats need to be taken into account: a fully powered RCT is unlikely to be feasible as patients are rare; empirical therapy will dilute the impact of the new therapy; and possibly the optimal time frame for highest attainable impact has already passed. These challenges are present regardless of the type of endpoint studied. It does, however, make it critical that the most sensitive, relevant endpoint is chosen.

Possible solutions (and their potential shortcomings)

As outlined above, there is no consensus about the most appropriate endpoint for RCTs of antibiotic therapies in severely ill patients. Mortality endpoints are not very sensitive, while clinical cure endpoints are subjective, microbiological cure is not viable, and antibiotic- or ventilation-free days are ambiguous (Table 2). It will not be easy to reach consensus, but Delphi processes like the one recently published on endpoints for BSIs [3] are a step in the right direction. In this review, the authors concluded that it is unlikely for one endpoint to capture all relevant information and proposed to use composite endpoints. In this article, we will discuss three promising strategies: composite endpoints, which are, for example, used in several ongoing, open-label RCTs evaluating antibiotic therapies, combining mortality and clinical endpoints (NCT02634411, NCT02365493, NCT02575495). Secondly, empirical treatment trials aiming to show non-inferiority have become standard, but, given the importance of assessing patients infected with MDROs, a hierarchical nested design, combining a non-inferiority trial with a nested superiority trial (Table 1), should be considered. Finally, up-to-date statistical methods like competing event analyses (Table 1) and multistate models (Table 1) could provide more insight into how timing of events influences effect measures.

Composite endpoints

Combining multiple endpoints into one composite effect measure would overcome the need to have co-primary endpoints and thus avoids the issue of multiplicity and the consequent need to adjust p values. It could reduce the required sample size even further, as the effect difference could be larger. The International Council on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) has written the ICH E9 guideline [30], a guidance document about the principles of statistical methods applied to RCTs for marketing applications. In this document, composite endpoints are regarded as “a useful strategy”. In addition, defining a single endpoint that includes an efficacy component as well as a safety component is consistent with FDA’s philosophy of defining an endpoint based on how a patient feels, functions, and survives. Finally, separate analysis of the components of a composite endpoint as secondary outcome measures could preserve the possibility of comparing old and new trials.

However, construction of such an endpoint is challenging and interpretation can be misleading, especially when an intervention affects the distinct endpoints differently [31]. For example, length of stay may be reduced by a treatment only because mortality is increased. Furthermore, an inherent limitation of the conventional reporting of composite endpoints is that it emphasizes each patient’s first event, which is often the outcome of lesser importance. Therefore, Pocock et al. have suggested the win ratio as a new effect measure that takes the different priorities of the components into account [32]. Pairs are composed of patients from the new and comparator treatment, within randomization strata, and are subsequently grouped into winners/losers based on whether the treated pair member experienced the most/least favorable event first (Table 3).

Table 3 Evans et al. proposed a way to utilize composite endpoints: the desirability of outcome ranking (DOOR) and the response adjusted for the duration of antibiotic risk (RADAR) [33]

Evans et al. recently proposed another way to utilize composite endpoints: the desirability of outcome ranking (DOOR) and the response adjusted for the duration of antibiotic risk (RADAR) [33]. For DOOR, a composite score is designed by assigning higher ranks to patients with better overall clinical outcomes; clinical success, clinical benefit with adverse event (AE), clinical failure without AE, clinical failure with AE, and death can be components of this composite endpoint. For RADAR, rankings are based on the duration of antibiotic use, which is tailored for trials evaluating optimal antibiotic use, and this can be combined with DOOR. In the end, DOOR/RADAR distributions can be compared between treatment arms. The generic version of DOOR is very similar to the win ratio (Table 3). This composite score could greatly reduce the required sample size to detect superiority for a novel drug. In a recent re-analysis of clinical trial data, looking at shorter duration of antimicrobial therapy for intra-abdominal infections, a DOOR/RADAR of 66% was found [34], meaning that patients in the control arm had a 66% chance of getting a better DOOR/RADAR than the controls. In Fig. 1c, we show that enrolment of 75 patients would already be sufficient to detect this difference.

While Molina and Cisneros state that this design provides valuable information and could be included as a co-primary analysis [35], Phillips et al. warn that considerable evaluation of this method is necessary, as it can easily be manipulated through choice of the categories [36]. It is therefore of key importance that this kind of hierarchical endpoints are clearly defined and published in the statistical analysis plan before unblinding and analyzing the data. Importantly, interpretation of the outcomes and the difference in outcomes between the arms would require pre-trial discussions; it is a novel endpoint and a clinically meaningful difference needs to be determined.

Components of composite endpoints

Composite endpoints seem to be a promising strategy; however, its specific components will depend on many factors, including whether it is a pivotal or pragmatic trial, or what type of indication is targeted. In two recent systematic reviews [3, 9], it is proposed to combine mortality and microbiological endpoints for BSIs, and mortality and clinical cure for HAP/VAP, respectively. However, both groups agree more debate is required to improve definitions of clinical cure and to finalize discussions on the importance of patient-centered outcomes versus microbiological or clinical response, as well as the role of safety endpoints.

Hierarchical nested designs

A hierarchical nested design is proposed by Huque et al. in order to overcome the problems associated with RCTs specifically focused on treatment of infections caused by MDROs [37]. In this design, the primary endpoint (described as a dichotomous one in Huque et al. [37], i.e. mortality or clinical/microbiological cure) is first tested for non-inferiority in the subgroup of patients who have infections caused by pathogens susceptible to the control drug. Then, patients with infections caused by resistant pathogens can be compared using a superiority test, once non-inferiority is confirmed, on the same endpoint. However, since this design is powered for the non-inferiority comparison, the probability of achieving superiority for the patients with infections caused by MDROs is small, given they are likely to be a small proportion of the overall population. Therefore, more sensitive, non-mortality endpoints should be considered in this type of trials for testing the superiority hypothesis in the resistant subgroup [2, 38, 39]. So far, no RCT has implemented a hierarchical nested design to determine treatment efficacy for infections caused by MDROs.

Statistical advancements in analyzing endpoints

In addition to the complexities related to study design and selection of the most appropriate endpoints, there are also some issues that need to be considered in the analytical phase. The high underlying mortality rate of critically ill patients makes an analytical evaluation of non-mortality endpoints challenging. Harhay et al. have already emphasized that special statistical methods are needed when evaluating a non-mortal clinical endpoint [38]. All non-mortal clinical endpoints are measured over time, i.e. days from randomization, such as clinical outcome at the test of cure visit. Thus, time plays a crucial role in the interpretation and analysis of the data. The main statistical challenge, when observing non-mortal clinical endpoints, is the way patients who die are handled. For instance, when cure is the endpoint of interest, patients who die before being cured, from, e.g., their underlying disease, will have zero chance of cure (Fig. 2, patients 2 and 9) as compared to patients who are simply lost to follow-up and still have a chance of cure (Fig. 2, patient 5). Death prevents observing cure as the primary endpoint and is therefore a competing event for cure [40, 41]. Competing risk methods take these effects into account and provide an appropriate estimation of how the cure probability develops over time. These methods have recently been applied for RCTs in similar settings: for instance, Ayzac et al. studied the impact of prone positioning on the incidence of VAP using a cumulative incidence function that accounts for death as a competing event [42].

Fig. 2
figure 2

An illustration of the follow-up time over 30 days for ten patients with cure as the primary endpoint. On the x-axis, time from infection is displayed in days. Death can happen early in time (e.g., patients 2 and 9) preventing a patient from being cured, but death can also be observed after cure (patients 6 and 10)

In addition, underlying all-cause mortality often remains high, even shortly after experiencing the non-mortality clinical endpoint. This requires a deeper understanding of the event dynamics after reaching clinical cure. It can be achieved by a multistate model, an extension of the competing risks model, where a non-mortality endpoint, such as cure, acts as an intermediate event. Such a model is appropriate to examine the whole time-dynamic process of the probability to be cured and alive. In particular, it accounts for the fact that patients might die from an underlying illness where the infection may or may not be contributory, either before or after the scheduled test of cure. In this model, efficacy can be evaluated at several time points simultaneously, comparing a short- with a long-term effect, potentially looking at signs and symptoms to assess how a patient is or not improving during the entire course of treatment.

Safety endpoints

Finally, it is not just about efficacy of the new treatment, as side effects should also be considered. Traditionally, side effects of new therapies include short-term, individual effects, involving one or more organ systems. However, side effects of antimicrobial prescribing can have a far wider reach, as any use of antibiotics will inevitably have an impact on the flora of the individual patient and the microbial ecology in the environment. Therefore, collateral damage at the individual and population level, in the short and the long term, should be considered in RCTs evaluating new antimicrobial therapies. Secondary safety endpoints that could capture this wider context at an individual level include endogenous resistance development, impact on the microbiome, i.e. colonization with MDROs after treatment, incidence of Clostridium difficile infections or super-infections. At the population level, trends in resistance proportions or colonization rates could be important indicators. In the field of pivotal RCTs, this is still uncharted territory, and as such it is very difficult to provide any practical guidance on how to implement this type of safety endpoints in future trials. However, it is clear that assessment of this collateral damage becomes more and more critical, and including some measure of collateral damage should be considered during the design phase of new RCTs.

Role of the clinician

As stressed previously, the commonly applied 10% non-inferiority margin for mortality is far higher than the 3–5% which would be acceptable by clinicians. Therefore, clinicians should participate in the discussion about applied non-inferiority margins in clinical trials performed in their hospitals. Secondly, it is clear that mortality should no longer be used as a single primary outcome measure to evaluate new treatment for sepsis, especially in non-inferiority trials. A panel of clinicians, pharmacists, methodologists and patients should be involved in order to come to a clinically meaningful hierarchy of endpoints, which could be proposed to regulatory agencies. The Delphi process is one of the possible ways to come to a consensus on new endpoints. This effort has already been initiated for circulatory failure in sepsis [22] and BSI [3], and is ongoing for HAP/VAP by our group [9].

Conclusion

Overall, there is a wide range of possible endpoints for evaluating new antibiotic therapies for patients with severe infections. In registration trials, mortality and clinical response tend to be primary endpoints. Non-registration trials have more frequently used microbiological response and antibiotic-free days. However, all of these endpoints suffer from drawbacks, as summarized in Table 2, and are inadequate in the light of improved clinical management and the increasing prevalence of MDROs. There is also a lack of consensus about which endpoint should be primary, and, as such, there is an urgent need to discuss and develop new endpoints that can capture added value of antibiotics in these specific circumstances. These endpoints need to be of importance from the patients’ perspective and represent the real-world setting. Currently, composite endpoints, hierarchical nested designs, and competing risks analyses seem to be the most promising new tools for designing and analyzing clinical trials in this area.

For composite endpoints, the applicability of the win ratio and/or DOOR/RADAR should be thoroughly tested for different categorization of objective and more subjective endpoints. If a hierarchical nested design is used, the likelihood of being able to achieve superiority on the resistant pathogen subgroup should be an important consideration in the selection of the endpoint, suggesting that more sensitive endpoints, like composite scores, would be most appropriate. The performance of nested non-inferiority/superiority trials wherein the most applicable endpoint can be tested in each of the subgroups could further improve this design, but this should be tested for performance. As an additional clinical trial requirement, sensitivity analyses should be considered that assess the whole time-dynamic process of the probability to be cured and alive, without applying stringent time limits to the test for cure or determination of life status, using a multistate framework.

The current discussions on the most appropriate endpoints for RCTs in severely ill patients leave developers in an uncertain position when designing new trials. First of all, if consensus is not reached, future reviews and meta-analyses will not be able to bring together information on new therapies, as the clinical data and evaluated endpoints will differ. Secondly, it postpones discussions about how to handle the transition period from old to new endpoints. In this era of rising resistance levels, slowly refilling antibiotic development pipelines, and an aging population, we need to make sure that RCTs effectively, and in a valid way, determine the added benefit of new antibiotic agents, especially for severely ill patients. Regulatory authorities, pharmaceutical companies, and clinical investigators need to agree soon on the most appropriate clinical endpoints for this vulnerable patient group in order to make sure that new, effective antibiotic agents will become available for medical prescription efficiently and that unnecessary deaths can be avoided.