Non-inferiority Trial Design (REFERENCE SERIES)
Why I don't like them. A discussion prior to the US OPC Trial release, which is...unfortunately...a non-inferiority design.
First off - I’m not a statistician. If you are a statistician and something should be tweaked or improved - as always, let me know. Undergrad in Engineering with plenty of math and stats with dedicated medical stat classes at MDACC in training, but its been a while.
Why I don’t love non-inferiority trials
I want to have a brief discussion on non-inferiority trials. I think I’ve made 3 prior comments on the blog regarding non-inferiority trial design so I think I need to spend a quick moment to explain my perspective. It is important to understand as we head towards the publication of the US randomized OPC Proton vs. Photon trial as it is…. unfortunately (my opinion), a non-inferiority trial.
Back in the day (I guess I’m now considered old so I can use phrases like that), I can’t remember quoting a non-inferiority trial. Since 2000 range, they became far more popular as we pushed to shorter treatment cycles.
(Before I begin, I just want to be clear that the following is an example. I think stories often work better and this helps explain my perspective. In the end, I’m for hypofractionation, we have a lot of data from many sources. And as I’ve said previously, I use a 7250 / 29 fraction approach as standard. But it makes a good example)
It’s a catchy name, “non-inferiority”, but I want to go back through the math and statistics of this type of trial design. I’ll use a classic non-inferiority trial - one that essentially helped standardize 70 Gy in 28 fractions for prostate cancer as an example case: the Randomized Phase III Noninferiority Study Comparing Two Radiotherapy Fractionation Schedules in Patients With Low-Risk Prostate Cancer (ref 1). Its a good example for a few reasons which we will discuss.
It looked at low risk prostate cancer treated to 73.8 Gy (lower than I have treated since 2001) and compared it to 70 Gy in 28 fractions. The conclusions state what many non-inferiority trial state - the shorter quicker course was not inferior to the longer traditional treatment.
But here is the rub and it is twofold. First on the math / statistics side and second on broader approach perspective. The math part many might not grasp if you don’t have a math background or have taken dedicated statistic classes - I’m by no means a statistician but I was fortunate to have a lot of math in undergrad and dedicated stats courses at MDACC.
For this example, the trial was designed with pre-set criteria to prove that - 5 year disease free survival - using a standard but rather VERY weak definition of failure - Phoenix definition (2 above nadir) was within 7.65% of the standard results. An anticipated hazard rate failure of less than 1.52. So if we anticipate a 15% failure rate, this trial would confirm non-inferiority by its basic design if the new shorter course achieved a failure rate of anything less than 22.65%.
(I say Phoenix is weak because it is. I’ve said that for 20 yrs. Clearest proof is in PSMA scans which today will pick up failures I’d guess pretty routinely at about 1/2 of that scope of rise. Today, I commonly will not wait out a full 2 point rise before restaging. The argument then becomes is that clinically relevant - that’s a post for a different day)
You see, the reason we run non-inferiority trials with this type of downside leeway is that you need increasingly higher patients numbers the tighter you make the range. So they make try to choose something “reasonable”, where they think clinical differences are minimal. But consider the patient perspective: this is the assumption that has been made behind closed doors “for you”: 85% cancer cure at 5 yrs vs. 78% cancer cure rate doesn’t matter “significantly.” You know, it is the “same.”
This trial also serves as a good example because, the results ultimately are very similar and illustrate a possible second misinterpretation. People will go back and look at the exact results. They will say, “in the shorter arm the DFS was 86.3% and in the standard arm it was 85.3% and look at the highly significant p value.” The assumption is the treatments are the same or shorter might actually be better.
Statistically you have made assumptions which are not implied by the results. The p value represents the odds, not of equal results or a comparison of those values of 86.3% and 85.3%, but of a hazard rate of less than 1.52 the failure rate. The trial isn’t designed to test for anything greater that initial pre-determined difference they thought it likely would pass. In this trial for example, the lower end of the confidence interval is about 10% more failures in the shorter treatment arm. I think many people more directly link the absolute numbers to the p value whereas in reality the p value considers the confidence interval relative to pre-defined inferiority hazard rate.
Therein lies the “marketing” (my term) within a non-inferiority trial. You set up the entire background as an argument for “good enough”. And that is my second issue with them. It is one of a broad perspective arguing that we’ve currently maximized the primary goal. It indirectly argues that increased cure with limited toxicity is not statistically available. As I’ve said previously, I don’t believe that is the case for any cancer we treat today.
Further I’m an engineer. I like things as clean as possible. “Equivalence” with soft endpoints of few trips and more throughput for the center should fall a distant distance behind improving tumor control and lessening treatment toxicity. And, to me, that makes it closer to marketing.
Back to the example. We do not know if we ran it over and over again if the result would remain at 86% or be closer to 80%. But in this case we CAN. We can look at other trials and feel reasonably confident it does likely lies around 85% (ref 2, 3 as just two examples). So this trial contributes but from this single non-inferiority trial, you can’t make that statement. From this trial, we know that is very likely better than 78%.
Here’s a table simply showing we have a lot of data to choose from. If you want to focus on the proton / IMRT aspect of the table, please my READ MY THOUGHTS HERE where I discuss the COMPARRE Trial.
So, in this example, this trial of 70 Gy comes from a non-inferiority comparison to a relatively low dose prostate treatment with a pretty generous definition of failure.
And then ironically in this trial, Gr2 and Gr3 toxicities for both bladder and bowel were significantly higher in the shorter treatment course arm (with about 45% more complications in the shorter arm). By most standards, we would consider this a negative result and I’m certain not explained to patients as “probably about equal cure rates and worse toxicity” in our risk / benefit discussion.
(Again, we have additional data so please separate the example from a broad context of prostate cancer fractionation trials.)
Now consider an SBRT non-inferiority trial to 70 Gy with a similar possible slide in expectations. If 70 Gy really does obtain a “true” value of 84% - within this trial conclusion and supported by the p value, then SBRT might need to hit 78% to pass a non-inferiority trial. And slowly over time, we can slide towards lower outcomes being acceptable.
If you want to read more on non-inferiority trial design here is a medical reference covering the topic: Non-inferiority statistics and equivalence studies. (ref 4)
Why is this context important?
Head and neck trials are NOT prostate trials. We won’t have trial after trial of 500+ or even 3000+ patient trials to consider. The data from the trials will have to be stronger when viewed individually. Non-inferiority works in prostate (I still don’t love it there) largely because, we can look at the broader evidence and generally discard the trial I used as an example where it basically was a worse outcome.
And so when I look forward to the US OPC data, I anticipate that these issues and types of conversations will suddenly surface rather broadly throughout our field. I anticipate large blocks of physicians will argue quite strongly against the trial, in part due directly to the non-inferiority trial design. And now, as a reader here, you might understand that discussion a little better.
Let’s look closer at the US OPC Trial Structure:
The primary endpoint of the trial is progression free survival at 3 yrs (ref 5).
And the power / stats calcs:
The 3-year PFS rate for the IMRT arm is assumed be 80% based on Ang et al, preliminary data from RTOG 1016, and the MD Anderson experience with OPC. A 9-percentage-point noninferiority margin will be used, similar to the one used in RTOG 1016. The corresponding HR is 1.535, based on the assumption that the time-to-event follows an exponential distribution.
If you think my “marketing” statement is not correct. Consider this:
IF PFS returns at 72% PBT and IMRT returns at 80%, the p value will likely be significant - meaning protons are non-inferior with respect to disease control. Statistically this is how we currently define the “same” outcomes - again, I don’t like it, but we are where we are.
But don’t kid yourself - that will be a negative trial result in our community - whether that is correct is debatable. But as you can see in the trial design document I referenced, this trial probably had input of over 100 scientists. Yet it is very likely that the broader radiation oncology community will disagree with “equivalent treatment outcomes” in many “validated” results. Hence, Marketing.
I’ll guestimate today anything more than 2% will be highly contested / criticized. Even if it clearly meets the predefined hurdle. Statistically you really can’t make a judgement either way. And that is a shame.
And that is the issue with non-inferiority - in prostate we want to believe so we do. In this example, the math might ultimately say non-inferior, but there are many “positive” proton trial outcomes from this study that could be interpreted very differently from the mathematical answer.
And if it does demonstrate benefit, naysayers without access to proton therapy will be lining up to nullify any positive result in part due to conflicts of interest in our healthcare system. It would simply require too much change too quickly to accept a benefit to protons. And the “non-inferiority” aspect adds a clear path to undercut results. It is due to this avenue that I even question the trial development process due to potential conflicts of interest.
At the bottom of this article, I have included a full list of secondary toxicity outcomes for the trial. The “proton camp” does get the benefit of being able to list a ton of stuff and then point to one or two or three metrics and say - THERE! But from a science standpoint the argument is weaker than if there was a single prospective primary metric that was chosen and validated. Its a shame really, especially when we struggle to run critical trials in the US.
When I read the about the Journey From Clinical Trial Concept to Activation (ref 6), I get a sense of frustration with the process. The intent was to prove an advantage in a toxicity metric, and yet due to a variety of “shareholder” inputs, we ended with a non-inferiority design.
To give you flavor, here are quotes from the referenced article - it really is a highly recommended read:
When the concept for the U19-supported clinical trial comparing IMPT with IMRT for OPC was developed, by consensus the main outcome of interest was the cumulative incidence of late-onset grade ≥3 treatment-related toxicity (scored according to the National Cancer Institute’s Common Terminology Criteria for Adverse Events [CTCAE]) during the 2 years after completion of radiation therapy.
However, the initially proposed primary endpoint (rate of grade ≥3 treatment-associated toxicity at 2 years) was met with resistance even after numerous discussions with NRG Oncology’s oversight committee.
The major point of contention was use of an endpoint based on the CTCAE scale for the phase III portion of the trial, because of a perceived lack of objectivity
Despite the finding of a significant difference in this composite endpoint at 3 months and 1 year, the NRG Oncology reviewers maintained that this endpoint was still too subjective and that even positive results would not conclusively show the superiority of IMPT.
To say that we have conflicts of interest in the US healthcare system is a vast understatement. To me, the end result of this study being a non-inferiority trial rather than picking a clear metric and proving benefit speaks directly to those issues in our system. It appears the trial designers attempted to push for a better type of structure although, in the end, it wasn’t allowed. Now we deal with what we have.
And what we will have is a non-inferiority result. In protons. Where facilities cost 4x min the cost of an IMRT machine. I’m not even sure what non-inferior means in that type of cost setting difference.
And that is why I recently wrote about the importance and the need for the data out of Europe. They just don’t seem to have near the conflicts of interest that have invaded medicine in the US. They share resources and develop national policies. In recent years across a variety of cancers and on a variety of issues, their systems seem better able to answer the scientific questions at hand in the most robust fashion.
I’ve been in a proton world for 4 years now and it seems the future is quite interesting for our field. I do think the trial will show benefit - I’m less clear on the magnitude. The data will be the final answer. But if protons demonstrate benefit, the non-inferiority aspect will contribute to the US treatment trends moving / changing at a fraction of the speed the data suggests which makes one question the “journey”.
SECONDARY OBJECTIVES FOR THE US OPC TRIAL:
I. Disease-related outcomes (2-year progression-free survival, patterns of failure, 2-year overall survival, 2-year [yr] distant metastasis free survival, and second primary cancers). (Phase III)
II. Patient Reported Outcome (PRO) measures of symptoms using MD Anderson Symptom Inventory (MDASI), MD Anderson Dysphagia Inventory (MDADI), Functional Assessment of Cancer Therapy-Head and Neck (FACT-HN), Xerostomia and Health Questionnaire (European Quality of Life 5-Dimension three level scale [EQ-5D-3L]), work status (Work Productivity and Activity Impairment: Specific Health Problem [WPAI: SHP]). (Phase III)
III. Physician reported toxicity using Common Terminology Criteria for Adverse Events (CTCAE)-4.0. (Phase III)
IV. Quality-Adjusted-Life-Years (QALY) comparison between IMPT and IMRT. (Phase III)
V. Cost-benefit economic analysis of treatment. (Phase III)
VI. To determine whether specific molecular profiles are associated with overall or progression-free survival. (Phase III)
VII. To investigate associations between changes in serum biomarkers or human papillomavirus (HPV)-specific cellular immune responses measured at baseline and three months with overall or progression-free survival. (Phase III)
VIII. To bank peripheral blood at time of enrollment, weeks 2, 4, and 6 during treatment and during follow up visits for 2 years to explore the ability of circulating markers to predict outcome. (Phase III)
IX. To bank head and neck tissues to explore the ability of tissue-based markers to predict outcome. (Phase III)
X. To bank peripheral blood and tissues for future interrogations. (Phase III)
XI. Acute side effects of radiation therapy will be assessed. (Phase III)
REFERENCES:
Randomized Phase III Noninferiority Study Comparing Two Radiotherapy Fractionation Schedules in Patients With Low-Risk Prostate Cancer
https://ascopubs.org/doi/10.1200/jco.2016.67.0448Ultra-hypofractionated versus conventionally fractionated radiotherapy for prostate cancer: 5-year outcomes of the HYPO-RT-PC randomised, non-inferiority, phase 3 trial
https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(19)31131-6/fulltextConventional versus hypofractionated high-dose intensity-modulated radiotherapy for prostate cancer: 5-year outcomes of the randomised, non-inferiority, phase 3 CHHiP trial
https://www.thelancet.com/journals/lanonc/article/PIIS1470-2045(16)30102-4/fulltextNon-inferiority statistics and equivalence studies
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7808096/Intensity-Modulated Proton Beam Therapy or Intensity-Modulated Photon Therapy in Treating Patients With Stage III-IVB Oropharyngeal Cancer
https://clinicaltrials.gov/ct2/show/NCT01893307Comparing Intensity-Modulated Proton Therapy With Intensity-Modulated Photon Therapy for Oropharyngeal Cancer: The Journey From Clinical Trial Concept to Activation
https://doi.org/10.1016/j.semradonc.2017.12.002