AI enters the Prostate Landscape: Part 2
What are the possible issues and how could the analysis be strengthened?
Last week, we looked a new paper: Artificial Intelligence Predictive Model for Hormone Therapy Use in Prostate Cancer
Today, in Part 2, we attempt to look at its impact and examine use cases from my clinical perspective and where / how additional data might strengthen the use case for this approach. You see, I view the recent paper as a setup to a release of a test that one can order. That is conjecture, but that’s my guess. The associated 2022 paper (which we’ll also review), is also quite interesting to evaluate so we’ll use that to compare and contrast their approaches.
As reviewed last week - trying to grapple with where this fits. What role it might have?
As always, opinion of one. If you see something different or think I’m off base (which certainly will happen at some point) please comment below or email. Always looking to learn / get better.
Can AI make ADT recommendations more precisely?
This current paper or “second version release” (my words) of this approach focuses on ADT.
The model used baseline data to provide a binary output that a given patient will likely benefit from ADT or not.
In a recent study, a multimodal artificial intelligence (MMAI) system leveraging digital histopathology and clinical data from five NRG Oncology phase 3 clinical trials, termed the MMAI Prostate Prognostic Model, was used to develop and validate prognostic models that consistently outperformed NCCN risk groups to determine which men with localized prostate cancer would benefit from ADT.17
In this study, we extend this approach by adapting the MMAI Prostate Prognostic Model to develop and test a predictive model on the basis of “deep learning” that has the potential to be used to identify which patients would benefit from ADT.
So this new publication seeks to tweak the prior approach and focuses on the question of ADT recommendations. So I went to my database for context.
My Context / My Data:
Here is a “live look” at my database and we’ll focus on unfavorable intermediate risk (UIR) - the favorable group is simply 2 men on ADT (given before they saw me) and both continue to do well at early time points:
Aug 9th,2023: ~24% UIR on ADT. 76% NOT on ADT (over 50 men). For those not on ADT in the UIR group, at 1 yr post-treatment, MEAN PSA is 0.87. Median is 0.53. At 2 yrs, MEAN PSA is 0.55. Median 0.37. With 7250 cGy(RBE) / 29 fractions continuing to outperform 7920 cGy(RBE) / 44 fractions. Mean follow-up of 20 months.
Follow-up is short and certainly things can shift but this appears to be on track for well above 90% disease free survival based on my review of historical PSA kinetic data. For comparison this is consistent with the SBRT consortium data where the median PSA was bit higher at ~0.95 at 1 year and was essentially the same at ~0.37 at 2 years - ultimately achieving 7% total failure rate for that patient population.
To me, my experience illustrates that clinicians have a very reasonable chance to eliminate ADT in the majority of intermediate risk men - even unfavorable intermediate risk men - simply by integrating good clinical risk markers and giving good dose (I infrequently use Decipher in IR risk). And honestly, now there is clear support for doing just that. In fact, now I have a “validated” metric demonstrating NCCN oversteps on ADT in unfavorable risk disease two out of three times.
And previously on this site, we’ve documented that prostate cancer specific mortality only begins to be a visible difference (at least at current trial patient numbers) generally when PSA failure rates exceed 20% - likely closer 30% or so in the first 5 years. Current treatments generate a fraction of events. It will be much harder, magnitudes more difficult, to see this effect on current treatment populations. Simple math.
I’m not arguing my database is unique or magical. I think you can get here via a number of paths. I have been able to get there with protons and it was pretty straightforward. But SIB and SBRT as I discussed in a prior post HERE (Can we De-escalate better) appear to be very viable paths to achieve great kinetics albeit I believe the SBRT path is a narrower road to travel with more nuance and risk.
What would traditional looks recommend?
Low risk men 35% - anyone recommend ADT? No.
Intermediate risk 55% of men in the study - with ~42% having favorable risk disease (so 23% of the entire cohort in the favorable category). Some might use
Decipher to parse this subset. I ran a quick twitter poll and most are not using additional tests trying to up treatment intensity in favorable risk disease - I agree with that.
So about 32% of this trial is then unfavorable intermediate risk disease and 10% is high risk. So from a large top down level, about 40%-45% of this patient population is likely to be considered broadly for ADT usage. In my clinical practice patterns, I’d treat about 18% of the men in RTOG 9408 (remember the AI model recommends ADT for ~33% of RTOG 9408 patients).
(And yes there are real issues with stage migration when you look back 25 years, but I think that as well speaks to the need for validation in a far more current dataset.)
We mentioned last week the number needed to treat of 10 for the test across the entire cohort of 9408. This is based on a ~10% improvement in DM rate in the model “ADT required” prediction group. I think you can see from the above practice patterns that it is potentially much more complicated than that depending upon your practice patterns and potential use application - at least until we see more data.
I’m not arguing any particular view is correct just pointing out perspective and those perspectives translate into how often a potential test might impact your clinical decision. And yes, one can argue that we don’t have great predictive markers for that hazard rate effect seen with ADT across different doses and different grades, but consider this. No one treats low risk with ADT and most everyone uses them in high risk - and for good reason. So we have some broad predictors of outcome, semantics aside. (and if this test changes that then wow!! )
And so with that context, here is what I personally would have liked to seen to help justify that level of shift to my clinical practice:
What I’d like to see next?
Risk Group and Gleason KM Curves
When viewed in the context of my practice patterns, I read the paper and immediately came back to the kicker in the trial data - it didn’t correlate with risk group or Gleason score - it appears to measure something different. Here is a subset of that table from the data supplement.
Not what one would guess - at least not me. You’d expect it to correlate better with Gleason score or risk group but it appears to break down quite evenly. In other words, it appears to be rather independent of these factors, but we don’t get much other data on this important point - at least that I appreciate.
To me, the obvious next missing piece is to show graphs of Low risk disease: overall, negative and positive KM curves showing, like they did for the entire cohort - the effect on Distant metastasis rate and prostate cancer specific mortality. And then repeated for intermediate and high risk.
Beyond that performing that same type of structured analysis for Gleason score.
My guess is - while these still parse the cohort - it will be more of a blended example of hazard rate vs. absolute benefit on some level where the number needed to treat will show some variance. But I would love to see more evidence of this independence in prediction - that is where much of the value would lie.
And I think while this apparent independence was presented as a critical finding, to me, I think it needs to be judged quite conservatively at this time. A primary reason stated by the authors for the selection of 9408 was to focus on the intermediate risk cohort which then clearly would lessen the value of high risk cohorts (only 9.7% of the validation trial had high risk disease).
then validated using data from NRG Oncology/Radiation Therapy Oncology Group (RTOG) 9408, a clinical trial that randomly assigned men to treatment with radiotherapy plus or minus 4 months of ADT; this trial consisted mostly of men with intermediate-risk prostate cancer, defined as a Gleason score of 7 or a Gleason score of 6 or less with a prostate-specific antigen (PSA) of 10 to 20 ng/ml or clinical stage T2b and not high risk
Proof that it can’t be simplified
Beyond those additional analysis, I’d like to see two other comparisons illustrating the benefit / value of the model complexity:
The model is 87% histology based - slides plus Gleason components. How much does the 13% add? This model attempts to give a global “total” assessment, but it doesn’t consider items like cores, or MRI findings or pre-treatment PSA kinetics or other factors we might learn are important in the future.
Along these lines, I’d like to see just the AI slide read of the data removing human Gleason scoring information (ie use just the 37.3% image / slide computer analysis). If this performed to a similar degree one could argue that there is no need for MD review of the slides.
Each of these steps - proving it cannot be further simplified, speaks to the value it provides. Perhaps the value is in the magical AI mixture or perhaps it largely lies within the image review. Based on what I’ve seen, this is what I can’t answer.
To me, so many great comparisons are available to demonstrate proof of concept and help define value.
(if you agree, like or comment below. If you disagree or these are buried in another reference, please comment.)
A Look Backwards: Initial Publication of the MMAI Model
In big simple terms, this is the second approach of this AI application to THIS dataset. The first application was released in 2022 (Reference 17 in current paper).
I’ll paraphrase what I think I read. It uses 5 trials to attempt to use a deep learning model to improve risk group stratification - ie, can a deep learning model do better than traditional risk stratification - LR, FIR, UIR, HR, VHR.
Patients were taken from RCT Trials: 9202, 9408, 9413, 9910, and 0126. A cohort of 80% of this total patient group was utilized for model creation and 20% reserved for validation - so a blended mix but importantly includes 9408 for both creation of the model and validation of the model.
And the answer was YES! - the deep learning multi-modal model does in fact do a better job than traditional risk stratification. And across a number of clinically important outcomes - including overall survival. Kudos!!
An impressive result demonstrating broad improvement!
The current paper does it differently - well kind of. It uses the exact same 5 randomized trials are used but this time rather than using an 80% / 20% split of all datasets, they simply reserve one entire trial for validation - 9408.
And as described above, they sought to tweak the model - attempting to optimize the deep learning model to answer a clinical question - is ADT required based on predicting time to development of distant metastatic disease?
A Question of “Validation”?
My main question then becomes, is this in fact really a “validation” as described? I’m less certain it is. Here’s why.
From paper 1, in 2022, we KNOW that AI runs successfully on this dataset which includes patients from RTOG 9408 - that is the dataset where the initial premise was demonstrated - ie paper 1. And secondly, we know that the 20% of 9408 helped validate the method as it demonstrated benefit. In a real way, to me, the first paper ensures consistency / proof of concept in the dataset.
To me, therefore, a far cleaner and much more pure “validation” would be to obtain a dataset NOT in the original deep learning AI cohort and again separate from the “adapted model for ADT” to “validate” the study.
Here, in a real way, the “deck” of patients was simply reshuffled and the AI tweaked. But tweaked to a known previously demonstrated outcome metric. Look at the 2022 paper results - a similar AI model in this same dataset was 13.85% better than risk groups at predicting distant metastatic disease. And we know that risk stratification predicts distant metastatic disease. So while the model was “rebuilt”, we knew (at least on some level) that 9408 data worked. Prostate cancer is that consistent.
Further, last week we discussed the difference in the prostate cancer specific mortality curves that are clearly visible which, to me, demonstrates clear confounding bias in the 9408 dataset in favor of ADT. And then remember that the current model (2023 paper) did NOT work for metastasis free survival or for Overall survival.
Below are curves from a secondary analysis of 9408 data (Ref).
And they seem odd. Prostate cancer specific mortality (PCSM) does what it shouldn’t do - separates immediately. While all-cause mortality stays identical (meaning a clear difference in NON-prostate cancer specific mortality) for about 3 years before separating at 3 years - a time when the PCSM curves do NOT appear to separate. These graphs are simply something that I would not hang a hat upon. Perhaps trending correctly, but likely, at a minimum, overstating benefits in any analysis. It is 25 yr old data with clear confounding on some level - at least from my perspective.
This issue is, on some level, included in the discussion:
A concern with any model is the possibility of overfitting and failure to validate. This cannot be overstated, and independent validation remains necessary to prove the performance of a model. In the specific case of predictive models, which aim to identify those patients who derive greater or lesser relative benefit, this almost always should be performed within the context of a randomized trial of the treatment of interest to avoid confounding and bias between groups.
And there is an entire paragraph dedicated to discussing the difference between DM prediction and metastasis free survival - I encourage you to read it. Here is just one part of that full paragraph.
However, they are suboptimal end points for development of prostate cancer–specific predictive models for localized disease. This is because 78% of deaths in the validation cohort were not from prostate cancer, and only 12% of events in the MFS end point were from metastatic events. Thus, the strongest prediction models for MFS and overall survival would be driven by variables associated with death from nonprostate cancer causes (i.e., comorbid conditions).
And while these speak to some issues, they really do not speak to the reshuffling of a known deck - a deck where we have already demonstrating a functional result. And it doesn’t speak to the unknown confounders in the 9408 dataset that show a pretty clear and significant difference in prostate cancer specific mortality within months of diagnosis - a fact which I believe is completely inconsistent with the natural history of intermediate prostate cancer when treated with definitive radiation.
Statisticians? Feel free to comment below.
Is this Yes or No? Analysis “Predictive” as written?
Here, I wonder if I’m not missing something, but I’ll ask the questions and perhaps someone who is smarter can step in and help.
Below is a nice description of prognostic vs. predictive from the authors via an X-torial. Their words:
Prognostic: Relative benefit (HR) of ADT is about the same, absolute benefit changes
Predictive: Relative benefit different by biomarker score, absolute benefit changes
As analyzed and presented, this paper divides the distribution into two factions - A YES group and a NO group. But it is labeled as “predictive” - ie the relative benefit changes - both in the paper and in the post-release descriptions.
Honestly, I don’t understand how the model as presented - positive or negative - 2 values, can be “predictive”. If it had been presented as 10 bins - and then demonstrated 10 different levels of separation from low to high (mentally picture 10 KM curves from high model prediction score to low) - then that would have been predictive, but as presented - I see a hazard rate - risk of metastatic disease in “needs ADT “/ risk of metastatic disease in “doesn’t need ADT”.
And Finally, the Changing State of Salvage:
At the time of 9408, salvage was simply ADT forever. Today we have tons of options and over the next decade options will likely multiply. With good salvage approaches there is less need to assign all patients to toxicity over more conservative approaches of simply doing good salvage treatment. Yes one or two percent might truly need ADT but at what cost? I didn’t really even dive into the fact that metastatic free survival trended against ADT even when the model predicted benefit. After all what are the goals? Is this only for unfavorable intermediate cases? Or is it being presented to apply from low risk to high risk to parse patients? Often easy salvage or ADT many more with less salvage? What is the most appropriate “endpoint”? Not easy to define from my perspective. But I think this speaks to the issue in a number to treat analysis moving from modeling to the clinic and that is why I think it is important to consider.
My Current View:
As of today, I wonder if this isn’t potentially more of a pathology replacement approach than a new clinical test that should be added at the end of day. I mean I guess at this point, it tells me to consider T-stage, age and PSA a little bit. Ok, mental note filed. But I already do that. And beyond that, I look at cores, consider the MRI findings, look not just at the absolute PSA level but the pattern of the PSA and PSA kinetics leading up to diagnosis.
But consider that point:
A possible replacement for Gleason Score via computer histology interpretation.
That is an amazing statement. And I wonder based on the approach in this publication if that isn’t where they should focus. From my perspective, paper 2 is actually one step farther from that approach than paper 1. And that is why I’d like to see more breakdowns and likely - as they imply - more validation on current / modern patients.
The authors present the limitations and speak to them directly, I have simply tried to add a further level of detail.
A concern with any model is the possibility of overfitting and failure to validate. This cannot be overstated, and independent validation remains necessary to prove the performance of a model.
Back to my clinic:
Using this tool across the spectrum of intermediate risk disease - ie consistent with the premise for this approach would increase my utilization of ADT to 1 in 3 men with FIR from essentially 0% and in UIR, an additional 10-15% would be treated. Further, it assumes the benefit would be the same in a cohort with a fraction of the failures. I think that needs to be demonstrated.
If you come from one of two scenarios, you might see this from a very different perspective and I appreciate that. If you can’t / aren’t achieving strong PSA kinetics with your approach, then evaluating a historically lower risk cohort (like FIR) for treatment intensification makes sense. If you recommend short term ADT for your entire unfavorable risk cohort (per NCCN), then this tool might assist you in reducing your ADT usage to 1/3rd in that subset of patients.
Me? I think you can get there other ways - with dose and careful outcome measurements in a high volume setting and integrating features that speak to risk (including among others: MRI findings, core data, PSA kinetics pre-treatment). In fact, in my clinic this would lead to higher ADT usage (nearly doubling my ADT use) when on the contrary, I’m looking for ways to reduce that extra treatment toxicity in close coordination with my patients after we discuss risks and benefits. And here on this Substack, I have laid out why and how I believe one can get there.
In Summary:
Any work that offers this many questions and makes one ponder what it might bring down the road for cancer patients is potentially strong, valuable work. From my perspective today, I judge the 2022 work and particularly the component centering on possible replacement of the reading of slides to be the more foundational work. I’d like to see the additional comparisons and deeper looks into this model to determine its benefit, if any, over the far simpler substitution of Gleason scoring.
Likely this type of “pathology substitution” would be much harder in the marketplace - patterns of care perhaps argue for something akin to a better Decipher type test. This is exactly where I see this angling in the later publication. I understand that the test would likely include a “number” for the AI predicted risk and then have a simplified YES or NO to help answer an ADT question (think Oncotype or Decipher). But, to me, the data should first demonstrate the clear predictive nature: ie 10 different curves of risk outcomes FIRST. Then later, present the simplified view of ADT recommendation. I don’t believe the former was clearly presented in either paper.
Kudos to the authors for the work! I hope our industry will continue to push dose and improve outcomes via radiotherapy alone to make this type of approach needed in as limited of a patient population as possible. I believe we have two levers - this type of approach to define what lies in the “black box” and improved radiation dose / approaches which serve to more simply shrink that box. And I tend to emphasize the later.
In the interim I think this data pretty clearly supports dropping ADT in many unfavorable risk patients. Certainly I think this work should be used to remove the uniform recommendation for ADT in UIR patients in the NCCN Guidelines - and at least changing the + to +/-. (Which again speaks to the strength of the work - using it to help improve standard of care guidelines).
I’d be happy to enroll patients in a randomized trial for UIR and HR patients to validate this tool with the following caveats. For the standard arm I think facilities that have shown good results avoiding consistent ADT usage in intermediate risk disease should be the target institutions. And secondly, the standard arm should be allowed to use clinical discretion in the use of ADT in the trial - not strictly protocol driven - thereby demonstrating the extra benefit in the experimental arm (assuming this is an orderable test format). That trial would answer a ton from my perspective and more clearly define a number needed to treat analysis.
In the end, I land somewhere between very minimal use case and a histology replacement / improvement approach - a laughable spectrum. And so ultimately I land needing to see more data for which I have tried to outline both the approach and rationale above.
I started this two part series encouraging people to read the article first and make their own assessment. If you did, I’d be interesting in hearing your differences in assessment. Feel free to comment or reach out. It is a complex topic and one that is likely seen via a number of perspectives. This took me weeks of thought on the topic and it is very possible that with more views and different perspectives, my thoughts will evolve and change - happy to consider additional views.
Addendum: Realized this is LIVE today apparently based on these data - both trials are highlighted on the site.
Company site is at: Artera.ai. At least now, you can read that site with more context than most. (I’m honestly surprised this is this far along on the commercial side based on this data - if there is more I missed, please post below).
As always, thanks for following along as we push for better. I look forward to see more data out of this project. Next week, we’ll turn back to the business side of things and look at the new ASTRO ROCR model, until then.
Click here to share the whole site. One click for nearly 50 articles - what a bargain! And it is completely free!!