Wednesday, April 03, 2019

How IBM Watson Overpromised and Underdelivered on AI Health Care

My friend Phil Shaffer, a fellow retired Nuclear Radiologist, is an avid poster on Aunt Minnie. His AM post today about AI in general and Watson in particular is worthy of a wider audience, and here you are. It is based on an Engineering article in the IEEE Spectrum: How IBM Watson Overpromised and Underdelivered on AI Health Care . This is a cautionary tale for all who have anything to do with AI...If IBM stumbled in this venue, if IBM could fall victim to hype and hubris...
  
Well, we all knew that. Big hype, zero output.

I wouldn't bother to post this non-news, if it were not for the other questions it brings up.

IBM’s bold attempt to revolutionize health care began in 2011. The day after Watson thoroughly defeated two human champions in the game of Jeopardy!, IBM announced a new career path for its AI quiz-show winner: It would become an AI doctor. IBM would take the breakthrough technology it showed off on television—mainly, the ability to understand natural language—and apply it to medicine. Watson’s first commercial offerings for health care would be available in 18 to 24 months, the company promised.

In fact, the projects that IBM announced that first day did not yield commercial products. In the eight years since, IBM has trumpeted many more high-profile efforts to develop AI-powered medical technology—many of which have fizzled, and a few of which have failed spectacularly. The company spent billions on acquisitions to bolster its internal efforts, but insiders say the acquired companies haven’t yet contributed much. And the products that have emerged from IBM’s Watson Health division are nothing like the brilliant AI doctor that was once envisioned: They’re more like AI assistants that can perform certain routine tasks.

In part, he says, IBM is suffering from its ambition: It was the first company to make a major push to bring AI to the clinic. But it also earned ill will and skepticism by boasting of Watson’s abilities. “They came in with marketing first, product second, and got everybody excited,” he says. “Then the rubber hit the road. This is an incredibly hard set of problems, and IBM, by being first out, has demonstrated that for everyone else.”

The diagnostic tool, for example, wasn’t brought to market because the business case wasn’t there, says Ajay Royyuru, IBM’s vice president of health care and life sciences research. “Diagnosis is not the place to go,” he says. “That’s something the experts do pretty well. It’s a hard task, and no matter how well you do it with AI, it’s not going to displace the expert practitioner.” (Not everyone agrees with Royyuru: A 2015 report on diagnostic errors from the National Academies of Sciences, Engineering, and Medicine stated that improving diagnoses represents a “moral, professional, and public health imperative.”)

In many attempted applications, Watson’s NLP struggled to make sense of medical text—as have many other AI systems. “We’re doing incredibly better with NLP than we were five years ago, yet we’re still incredibly worse than humans,” says Yoshua Bengio, a professor of computer science at the University of Montreal and a leading AI researcher. In medical text documents, Bengio says, AI systems can’t understand ambiguity and don’t pick up on subtle clues that a human doctor would notice.

Both efforts have received strong criticism. One excoriating article about Watson for Oncology alleged that it provided useless and sometimes dangerous recommendations (IBM contests these allegations). More broadly, Kris says he has often heard the critique that the product isn’t “real AI.” And the MD Anderson project failed dramatically: A 2016 audit by the University of Texas found that the cancer center spent $62 million on the project before canceling it. A deeper look at these two projects reveals a fundamental mismatch between the promise of machine learning and the reality of medical care—between “real AI” and the requirements of a functional product for today’s doctors.

Watson learned fairly quickly how to scan articles about clinical studies and determine the basic outcomes. But it proved impossible to teach Watson to read the articles the way a doctor would. “The information that physicians extract from an article, that they use to change their care, may not be the major point of the study,” Kris says. Watson’s thinking is based on statistics, so all it can do is gather statistics about main outcomes, explains Kris. “But doctors don’t work that way.”

At MD Anderson, researchers put Watson to work on leukemia patients’ health records—and quickly discovered how tough those records were to work with. Yes, Watson had phenomenal NLP skills. But in these records, data might be missing, written down in an ambiguous way, or out of chronological order.

In a final blow to the dream of an AI superdoctor, researchers realized that Watson can’t compare a new patient with the universe of cancer patients who have come before to discover hidden patterns

If an AI system were to base its advice on patterns it discovered in medical records—for example, that a certain type of patient does better on a certain drug—its recommendations wouldn’t be considered evidence based, the gold standard in medicine. Without the strict controls of a scientific study, such a finding would be considered only correlation, not causation.

The question this raises in my mind is: Why?

It seemed so intuitive that this would work. Why doesn't it?

One thing that happens when you try to apply computers to any problem is that first you must break down the task and understand completely how humans do it. I think that what we are seeing is that there was a very incomplete understanding of how humans process information. Starting with a naive understanding of this, IBM brazenly predicted success. And failed. Miserably.
  
Another important point is that much of our scientific effort is reported as statistical differences, derived from controlled experiments. But this is NOT the way that medicine works. There is another level, as Luke Oakden-Rayner has pointed out.

He points out - convincingly - that experiments are NOT clinical performance.

Medical AI today is assessed with performance testing; controlled laboratory experiments that do not reflect real-world safety.

Performance is not outcomes! Good performance in laboratory experiments rarely translates into better clinical outcomes for patients, or even better financial outcomes for healthcare systems.
Humans are probably to blame. We act differently in experiments than we do in practice, because our brains treat these situations differently.

Even fully autonomous systems interact with humans, and are not protected from these problems. We know all of this because of one of the most expensive, unintentional experiments ever undertaken. At a cost of hundreds of millions of dollars per year, the US government paid people to use previous generation AI in radiology. It failed, and possibly resulted in thousands of missed cancer diagnoses compared to best practice, because we had assumed that laboratory testing was enough.

The unintentional experiment he references is Breast CAD.

He recounts how the initial studies suggested that there would be 20% more cancers found using CAD, however subsequent VERY LARGE studies showed (in one case) a 20% increase in biopsies for an increase in cancers found from 4.15 per 1000 to 4.20 per thousand (p = NS).

His diagnosis:

People are weird. It turns out that if you run an experiment with doctors being asked to review cases with CAD, they get more vigilant. If you give them CAD and make them use it clinically, they get less vigilant than if you never gave it to them in the first place.
There are a range of things going on here, but the most important is probably the laboratory effect. As several studies have shown [5, 6], when people are doing laboratory studies (i.e., controlled experiments) they behave differently than when they are treating real patients. The latter study concluded:

“Retrospective laboratory experiments may not represent either expected performance levels or inter-reader variability during clinical interpretations of the same set of mammograms”

Which really says it all.

He goes on to say that when people use computers they over value what computer input and under value the other evidence:
This effect has been implicated in several recent deaths in partially self-driving cars – it has been shown that even trained safety drivers are unable to remain vigilant in autonomous cars that work most of the time.

This effect has also been directly cited as a possible reason for the failure of mammography CAD. One particularly interesting study showed that using CAD resulted in worse sensitivity (less cancers picked up) when the CAD feedback contained more inaccuracies [8] (pdf link). On the surface this didn’t make a lot of sense, since CAD was never meant to be used to exclude cases; it was approved to highlight additional areas of concern, and the radiologists were supposed to use their own judgement for the remainder of the image. Instead, we find that radiologists are reassured by a lack of highlighted regions (or by dismissing incorrectly highlighted regions) and become less vigilant.

I’ve heard many supporters of CAD claim that the reason for the negative results in clinical studies is that “people just aren’t using the CAD as it was intended,” which is both accurate and absurdly naive as far as defenses go. Yes, radiologists become less vigilant when they use CAD. It is not surprising, and it is not unexpected. It is inevitable and unavoidable, simply the cost that comes with working alongside humans. 

There you go. Some food for thought.  

No comments :