Tech giant OpenAI has touted its AI-powered transcription tool, Whisper, as having “near human-level robustness and accuracy.”
But Whisper has a major flaw: It tends to compose chunks of text, or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. These experts said some made-up texts — known in the industry as hallucinations — can include racist comments, violent rhetoric and even imaginary medical treatments.
Experts said such fabrications are problematic because Whisper is used in many industries around the world to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos.
What’s more worrying, they say, is the rush by medical centers to use Whisper-based tools to transcribe patients’ consultations with doctors, despite OpenAI’s warnings that the tool should not be used in medical settings. “high risk areas”.
It’s difficult to determine the extent of the problem, but researchers and engineers say they’ve encountered Whisper’s hallucinations frequently in the course of their work. A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in 8 out of 10 audio transcripts he inspected, before starting to try to improve the model.
A machine learning engineer said he initially discovered hallucinations in about half of the more than 100 hours of Whisper transcripts he analyzed. A third developer reported finding hallucinations in almost every one of the 26,000 transcripts he created with Whisper.
Problems persist even in short, well-recorded audio samples. A recent study by computer scientists discovered 187 hallucinations in more than 13,000 clear audio clips they examined.
This trend would result in tens of thousands of faulty transcriptions across millions of recordings, the researchers said.
Such errors could have “very serious consequences,” especially in hospital settings, said Alondra Nelson, who led the White House Office of Science and Technology Policy for the Biden administration until last year .
“No one wants a misdiagnosis,” said Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey. “There should be a higher bar.”
Whisper is also used to create closed captioning for the deaf and hard of hearing – a population at particular risk for faulty transcriptions. That’s because deaf and hard-of-hearing people have no way of identifying fabrications “hidden among all these other texts,” said Christian Vogler, who is deaf and directs the University’s technology access program. Gallaudet University.
Receive weekly health news
Receive the latest medical news and health information every Sunday.
OpenAI urged to fix the problem
The prevalence of such hallucinations has led experts, advocates, and former OpenAI employees to call on the federal government to consider regulating AI. At a minimum, they said, OpenAI must fix the flaw.
“It seems solvable if the company is willing to prioritize it,” said William Saunders, a San Francisco-based research engineer who left OpenAI in February over concerns about the company’s direction. “It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems.”
An OpenAI spokesperson said the company continually studies how to reduce hallucinations and appreciates the researchers’ findings, adding that OpenAI incorporates feedback into model updates.
While most developers assume transcription tools misspell words or make other mistakes, engineers and researchers said they’ve never seen another AI-powered transcription tool hallucinate so much than Whisper.
Whispered hallucinations
The tool is integrated into some versions of OpenAI’s flagship chatbot, ChatGPT, and is an integrated offering with Oracle and Microsoft’s cloud computing platforms, which serve thousands of businesses around the world. It is also used to transcribe and translate text in multiple languages.
In the last month alone, a recent version of Whisper has been downloaded more than 4.2 million times from the open source AI platform HuggingFace. Sanchit Gandhi, a machine learning engineer, said Whisper is the most popular open source speech recognition model and is integrated into everything from call centers to voice assistants.
Professors Allison Koenecke of Cornell University and Mona Sloane of the University of Virginia examined thousands of short excerpts obtained from TalkBank, a research repository hosted at Carnegie Mellon University. They determined that nearly 40% of hallucinations were harmful or disturbing because the speaker could be misinterpreted or distorted.
In one example they discovered, a speaker said: “He, the boy, was going to, I’m not sure exactly, take the umbrella. »
But the transcription software adds: “He took a big piece of the cross, a very small piece…I’m sure he didn’t have a terrorist knife, so he killed a number of people.” »
A commenter in another recording described “two other girls and a lady.” Whisper made up an additional comment about race, adding “two other girls and a lady, uh, who were black.”
In a third transcript, Whisper invented a nonexistent drug called “hyperactivated antibiotics.”
Researchers aren’t sure why Whisper and similar tools hallucinate, but software developers have said the hallucinations tend to occur amid pauses, background noises or music.
OpenAI has recommended in its online publications against using Whisper in “decision-making contexts, where lapses in accuracy can lead to pronounced flaws in the results.”
Transcription of doctor appointments
This warning has not stopped hospitals or medical centers from using text-to-speech models, including Whisper, to transcribe what is said during doctor visits to allow medical providers to spend less time taking notes or writing reports.
More than 30,000 clinicians and 40 health systems, including the Mankato Clinic in Minnesota and Children’s Hospital Los Angeles, have started using a Whisper-based tool developed by Nabla, which has offices in France and the United States. United.
This tool was refined on medical language to transcribe and summarize patient interactions, said Martin Raison, chief technology officer of Nabla.
Company officials said they were aware that Whisper could be hallucinating and were mitigating the problem.
It’s impossible to compare Nabla’s AI-generated transcript to the original recording because Nabla’s tool erases the original audio for “data security reasons,” Raison said.
Nabla said the tool has been used to transcribe around 7 million medical visits.
Saunders, the former OpenAI engineer, said erasing original audio could be concerning if transcriptions aren’t verified or clinicians can’t access the recording to verify they’re correct .
“You can’t catch errors if you remove the ground truth,” he said.
Nabla said no model is perfect and theirs currently requires medical providers to quickly edit and approve transcribed notes, but that could change.
Privacy issues
Because patients’ meetings with their doctors are confidential, it’s unclear how AI-generated transcripts affect them.
A California state lawmaker, Rebecca Bauer-Kahan, said she took one of her children to the doctor earlier this year and refused to sign a form provided by the health network that asked her permission to share audio from the consultation with vendors including Microsoft Azure, the cloud computing system run by OpenAI’s largest investor. Bauer-Kahan didn’t want such intimate medical conversations shared with tech companies, she said.
“The release was very specific that for-profit companies would be allowed to have this,” said Bauer-Kahan, a Democrat who represents part of suburban San Francisco in the state Assembly . “I said to myself, ‘absolutely not’.”
John Muir Health spokesman Ben Drew said the health system complies with state and federal privacy laws.
Schellmann reported from New York.
This story was produced in partnership with the Pulitzer Center’s AI Accountability Network, which also partially supported the Whisper academic study.