“It’s startling and disturbing,” says Ashique KhudaBukhsh, an assistant professor at Rochester Institute of Technology who researched the problem with collaborators Krithika Ramesh and Sumeet Kumar at the Indian School of Business in Hyderabad.
Automated captions are not available on YouTube Kids, the version of the service aimed at children. But many families use the standard version of YouTube, where they can be seen. Pew Research Center reported in 2020 that 80 percent of parents to children 11 or younger said their child watched YouTube content; more than 50 percent of children did so daily.
KhudaBukhsh hopes the study will draw attention to a phenomenon that he says has gotten little notice from tech companies and researchers and that he dubs “inappropriate content hallucination”—when algorithms add unsuitable material not present in the original content. Think of it as the flip side to the common observation that autocomplete on smartphones often filters adult language to a ducking annoying degree.
YouTube spokesperson Jessica Gibby says children under 13 are recommended to use YouTube Kids, where automated captions cannot be seen. On the standard version of YouTube, she says the feature improves accessibility. “We are continually working to improve automatic captions and reduce errors,” she says. Alafair Hall, a spokesperson for Pocket.watch, a children’s entertainment studio that publishes Ryan’s World content, says in a statement the company is “in close and immediate contact with our platform partners such as YouTube who work to update any incorrect video captions.” The operator of the Rob the Robot channel could not be reached for comment.
Inappropriate hallucinations are not unique to YouTube or video captions. One WIRED reporter found that a transcript of a phone call processed by startup Trint rendered Negar, a woman’s name of Persian origin, as a variant of the N-word, even though it sounds distinctly different to the human ear. Trint CEO Jeffrey Kofman says the service has a profanity filter that automatically redacts “a very small list of words.” The particular spelling that appeared in WIRED’s transcript was not on that list, Kofman said, but it will be added.
“The benefits of speech-to-text are undeniable, but there are blind spots in these systems that can require checks and balances,” KhudaBukhsh says.
Those blind spots can seem surprising to humans who make sense of speech in part by understanding the broader context and meaning of a person’s words. Algorithms have improved their ability to process language but still lack a capacity for fuller understanding—something that has caused problems for other companies relying on machines to process text. One startup had to revamp its adventure game after it was found to sometimes describe sexual scenarios involving minors.
Machine learning algorithms “learn” a task by processing large amounts of training data—in this case audio files and matching transcripts. KhudaBukhsh says that YouTube’s system likely inserts profanities sometimes because its training data included primarily speech by adults, and less from children. When the researchers manually checked examples of inappropriate words in captions, they often appeared with speech by children or people who appeared not to be native English speakers. Previous studies have found that transcription services from Google and other major tech companies make more errors for non-white speakers and fewer errors for standard American English, compared with regional US dialects.
Rachael Tatman, a linguist who coauthored one of those earlier studies, says a simple blocklist of words not to use on kids’ YouTube videos would address many of the worst examples found in the new research. “That there’s apparently not one is an engineering oversight,” she says.