With the advancement of Artificial intelligence, the requirement of huge datasets is also increasing, which some companies seem to be using illegally. As reported by Proof news companies such as Apple, Nvidia, Anthropic, and Salesforce used subtitles from YouTube videos to train generative AI models.
The dataset, called YouTube Subtitles, reportedly included video transcripts from educational and online learning channels such as Khan Academy, MIT, and Harvard. Additionally, the Wall Street Journal, NPR and the BBC also had their videos used to train AI, as did “The Late Show With Stephen Colbert,” “Last Week Tonight With John Oliver,” and “Jimmy Kimmel Live.”
The illegal data market
According to Proof news, Apple used data from the Wall Street Journal, NPR and the BBC to train its OpenELM model, released in April just before its WWDC event. In addition to this Bloomberg, Databrick, and Anthropic, also used the dataset to train AI models. Salesforce used the Pile to build an AI model it claimed was for “academic and research”, but later released this for public use in 2022. It has been downloaded over 85,000 times.
But what is the ‘Pile’ and why does its misuse matter? So, EleutherAI is a YouTube Subtitles dataset, which is part of a larger compilation called the Pile. This includes material from Wikipedia and the European Parliament and is generally accessible to anyone with internet access and the know-how to find it. However, its misuse can lead to the data breach of sensitive or personal data. In addition to this the creation of the dataset may also have violated YouTube’s terms of service, where the platform prohibits using “automated means” to access its videos.
“The Pile had been used to train Claude, Anthropic’s generative AI assistant,” a spokesperson from Anthropic explained. However, representatives for Nvidia, Apple, Bloomberg, and Databricks declined to comment on their use of the Pile. Furthermore, EleutherAI also refused to respond to Proof News’ request for comment.
The safety nets
A case against EleutherAI, was voluntarily dismissed by the plaintiffs. The Pile has since been removed from its official download site, but it’s still available on file sharing services.
Early reports suggested that YouTube Subtitles, which was published in 2020, also contained subtitles from more than 12,000 videos that have since been deleted from YouTube.
Follow FE Tech Bytes on Twitter, Instagram, LinkedIn, Facebook
