Using software called “Whisper,” OpenAI is said to have prepared about 1 million hours of video from YouTube to use as training data for ChatGPT — without informing Google, YouTube's parent company Alphabet, or the creators of the clips. This was reported by the New York Times, citing several sources in OpenAI, Google and the company's environment.
advertisement
Accordingly, OpenAI is said to have run out of training materials at the end of 2021. To put it fairly bluntly: the Internet has disappeared. The AI startup had already transferred all publicly available texts in English to its training data. Current models of generative AI provide better results the larger the pre-processed training material.
So, according to the Times, OpenAI has begun mass-transforming the audio of YouTube videos into text with its Whisper tool. The video platform itself also offers automatically generated subtitles, and other programs like Adobe Premiere can now also do this transcription in high quality. As reported by the newspaper, which is currently locked in a legal dispute with OpenAI over the alleged use of its content to train AI, Google has certainly taken notice.
Google did not intervene
However, the company did not take any action against it as it was already using content from YouTube and other services like Google Docs to train its AI model. According to the New York Times, Google was aware that it might be violating the rights of video creators. There is currently a wave of lawsuits and other complaints, especially in the USA, against the use of copyrighted materials for AI training without appropriate licensing agreements. According to the Times, the U.S. Copyright Office alone received more than 10,000 complaints about the matter from individuals, companies and other organizations last year.
Read also
In recent years, several tech companies have changed their terms of service to require your consent to use self-generated AI training materials before using their services — including Google and Facebook. In return, companies prevent other companies from accessing the data and using it for their own services. OpenAI is said to have invoked the US legal construct of “fair use” in internal discussions prior to the YouTube campaign.
Licensing is still the exception
As there are increasing legal limits to AI, including EU AI law, some companies are now explicitly entering into licensing agreements with data sources. An example is Reddit, which entered into an agreement with Google before its IPO. Reddit user data will be provided to Google for US$60 million per year.
While 1 million hours of YouTube video or over 114 years of training material playback time may seem high at first glance, this has been kept in mind as the platform continues to grow rapidly. In 2019, Google reported that about 500 hours of video are uploaded to YouTube every minute. This has probably increased significantly now. Based on these old numbers, this means that after just over 33 hours, over a million hours of new or newly edited material had been achieved. It is likely that OpenAI only processed a very small portion of the total content. The question that has not yet been answered is what criteria were used for selection.
(never)
Lifelong foodaholic. Professional twitter expert. Organizer. Award-winning internet geek. Coffee advocate.