OpenAI, the start-up behind ChatGPT, signed a licensing deal on Monday (29 April) with the Financial Times (FT) allowing it access to archived features and news articles.
The FT’s articles will be used as training data for OpenAI’s GenAI chatbot, allowing it to answer prompts using data from the articles. Any article mentioned in ChatGPT’s response will be linked so its user can easily view the source material.
ChatGPT is able to generate human-like text and understand prompts due to the large swathes of training data it has ingested.
This training data has been central to numerous copyright lawsuits filed by authors against OpenAI.
Authors, such as Mona Awad, claim that their work has been used to train ChatGPT without their consent or knowledge. According to court documents, OpenAI did admit that “written works, plays and articles” were the most valuable form of training data.
Long-form text can help ChatGPT reliably respond to a wider range of prompts and improve its language comprehension.
How well do you really know your competitors?
Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.
Thank you!
Your download email will arrive shortly
Not ready to buy yet? Download a free sample
We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below form
By GlobalDataPrior to ChatGPT’s release in 2022, OpenAI published a paper in 2018 entitled Improving Language Understanding by Generative Pre-Training.
The paper detailed that the first edition of ChatGPT, GPT-1, was trained using BookCorpus, an online database containing more than 11,000 self-published novels. GPT-1 had also been trained on Common Crawl, an online archive of long-form web articles.
In 2020, OpenAI stated that ChatGPT had been trained on a database of published novels in a white paper entitled Language Models are Few-Shot Learners but did not specify what copyrighted material was included in this data.
OpenAI’s copyright lawsuits required the start-up to take data transparency seriously, and it has since signed numerous licensing deals with major news publishers such as the Associated Press, Le Monde and Prisa Media.
While these deals give OpenAI access to a large variety of long-form text to train ChatGPT, graduate analyst at GlobalData Emma Christy explained the symbiotic nature of licensing journalists’ work.
“Licensing partnerships between AI developers and news publishers, such as that between the Financial Times and OpenAI, can only be positive for the future of AI chatbots and AI ethics,” said Christy.
“It is important that news publishers are renumerated fairly by AI developers for the use of their materials,” she added.
“Access to real-time, high-quality news articles for training will in turn ensure chatbots are trained on reliable sources. Such partnerships, both now and in the future, will help to reduce both lawsuits and friction in integrating AI across industries,” Christy said.
Despite its licensing deals, OpenAI has continued to face backlash from news publishers.
In January 2024, the New York Times (NYT) sued OpenAI accusing it of using “millions” of NYT articles to train ChatGPT non-consensually, stating that ChatGPT’s responses often copied out text from the NYT’s articles verbatim. This was despite the fact that NYT’s articles are behind a paywall.
The NYT stated in the lawsuit that it was losing subscribers and ad revenue due to some of its content becoming available on ChatGPT.
Additionally, eight local daily US newspapers filed a lawsuit against OpenAI on 30 April alleging that their articles had also been used to train ChatGPT and that they had not been fairly paid for their work.
The newspapers, including the Chicago Tribune, Orlando Sentinel and Denver Post, stated that they had a legal right to be compensated for the use of their work by ChatGPT.
The executive editor of Media News Group and Tribune Publishing Newspapers, which own the affected papers, Frank Pine, said that OpenAI was stealing the work of journalists to build its business at the media’s expense.
“They pay their engineers and programmers, they pay for servers and processors, they pay for electricity, and they definitely get paid from their astronomical valuations,” Pine stated, “but they don’t want to pay for the content without which they would have no product at all. That is not fair use, and it is not fair. It needs to stop.”
Despite its recent and continuous legal trouble with news publishers, tech industry insiders believe that OpenAI’s licensing deals could save it from future copyright lawsuits.
Luke Budka, AI director at marketing tech company Definition, stated that he believed many more licensing deals would be signed by OpenAI. Budka explained that the deals could help OpenAI fight online misinformation generated by ChatGPT.
“AI companies want real-time training data so we will see more of these deals spring up,” he said.
“Not every publisher is going to get a deal though – they will need to be able to offer data the AI company won’t be able to get elsewhere,” he said.
“OpenAI has said users will start to get access to real-time news reporting globally, including attribution and links, in order to help combat disinformation in elections – so they need relationships that can help facilitate this,” he added.
Licensing deals could help OpenAI create transparency in its mass of training data, helping it easily trace the sources of information that have been used to create ChatGPT.
Sara Saab, VP of product at data platform Prolific, stated that OpenAI’s existing licensing deals marked a pivot in data provenance for AI chatbots and large language models.
“Data provenance is becoming increasingly vital in AI, highlighting the importance of transparently indicating where an AI model’s data originates in a reliable and trustworthy way,” said Saab.
Saab predicts that ethically sourced and well-compensated data will become a necessity for future AI tools.
“Facilitating the easier tracing of AI training data back to representative and diverse human groups is crucial, ensuring a broad spectrum of human experiences are captured in our AI systems,” Saab concluded.