Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

Avram Piltch

June 30, 2023 at 5:06 AM·4 min read

A number of visual artists have filed suit over use of their images as training data for text-to-image generators. Now, two well-known novelists have filed their own class-action suit against OpenAI, accusing the company behind ChatGPT and Bing Chat of copyright infringement because it allegedly used their books as training data. This appears to be the first lawsuit filed over the use of text (as opposed to images or code) being used as training data.

In the lawsuit filed in the United States District Court of the Northern District of California, plaintiffs Paul Tremblay and Mona Awad allege that OpenAI and its subsidiaries committed copyright infringement, violated the Digital Millennium Copyright Act and also ran afoul of California and common law restrictions on unfair competition. The writers are represented by Joseph Saveri Law Firm and Matthew Butterick, the same team that is behind recent lawsuits filed against Diffusion AI and GitHub (over GitHub copilot).

The complaint alleges that Tremblay's novel The Cabin at the End of the World and two of Awad's novels: 13 Ways of Looking at a Fat Girl and Bunny were used as training data for GPT-3.5 and GPT-4. Though OpenAI has not disclosed that the copyrighted novels are in its training data (which is kept secret), the plaintiffs conclude that they must be because ChatGPT was able to provide detailed plot summaries and answer questions about the books, a feat which would require it to have access to the complete texts.

"Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act," the complaint says.

All three books also carry copyright management information (CMI) such as ISBN and copyright registration numbers. The Digital Millennium Copyright Act (DMCA) states that removing or falsifying CMI is illegal and, since ChatGPT's output does not contain that information, the plaintiffs allege that OpenAI is guilty of violating the DMCA on top of regular copyright infringement.

Though the lawsuit only has two plaintiffs right now, the lawyers are seeking class action status which would allow other authors who have had copyrighted works used by OpenAI to also collect damages. The lawyers are seeking monetary damages, court costs and an injunction forcing OpenAI to change its software and business practices around copyrighted material.

We reached out to Butterick for comment on the lawsuit and he referred us to his website, LLM Litigation, which has a detailed explanation of the plaintiffs' position and why they are suing.

"We’ve filed a class-action lawsuit against OpenAI challenging ChatGPT and its underlying large language models, GPT-3.5 and GPT-4, which remix the copyrighted works of thousands of book authors—and many others—without consent, compensation, or credit," the lawyers write.

They also criticize the concept of generative AI, writing that "'Generative artificial intelligence” is just human intelligence, repackaged and divorced from its creators."

Like Saveri and Butterick's lawsuit against Stability AI for using copyrighted images as training data, this one hinges on the belief that grabbing text from the open Internet to power an LLM is not fair use. That's a question that has not yet been answered in court.

In a 2006 case, Blake vs Google, a writer sued the search engine for caching his work and making the cached versions available via search. However, a U.S. district court dismissed the suit, holding that Google's caching of the data was fair use. Judge Robert C. Jones wrote that holding documents in cache is a transformative use (one of four factors used to determine fair use) and that it doesn't harm the potential market for the work (another factor). So simply storing copyrighted data on its server in the form of a cache did not make Google liable.

However, using a copyrighted creative work as training data is quite a bit different than indexing content for search. One could argue that if the LLM is able to repeat key details from the book, it is harming the market for those works and it is not truly transformative. On the other hand, if a human writes a plot summary of a book, that generally doesn't run afoul of copyright law. Ultimately, these questions are going to be decided because of lawsuits like this one.

OpenAI isn't the only company that's using copyrighted materials for training or even output. Google SGE, the company's new search experience, often plagiarizes whole sentences and paragraphs word-for-word from copyrighted articles. What happens in this suit could have a much wider impact on the generative AI industry.

Yahoo Sports
2024 NFL Draft grades: Denver Broncos earn one of our lowest grades mostly due to one pick
Yahoo Sports' Charles McDonald breaks down the Broncos' 2024 draft.
23h ago
Yahoo Sports
NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in
Not everyone was thrilled with their team's draft on Thursday night.
3d ago
Yahoo Sports
NFL Draft: Bears take Iowa punter, who immediately receives funny text from Caleb Williams
There haven't been many punters drafted in the fourth round or higher like Tory Taylor just was. Chicago's No. 1 overall pick welcomed him in unique fashion.
2d ago
Yahoo Sports
NFL to allow players to wear protective Guardian Caps in games beginning with 2024 season
The NFL will allow players to wear protective Guardian Caps during games beginning with the 2024 season. The caps were previously mandated for practices.
3d ago
Yahoo Sports
Joel Embiid not happy that Knicks fans took over 76ers home playoff games: It 'pisses me off'
"I don't think that should happen. It's not OK."
13h ago
Yahoo Sports
The expanded 12-team College Football Playoff is here — and it already has problems
There is cause for excitement around the new playoff format. There's also lots of complaints and criticism to go around.
1h ago
Yahoo Sports
Cowboys owner Jerry Jones compared his 2024 NFL Draft strategy to robbing a bank
Dallas Cowboys owner Jerry Jones made an amusing analogy when asked why the team selected three offensive lineman in the 2024 NFL Draft.
2d ago
Yahoo Sports
Korey Cunningham, former NFL lineman, found dead in New Jersey home at age 28
Cunningham played 31 games in the NFL with the Cardinals, Patriots and Giants.
3d ago
Yahoo Sports
2024 NFL Draft grades: Kansas City Chiefs get even richer with one of the best hauls this year
Yahoo Sports' Charles McDonald breaks down the Chiefs' 2024 draft.
22h ago
Yahoo Sports
Michael Penix Jr. said Kirk Cousins called him after Falcons' surprising draft selection
Atlanta Falcons first-round draft pick Michael Penix Jr. said quarterback Kirk Cousins called him after he was picked No. 8 overall in one of the 2024 NFL Draft's more puzzling selections.
3d ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

Recommended Stories

2024 NFL Draft grades: Denver Broncos earn one of our lowest grades mostly due to one pick

NFL Draft: Packers fan upset with team's 1st pick, and Lions fans hilariously rubbed it in

NFL Draft: Bears take Iowa punter, who immediately receives funny text from Caleb Williams

NFL to allow players to wear protective Guardian Caps in games beginning with 2024 season

Joel Embiid not happy that Knicks fans took over 76ers home playoff games: It 'pisses me off'

The expanded 12-team College Football Playoff is here — and it already has problems

Cowboys owner Jerry Jones compared his 2024 NFL Draft strategy to robbing a bank

Korey Cunningham, former NFL lineman, found dead in New Jersey home at age 28

2024 NFL Draft grades: Kansas City Chiefs get even richer with one of the best hauls this year

Michael Penix Jr. said Kirk Cousins called him after Falcons' surprising draft selection