Novelists Sue OpenAI for Copyright Infringement Over Books Used as Training Data

 Gavel in front of ChatGPT Screen
Gavel in front of ChatGPT Screen

A number of visual artists have filed suit over use of their images as training data for text-to-image generators. Now, two well-known novelists have filed their own class-action suit against OpenAI, accusing the company behind ChatGPT and Bing Chat of copyright infringement because it allegedly used their books as training data. This appears to be the first lawsuit filed over the use of text (as opposed to images or code) being used as training data.

In the lawsuit filed in the United States District Court of the Northern District of California, plaintiffs Paul Tremblay and Mona Awad allege that OpenAI and its subsidiaries committed copyright infringement, violated the Digital Millennium Copyright Act and also ran afoul of California and common law restrictions on unfair competition. The writers are represented by Joseph Saveri Law Firm and Matthew Butterick, the same team that is behind recent lawsuits filed against Diffusion AI and GitHub (over GitHub copilot).

The complaint alleges that Tremblay's novel The Cabin at the End of the World and two of Awad's novels: 13 Ways of Looking at a Fat Girl and Bunny were used as training data for GPT-3.5 and GPT-4. Though OpenAI has not disclosed that the copyrighted novels are in its training data (which is kept secret), the plaintiffs conclude that they must be because ChatGPT was able to provide detailed plot summaries and answer questions about the books, a feat which would require it to have access to the complete texts.

"Because the OpenAI Language Models cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI Language Models are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act," the complaint says.

All three books also carry copyright management information (CMI) such as ISBN and copyright registration numbers. The Digital Millennium Copyright Act (DMCA) states that removing or falsifying CMI is illegal and, since ChatGPT's output does not contain that information, the plaintiffs allege that OpenAI is guilty of violating the DMCA on top of regular copyright infringement.

Though the lawsuit only has two plaintiffs right now, the lawyers are seeking class action status which would allow other authors who have had copyrighted works used by OpenAI to also collect damages. The lawyers are seeking monetary damages, court costs and an injunction forcing OpenAI to change its software and business practices around copyrighted material.

We reached out to Butterick for comment on the lawsuit and he referred us to his website, LLM Litigation, which has a detailed explanation of the plaintiffs' position and why they are suing.

"We’ve filed a class-action law­suit against OpenAI chal­leng­ing Chat­GPT and its under­ly­ing large lan­guage mod­els, GPT-3.5 and GPT-4, which remix the copy­righted works of thou­sands of book authors—and many oth­ers—with­out con­sent, com­pen­sa­tion, or credit," the lawyers write.

They also criticize the concept of generative AI, writing that "'Gen­er­a­tive arti­fi­cial intel­li­gence” is just human intel­li­gence, repack­aged and divorced from its cre­ators."

Like Saveri and Butterick's lawsuit against Stability AI for using copyrighted images as training data, this one hinges on the belief that grabbing text from the open Internet to power an LLM is not fair use. That's a question that has not yet been answered in court.

In a 2006 case, Blake vs Google, a writer sued the search engine for caching his work and making the cached versions available via search. However, a U.S. district court dismissed the suit, holding that Google's caching of the data was fair use. Judge Robert C. Jones wrote that holding documents in cache is a transformative use (one of four factors used to determine fair use) and that it doesn't harm the potential market for the work (another factor). So simply storing copyrighted data on its server in the form of a cache did not make Google liable.

However, using a copyrighted creative work as training data is quite a bit different than indexing content for search. One could argue that if the LLM is able to repeat key details from the book, it is harming the market for those works and it is not truly transformative. On the other hand, if a human writes a plot summary of a book, that generally doesn't run afoul of copyright law. Ultimately, these questions are going to be decided because of lawsuits like this one.

OpenAI isn't the only company that's using copyrighted materials for training or even output. Google SGE, the company's new search experience, often plagiarizes whole sentences and paragraphs word-for-word from copyrighted articles. What happens in this suit could have a much wider impact on the generative AI industry.