Authors Sue OpenAI Claiming Mass Copyright Infringement of Hundreds of Thousands of Novels

Another lawsuit has been filed against OpenAI over its unauthorized collection of information across the web to train its artificial intelligence chatbot, this time by authors who say ChatGPT infringes on copyrights to their novels.

The proposed class action filed in San Francisco federal court on Wednesday alleges that OpenAI “relied on harvesting mass quantities” of copyright-protected works “without consent, without credit, and without compensation.” It seeks a court order that the company infringed on writers’ works when it illegally downloaded copies of novels to train its AI system and that ChatGPT’s answers constitute infringement.

More from The Hollywood Reporter

Generative AI companies are being bombarded by legal challenges over the material used to train their AI systems as courts wrestle with whether the practice qualifies as fair use. OpenAI is facing a proposed class action claiming the billions of lines of computer code that its AI technology analyzes to generate its own code qualify as copyright infringement, in addition to a suit filed Wednesday targeting its automated copying of personal data from hundreds of millions of people. Getty Images has also sued AI art generator Stable Diffusion for copyright infringement.

As evidence of infringement, the suit filed by authors points to ChatGPT generating summaries of their novels when prompted. They argue that’s “only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.”

And because the AI system can’t function without the information extracted from the material, the software programs known as large language models that power ChatGPT “are themselves infringing derivative works, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act,” the suit says. A derivative is a work based upon a preexisting, copyright-protected work.

The authors take issue with OpenAI illegally downloading hundreds of thousands of books to train its AI system. In June 2018, the company revealed that it fed GPT-1 — the first iteration of its large language model — a collection of over 7,000 novels on BookCorpus, which was assembled by a team of AI researchers.

“They copied the books from a website called Smashwords.com that hosts unpublished novels that are available to readers at no cost,” states the complaint. “Those novels, however, are largely under copyright. They were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.”

Later versions of OpenAI’s large language models were trained on larger quantities of copyright-protected works, according to the complaint. In a 2020 paper introducing GPT-3, the company disclosed that 15 percent of its training dataset came from  “two internet-based books corpora” that it simply called “Books1” and “Books2.” While it never revealed what works were part of those datasets, the authors claim they came from “notorious shadow library websites,” like Library Genesis, Z-Library, Sci-Hub and Bibliotik.

“These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called ‘Books3′ includes a recreation of the Bibliotik collection and contains nearly 200,000 books,” writes the authors’ lawyer Joseph Saveri, who also represents programmers in the proposed class action against OpenAI and Microsoft.

OpenAI no longer discloses information about the sources of its dataset, “[g]iven both the competitive landscape and the safety implications of large-scale models like GPT-4,” the company said last year.

The suit seeking to represent a nationwide class of hundreds of thousands of authors in the U.S. was brought by Paul Tremblay and Mona Awad. Tremblay wrote the novel The Cabin at the End of the World, which was adapted by M. Night Shyamalan into Knock at the Cabin. The complaint alleges direct copyright infringement, vicarious copyright infringement, violations of the Digital Millennium Copyright Act, unjust enrichment and negligence, among other claims.

OpenAI and Microsoft, which owns part of the AI company, didn’t immediately respond to requests for comment.

In a May hearing before the House Judiciary Subcommittee on Courts, Intellectual Property and the Internet examining the intersection of AI and copyright law, key players in Hollywood argued in favor of legislation to bar the rampant, unpermitted collection of their works to train AI systems. “The rapid introduction of generative AI systems is seen as an existential threat to the livelihood and continuance of our creative professions unless immediate steps are taken on legal interpretive and economic fronts to address these emerging issues,” said Ashley Irwin, president of the Society of Composers and Lyricists, at the hearing. “It’s essential to prioritize policies and regulations to safeguard the intellectual property and copyright of creators and preserve the diverse and dynamic U.S cultural landscape.”

Irwin stressed that AI firms should be required to secure consent by creators for the use of their works to train AI programs and compensate them at fair market rates for any subsequent new work that’s created, on top of providing the proper credit.

Best of The Hollywood Reporter

Click here to read the full article.