OpenAI’s Sora Shines a Spotlight on The Need for ‘Ethically Sourced’ AI | Commentary

Peter Csathy

April 18, 2024 at 9:00 AM·6 min read

OpenAI’s cinematic quality AI video generator Sora — and the power of what it represents — shook Hollywood just weeks ago. Its shocking quality certainly elevates the issue of what AI means for future Hollywood productions. But Sora also, once again, puts the spotlight on the fundamental issue of AI “training” on copyrighted works without consent.

Of course, when asked, OpenAI — like most generative AI companies — never comes right out and says that’s what it does. The company simply says that it trains Sora on “publicly available” works. While that sounds innocuous enough, it really isn’t. If it were, why would the company be so cagey about it? When directly asked whether Sora trained on YouTube videos, OpenAI’s CTO Mira Murati deflected. “I’m actually not sure about that,” she said.

We now know that it’s what we’ve suspected all along. “Publicly available” means simply that the food OpenAI uses for training its AI – because that’s what it is to OpenAI’s voracious AI pet – is content accessible online, much of which is copyrighted of course. Thanks to some intrepid journalistic digging by The New York Times, it’s clear now that OpenAI trained its ChatGPT Large Language Model (LLM) on over one million hours of YouTube videos, all without payment or consent.

And here’s a big tell about how OpenAI itself really feels about what it’s doing. The name for its internal speech-recognition program that takes YouTube videos and transcribes them into text for training purposes is Whisper, as in, let’s keep things on the down low. I’m no linguist, but it certainly seems like an admission of some sort to me. (OpenAI did not respond to a request for comment from TheWrap.)

Apparently, even YouTube — the copyright infringing OG — agrees. YouTube doesn’t precisely couch the issue in those terms, of course, perhaps because Google is reportedly training its own AI using YouTube videos. Earlier this month, CEO Neal Mohan bemoaned the fact that Sora’s non-consensual training on millions of its videos violates its terms of service. That’s a rich claim coming from YouTube, since YouTube built its initial base by enabling users to upload any videos they wanted – including copyrighted videos like SNL’s notorious “Lazy Sunday” that blew the lid off the platform– without securing licenses or compensating rights holders. One could say that the U.S. copyright laws and notices were the relevant “terms of service” at the time.

Given that inconvenient truth, some would say YouTube’s position sounds a bit, shall we say, hypocritical. But putting that aside for the moment, Mohan has a point. Why should OpenAI — or any other LLM — be able to feed off the works of others in order to build its value as a tool (or whatever you call generative AI)? And even more pointedly, where are the creators in this equation?

Apart from trying to dodge those specific questions, OpenAI, predictably, tries to turn the tables on Google and contend that what it does is effectively no different than what Google itself did when it vacuumed up the entire Internet world – millions of copyrighted works — for its “books” project in order to make them searchable online. At the time, the Ninth Circuit Court of Appeals blessed Google’s actions as being a “fair use,” in a seminal case that is always cited by those in tech who feel that creative works should be considered fodder for some kind of higher calling of infinite progress.

But Google showcased only snippets of those books in its search results – not the whole enchilada. There was no market substitution here. Once a user found the copyrighted work through search, they still would need to actually go out and buy the real thing. That’s a fundamentally different proposition than Sora’s. Sora doesn’t call attention to other copyrighted works and build new channels of monetization for them. Sora, instead, competes directly with them (at least it will when it becomes widely available).

Anyway, if OpenAI were so confident in the righteousness of its position, why be so cagey about it? Because it isn’t. Generative AI tech without content is essentially useless. We know that, and they know that. That means artists and creators of those creative works should be compensated. You can’t re-use my article here without my permission simply because it’s been posted. And that basic fact doesn’t change simply because you’ve sucked millions of works into your training vortex. It’s not just about the outputs generated by AI (that’s a separate copyright matter). It’s about the inputs as well.

At a minimum, it’s hard to argue that OpenAI’s opacity about what’s really going on should be confronted head on. All of us (creators and consumers alike), for a whole host of reasons, deserve to know precisely what OpenAI uses in its training data sets.

That kind of transparency is precisely what President Joe Biden’s Executive Order about AI calls for. Congress finally took Biden’s hint when just last week U.S. Rep. Adam Schiff introduced “The Generative AI Copyright Disclosure Act.” Following the European Union’s own historic legislation on the subject, Schiff’s act would require anyone that uses a data set for AI training to send the U.S. Copyright Office a notice that includes “a sufficiently detailed summary of any copyrighted works used.”

Essentially, this is a call for “ethically sourced” AI and transparency so that consumers can make their own choices. Think of it like nutritional labeling on food products for consumer safety reasons. “Trust and safety” logically should apply here too, and artists certainly agree. Two weeks ago leading musicians like Billie Eilish penned an open letter to the tech community to knock it off and stop training their LLMs without consent or compensation. So the heat is most definitely on, and it’s up to the creative community to keep the issue on the front burner.

So let’s first pull the curtain on what’s really going on in the AI sausage factory via demands for transparency. Then we can all directly confront the copyright legal issues head on with reality we all understand. To infringe, or not to infringe (because it’s fair use)? That is the question – and it’s a question winding through the federal courts right now that will ultimately find its way to the U.S. Supreme Court.

And when it does, my prediction is that ultimately even this wacky court will find a way to protect artists in the most basic of ways by following its surprising (to many) recent decision in the Andy Warhol Prince copyright case – in which it defined a new kind of direct harm to creator exception to fair use – it will rule in favor of creators. It will reject Big Tech’s efforts to train their LLMs on copyrighted content without consent or compensation, properly finding that AI’s raison d’etre in those circumstances is to build new systems to compete directly with creators – in other words, market substitution.

Simply because something is “publicly available” doesn’t mean that you can take it. It’s both morally and legally wrong. I’m an IP lawyer and welcome a healthy debate on that subject. But for god’s sake, be transparent about what you’re doing.

Reach out to Peter at peter@creativemedia.biz. For those of you interested in learning more, sign up to his “the brAIn” newsletter, visit his firm Creative Media at creativemedia.biz, and follow him on Threads @pcsathy.

The post OpenAI’s Sora Shines a Spotlight on The Need for ‘Ethically Sourced’ AI | Commentary appeared first on TheWrap.

TechCrunch
Creators of Sora-powered short explain AI-generated video's strengths and limitations
OpenAI's video generation tool Sora took the AI community by surprise in February with fluid, realistic video that seems miles ahead of competitors. Shy Kids is a digital production team based in Toronto that was picked by OpenAI as one of a few to produce short films essentially for OpenAI promotional purposes, though they were given considerable creative freedom in creating "air head." In an interview with visual effects news outlet fxguide, post-production artist Patrick Cederberg described "actually using Sora" as part of his work. Perhaps the most important takeaway for most is simply this: While OpenAI's post highlighting the shorts lets the reader assume they more or less emerged fully formed from Sora, the reality is that these were professional productions, complete with robust storyboarding, editing, color correction, and post work like rotoscoping and VFX.
4d ago
Engadget
OpenAI will train its AI models on the Financial Times' journalism
Generative AI is only as good as the training data used to train the models that power it, so AI companies have increasingly been striking deals with news publishers.
2d ago
TechCrunch
ChatGPT's 'hallucination' problem hit with another privacy complaint in EU
OpenAI is facing another privacy complaint in the European Union. This one, which has been filed by privacy rights nonprofit noyb on behalf of an individual complainant, targets the inability of its AI chatbot ChatGPT to correct misinformation it generates about individuals. Rather more importantly for a resource-rich giant like OpenAI: Data protection regulators can order changes to how information is processed, so GDPR enforcement could reshape how generative AI tools are able to operate in the EU.
3d ago
TechCrunch
OpenAI Startup Fund quietly raises $15M
The OpenAI Startup Fund, a venture fund related to -- but technically separate from -- OpenAI that invests in early-stage, typically AI-related companies across education, law and the sciences, has quietly closed a $15 million tranche. According to a filing with the U.S. Securities and Exchange Commission, two unnamed investors contributed the $15 million in new cash on or around April 19. The paperwork was submitted on April 25, and mentions Ian Hathaway, the OpenAI Startup Fund's manager and sole partner.
5d ago
Engadget
Apple has reportedly resumed talks with OpenAI to build a chatbot for the iPhone
Apple has resumed talks with OpenAI, the maker of ChatGPT, to build an AI-powered chatbot into the iPhone, according to a new report.
5d ago
TechCrunch
Adobe's working on generative video, too
Adobe says it's building an AI model to generate video. Offered as an answer of sorts to OpenAI's Sora, Google's Imagen 2 and models from the growing number of startups in the nascent generative AI video space, Adobe's model -- a part of the company's expanding Firefly family of generative AI products -- will make its way into Premiere Pro, Adobe's flagship video editing suite, sometime later this year, Adobe says. Like many generative AI video tools today, Adobe's model creates footage from scratch (either a prompt or reference images) -- and it powers three new features in Premiere Pro: object addition, object removal and generative extend.
16d ago
Engadget
YouTube CEO warns OpenAI that training models on its videos is against the rules
YouTube CEO Neal Mohan stated that OpenAI using its videos to train AI tool Sora would violate its terms of use.
a month ago
TechCrunch
Former Snap AI chief launches Higgsfield to take on OpenAI's Sora video generator
OpenAI captivated the tech world a few months back with a generative AI model, Sora, that turns scene descriptions into original videos -- no cameras or film crews required. Alex Mashrabov, the former head of generative AI at Snap, sensed an opportunity. Powered by a custom text-to-video model, Higgsfield's first app, Diffuse, can generate videos from scratch or take a selfie and generate a clip starring that person.
a month ago
TechCrunch
Vana plans to let users rent out their Reddit data to train AI
In the generative AI boom, data is the new oil. From Big Tech firms to startups, AI makers are licensing e-books, images, videos, audio and more from data brokers, all in the pursuit of training up more capable (and more legally defensible) AI-powered products. Shutterstock has deals with Meta, Google, Amazon and Apple to supply millions of images for model training, while OpenAI has signed agreements with several news organizations to train its models on news archives.
18d ago
TechCrunch
Humane’s $699 Ai Pin is now available
Humane today announced the availability of its first product, the Ai Pin. The Bay Area-based hardware startup has been kicking around since 2017, a year after co-founders Bethany Bongiorno and Imran Chaudhri left Apple. Ai Pin is the first of what Humane hopes will be a long line of devices aimed at harnessing the power and popularity of generative AI platforms such as OpenAI’s ChatGPT and Google’s Gemini.
20d ago
Engadget
The US and UK are teaming up to test the safety of AI models
The UK and the US governments have signed a Memorandum of Understanding in order to create a common approach for independent evaluation on the safety of generative AI models.
a month ago
TechCrunch
Microsoft taps Sanctuary AI for general-purpose robot research
Microsoft, it seems, is hedging its bets when it comes to general-purpose robotics AI. Today, the tech giant announced a collaboration with Figure competitor Sanctuary AI, best known for its humanoid robot, Phoenix. The Sanctuary partnership really gets to the heart of Microsoft’s interest in the category: artificial general intelligence.
2h ago
Yahoo Life Shopping
Need a can't-miss Mother's Day gift? Kardashian-approved Barefoot Dreams blankets are $48 — that's over 65% off
Help mom live her cuddliest life with this celebrity favorite: 'My family fights over it regularly,' says a fan.
19m ago
Yahoo Sports
University of Houston blowing off NFL's cease-and-desist about Oilers-like uniform: 'We're doing it'
The University of Houston is doing something rare: publicly defying the NFL.
16m ago
Engadget
May's PlayStation Plus games include Ghostrunner 2 and the modern classic Tunic
Sony just announced May’s PlayStation Plus lineup of games. They include Ghostrunner 2, Tunic and EA Sports FC 24.
57m ago
Autoblog
Porsche will run an entire race series using only synthetic fuels
Up to 32 cars will compete in the 2024 season of Porsche's Supercup series, and they'll all burn a synthetic fuel manufactured by the brand.
1h ago
Autoblog
Italdesign Quintessenza is part GT, part pickup, all EV, with tons of tech
Italdesign Quintessenza is part GT, part pickup, tons of tech, all EV, and quick, too. A 150-kWh battery and 778 hp get the concept from 0-62 in 3 seconds.
1h ago
Yahoo Sports
Is it time to panic about Corbin Carroll's offense, Oneil Cruz's strikeouts or Craig Kimbrel's inconsistency?
With a month of baseball in the books, we can begin to separate overreaction from legitimate cause for concern.
1h ago
Autoblog
BMW will drop the 'i' from gas-powered trim names
BMW says it will drop the 'i' from gas-powered trim names starting with the coming X3 M50. Once an indication of fuel injection, it will now denote EVs.
2h ago
Yahoo Finance
Oil prices drop amid rising inventories, diplomatic push for Gaza ceasefire
Oil prices have been on a decline over optimism over a possible ceasefire between Hamas and Israel and rising crude inventories.
2h ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

OpenAI’s Sora Shines a Spotlight on The Need for ‘Ethically Sourced’ AI | Commentary

Recommended Stories

Creators of Sora-powered short explain AI-generated video's strengths and limitations

OpenAI will train its AI models on the Financial Times' journalism

ChatGPT's 'hallucination' problem hit with another privacy complaint in EU

OpenAI Startup Fund quietly raises $15M

Apple has reportedly resumed talks with OpenAI to build a chatbot for the iPhone

Adobe's working on generative video, too

YouTube CEO warns OpenAI that training models on its videos is against the rules

Former Snap AI chief launches Higgsfield to take on OpenAI's Sora video generator

Vana plans to let users rent out their Reddit data to train AI

Humane’s $699 Ai Pin is now available

The US and UK are teaming up to test the safety of AI models

Microsoft taps Sanctuary AI for general-purpose robot research

Need a can't-miss Mother's Day gift? Kardashian-approved Barefoot Dreams blankets are $48 — that's over 65% off

University of Houston blowing off NFL's cease-and-desist about Oilers-like uniform: 'We're doing it'

May's PlayStation Plus games include Ghostrunner 2 and the modern classic Tunic

Porsche will run an entire race series using only synthetic fuels

Italdesign Quintessenza is part GT, part pickup, all EV, with tons of tech

Is it time to panic about Corbin Carroll's offense, Oneil Cruz's strikeouts or Craig Kimbrel's inconsistency?

BMW will drop the 'i' from gas-powered trim names

Oil prices drop amid rising inventories, diplomatic push for Gaza ceasefire