Generative AI

AI’s Data Debacle: When Training Becomes Stealing?


The use of copyrighted materials for training large language models and other generative AI systems is a contentious issue. AI companies like OpenAI, Google and Anthropic argue that ingesting broad data is necessary to create beneficial AI tools. However, artists, creators and rights holders argue this constitutes unlawful use and exploitation of their work.

As AI rapidly advances by ingesting massive datasets, a burning ethical debate rages: Should companies freely consume copyrighted creative works to fuel AI, or must they obtain explicit consent and compensate rights holders for commoditizing intellectual property?

The conflict with Sony Music reached a new level recently when the company sent letters to 700 AI companies and streaming platforms, demanding they stop using Sony’s music and lyrics to train AI systems without permission, credit or compensation. “Unauthorized use” of copyrighted content typically violates intellectual property laws unless compensation is paid to rights holders.

OpenAI recently faced backlash for an AI voice model that mimicked Scarlett Johansson’s voice without her consent. Johansson revealed she had declined an offer from OpenAI to voice the ChatGPT 4.0 system – highlighting the broader issue of deepfakes and protecting individuals’ name, image and likeness rights in the AI age.

Despite this, the release of a voice eerily similar to hers led to public outcry and legal action. Johansson stated: “When I heard the released demo, I was shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference.”OpenAI has now paused its voiceover tool, admitting in a blog post that the voice “acted in an unintended way… inconsistent with our ethics.”

Responding to growing backlash, OpenAI announced plans for a Media Manager tool intended to allow rights holders to specify whether their works can be included in AI training data, set to launch in 2025.

However, by placing the burden on creators to protect their rights, this opt-out approach fundamentally disrespects intellectual property rights by assuming unconsented use is acceptable until explicitly denied – an unfair and impractical stance. Many believe OpenAI’s Media Manager merely attempts to mitigate backlash and lawsuits. Skepticism abounds regarding whether works will actually be removed from datasets or if this is just a PR move to avoid future legal claims while keeping prior training data intact.

The 2025 timeline gives OpenAI years from when it launched to leverage creators’ works without consent, raising questions about output already derived from pre-Media Manager training data proliferating online. This is particularly concerning given OpenAI has now repeatedly issued statements promising to respect creators, reflecting intense pressure from copyright infringement cases.

The Ethical Approach: Opt-In

Opt-in critics argue the ethical approach is an explicit opt-in system where creators proactively grant licenses and set terms. Only then can creators’ rights and labor be truly respected with transparency around permitted training data.

While an output like Johansson’s voice suggests training data included her vocals or an intentionally similar voice, outputs from large datasets often don’t directly reveal the specific rights or materials used. A keyboard output trained on Jimmy Page’s guitar and Ringo Star’s drums may not sound like a Led Beatles hybrid.

This opt-in approach raises questions about handling derivative AI outputs closely mimicking original works or individuals’ NIL rights: Will AI systems only train on opted-in permitted content? How would this impact previously-generated synthetic data?

Untrained Models and Industry Practices

Developers decide what data to include every time an AI model gets trained. Eventually, models become obsolete, replaced by new versions – some freshly trained, some building on prior iterations. AI has great potential to assist creators with tasks like marketing, rights management, and even aiding the creative process itself.

For society’s benefit, the focus should be developing AI tools respecting creators’ intellectual property while supporting human ingenuity and expanding avenues for creativity. There is risk, however, of funneling all individuality into a handful of elite AI systems, narrowing rather than expanding humanity’s innovative potential.

The Democratization of Creativity or Exploitation?

Generative music AIs like Udio and Suno claim they will “democratize” music while using copyrighted tracks to train, as highlighted by Ed Newton-Rex, Founder of Fairly Trained. The notion of “democratizing creativity” through AI has been heavily criticized, wrongly implying art is a natural expression rather than a skill honed over thousands of hours of practice and experimentation.

AI risks stripping away incentives for humans to innovate and weakening artists’ and audiences’ sense of identity by reductively duplicating the essence of what people create – the pillars underpinning cultural evolution and exploring the human condition. Framing AI as a democratizing force follows a simplistic, rules-based view overlooking art’s nuanced complexity.

The Core Ethical Question

At the heart of the battle around AI data practices is the question: Who gets to use and commoditize the ideas created by others?This is a tangible, present-day conflict over rights to the intellectual property generated by human intellect. As generative AI ingests broader datasets, the risk of exploiting creators without approval or compensation grows to a point where no arbiter could identify how the balance was irrevocably tipped – a paradigm shift too advanced to reverse or restore creators’ lost ground.

For artists and rights holders, an “opt-in” system cleanly protects creators by requiring explicit consent and negotiated licensing terms before any AI training occurs.

The stakes are high in this battle over data appropriation and rights violations, and the outcomes will govern the relationship between human intellect and artificial intelligence as this socio-technical revolution unfolds.



Source

Related Articles

Back to top button