A new bill wants to reveal what’s really inside AI training data

April 10, 2024

169 1 minute read

A new bill wants to reveal what’s really inside AI training data — STK414 AI CHATBOT E.jpg

A new bill would compel tech companies to disclose any copyrighted materials that are used to train their AI models.

The Generative AI Copyright Disclosure bill from Rep. Adam Schiff (D-CA) would require anyone making a training dataset for AI to submit reports on its contents to the Copyrights Register. The reports should include a detailed summary of the copyrighted material in the dataset and the URL for the dataset if it’s publicly available. This requirement will be extended to any changes made to the dataset.

Companies must submit a report “not later than 30 days” before the AI model that used the training dataset is released to the public. The bill will not be retroactive to existing AI platforms unless changes are made to their training datasets after it becomes law.

Schiff’s bill hits on an issue artists, authors, and other creators have been complaining about since the rise of generative AI: that AI models are often trained on copyrighted material without permission. Copyright and AI have always been tricky to navigate, especially as the question of how much AI models change or mimic protected content has not been settled. Artists and authors have turned to lawsuits to assert their rights.

Developers of AI models claim their models are trained on publicly available data, but the sheer amount of information means they don’t know specifically which data is copyrighted. Companies have said any copyrighted materials fall under fair use. Meanwhile, many of these companies have begun offering legal cover to some customers if they find themselves sued for copyright infringement.

Schiff’s bill garnered support from industry groups like the Writers Guild of America (WGA), the Recording Industry Association of America (RIAA), the Directors Guild of America (DGA), the Screen Actors Guild – American Federation of Television and Radio Artists (SAG-AFTRA), and the Authors Guild. Notably absent from the list of supporters is the Motion Picture Association (MPA), which normally backs moves to protect copyrighted work from piracy. (Disclosure: The Verge’s editorial staff is unionized with the Writers Guild of America, East.)

Other groups have sought to bring more transparency to training datasets. The group Fairly Trained wants to add labels to AI models if they prove they asked for permission to use copyrighted data.

Source

April 10, 2024

169 1 minute read