Regulations governing training material for generative artificial intelligence
Legal use of copyrighted material is crucial for training artificial intelligence (AI) but also poses a significant copyright risk for AI front-end developers. During the training process, developers need to feed large amounts of text data to AI algorithm models to enhance training effectiveness, which inevitably involves copyrighted works.
Although various countries have not definitively determined whether copying copyrighted materials for AI training constitutes copyright infringement or can be claimed as fair use, regulatory oversight of training material for generative AI is increasingly stringent.
This article briefly introduces the latest regulatory trends in the EU, Japan, the US and China regarding training material for generative AI, and provides corresponding compliance advice.
Increasing regulatory stringency
Although the EU Directive on Copyright in the Digital Single Market specifically provides copyright exceptions for text and data mining (TDM), it still imposes many restrictions on TDM for commercial purposes, including requirements for the mined content to be legally obtained, and that TDM has not been expressly reserved by the right holder in an appropriate manner.
Additionally, to implement the above-mentioned directive, the EU passed the world’s first AI regulation, the Artificial Intelligence Act, on 13 March 2024, further stipulating that providers of general AI models must establish policies respecting the rights of copyright holders to reserve statements under the directive regarding TDM, and are obligated to draft and publish detailed summaries of the material used to train AI models. It is evident that the EU aims to enhance the transparency and compliance of AI technology to ensure that the development and use process of AI systems respect copyright laws and protect the rights of copyright owners.
Against this legislative backdrop, Google recently faced a EUR250 million (USD267 million) fine from the French Competition Authority for unauthorised use of copyrighted content from French publishers and news agencies to train its AI product, Bard.
While Japan was considered a paradise for machine learning due to the addition of a copyright exception clause for computer information analysis purposes in its Copyright Act amendment in 2018, in 2024, the Agency for Cultural Affairs further clarified, through a draft Approach to AI and Copyright, that not all uses of copyrighted works in machine learning constitute copyright exceptions, adding further limitations to the exceptions.
There is no definitive conclusion on the issue of fair use of AI training material in the US. However, in the case of Thomson Reuters v Ross Intelligence, the first case in the US to consider whether the use of third-party copyrighted material in training generative AI constitutes fair use, the court provided explanations on the elements of fair use of copyrighted material for AI training, including the profitability of the conduct, transformative nature, and potential market substitution.
Although this case has not yet been concluded, the judge’s analysis of the elements of fair use reflects the cautious and detailed consideration of the relationship between copyright law and generative AI by the US judiciary.
The Interim Measures for the Management of Generative Artificial Intelligence Services, which came into effect in China on 15 August 2023, also require providers of generative AI services to comply with certain requirements for acquiring training material. Article 7 explicitly stipulates that providers should use data and basic models from legal sources when conducting activities such as pre-training and optimisation training, and they must not infringe upon the lawful IP rights of others.
Compliance advice
In the above-mentioned regulatory context, overall, unauthorised use of copyrighted works by generative AI enterprises for AI training is likely to pose risks of copyright infringement. Therefore, in terms of acquiring and using training material, the authors suggest that generative AI enterprises can take the following measures to manage risks effectively.
First, if the enterprise autonomously collects training material, it should strive to select text data from low-risk sources (such as material already in the public domain, open-source databases, etc.) and ensure the legitimacy of the sources of training material, for example, by not bypassing technical protection measures or obtaining pirated copyrighted content. Additionally, attention should be paid to whether copyright holders have declared a prohibition on crawling content for AI training.
Second, enterprises can enter into licensing agreements with copyright holders to effectively reduce infringement risks while also improving the quality of training data. When purchasing training databases from third parties, it is crucial to ensure they provide clear copyright chains, authorised documents and guarantee the legality of the copyright.
Furthermore, enterprises should establish a regular audit mechanism for the training material library to systematically screen and eliminate high-risk content. For AI models with user-input content, it is recommended to differentiate between proprietary databases and third-party input material libraries uploaded by users to enhance the efficiency and comprehensiveness of supervision.
Finally, enterprises should keep records of the sources and use of training material. If problems arise due to training material in the future, enterprises can assert that they have legally and compliantly acquired training material and fulfilled their duty of care by providing transparency reports and explaining the sources of training data, thereby minimising relevant responsibilities.
Estella Chen is a partner at Han Kun Law Offices. She can be contacted by phone at +86 10 8525 5541 and by email at estella.chen@hankunlaw.com
Zhao Minxi is an associate at Han Kun Law Offices. She can be contacted by phone at +86 10 8524 5830 and by email at minxi.zhao@hankunlaw.com
Chao Xin also contributed to this article.