C4C’s Perspective on the EU AI Act: Copyright in Real Life is Messy and AI Discussions Are Not Helping

Cross-posted from C4C’s LinkedIn page – see the original article

The trilogue black box effect combined with an end of term

The AI Act has reached that black box moment referred to as the trilogue, where each of the three EU institutions enter a room to make a deal. Two elements might affect this process, and not necessarily in a way that allows for the outcome to be reasonable and proportionate:

  • One, as this European Commission and European Parliament’s tenures are coming to an end, there is a bigger risk a deal is made at all cost. Add to that the element of the EU loving to “set the standard” for the rest of the world (whatever that could possibly mean) and you sense that the AI Act might be negotiated with a misplaced sense of urgency.
  • Two, the creative industries are voicing their concerns more and more about the possible threats AI could pose to their current business model, a situation which especially in Europe tends to generate knee jerk reactions by policy makers at the expense of careful assessments.

The copyright creep into the AI Act discussion

The AI Act as initially proposed did not comprise any copyright references. And this is absolutely justified as the Copyright Directive in the Digital Single Market had just been adopted and comprised two provisions covering text and data mining (TDM) – Article 3 covering research organisations and cultural heritage institutions and Article 4 covering all other uses. TDM is defined in a sufficiently broad manner to cover many known machine learning processes (“any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”).

This was confirmed in an answer by Commissioner Thierry Breton to a parliamentary question from MEP Emmanuel Maurel, where he stated that the Copyright Directive applied to AI and that hence “the creation of art works by AI does not deserve specific legislative intervention”.  

Nobody know what is copyrighted and what isn’t: do not ask AI to deliver transparency on an untransparent status

So what was added by the European Parliament?

Under Article 28b, the current Parliament position requires that providers of generative AI models should document and share a “sufficiently detailed” summary of the use of training data protected under copyright law.

Simple no? Actually, absolutely not. No one knows what is copyrighted or not. Copyright is not vested upon a work through a deliberate act like a registration: it is bestowed on any creation that meets the requirements of copyright laws, and those requirements may vary from one country to another. One of those criteria is originality, a threshold that has led to many lengthy court cases and that is in no shape or form something a web crawler or automated tool could identify.

Of course, when you feed the whole Harry Potter series in an AI training model, you probably know it is copyrighted. But what about a drawing made by a child and posted proudly by one of their parents on social media, or a poem they wrote at school and got that perfect grade on? That is likely to be worthy of copyright too. Or not. But we just don’t know.

There is no register of copyrighted works and hence there is no way to list separately which of the elements in your data set are copyrighted. For this reason, any transparency obligation that creates a subset requirement for copyrighted works is a recipe for compliance failure through no fault of the entity trying to comply, and hence for legal uncertainty.

Or as aptly stated by the COMMUNIA Association: “AI developers also should not be expected to know which of their training materials are copyrightable. Introducing a specific requirement for this category of data adds legal complexity that is not needed nor advisable”.

And that is a very different compliance requirement to one that would try to enable the reservation right given to rightholders under the commercial TDM provision of Article 4 of the Copyright Directive, as explained below.

Generative AI does not necessarily use all the data in a data set to train a model or generate something

Machine learning is about collecting huge data sets, cleaning them up, chopping them in small parts referred to as tokens, splitting them into training and test data, and allowing them to be extracted in response to a prompt. Depending on the prompt, different tokens can be relevant while others might be completely disregarded.

When it comes to generative AI, as explained by Dr Andres Guadamuz, Reader in Intellectual Property Law at the University of Sussex, “the most important takeaway from the perspective of a legal analysis is that a generative AI does not reproduce the inputs exactly, even if you ask for a specific one” and “style and a ‘look and feel’ are not copyrightable” as “copyright protects the expression of an idea, not the idea itself”.

For those that think they hold copyright, a mechanism exists on paper: maybe it should be made to function in reality

The Copyright Directive has created a reservation right (also referred to as opt-out) for rightholders wishing to express that their content should not be used in commercial TDM activities.

The conclusion to this debate has been perfectly summarised by Assistant Professor João Pedro Quintais from the Institute for Information Law (IViR): “the type of transparency that is useful is one that allows copyright holders to access datasets in order to exercise their opt-outs. It is unclear how the present text would enable that, since it imposes a requirement that cannot  be met in practice”.

More practically: remove copyright references in the AI Act so that a horizontal measure does not get polluted by sector specific measures and start working on a practical implementation of the TDM measures in the Copyright Directive, with the input of all relevant stakeholders.