Last week saw the release of two court decisions in cases addressing the use of copyrighted material for training of artificial intelligence (AI) platforms, Bartz et al., v. Anthropic, and Kadrey et al., v. Meta. We asked the Chefs for their thoughts on these decisions and the potential impacts on publishers and authors.
Roy Kaufman
In the past years, I have been asked some version of the following question by publishers: “as we are developing our AI licensing strategy, should we wait to see if the courts grant us clarity?” My answer has been “fair use is fact dependent. If you wait 5 plus years for the end results of motions for discovery, summary judgement, trials on the merits, and inevitable appeals, we still will not have bright line clarity.”
Of the 40-plus AI training cases in the US, we now have preliminary decisions in three. And have moved backwards in terms of clarity. To vastly oversimplify the results, training AI is non-transformative infringement (Thomson Reuters v Ross), training is transformative and mostly fair use with a major caveat (Bartz), or training is mostly not fair use– but was fair use in this case because the lawyers did not plead correctly (Kadrey).
I can criticize and nit-pick these decisions for pages. I am especially concerned with the Bartz court’s lack of market harm analysis given the importance of market harm as set right by the Supreme Court in the Andy Warhol case. Both Bartz and Kadrey also ignored long-term precedent about the importance of licensing to fair use analysis.
As a practical matter though, the only way out of this morass of uncertainty is for parties to collaborate; whether through “licenses,” “content feeds,” or some other formulation which enables parties to agree and advance AI. Decisions which deny publishers the ability to enforce rights in — and issue licenses for — content will disincentivize creation and posting of new materials. This is not good for anyone. If an AI firm wants to train only on what it can find for “free,” this will increasingly be AI-generated and will result in poorer quality AI.
As noted in the Bartz case:
“Over time, Anthropic came to value most highly for its data mixes books like the ones Authors had written, and it valued them because of the creative expressions they contained. Claude’s customers wanted Claude to write as accurately and as compellingly as Authors. So, it was best to train the LLMs underlying Claude on works just like the ones Authors had written, with well-curated facts, well-organized analyses, and captivating narratives — above all with “good writing” of the kind “an editor would approve of.”… Anthropic could have trained its LLMs without using such books or any books at all. That would have required spending more on, say, staff writers to create competing exemplars of good writing, engineers to revise bad exemplars into better ones, energy bills to power more rounds of training and fine-tuning, and so on. Having canonical texts to draw upon helped.”
Good content matters. It matters to training and matters even more for use in agentic AI and RAG models, neither of which models were involved in these lawsuits and both of which would be subject to a different copyright analysis. Human authored content is critical, and licensing supports its creation and use.
Rick Anderson
Both of Judge Alsup’s rulings make sense to me.
First, as to the “fair use” nature of using copyrighted texts to train AI large language models: it seems clear to me that such applications represent a transformative use of the copyrighted content. Using these texts to train a language model is a radically different use from that for which the texts were designed and intended. Furthermore, this use does not result in a product that competes in any way with the original works in the marketplace and therefore has no impact on the copyright holders’ ability to sell copies or access – nor does it result in anything that could reasonably be characterized as a derivate work of the original. I can’t fault the court’s finding with regard to the fair-use argument.
I’m also heartened to see that the court decided Anthropic should be held accountable for illegally downloading millions of books from websites that had stolen them. By downloading those books en masse, Anthropic had participated in a massive piracy scheme, perpetuating and expanding the illegal proliferation of unauthorized duplicates of copyrighted material. Copyright holders were entirely right to call “foul” on that behavior, and I’m grateful to see the court distinguishing carefully between fair and unfair uses within the same project by the same company. Going forward, this kind of thoughtful discrimination will help to create greater clarity in the always-murky arena of intellectual property rights and AI.
Todd Carpenter
The two judgements last week are simply the first forays into what will be a very active legal battle that will likely rage for many years to come. There are several dozen more cases that have been filed and we can expect many more rulings in the coming months. A core aspect of fair use cases is that they turn on the very specific circumstances of the case, so it also won’t be surprising that there are nuanced differences in the early cases, at least until the cases reach the Supreme Court level some years from now. If you’ve seen one fair use case, you’ve seen only one fair use case.
I’m compelled by the nuance in the Bartz v. Anthropic opinion. From a copyright perspective AI systems are engaged in three acts that touch on copyright. The first is the sourcing of the training content. Given the fact that Anthropic has been deemed to have used pirated content sites, such as Books3, there is a real risk of this being proved in court, and if true, could expose to a significant pool of liability, running into the hundreds of millions of dollars. “The downloaded pirated copies used to build a central library were not justified by a fair use,” Judge Alsup wrote. “Every factor points against fair use.”
Here I expect that the many LLM tools that similarly started with pirated content should be concerned. The ruling could fundamentally change the marketplace if training on unlicensed content is found to be infringing. It is pretty easy to presume most LLM model developers were using unlicensed, copyrighted material at least until a sizable market for AI licensing began developoing in about 2023. Last month, Nick Clegg as much acknowledged this when he claimed that requiring consent would kill AI. On its face this is rather insane statement when one considers the billions of dollars flowing into AI development, and the licensing costs for content is still measured in the tens of millions of dollars. Publishers have made a strong case that training was done using copyrighted content, notably Tim O’Reilly a few weeks ago, but there are many others. If a large model developer was to be found to be liable for using ‘pirated’ copies of copyrighted works, each work would be eligible for compensation and these penalties could bankrupt many companies.
The second is the training work that is involved in building an LLM, that is the encoding of the knowledge of the books ingested. This was deemed to be transformative in both cases and can be reasonably understood in these terms. If this were the extent of the things that LLMs were doing with content, one might consider these judgements as simple wins.
The third touchpoint to copyright is the output of the LLM systems. An LLM system could be used to create a completely new work in the style of a previous work, which is widely recognized as transformative. However, the same system could easily be used to produce a passingly similar copy, or a follow-on book where the first book left off, fanfiction, or an image that is as closely representative as to be passable to an average consumer. It seems that in both cases, the judges were too enamored in the transformative nature of the output that is possible that they ignored the possibility of near exact replication. Here there are a variety of examples and interpretations of transformative use related to the output itself. Since LLMs are not simply indexing the content, or creating a search functionality, but generating new content based on the original content, one must consider the actual outputs in a fair use determination. Some of the outputs might be new and novel, fitting into established fair use buckets of transformation. However, based on the rich case law, just as many likely are not. When considering the fair use tests, it is therefore surprising that the cases so far have only revolved around the training, but not the outputs, since these are the real purpose of the systems, not simply the training of the model. One response is that the judges were reacting to the cases put before them. I expect this final issue will come to the forefront in other cases.
What does this all imply for what will happen moving forward? First, let’s acknowledge that any definitive resolution to these questions is many years in the future. It will take years for these various cases to wind their way through the court systems. Very likely, these cases will end up before the Supreme Court because there is so much at stake for everyone involved. The problem is that an entire ecosystem of real world applications will be developed, deployed, and adopted or discarded before the court system will come to a resolution.
My expectation is—and has been for some time—that these cases will drive the AI vendor community (or at least those with the resources) to seek content agreements with the publishing world as quickly and efficiently as possible. The risks of being found outside of compliance are so great that AI companies will want to hedge their bets. In some ways, they are trying to build the next billion-dollar business on a bed of sand, but that is one court ruling away from being deemed illegitimate or worse a significant liability. Another possible approach might be to revive the notion of an industry-wide settlement, something akin to the Google Books Settlement that collapsed many years ago in the Authors Guild vs Google case, though this will face many of the same challenges that the Books Rights Registry faced as a solution. There’s some chatter that a settlement was always the preferred outcome, but one challenge with a settlement is that with each loss in court, the resulting settlement becomes ever more costly.
Discussion
1 Thought on "Ask The Chefs — New Court Decisions Issued in Cases Addressing AI Training and Copyright"
Thank you for sharing your interpretation of these rulings. I’ve been thinking about Todd’s point that AI vendors will likely pursue agreements with publishers, and fast. When it comes to RAG tools that rely on having access to a broad database in order to answer researcher’s questions, I wonder what the impact of these agreements will be.
Will we see more siloing, so that some tools will only return documents from publishers covered by the agreements?
Or will they (and I think this is more realistic) pursue agreements such as they have with Semantic Scholar, which right now is the one tool that most of the start-ups use. Will Semantic Scholar be the next big disseminator of scholarly content, then? Will that finally disintermediate the big databases? What about smaller publishers, how do we make sure their content is (legally) pulled upon by these tools so that researchers get the full picture?