Anthropic Wins Major Fair Use Victory for AI Training

A groundbreaking legal decision has sent shockwaves through the AI industry. In what may be the most significant copyright ruling for artificial intelligence to date, Judge William Alsup of the Northern District of California issued a summary judgment in Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson v. Anthropic PBC that fundamentally clarifies the boundaries of fair use in AI training.

The ruling is a mixed victory for Anthropic—and by extension, the entire AI industry. While the court found that using copyrighted books to train large language models constitutes fair use, it also held that building permanent digital libraries of pirated content is not protected, even when used for transformative purposes.

The Facts: From Piracy to Purchase

The case reveals fascinating details about Anthropic's data collection practices from its earliest days. Founded by ex-OpenAI researchers in February 2021, the company initially relied heavily on pirated content to build its training datasets.

According to court documents, Anthropic co-founder Ben Mann downloaded Books3 in early 2021—a notorious collection of 196,640 pirated books. The company didn't stop there. By June 2021, Mann had downloaded at least five million books from Library Genesis (LibGen), and in July 2022, Anthropic added at least two million more from the Pirate Library Mirror (PiLiMi). All of these sources were known to contain unauthorized copies of copyrighted works.

But here's where the story gets interesting: Anthropic eventually changed course. In February 2024, the company hired Tom Turvey, former head of partnerships for Google's book-scanning project, with an ambitious mission—obtain "all the books in the world" while avoiding as much "legal/practice/business slog" as possible.

Turvey's team embarked on a massive legitimate book-buying spree, spending millions of dollars to purchase print books, often in used condition. These books were then professionally scanned—stripped from their bindings, cut to size, and digitized into PDFs before the physical copies were discarded.

The Court's Nuanced Ruling

Judge Alsup's 32-page decision draws crucial distinctions that could define AI copyright law for years to come:

What Counts as Fair Use

AI Training: The court found that using copyrighted books to train LLMs is "spectacularly transformative" and constitutes fair use. The judge compared it to how humans learn from reading, noting that forcing people to pay "each time they read, each time they recall from memory, each time they later draw upon it when writing new things" would be unthinkable.

Purchased-and-Scanned Books: Converting legitimately purchased print books to digital format for internal use was also ruled fair use, though on narrower grounds—essentially treating it as format shifting rather than transformative use.

What Doesn't Count as Fair Use

Pirated Central Library: Building and maintaining a permanent digital library of millions of pirated books was ruled not fair use, even when some content was later used for transformative AI training. The court emphasized that Anthropic kept these pirated copies even after deciding not to use them for training.

Implications for the AI Industry

This ruling provides crucial clarity for AI companies, but with important caveats:

The Good News for AI Companies

Training on copyrighted material can be fair use when the process is truly transformative and doesn't produce infringing outputs
Legitimate purchase and digitization of content for internal AI training purposes is likely protected
No blanket liability for using copyrighted works in AI training datasets

The Warning Signals

No special carveout for AI: As the court bluntly stated, "There is no carveout from the Copyright Act for AI companies"
Piracy isn't excused by downstream fair use: Building convenience libraries of pirated content isn't protected just because you later use it transformatively
Intent matters: The court considered whether companies actively sought out pirated content versus legitimately acquired materials

What Happens Next

While Anthropic won on the AI training question, the company still faces a jury trial over damages related to their pirated book library. This could include statutory damages for willful infringement—potentially a significant financial penalty.

The decision also sets up an interesting dynamic for other pending AI copyright cases. Companies like OpenAI, Meta, and others who have used similar datasets (Books3 was also part of Meta's LLaMA training data) are likely analyzing this ruling carefully.

The Broader Context

The Anthropic decision provides a framework but doesn't resolve all questions. Key issues remain:

What constitutes "transformative" use in different AI contexts?
How should fair use apply to other types of copyrighted content like images, videos, or code?
What licensing models might emerge to serve both content creators and AI developers?

For now, AI companies have clearer guidance: training on copyrighted content can be fair use, but building pirate libraries is risky business. The smartest approach appears to be what Anthropic eventually adopted—invest in legitimate content acquisition, even if it's expensive.

As the AI industry continues to mature, we're likely to see more companies following Anthropic's later playbook: finding ways to obtain content legally while pushing the boundaries of what fair use allows in the transformative context of AI training.

KahWee