Applied Intelligence
Module 7: Data Privacy and Compliance

Training Data and Licensing Implications

The training data problem

The previous section addressed who owns AI-generated code. This section addresses a different question: what obligations might AI-generated code inherit from the data used to train the model?

Over 70 copyright lawsuits are currently pending against AI companies as of late 2025. The legal outcomes will take years to resolve, but enterprises cannot wait. Code is being written today with AI tools trained on datasets of uncertain provenance. Understanding the risk landscape is essential for informed decision-making.

The litigation landscape

AI companies trained their models on massive datasets scraped from the internet, including copyrighted books, articles, images, and code. Content creators are suing.

Major pending cases

The New York Times v. OpenAI and Microsoft represents the most significant AI copyright case in the US. Filed in late 2023, the case survived most of OpenAI's motion to dismiss in April 2025. In May 2025, Judge Wang ordered OpenAI to preserve all ChatGPT logs over 400 million users' worth after the Times demanded access to 1.4 billion conversations. No trial date is set, but discovery is proceeding.

Authors Guild v. OpenAI consolidated 17 authors including George R.R. Martin, John Grisham, and Jodi Picoult into a class action. The court ordered OpenAI to produce communications about the deletion of its "Books1" and "Books2" training datasets, along with 20 million de-identified ChatGPT logs. A discovery status conference is scheduled for February 2026, with a fair use decision expected no earlier than summer 2026.

Getty Images v. Stability AI produced a notable UK ruling in November 2025. The UK High Court found that Stable Diffusion's model weights "never contained a copy" of Getty's images, rejecting the claim that training on copyrighted content makes the model itself infringing. However, older versions of Stable Diffusion did generate images with Getty watermarks clear trademark infringement. Getty is appealing secondary copyright claims, and states it will use UK findings in its parallel US case.

Music industry lawsuits against Suno and Udio (AI music generators) resulted in settlements. Warner Music Group settled with both companies in November 2025, entering licensing deals for AI music platforms launching in 2026. Universal Music Group settled with Udio. Sony Music's cases remain active.

The Anthropic settlement: $1.5 billion

The Bartz v. Anthropic case produced the largest copyright settlement in U.S. history.

In June 2025, Judge William Alsup issued a split ruling. Training on legally acquired books qualified as fair use "quintessentially transformative." Training on pirated copies from shadow libraries like LibGen did not.

The class certification in August 2025 expanded three authors into representatives for approximately 500,000 works. With statutory damages up to $150,000 per work, theoretical liability exceeded $70 billion.

The preliminary settlement, approved September 2025, requires Anthropic to pay $1.5 billion in four installments through September 2027 roughly $3,000 per title. Anthropic must destroy pirated library copies and confirm destruction in writing. Critically, the settlement does not grant permission for future use or shield Anthropic from future lawsuits.

Thomson Reuters v. ROSS Intelligence

This February 2025 ruling was the first US court decision rejecting fair use for AI training.

ROSS Intelligence asked to license Westlaw content for its legal research AI. Thomson Reuters declined. ROSS then used a third party to create memos incorporating Westlaw headnotes. Judge Stephanos Bibas found ROSS infringed over 2,000 headnotes and rejected the fair use defense.

The ruling is on appeal to the Third Circuit. The case was not generative AI, making it factually distinct from the major training data lawsuits, but it established that AI training on copyrighted content is not automatically fair use.

The license inheritance question

Could code generated by an AI inherit licensing obligations from the code used to train the model?

The theory

If an AI model is trained on GPL-licensed code and generates output that is substantially similar to that training data, the output might qualify as a derivative work. Copyleft licenses like GPL require derivative works to be distributed under the same license. This creates a potential "taint" scenario: GPL fragments in AI output could theoretically require the entire generated codebase to be GPL-licensed.

As of late 2025, no court has ruled that AI output inherits training data licenses. The theory remains legally untested.

The November 2025 Munich Regional Court ruling in GEMA v. OpenAI established that if a model can reproduce an expression substantially identical to original training material, that memorization constitutes reproduction under German copyright law. This supports the derivative work theory in certain scenarios but does not address license inheritance directly.

The Copyright Office's May 2025 report concluded that "it is not possible to prejudge litigation outcomes" and that "some uses of copyrighted works for generative AI training will qualify as fair use, and some will not."

The legal community has not reached consensus.

Arguments against inheritance

  • AI models perform statistical generalization, not direct copying
  • GPL applies to "distribution of the work" model parameters are not clearly distribution
  • Code fragments do not persist in weights the same way they exist in traditional software

Arguments for inheritance

  • Substantially similar output could constitute a derivative work regardless of the generation method
  • The Munich ruling suggests models can legally "contain" copies of training material when output is reproducible
  • Copyleft licenses are specifically designed to propagate to derivative works

Open source projects respond

Some open source projects have adopted explicit policies banning AI-generated contributions.

QEMU's policy

QEMU formally adopted a complete ban on AI-generated contributions in June 2025. The official policy states:

"Current QEMU project policy is to DECLINE any contributions which are believed to include or derive from AI generated content. This includes ChatGPT, Claude, Copilot, Llama and similar tools."

The primary rationale is Developer Certificate of Origin (DCO) compliance. Contributors must certify the contribution was "created by me" with clear copyright and license status. QEMU determined that AI-generated code cannot satisfy this requirement.

The policy allows using AI for research, API documentation lookup, static analysis, and debugging provided those outputs are not included in contributions.

Other project policies

Gentoo Linux (April 2024) explicitly forbids AI-assisted contributions, citing copyright concerns, quality issues ("plausibly looking, but meaningless content"), and ethical concerns including environmental costs.

NetBSD presumes AI-generated code is "tainted" and requires prior written approval. As a BSD-licensed project, NetBSD explicitly concerns itself with accidentally incorporating GPL code from AI training data which would create license conflicts.

The Git project has forbidden AI-generated code entirely.

Creative Commons Technology Team (December 2025) will not accept AI-generated submissions until further notice, citing disputed productivity claims, review burden, and negative impact on skill development for programs like Google Summer of Code.

LLVM adopted a "human in the loop" approach rather than a ban. Contributors must disclose tool usage in PR descriptions or commit messages, be able to answer questions about their work, and maintain accountability for all submitted code.

Why projects ban AI code

The common threads across these policies:

  1. Provenance uncertainty: Projects cannot verify that AI output is free from license-incompatible training data
  2. DCO compliance: Standard contribution agreements require personal attestation incompatible with AI generation
  3. Quality concerns: AI-generated code requires additional review burden
  4. License contamination risk: BSD-licensed projects face particular risk from GPL fragments in training data

Enterprise risk mitigation

Enterprises cannot wait for litigation to resolve. Practical risk management is necessary now.

Due diligence on AI vendors

When evaluating AI coding tools, ask:

  • What is the source of training data for the model?
  • Does the vendor maintain documentation of the training pipeline, including dataset provenance?
  • Have training data sources undergone legal review for consent, licensing, and copyright compliance?
  • Will inputs to the system be used as additional training data?

Vendor hesitancy or evasiveness on these questions is a significant red flag.

Detection and documentation

License scanning tools are emerging. FOSSA released snippet scanning tools in September 2025. Codacy announced GPL license scanners for AI-generated code.

Duplication detection filters are available. GitHub Copilot offers an optional filter preventing suggestions that match public code. Enable this in enterprise deployments.

Documentation requirements are increasing. California SB 1047 requires disclosure of copyrighted material in training data. Colorado HB 23-1239 mandates detailed summaries of AI model training sources. The EU AI Act demands compliance reporting on training data and risk mitigation.

Review practices

Apply the same compliance standards to AI-generated code as human-written code:

  • IP infringement detection services can check AI output for copyright violations
  • Log AI-generated code for traceability in case of disputes
  • Enable vendor-provided duplication detection filters
  • Conduct code review with license contamination as an explicit focus area

Vendor indemnification

Microsoft's Copilot Copyright Commitment provides IP indemnification for paid Copilot Business and Enterprise users, provided required safety systems are enabled. OpenAI's Copyright Shield provides similar protections for enterprise customers. Anthropic indemnifies enterprise API customers with specified exclusions.

These protections require compliance with vendor terms. Modifications, combinations with non-vendor technology, and disabled safety features can void coverage.

The prudent path

The law is unsettled. Litigation will continue for years. Courts may ultimately establish that AI training on copyrighted content is fair use, that license inheritance does not apply, or both. They may rule otherwise.

Enterprises operating today should:

  1. Enable duplication detection and filtering in AI tools
  2. Document AI tool usage and output for audit trails
  3. Review vendor indemnification terms and maintain compliance
  4. Apply license scanning to AI-generated code in sensitive contexts
  5. Monitor legal developments for shifts requiring policy updates

The risk is not zero, but it is manageable. Documentation, detection, and vendor protections provide a defensible position while the law develops.

On this page