CreativeIP.org: Creative Intellectual Property

How AI training data creates hidden copyright liability

24 pages · Training data law · CIP Standards Committee · Updated April 2026

Why ingesting copyrighted content into a training corpus does not extinguish the rights subsisting in it; how UK DUA Act 2025, EU AI Act Article 53 and US fair use jurisprudence interact; and what AI companies actually face when content carrying TDM opt-outs enters their pipelines without consent.

What this paper covers

The legal status of training ingestion under each major jurisdiction
The treatment of works carrying machine-readable TDM opt-outs
The exposure created by inference outputs that derive from specific copyrighted sources
The litigation landscape as it stood at publication

Key findings

Ingestion of copyrighted works into an AI training corpus creates a reproduction that engages the rights holder's exclusive rights under copyright law in all three major jurisdictions. The specific legal treatment varies — the UK Data (Use and Access) Act 2025 provides a statutory opt-out mechanism; the EU AI Act Article 53 requires rights reservation to be honoured; US fair use jurisprudence remains unsettled following the NYT v. OpenAI litigation.

Works carrying a machine-readable TDM opt-out — including those declared via cip.md — create heightened liability for ingestion without consent. The opt-out operates as an explicit reservation of rights that removes any implied-licence defence available to the AI operator.

Inference outputs that demonstrably derive from specific copyrighted sources engage derivative-work liability independently of the training-ingestion liability. This creates a dual exposure: one at the point of training and another at the point of output generation.

Jurisdictional comparison

United Kingdom: The Data (Use and Access) Act 2025 provides a statutory TDM exception with an opt-out mechanism. Content carrying a machine-readable opt-out signal is excluded from the exception. The CIP cip.md declaration satisfies the machine-readable requirement when deployed at the domain root with the CIP-Training-Ingestion: Prohibited field.

European Union: AI Act Article 53 and the Copyright Directive establish a rights-reservation regime. Rights holders who have made their reservation known — in a machine-readable format for online uses — retain full control over training use. The CIP framework's TDM opt-out field aligns directly with this requirement.

United States: The fair use defence under 17 U.S.C. § 107 remains the primary framework. Following NYT v. OpenAI/Microsoft and related litigation, the application of the four-factor test to AI training is actively being litigated. No statutory TDM exception or opt-out mechanism exists. The CIP framework recommends US operators treat TDM opt-out declarations as contractually binding even absent statutory force.

Implications for AI operators

AI operators should conduct a full audit of their training corpus to identify content carrying TDM opt-outs. Where opt-outs are identified, the operator faces a choice: remove the content from the corpus (expensive but legally safe), or continue use and accept the resulting liability exposure.

The CIP framework provides the infrastructure for this audit through the Rights Registry and CDR system. Platform Certification at Level 2 and above requires documented evidence of rights-aware ingestion practices.

How AI training data creates hidden copyright liability

What this paper covers

Key findings

Jurisdictional comparison

Implications for AI operators

Where to go next