TDM and Training Data Law
This specialist module provides in-depth analysis of text and data mining (TDM) law as it applies to AI training. It covers the three major jurisdictions and their interaction through contracts and enforcement.
UK: Data (Use and Access) Act 2025
The UK DUA Act 2025 replaced the earlier proposed TDM exception (which would have allowed commercial TDM without consent) with a balanced regime:
- A TDM exception exists for non-commercial research purposes (carried over from CDPA 1988 s.29A).
- Commercial TDM (including AI training) is subject to a rights-holder opt-out mechanism.
- The opt-out must be expressed in a machine-readable form accessible to the AI operator.
- Content carrying a valid opt-out signal is excluded from the exception — any use requires explicit licence.
The CIP framework's cip.md declaration satisfies the machine-readable requirement when deployed at the domain root with CIP-Training-Ingestion: Prohibited. The Act does not prescribe a specific format, but the Government's technical guidance references "standardised machine-readable signals" — which cip.md implements.
EU: AI Act Article 53 and Copyright Directive
The EU framework operates through two complementary instruments:
- Copyright Directive Article 4: A TDM exception for commercial purposes, subject to a rights-reservation mechanism. Rights holders who "appropriately" reserve their rights are excluded from the exception.
- AI Act Article 53: Imposes transparency obligations on general-purpose AI model providers, including a requirement to "put in place a policy to comply with Union copyright law, and in particular to identify and comply with reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790".
The combined effect: AI operators must actively check for and respect rights reservations. The "appropriate" reservation for online content must be machine-readable. The CIP cip.md format satisfies this requirement.
US: Fair use four-factor analysis
The US has no statutory TDM exception or opt-out mechanism. The primary legal framework is the fair use doctrine under 17 U.S.C. § 107, which requires analysis of four factors:
- Purpose and character of the use — is the AI training use "transformative"? Courts have reached different conclusions on this.
- Nature of the copyrighted work — creative works receive stronger protection than factual works.
- Amount and substantiality of the portion used — AI training typically ingests entire works, which weighs against fair use.
- Effect on the market for the original — does AI training (and AI-generated outputs) substitute for the original? This is the most contested factor.
The NYT v. OpenAI/Microsoft litigation (1:23-cv-11195, S.D.N.Y.) is the most significant active case. The April 2025 motion-to-dismiss ruling allowed most claims to proceed, and the case is now in expert-discovery. No final ruling on fair use in AI training has been issued by a US federal court.
Jurisdiction selection in contracts
The jurisdiction clause in a content licensing contract determines which TDM regime applies. This is a strategic drafting decision:
- UK jurisdiction: Gives the rights holder access to the DUA Act 2025 opt-out mechanism and statutory remedies.
- EU jurisdiction: Gives access to the AI Act Article 53 compliance framework and Copyright Directive rights-reservation regime.
- US jurisdiction: Relies on fair use arguments (uncertain) but provides access to federal copyright statutory damages under 17 U.S.C. § 504(c).
The CIP framework recommends that UK and EU rights holders include a UK or EU jurisdiction clause in content licensing agreements, giving them access to the statutory opt-out mechanisms. US rights holders may prefer US jurisdiction for the statutory damages regime despite the fair use uncertainty.
Enforcement pathways
- UK: Statutory damages under DUA Act 2025 for breach of opt-out; additional damages under CDPA 1988 s.97(2); injunctive relief through the Intellectual Property Enterprise Court or High Court.
- EU: Injunctive relief through national courts implementing the Copyright Directive; Article 53 enforcement through national AI supervisory authorities; GDPR enforcement where personal data is involved.
- US: Statutory damages under 17 U.S.C. § 504(c) ($750–$30,000 per work, up to $150,000 for wilful infringement); actual damages and profits; injunctive relief through federal district courts.
Collecting society coordination
Collecting societies can coordinate Training Data Dividend claims at scale through their existing mandates. The CIP framework supports this through:
- CDR records that identify collecting society membership (PRS, MCPS, DACS, etc.)
- Revenue waterfall routing that can direct Training Data Dividend payments through existing collecting society infrastructure
- Collective licensing negotiations that societies can conduct on behalf of their members using CDR-documented rights exposure data
For rights holders not represented by a collecting society, the Rights Registry provides direct enforcement support and Training Data Dividend distribution.
Summary
Key Takeaways
- UK DUA Act 2025 provides a statutory TDM exception with an opt-out mechanism — machine-readable opt-out signals must be respected
- EU AI Act Article 53 and the Copyright Directive require rights reservation to be honoured — opt-out must be in machine-readable form for online uses
- US fair use analysis for AI training remains unsettled — NYT v. OpenAI and related cases are still in litigation
- Jurisdiction selection in contracts determines which TDM regime applies — this is a strategic drafting decision
- Enforcement pathways differ materially: UK statutory damages, EU injunctive relief through national courts, US damages through federal copyright litigation
- Collecting societies can coordinate Training Data Dividend claims at scale through existing mandates
Self-check
Check Your Understanding
- What does the UK DUA Act 2025 require for a valid TDM opt-out?
- How do EU AI Act Article 53 and the Copyright Directive interact on TDM?
- Why might a UK rights holder prefer UK jurisdiction in a content licensing agreement?