Most publishers I speak to want the same thing, usually phrased two different ways:

  • How do we get paid without spending years in court?

  • How do we stop being told ‘trust us’ by companies that have screwed us already, and have every incentive not to be trusted?

Last week we looked at what’s causing the LLM attribution crisis. This week, the fix.

The solution doesn’t involve a single lawsuit. Instead, it’s about market structure:
low-friction licensing at scale, paired with a way to measure use that’s good enough to pay.

The legal backdrop still matters, because it shapes leverage. Thomson Reuters v. Ross Intelligence shows courts can reject fair use when copyrighted material is used to create a competing commercial tool.

The Anthropic v Bartz settlement also fell in the favour of writers and creators, for the most part. Whereas the Meta ruling did not. The New York Times came to a settlement with OpenAI, which is great for them, but largely useless for most other publishers and journalists who do not have the same clout or financial prowess.

Furthermore, the U.S. Copyright Office’s 2025 training report explicitly contemplates that some unauthorised training uses won’t qualify as fair use, especially where market substitution is plausible.

And if that wasn’t enough, here’s the real spicy meatball:
even favourable rulings won’t automatically create a payment pipeline for everyone who isn’t Reuters or The New York Times. A durable solution has to be designed as a product and a contract system, not a moral appeal that tries to tug at the heartstrings.

“The LLM Attribution Gap: Why AI Isn’t Crediting Your Work”
Image created using NotebookLM.

Start where the SSRC paper ends: disclosure as infrastructure

The SSRC AI Disclosures Project paper, “The Attribution Crisis in LLM Search Results”, doesn’t merely diagnose “ecosystem exploitation.” It recommends a transparent LLM search architecture based on fuller disclosure of search traces and citation logs.

That recommendation is radical, yet quiet; subtle. It implies a shift away from arguments about what models might be doing toward evidence of what they did do, query by query.

And there’s a standards path for this already: OpenTelemetry has published semantic conventions for generative AI spans.
The point here is not to worship any one standard. It’s to establish a shared format for “what happened inside the box” that procurement teams, auditors, and rights-holders can validate.

If you can standardise the record of exactly which sources were fetched and which of those materially contributed (statistical significance), you can price consumption rather than relying on clicks alone.

If it’s not statistically significant, it doesn’t qualify for a payout. Similar, in a way, to Spotify’s recent update of only paying-out on songs that reach 1000 streams or more.

The business model: collective licence + usage-based payouts

The music analogy (ASCAP/BMI → iTunes/Spotify) isn’t perfect, and pretending it is will get you laughed out of any serious room.

I speak from experience.

It is, however, a good metaphor. It gives us a simple way to think about the problem, and sets a precedent in the fight for creative rights. One part of the analogy certainly holds true: simple transactions at low cost beat bespoke, tedious negotiations, when usage is high-volume and widely distributed.

Journalism needs a way for AI labs to license broadly without doing a thousand custom deals, and for freelancers and small publishers to get paid without hiring a negotiating team.

So the “better offer” has two legs:

  1. A low-cost, high-volume collective licence that AI labs can sign once and use widely.

  2. Usage-based payouts based on measurable, attributable contribution, so payment follows value rather than brand power.

This is not theoretical price discovery. The market is already experimenting with training licences in books; for example, the Authors Guild has described reported terms of a HarperCollins arrangement as roughly $5,000 per title, split between author and publisher for participants.

That isn’t journalism, and it doesn’t solve attribution. But it proves that “permissioned access” is becoming a commercial category.

In a different context, yet with striking similarity, Infactory CEO Brooke Hartley Moy told me at Web Summit Lisbon in November that “publishers need to think of themselves as data companies now, not traditional media companies”, because “the new web is made for bots and optimised for data points”.

(Side note: new episode of the Around the Bloc podcast coming out soon, where I chat to Brooke in depth about this problem. Subscribe to get first access of my conversation with Brooke, we had a great discussion, she’s fantastic!)

Why AI labs would opt-in, even when they could ignore or opt-out

AI labs will not adopt this out of kindness or some sense of morality. They’ll adopt it for three reasons: customer risk, product quality, and operational stability.

Customer risk shows up when enterprises start asking for provenance and audit trails as a procurement requirement. The SSRC paper explicitly frames transparency and disclosure as governance levers that can shape behaviour.

Operational stability shows up in bot wars. Cloudflare has moved toward permission-based approaches for AI crawlers and announced “Pay Per Crawl,” positioning it as infrastructure for a new business model between content owners and crawlers.
You don’t need to love Cloudflare to see the signal: the internet’s plumbing is starting to treat AI access as something that can be negotiated, not assumed.

And the “bot traffic” numbers make the incentive clearer. Imperva/Thales research data shows how large-scale automated traffic already is; TollBit’s reported AI-bot growth suggests this isn’t slowing down.

A predictable licensing channel is cheaper than a permanent arms race.

A ruthless self-critique (no such thing as the ‘perfect’ solution, yet)

Any workable proposal needs to first survive hostile readers, and next, the reality of market forces.

“Micro-payments won’t add up.”
Fair. Articles aren’t songs, and journalism doesn’t work like music; it’s fragmented inputs compiled into an answer. That’s why payouts must be tied to measurable contribution and aggregated at scale; otherwise you create a system that costs more to run than it pays out.

“Providers will hide the data.”
Also fair. The SSRC paper explicitly raises concerns about selective disclosure.
That’s why the enforcement mechanism cannot simply be “please disclose.” It has to be embedded in contracts (enterprise buyers demand it), infrastructure (access controls), and, when needed, litigation leverage. And it needs to be incredibly simple to deploy.

“Synthetic data means you’ll matter less.”
Perhaps, but not always. Synthetic data can expand a seed set, but it doesn’t eliminate the value of high-quality human reporting that keeps systems grounded, or upon which models rely. The continued dependence on the open web is visible in the very existence of retrieval systems, and in the SSRC finding that models vary widely in how they browse and cite, implying design choices still dominate outcomes.

“Watermarking is broken.”
That’s why the emphasis shifts to provenance, fingerprints, and verifiable trails rather than the more traditional watermarks. Watermarking embeds a signal in text (fragile under edits); fingerprinting is mathematical, comparing sources/outputs for overlap and similarity to detect likely derivation. Think Shazam, but for text.

Where Writers’ Bloc fits in

Instead of making a legal, or even an emotional appeal, I firmly believe in the power of the market and in creating the right incentives.

As the late, great Charlie Munger, billionaire investor and long-time business partner of Warren Buffett, always saidshow me the incentive and I’ll show you the outcome.

This applies to the publishing and journalism markets too.

Writers’ Bloc, only one implementation path for this architecture, changes the incentive structure by rethinking the underlying mechanism: fingerprints logged in an auditable registry, monitoring for suspected misuse, and payout rails so writers, publishers, and rights-holders can be compensated when their work materially informs model outputs. (That’s our approach; there will be others, I’m certain.)

If you struggle on the “How” to implement it at your company or as a freelance, reply directly or book a short call, and I will personally DM you. I want to help my most active subscribers.

Appendix

Glossary of Terms (AI jargon-buster)

  • Attribution gap = relevant URLs visited minus URLs cited (SSRC metric).

  • Ecosystem exploitation: SSRC framing for LLM systems consuming web content without adequate credit/reciprocity.

  • RAG (retrieval-augmented generation): Systems that fetch documents at answer time and use them to generate responses.

  • No Search / No citation / High-volume, low-credit: SSRC’s three documented patterns in LLM search behaviour.

  • Standardised telemetry: A shared, machine-readable “receipt format” for recording what was fetched/used during an AI answer.

  • OpenTelemetry GenAI semantic conventions: Published conventions for recording GenAI operations as traces/spans.

  • Trace / span: A trace is the full timeline of one request; spans are the steps within it.

  • Hashing for provenance / Stable Source ID: Using a cryptographic fingerprint to reference a source without reproducing its text.

  • Relevance score: A recorded indicator of how strongly a source contributed to an answer (enables usage-based payouts).

  • Membership inference / influence functions: Research techniques that attempt to infer training-set inclusion or training-data influence; often noisy at scale and not a universal proof mechanism.

  • CDN-level blocking: Using infrastructure providers (e.g., Cloudflare) to block or charge crawlers at the network edge.

  • Bad bots: Automated traffic used for malicious or unwanted activity; Imperva/Thales report bad bots as a large share of total web traffic.

Keep Reading