In July of 2025, Cloudflare made a significant announcement – they would implement a permission-based flow, removing the free and unfettered access enjoyed by AI model training, which has underpinned much of the current AI surge. This action introduced a lot of critical questions around the future of AI model training, IP rights for public data, and data access processes. For many, this looks like a watershed moment in the battle over unregulated AI – a seismic shift in the relationship between AI models and the data that they depend on. For others, this feels like a business move rather than an ethical one, an opportunity to create revenue streams from thin air.
Today, we’re going to dive into Cloudflare’s decision, look at what it might affect, and discuss the implications this move has created for AI systems. Here are the highlights:
1. How LLM AI Models Work:
– LLMs are large language models that use public data for training by analyzing entire sequences and generating a relative weight between each term. They work by processing text one token at a time, using a transformer architecture that looks at an entire sequence and establishes context between words in the form of relational context as well as the likelihood of connected materials appearing in context.
–
In July of 2025, Cloudflare made a significant announcement – they would block AI crawlers through the implementation of a permission-based flow, removing the free and unfettered access enjoyed by AI model training, which has underpinned much of the current AI surge.
This action introduced a lot of critical questions around the future of AI model training, IP rights for public data, and data access processes. For many, this looks like a watershed moment in the battle over unregulated AI — a seismic shift in the relationship between AI models and the data that they depend on. For others, this feels like a business move rather than an ethical one, an opportunity to create revenue streams from thin air.
Today, we’re going to dive into Cloudflare’s decision, look at what it might affect, and discuss the implications this move has created for AI systems.
How LLM AI Models Work
Before we discuss Cloudflare’s approach, we should talk about how large language models (LLMs) work, and how these systems use public data for training.
The key innovation behind modern LLMs powering OpenAI and Claude is the transformer architecture. Unlike previous approaches, which processed text one token at a time, transformers analyze entire sequences and generate a relative weight between each term. In other words, this architecture looks at an entire sequence and establishes the context between the constituent parts, establishing both relationships between words in the form of relational context as well as the likelihood of connected materials appearing in context.
Let’s look at a simplified example. Let’s take the sentence: “Nordic APIs is the largest community of API practitioners and experts.” When processed through an LLM model, we first tokenize this data:
["Nordic", "APIs", "is", "the", "largest", "community", "of", "API", "practitioners", "and", "experts", "."]
From here, we take the data and convert it into a high-dimensional embedded vector.
"Nordic" -> [0.12, -0.45, 0.88, ..., 0.05]
"APIs" -> [-0.33, 0.20, -0.10, ..., 0.47]
From here, the model puts the embedded vectorizations through a mathematical model to ascertain the relationship between the query (Q), the key (k), and the value (v). A common method for doing this is the scaled dot product, an equation that assigns an attention score between certain words:
Attention(Q, K, V ) = softmax(QKT / √ dk )V
Put simply, this would look at our sentence, and determine that a word like “community” has high attention to “largest”, “API”, and “experts”, whereas “Nordic” would be extremely contextually tied to “APIs”.
From here, we could ask the model to generate a similar sentence, and —using the weights it has generated alongside other computational systems, such as multi-head attention computation, and feed-forward networking — a similar sentence could be computed.
Why This is Important
The reason this is important is that these models necessarily depend on context. The more data you feed into this process, the more accurate it will be. The single sentence we provided will only generate so many permutations — and they will look almost identical outside of the use of synonyms. To make “derivative yet original” content (an arguable concept in and of itself), more data is necessary to add noise, complexity, and contextualization to our word dictionary.
For AI model providers, this has led to a sort of soft arms race: the more data is available, the better the model, and the easier it is to fine-tune it to specific outcomes.
In some cases, this data is intrinsic to the model provider. For instance, a social media service might train its own AI on its users’ data. In other cases, the use of data is much murkier, such as Facebook’s alleged use of pirated books for training Meta’s Llama model.
For several years, AI companies have relied on publicly accessible data and web scraping to collect massive datasets for model training. These data sets have caused soaring complexity and capability in AI models, fueling breakthroughs across everything from medical research to generative poetry.
The core problem, however, is that none of this material is truly “new.” AI models are derivative in nature, and when the data sets utilize copyright materials or data for which the creators and owners did not give express consent, huge issues arise, including:
- Data privacy, especially in GDPR zones, is often circumvented, with data subjects not able to give their explicit consent for their data to be used in training
- Copyright concerns where models have been trained on data that is under copyright, sometimes even generating allegedly copyright-infringing outputs as a result.
- Legal and ethical concerns around the integrity of artist’ works and their access to the market.
The Shifting Tide
With all of this context, you can see why Cloudflare would want to change the game. As of right now, the data space is a relatively free-for-all in which Cloudflare pays to cache and serve content, services pay to host content, users often pay either directly or indirectly to generate content, and AI models use this data for free. A formal agreement would upend this system, creating a transparent ecosystem that balances innovation for AI systems while protecting rights and data protection.
The ban, as implemented by Cloudflare, affects a few specific classes of people:
- AI companies like OpenAI, Anthropic, and Google that leverage large-scale data ingestion for their training models.
- Content creators and IP holders that often have their content abstracted into AI models without consent.
- API providers who leverage open or closed models to offer specific features and services.
How the AI Crawler Block Works
Cloudflare’s implementation uses a few systems to control AI access.
Firstly, it uses a user-agent filtering approach. This system identifies and blocks known AI crawler signatures, catching the lion’s share of above-board access. For providers who tag their content crawlers as AI, this would basically reroute that crawling to another service or system custom-built for the AI implementation, and failing that, would outright block it.
Not all providers are above-board, however. For those who don’t sign their systems as AI, rate limiting and request pattern analysis will be able to ascertain human versus machine requests, detecting abnormal behaviors indicative of automated scraping. This will capture a large number of systems that don’t declare themselves as AI.
Finally, a larger Access Control List process will be implemented, allowing creators a more nuanced system to control access via AI systems. This last system is the principal method by which “pay per click,” as it has often been termed, will be implemented, allowing for traffic to be limited and gated.
Implications for Model Training
AI models need diverse data to scale their training and improve overall data value. A shift away from the free-for-all that has dominated the AI landscape is a seismic one, and one that has a lot of implications and impacts.
Firstly, when model training is not “free,” we are more likely to start to see smaller curated datasets that are higher quality. When you have to be pickier about your data, that data tends to be more curated and higher value. We will likely start to see fewer signals of AI model collapse as AI-generated quality tends to be of lower quality and value compared to human-generated content.
This, paired with increased costs for training, will also more than likely mean more expensive but effective models in the short term. As AI providers shift away from free data, other solutions, such as synthetic data generation, may offset the reduced availability of free data. That being said, this synthetic data could quickly result in increased model collapse, so this will likely be a temporary solution for providers before they pivot towards pay-per-crawl methods.
What This Means for IP Holders
For IP holders and creators, this is a huge win. Much like the data privacy and sovereignty laws (GDPR, CCPA, etc.), which forced companies to rethink their data storage and access processes, this sea change is likely to change the relationship between data and AI. While it won’t solve the problem entirely, it will create a barrier to simple data scraping, thereby returning value to data and forcing the priorities of the models to something more specific and actionable.
Likely, this will also open up more legal avenues for securing data. We’ve already observed alleged data and copyright infringement by AI models at scale, but much of this goes unpunished because of the edge case of the data systems. If there’s a pay-to-crawl system that a provider purposefully circumvents in order to access copyright material, that’s not a grey area anymore — it’s a civil lawsuit that will be hard for them to defend against.
Implications for API Providers
API providers may be among the biggest beneficiaries of this change. APIs offer structured, governed access to data — exactly what API providers are going to need following this change. Providers can monetize their APIs through rate limiting, access controls, and usage policies, and can control the quality of data going into these models. This will likely create a whole cottage industry of API training data that is clean, accurate, and copyright-free.
This also aligns more broadly with a shift towards API-first design, where APIs are replacing traditional scraping and manual integrations with smarter data provisioning and API access.
Implications for AI Orgs
Interestingly, this could benefit AI providers, even if it doesn’t seem so right off the bat. While access to free data has allowed these systems to generate strong models, they have often suffered from issues surrounding hallucinations and a lack of quality. When you’re training these models on billions of accessible web pages, it’s hard to identify what is quality and what isn’t.
While this will be more expensive in the short term for AI providers, the long-term will likely be better control and more trust: things AI desperately needs to gain in 2025.
Big Ripple Effects
Ultimately, this move will be controversial with data access absolutists who think all public data should be available for AI to train upon. But for those who create the material and want to protect their creations, this is a huge win. Long-term, this will be healthy for the industry and will lead to significant improvements, but in the short term, this is set to be an interesting battle around ethics and moral data access in the coming decade.