Blocking AI bots from scraping websites is powered by Cloudflare.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Cloudflare, a global internet security firm that claims to protect about 20 percent of the world's web traffic, has launched an “easy button” for website owners who want to block AI services from accessing their content. are The move comes at a time when demand for materials used to train AI models has increased.

Cloudflare's core service, which acts as an Internet proxy, scans and filters web traffic before it reaches websites. On average, the firm says its network sees more than 57 million requests per second.

“To help preserve a safer internet for content creators, we've just launched a brand new 'easy button' to block all AI bots,” Cloudflare said in its announcement Wednesday. ” “We clearly hear that consumers don't want AI bots visiting their websites, and especially those who do so dishonestly.”

While some AI companies correctly identify their web-scraping bots and respect the website's stay-away instructions, not all of them are transparent about their activities.

The new simplified configuration is being made available to all Cloudflare customers, including its free tier customers.

Isolating AI bot activity

Along with its announcement, Cloudflare shared a wealth of information about the AI ​​crawler activity it observes on its systems.

According to Cloudflare data, AI bots accessed about 39% of the top 1 million “Internet properties” using Cloudflare in June. However, only 2.98 percent of these properties took steps to block or challenge these requests. Cloudflare also mentions that “the higher-ranked (more popular) an internet property is, the more likely it is to be targeted by AI bots.”

Web crawlers powered by TikTok owner ByteDance, Amazon, Anthropic, and OpenAI were the most active, the firm said. The top crawler was Bytedance's Bytespider, which topped the charts in terms of number of requests, scope of its activity, and frequency of being blocked. GPTBot, powered by OpenAI and used to collect training data for products like ChatGPT, ranks second in both crawl activity and blocks.

Photo: Cloudflare

The web crawler for Perplexity, which has recently attracted controversy for its content crawling methods, was found to be visiting a percentage of sites that protect CloudFlare.

Photo: Cloudflare

While website owners can implement their own rules to block well-known web crawlers, Cloudflare also said that most of its clients do so only by blocking mainstream AI developers like OpenAI, Google, or Meta. There have been, but not the top crawlers from Bytedance or other companies. .

AI vs. AI

Cloudflare's report highlights how some AI bot operators are resorting to spoofing tactics to circumvent anti-blocking measures, passing off their crawler activity as legitimate web traffic. Trying to do.

“Unfortunately, we have seen bot operators attempt to appear as if they are a real browser using a fake user agent,” Cloudflare wrote.

As it turns out, AI is a key tool in a company's arsenal for preventing automated activity — whether from AI developers, search engines, or malicious attackers. Cloudflare said it uses a machine learning model to assign a “bot score” to each request made to a website protected by its services, with lower scores indicating that the activity is legitimate. Is.

With Cloudflare's massive dataset on global Internet traffic, the model takes into account multiple signals, including the application's IP address, user agent, and behavior patterns, to determine the bot score.

Photo: Cloudflare

To explain this, Cloudflare said it looked at traffic from a specific bot known for its malicious behavior. The results were telling: All the findings were scored below 30 out of 100, with the majority falling in the bottom two bands, indicating a score of 9 or less. In other words, even with efforts to obfuscate its source, the bot's activity patterns gave it away—allowing Cloudflare to block it.

Protecting Web Content

Generative AI models rely on titanic volumes of existing content, much of it collected from around the web. For AI to continue to provide current information, its developers need to continue to gather massive amounts of information.

Website owners and content creators are pushing back, with major publishers such as news organizations taking legal action against AI companies. In the aforementioned case of Perplexity, publications such as Forbes And Wired Claim it is taking and republishing content without permission. Music publisher Sony warned more than 700 tech firms to stay away in May, and Warner Music Group followed suit this week.

The threat could be an existential one for publishers, should AI increasingly deliver information to users without citing the source. A recent study published by SparkToro CEO Rand Fishkin suggested that 60% of people searching for information on Google stopped visiting websites that offered it because Google's AI immediately Short answers are provided.

Edited by Ryan Ozawa.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment