Cloudflare on Wednesday introduced a way for web hosting customers to prevent AI bots from using data without permission to scrape website content and train machine learning models.
It said in a statement that it did so based on customer aversion to AI bots and “to help preserve a safer internet for content creators.”
“We've clearly heard that customers don't want AI bots visiting their websites, and especially those who are dishonest. To help, we've decided to block all AI bots altogether. Added a new one click.”
There is already an effective way to block bots that is widely available to website owners, the robots.txt file. When placed in the root directory of a website, automated web crawlers are expected to see and obey the instructions in the file that tell them to stay out.
Given the widespread belief that generative AI is based on piracy, and the many lawsuits trying to hold AI companies accountable, firms trafficking in laundered content willingly allow web publishers to opt out of piracy. Is.
Last August, OpenAI published guidance on how to block its GPTbot crawler using a robots.txt directive, possibly out of concern about content being scraped and used for AI training without permission. Is. Google followed suit the following month. Also in September of last year, Cloudflare began offering a way to block rule-abiding AI bots, and 85 percent of users — it claims — enabled the block.
Now the network services biz aims to provide a stronger barrier to bot entry. He said the Internet is “now flooded with these AI bots,” which visit about 39 percent of the top 1 million web properties served by Cloudflare.
The problem is that robots.txt, like the Do Not Track header implemented in browsers fifteen years ago to declare a preference for privacy, can be ignored, usually without consequence.
And recent reports suggest that AI bots do just that. Amazon said last week that it was looking into evidence that bots operated by AI search organization Perplexity, an AWS client, had crawled websites, including news sites, without proper credit or authorization. Content was reproduced.
Amazon Cloud users are supposed to obey robots.txt, and Perplexity was accused of not doing so. Arvind Srinivas, CEO of the AI upstart, denied that his business was ignoring the file, though acknowledged that third-party bots used by Perplexity can access pages against webmasters' wishes. were seen scratching.
Forgery
“Unfortunately, we've seen bot operators attempt to appear as if they're a real browser using a fake user agent,” Cloudflare said. “We've monitored this activity over time, and we're proud to say that our global machine learning model has always recognized this activity as a bot, even when operators tell their user agents about it. I lie.”
Cloudflare said its machine learning scoring system consistently ranked the perplexity bot below 30 between June 14 and June 27, indicating it is “likely automated.” .
This bot detection method relies on digital fingerprinting, a technique commonly used to track people online and deny them privacy. Crawlers, like individual Internet users, often stand out from the crowd based on technical details that can be read by network interactions.
These bots use the same tools and framework to automate website visits. And with a network that sees an average of 57 million requests per second, Cloudflare has enough data to determine which of those fingerprints can be trusted.
So here's what it comes down to: Machine learning models that defend against bots used to feed AI models, available even to free tier users. All users have to do is click the Block AI Scrapers and Crawlers toggle button in the Security -> Bots menu for a website.
“We fear that some AI companies intent on circumventing the rules to access content will permanently adapt to avoid bot detection,” Cloudflare said. “We will continue to monitor and add more botblocks to our AI scrapers and crawlers and develop our machine learning models to help keep the internet a place where content creators can thrive and thrive.” Have complete control over which models are used to train their content or run predictions” ®