Bussiness

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Published

6 months ago

June 21, 2024

Admin

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

The world’s top two AI startups are ignoring requests by media publishers to stop scraping their web content for free model training data, Business Insider has learned.

OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites.

TollBit, a startup aiming to broker paid licensing deals between publishers and AI companies, found several AI companies are acting in this way and informed certain large publishers in a Friday letter, which was reported earlier by Reuters. The letter did not include the names of any of the AI companies accused of skirting the rule.

OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers, GPTBot and ClaudeBot.

However, according to TollBit’s findings, such blocks are not being respected, as claimed. AI companies, including OpenAI and Anthropic, are simply choosing to “bypass” robots.txt in order to retrieve or scrape all of the content from a given website or page.

A spokeswoman for OpenAI declined to comment beyond pointing BI to a corporate blogpost from May, in which the company says it takes web crawler permissions “into account each time we train a new model.” A spokesperson for Anthropic did not respond to emails seeking comment.

Robots.txt is a single bit of code that’s been used since the late 1990s as a way for websites to tell bot crawlers they don’t want their data scraped and collected. It was widely accepted as one of the unofficial rules supporting the web.

With the rise of generative AI, startups and tech companies are racing to build the most powerful AI models. A key ingredient is high-quality data. The thirst for such training data has undermined robots.txt and the unofficial agreements supporting the use of this code.

OpenAI is behind the popular chatbot ChatGPT. The company’s largest investor is Microsoft. Anthropic is behind another relatively popular chatbot, Claude. It’s largest investor is Amazon.

Both chatbots serve up answers to user questions in the tone of a human. Such answers are only possible because the AI models they are built on include massive amounts of written text and data scraped from the web, much of it under copyright or otherwise owned by creators.

Several tech companies last year argued to the US Copyright Office that nothing on the web should be considered under copyright when it comes to AI training data.

OpenAI has struck a few deals with publishers for access to content, including Axel Springer, which owns BI. The US Copyright Office is set to update its guidance on AI and copyright later this year.

Are you a tech employee or someone else with a tip or insight to share? Contact Kali Hays at khays@businessinsider.com or on secure messaging appSignal at +1-949-280-0267. Reach out using a non-work device.

Related Topics:advertisement ai company ai startup anthropic Block business insider company datum email address medium publisher OpenAI rule story web content website

Up Next

Cyberattacks crippled thousands of car dealers. Here’s what to know.

Don't Miss

What retro arcade games are teaching these GenZers about modern business — and fun

Crunchbase News Today

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Bussiness

OpenAI and Anthropic are ignoring an established rule that prevents bots scraping online content

Holiday travel crush underway; what to expect on area roads and at airports

CBS Texas Sports team has Dallas Cowboys primed to defeat Buccaneers in Week 16 showdown

Horoscope Tomorrow December 23, 2024, read predictions for all sun signs

Amazon’s secret overstock section is bursting with savings — and some arrive by Christmas

First Alert Weather: Some messy travel conditions on Monday

Popular Fitness Influencer Dies 3 Months After Getting Shot in Los Angeles Robbery

December 23 to December 29 Horoscope Forecast: Your Zodiac Sign’s Week

Lego Fortnite Brick Life: All Job Locations

5 great places to cram your last-minute holiday shopping in Ann Arbor, Ypsilanti

Pacers Sports & Entertainment bring toys to Fort Wayne families