Tech

Google’s search dominance leaves sites little choice on AI scraping

Published

4 months ago

August 16, 2024

Admin

Google’s search dominance leaves sites little choice on AI scraping

The convenient artificial intelligence answers Google now puts at the top of its search results come at steep cost to websites that users would otherwise visit. But many site owners say they can’t afford to block Google’s AI from summarising their content.

That’s because the Google tool that sifts through web content to come up with its AI answers is the same one that keeps track of web pages for search results, according to publishers. Blocking Alphabet Inc’s Google the way sites have blocked some of its AI competitors would also hamper a site’s ability to be discovered online.

Google’s dominance in search – which a US federal court ruled last week is an illegal monopoly – is giving it a decisive advantage in the brewing AI wars, which search startups and publishers say is unfair as the industry takes shape. The dilemma is particularly acute for publishers, which face a choice between offering up their content for use by AI models that could make their sites obsolete and disappearing from Google search, a top source of traffic.

“It becomes like an existential crisis for these companies,” said Joe Ragazzo, publisher of the news site Talking Points Memo. “These are two bad options. You drop out and you die immediately, or you partner with them and you probably just die slowly, because eventually they’re not going to need you either.”

Google said AI Overviews – the summaries displayed at the top of Google search – are part of its longstanding commitment to serve higher quality information and bolster opportunities for publishers and other businesses. “Every day, Google sends billions of clicks to sites across the web, and we intend for this long-established value exchange with websites to continue,” a Google spokesperson said in a statement. “With AI Overviews, people find Search more helpful and they’re coming back to search more, creating new opportunities for content to be discovered.”

Since its earliest days, Google has deployed a piece of software known as the Googlebot to visit or “crawl” millions of websites, building up a detailed repository of the global internet. That index has posed a daunting barrier to entry for companies that have sought to build rival search engines over the years – even ones with deep pockets, like Microsoft Corp.

The rise of generative AI has touched off a new wave of startups seeking to offer search products in which AI models deliver succinct answers to users’ questions. The chatbots’ popularity has sparked a panic within Google about the future of its search engine, which for so long seemed invincible. But before these startups can truly threaten the search giant’s business, they must crawl the web. And that’s no easy feat.

Being crawled costs website owners money, computing power and storage, so many publishers include a file that sets out rules for bots visiting their sites. The companies given the most leeway are usually Google and Microsoft’s Bing, which can drive traffic to sites through their search engines.

But search startups can’t promise such traffic before gaining traction – which is one reason why the young firms have begun striking deals to pay publishers to license content, said Alex Rosenberg, chief executive officer of Tako Inc., an AI startup.

“Now you have a bunch of tech companies that are paying for content, they’re paying for access to that because they need it to be able to compete in any kind of serious way,” Rosenberg said. “Whereas for Google, they don’t really have to do that.”

Amid a wave of deal-making between media companies and AI startups, Google has been a notable holdout. With the exception of a reported $60 million deal with Reddit Inc., Google has signaled to publishers behind closed doors that it is not interested in negotiating, according to two people with knowledge of the matter, who asked not to be identified because the information is private.

Media companies have little leverage in these conversations. Earlier this year, Google rolled out AI Overviews, in which the company uses AI to give succinct answers to some of users’ questions at the top of the search page. Publishers were immediately concerned about the impact the answers could have on their traffic but had no clear way to address those fears.

Google uses a separate crawler for some AI products, such as its chatbot Gemini. But its main crawler, the Googlebot, serves both AI Overviews and Google search. A company spokesperson said Googlebot governs AI Overviews because AI and the company’s search engine are deeply entwined. The spokesperson added that its search results page shows information in a variety of formats, including images and graphics. Google also said publishers can block specific pages or parts of pages from appearing in AI Overviews in search results – but that would also likely bar those snippets from appearing across all of Google’s other search features, too, including web link listings.

Many publishers, which often rely on search engines for at least half their traffic, aren’t willing to take the risk of minimizing their reach.

Google’s position “understates the significant risk this poses to content creators, particularly those who rely on search visibility for their livelihood,” said Marc McCollum, who heads up innovation at Raptive, which represents publishers and influencers. “By opting out, creators may inadvertently reduce their overall search presence, which could harm their ability to reach audiences and generate revenue.”

Kyle Wiens, the CEO of iFixit, a website that publishes free online repair guides for consumer electronics, said the site’s relationship with Google is “much more tenuous” than with other AI companies. “I can block ClaudeBot from indexing us without harming our business,” Wiens wrote in an email, referring to the bot from generative AI startup Anthropic. “But if I block Googlebot, we lose traffic and customers.”

Google’s deal with Reddit, where millions of users engage in heated debates about niche topics, offers the company a treasure trove of information for AI models. The deal coincided with changes Google’s made that boosted the presence of results from forums like Reddit in search results, driving huge gains in traffic to the social media site. A spokesperson for Reddit said improvements in product quality and speed have also contributed to the growth in traffic.

Search startup Perplexity is in talks with Reddit about licensing content, but the Google deal has set a rate that is hard for a startup to match, according to a person familiar with the matter. Google said the deal with Reddit is a far-reaching partnership that covers more than just training data. The spokesperson for Reddit declined to comment on business discussions. Perplexity declined to comment.

Other search startups have concluded that the data is simply out of reach.

“We would need 20 years of our current revenue just to pay Reddit,” said Vladimir Prelovac, founder of Kagi, a search startup. “That’s not even a possibility I’m entertaining.”

Small startups aren’t alone in their struggles. OpenAI recently launched SearchGPT, a test version of its wildly popular chatbot tailored for search. Yet popular websites including Amazon, Goodreads and Uniqlo have blocked the GPT crawler from their sites, according to public documentation, potentially spelling trouble for OpenAI’s ambitions in search. OpenAI has said sites may appear in its search results even if they choose to exclude their content from AI training.

Prelovac said at least half of Kagi’s costs go toward crawling and other sources of search data. A detailed index of the web is table stakes for a search engine, to offer users a detailed view into the contents of the internet. Yet for companies seeking to answer users’ questions directly using AI, a model popularized by ChatGPT, the data takes on another level of importance, Prelovac said.

“Generative AI models on their own are not very smart,” Prelovac said. “In order to have any sort of high quality generative AI output, you need to have access to that same search index.”

The ubiquity of robots.txt files, which set guidelines for crawling, forces startups to make complex decisions, said Richard Socher, founder of search startup You.com. The files have not been found to be legally binding, so companies may crawl public data so long as no log-in or subscriber credentials are required, Socher said.

“When we do crawl, we try not to overly burden any website,” he said. “Any website that has a robots.txt file that allows only Google to crawl and nobody else essentially supports a Google search monopoly.”

Neeva, a search startup founded by former Googlers that was bought by Snowflake Inc. last year, advocated for “crawl neutrality” to make it easier for startups to build out their search indexes. In the wake of a landmark court ruling finding that Google monopolized the online search market, the Justice Department is considering seeking remedies including forcing the search giant to share more data with competitors and even breaking up the company, Bloomberg has reported. One proposal that’s attracted considerable attention is requiring Google to share the data it collects through the Googlebot, or to open up its famous search index to its rivals. The European Union’s Digital Markets Act already requires Google to share some search query data.

For Wiens, the iFixit CEO, the advantage that Google has over other AI companies because of its search empire is at the heart of antitrust issues for the company. “Splitting Google search from their AI work,” he said, “would deconflict things.”

Search engine DuckDuckGo said the technological shifts underway in search make “Google’s index related to antitrust concerns even more problematic.”

“The search indexes are extremely important in the age of generative AI,” said Kamyl Bazbaz, the senior vice-president of public affairs at DuckDuckGo.

Regardless of the outcome of the antitrust case, the changes underway in the search landscape underscore the importance for publishers of controlling their own destiny and not becoming overly reliant on any one tech platform – including Google, said TPM’s Ragazzo.

“Our belief is you have to form real relationships with readers,” Ragazzo said, “and that’s how you build a publication that can withstand different eras.” – Bloomberg

Crunchbase News Today

Google’s search dominance leaves sites little choice on AI scraping

Tech

Google’s search dominance leaves sites little choice on AI scraping

Hall of Fame Classic celebrates World Basketball Day in Springfield

Paxton sues NCAA over ‘deceptive’ practice of allowing transgender women to compete in women’s sports

Bob on Business: Clearfork plans expansion with new retail/office building | Fort Worth Report

Millions face wintry weather for what could be a record-setting holiday travel season

Which retail stores are open Christmas Eve 2024? See hours for Kohl’s, TJ Maxx, Lowe’s, more

Which gyms are open on Christmas Eve this year? Details on Planet Fitness, Gold’s Gym, more

NFL scores, live updates: Vikings visit Seahawks in critical NFC matchup with playoff implications

Why honest T-Mobile and Metro reps hate their jobs

Glasgow City finish 2024 on top of SWPL as Hibs stun Rangers

One Man’s Opinion: Pack Your Patience…Ho, Ho, Ho