The Role of Robots.txt in Modern AI and Web Governance

The Role of Robots.txt in Modern AI and Web Governance

The robots.txt file, created in 1994 as part of the Robots Exclusion Protocol (REP), governs the behavior of web crawlers on the Internet. However, increasing changes in AI and web governance are reshaping the understanding and implementation of robots.txt.

The Birth of robots.txt

Robots.txt tells web crawlers which URLs they can and cannot access on a particular website. It has a foundational role in internet etiquette and governance.

Historical Context and Creation

Robots known as “web crawlers,” often shortened to “crawlers,” were introduced to find and index new and updated content. A crawler works by starting with a list of well-known domains, such as Wikipedia. It will then search all of these domains individually and index all of the links available on these web pages. Then, the process will begin again by following the links to these other sites and gradually building up a list.

In the early days of the internet, crawlers downloading website pages posed a significant challenge. Back then, internet connections were slow and expensive. This meant that even a few crawlers could quickly consume a website’s resources, especially for those hosting large archives, leading to high costs.

Also, there was no way to opt-out if you decided you didn’t want these robots indexing your pages. To guide these robots away from certain areas of the internet, software engineer Martijn Koster proposed a standard for robot exclusion, also known as the Robots Exclusion Protocol (REP).

The Social Contract of the Web

The REP requested web developers to add a text file named “robots.txt” to their domains to define which crawlers are not allowed to scan their websites. You can view this file by adding “robots.txt” to the end of a domain name.

For instance, to access the robots.txt file for nDash, you’d search “nDash.com/robots.txt.” The “Disallow” under each robot name tells the robot not to visit any page on the website.

In a message to the WWW-Talk mailing list on February 25, 1994, Koster addressed the dual nature of robots on the web. He noted that robots, despite causing operational headaches at times, offer substantial advantages.

His message emphasized that the proposed standard’s purpose was to reduce these problems and potentially enhance the advantages that robots bring to the web. WWW-Talk is a public forum for technical exchanges among developers of World Wide Web software.

At this time, relatively few robots roamed the web, making it possible to create a list of all known robots in use. The standard didn’t require any server, client, or protocol changes; the robots’ creators were simply expected to respect these wishes as part of online etiquette.

The Data Dilemma: AI and Web Governance

The focus for many content marketing managers who are trying to create new strategies for AI has transitioned from search engine optimization (SEO) to focusing on how to handle the complexities of AI-driven data collection.

From Search Engines to AI Models

While robots.txt files may have been fit for purpose for their original role of guiding search engine crawlers to stop websites from being overloaded with requests, the recent boom in generative AI has raised a whole new series of challenges.

Today, one of the biggest challenges for many content marketers and online publishers is how to address the challenges of AI companies scraping online data to train their models. The stakes are high because these AI models use existing content as training data. They are then typically used to generate content designed to compete with that same content.

The Breaking Point

AI companies are hungry for training data to help them develop their language models. These models power interactive chatbots, voice assistants, and automated content generation. They also boost search engine functionality and debug code. Many organizations use data they have accessed online for free to train their models, which they then sell for a profit.

Theoretically, organizations can deter these bots from crawling their websites by denying access to their robots.txt documents. In reality, this is not necessarily the case: organizations are not legally required to abide by the rules of robots.txt documents — they are merely a social contract.

In addition, there are now so many AI crawlers on the web that blocking them all would be difficult. While OpenAi’s GPTBot is the most well-known bot, there are an increasing number of smaller bots crawling the web in search of AI training data. As the social contract set out by robots.txt becomes increasingly strained and outdated, it’s clear that a new method of governance is required.

Should You Block AI Bots?

Deciding whether to block AI bots depends largely on the type of content you host and your overall web strategy. For many, the primary concern is the unauthorized use of content by AI to train models, which can then generate competing content.

This unauthorized data scraping can lead to potential revenue loss and dilution of brand uniqueness. For others, the presence of AI bots may actually enhance site visibility and audience engagement, particularly when these bots belong to major search engines or analytics services.

How to Block Unwanted AI Bots

Update Robots.txt with Precision

While the traditional use of robots.txt allows you to disallow specific bots, many modern AI bots do not necessarily adhere to these directives. Regularly updating your robots.txt is crucial to specify which bots are allowed and which are not based on their behavior and purpose.

Implement More Robust Measures

Since robots.txt is often insufficient alone:

  • CAPTCHA: Integrating CAPTCHA tests can effectively prevent bots from accessing your content. CAPTCHA challenges, especially those that are interactive or image-based, are hard for bots to bypass.
  • IP blocking and rate limiting: Detect and block IPs that exhibit bot-like behavior or exceed normal request rates. Doing so can reduce the load on your servers from unwanted crawling.
  • Server-side analytics: Use advanced server-side analytics to detect unusual access patterns. These tools can help identify bot traffic based on behavior that differs from human users, such as high-speed data requests and navigating hidden links.

Legal and Technical Guidelines

Consult with legal and IT security professionals to align your bot management strategies with current laws and best practices. This strategy ensures you are not only protecting your content but also complying with data protection regulations.

Community and Collaboration

Engage with other web admins and industry groups to share insights and strategies for managing AI bots. Collective action and shared knowledge can lead to better defense mechanisms against malicious or unauthorized crawling.

Consequences and Responses

AI is leading to shifting attitudes on web crawling, and the diverse responses from web administrators provide insights into the current state of web governance.

Web Crawlers: The Good, The Bad, and The AI

There are many different types of web crawlers, and they all serve various purposes. Search engines use general-purpose crawlers to collect information about web pages and index them, and they use incremental crawlers to keep this index updated. Marketers use focused crawlers to collect information about specific topics, while researchers use deep web crawlers to gather information that is not publicly available.

AI-powered web crawlers such as GPTBot work similarly to general-purpose crawlers. The key difference is that instead of indexing the web, they use the content to train machine learning (ML) models. In particular, the rise of AI crawlers has raised concerns among web publishers, who are cautious of the bots’ ability to scrape their original intellectual property (IP) and use it for unapproved purposes.

Resistance and Regulation

OpenAI’s web crawler, GPTBot, has been blocked by many of the most popular online publishing platforms. That includes major news companies like The New York Times, The Washington Post, and The Atlantic. One study showed that 633 of 1,160 (54.6%) surveyed news publishers had requested for OpenAI, Google AI, or the non-profit Common Crawl to stop scanning their sites.

In 2023, The New York Times went as far as to sue OpenAI and Microsoft, claiming that millions of articles published by the publication were used to train automated chatbots that now compete with it. Instances like this have raised questions about the adequacy of robots.txt in continuing to fulfill its purpose, and many organizations are calling for stronger measures to protect their IP.

Adapting Content Strategies for AI’s Impact on Web Governance

As AI continues to shape the internet, content marketing managers must take strategic steps to ensure their content remains visible online while also protecting its copyright.

Strategic Implications of AI for Content Visibility

Search engine results pages (SERPs) were once just lists of links. The integration of AI has significantly changed their layout by introducing visual aids, featured snippets, and search suggestions.  These changes have impacted content visibility by influencing the way that users interact with content. For instance, they have come to expect short, concise blocks of content that answer their query without them having to click through to links or interact with websites directly.

While this presents new obstacles, it also unlocks exciting possibilities for marketers willing to adapt. Organizations can differentiate by providing expert opinions, emphasizing unique insights, and offering deeper analyses. This richer content will pique user interest and encourage them to click through to learn more.

Protecting Copyright and Content in the AI Era

AI-based web scraping is constantly evolving, and regulations are changing with it. This new technology offers many benefits, but it also creates challenges. Many people are grappling with how to protect copyright and proprietary content on the web in the face of these challenges.

Content marketing managers are facing copyright issues as AI companies scrape data. To address this, several options are currently available to safeguard their content. Updating your robots.txt file can deter specific bots from scraping your website. This method, however, requires you to name each bot that you’d like to block individually.

You can add a CAPTCHA test to your website. Doing so will help to deflect bots from accessing your content. Investing in tools that track movement and detect suspicious activity can also help stop bots from scraping your content.

Robots.txt May No Longer Be Enough For the Modern AI and Web Governance

The robots.txt file is a fundamental part of the Internet, shaped by how information is discovered and indexed.

Many bots still follow the rules listed in robots.txt files. However, with AI, robots.txt is no longer as effective at protecting copyrights as it was in the early days.

As online etiquette continues to evolve, it is clear that a more advanced REP is required.

About the author

Aimee PearcyAimee Pearcy is a tech journalist and a B2B SaaS copywriter with over five years of experience. Check out her writer profile to learn how her experience can help level up your content strategy: Aimee Pearcy.