Robots.txt update proposal

Internet-Draft	robots-proposal	November 2024
Jimenez	Expires 10 May 2025	[Page]

Abstract

This document proposes updates to the robots.txt standard to accommodate AI-specific crawlers, introducing a syntax for user-agent identification and policy differentiation. It aims to enhance the management of web content access by AI systems, distinguishing between training and inference activities.¶

1. Introduction

The current robots.txt standard inadequately filters AI crawlers due to its reliance on a "user-agent name" based approach and limited syntax. It is difficult to differentiate based on the intended use of data, such as storage, indexing, training, or inference.¶

We submitted the following proposal to the AI-Control WS: https://www.ietf.org/slides/ slides-aicontrolws-ai-robotstxt-00.pdf based on further discussion, the following text may describe a solution to the problems described in the WS.¶

1.1. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

This specification makes use of the following terminology:¶

Crawler:: A traditional web crawler. Also crawlers that relate to AI companies but that do not use the gathered content to train any model, LLMs or otherwise, as their purpose is purely real-time data integration for inference.¶
AI Crawler:: A specialized type of crawler employed by AI companies, which utilizes the gathered content exclusively for training purposes rather than for inference.¶

1.2. User-Agent Update

Crawlers are normally identify with the HTTP user-agent request header, the source IP address of the request or reverse DNS hostname of it.¶

A draft that defines a syntax for user-agents would be necessary. The syntax has to be extendable, so that not only AI but potentially other crawlers can use it. it should not be mandatory for clients to implement as it should be backwards compatible.¶

An absolutely minimal syntax would be similar to what we see in the wild, most AI companies use the -ai characters at the end of the user agent name to indicate that the crawler is used for ingesting the content into an AI system, for example:¶

  User-agent: company1-ai
  User-agent: company2-ai

Otherwise we could reuse identifiers like URNs Namespace (e.g., urn:rob:...), CRIs or cryptographically derived identifiers ... there are dozens of options on the IETF so it is a matter of choosing the right one.¶

The -ai syntax would indicate that the crawler using it is interested in training. In this draft we treat inference as a separate process akin to normal web-crawling and thus already covered.¶

This approach different from draft-canel-robots-ai-control, as it does not require a new field in the robot.txt ABNF as shown below:¶

User-Agent-Purpose: EXAMPLE-PURPOSE-1

1.3. Robots.txt Update

RFC9309 ABNF should be updated to address the new User-agent syntax. If we continue with the -ai convention above, we could use regex to indicate different policies to AI crawlers. For example:¶

Disallow all AI-training¶

User-Agent: .*?-ai$ Disallow: /

Allow all images for training but disallow training on /maps for all AI agents that do AI training.¶

User-Agent: .*?-ai$ Allow: /images
Disallow: /maps*

Allow /local for cohere-ai¶

User-Agent: cohere-ai Allow: /local

This proposal is also different that the new control rules DisallowAITraining and AllowAITraining proposed by draft-canel-robots-ai-control. From a semantic perspective, it is problematic to create specific purpose-oriented lines that fullfill such as DisallowThisProperty and DisallowAnotherProperty that have the same meaning and effect as the existing verbs Disallow and Allow.¶

In our proposal the information about the agent's purpose is on the User-Agent itself, which enables to filter out AI training agents using simple regex and the existing semantics.¶

Robots.txt update proposal

Abstract

About This Document

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Terminology

1.2. User-Agent Update

1.3. Robots.txt Update

Acknowledgements

Normative References

Author's Address