Internet-Draft | robots-proposal | November 2024 |
Jimenez | Expires 10 May 2025 | [Page] |
This document proposes updates to the robots.txt standard to accommodate AI-specific crawlers, introducing a syntax for user-agent identification and policy differentiation. It aims to enhance the management of web content access by AI systems, distinguishing between training and inference activities.¶
This note is to be removed before publishing as an RFC.¶
Status information for this document may be found at https://datatracker.ietf.org/doc/draft-jimenez-tbd-robotstxt-update/.¶
Discussion of this document takes place on the ai-control Working Group mailing list (mailto:ai-control@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/ai-control/. Subscribe at https://www.ietf.org/mailman/listinfo/ai-control/.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 10 May 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The current robots.txt standard inadequately filters AI crawlers due to its reliance on a "user-agent name" based approach and limited syntax. It is difficult to differentiate based on the intended use of data, such as storage, indexing, training, or inference.¶
We submitted the following proposal to the AI-Control WS: https://www.ietf.org/slides/ slides-aicontrolws-ai-robotstxt-00.pdf based on further discussion, the following text may describe a solution to the problems described in the WS.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This specification makes use of the following terminology:¶
A traditional web crawler. Also crawlers that relate to AI companies but that do not use the gathered content to train any model, LLMs or otherwise, as their purpose is purely real-time data integration for inference.¶
A specialized type of crawler employed by AI companies, which utilizes the gathered content exclusively for training purposes rather than for inference.¶
Crawlers are normally identify with the HTTP user-agent request header, the source IP address of the request or reverse DNS hostname of it.¶
A draft that defines a syntax for user-agents
would be necessary. The syntax has to be extendable, so that not only AI but potentially other crawlers can use it. it should not be mandatory for clients to implement as it should be backwards compatible.¶
An absolutely minimal syntax would be similar to what we see in the wild, most AI companies use the -ai
characters at the end of the user agent name to indicate that the crawler is used for ingesting the content into an AI system, for example:¶
User-agent: company1-ai User-agent: company2-ai¶
Otherwise we could reuse identifiers like URNs Namespace (e.g., urn:rob:...), CRIs or cryptographically derived identifiers ... there are dozens of options on the IETF so it is a matter of choosing the right one.¶
The -ai
syntax would indicate that the crawler using it is interested in training.
In this draft we treat inference as a separate process akin to normal web-crawling and thus already covered.¶
This approach different from draft-canel-robots-ai-control, as it does not require a new field in the robot.txt ABNF as shown below:¶
User-Agent-Purpose: EXAMPLE-PURPOSE-1¶
RFC9309 ABNF should be updated to address the new User-agent syntax. If we continue with the -ai
convention above, we could use regex to indicate different policies to AI crawlers. For example:¶
Disallow all AI-training¶
User-Agent: .*?-ai$ Disallow: /¶
Allow all images for training but disallow training on /maps for all AI agents that do AI training.¶
User-Agent: .*?-ai$ Allow: /images Disallow: /maps*¶
Allow /local for cohere-ai¶
User-Agent: cohere-ai Allow: /local¶
This proposal is also different that the new control rules DisallowAITraining
and AllowAITraining
proposed by draft-canel-robots-ai-control. From a semantic perspective, it is problematic to create specific purpose-oriented lines that fullfill such as DisallowThisProperty and DisallowAnotherProperty that have the same meaning and effect as the existing verbs Disallow and Allow.¶
In our proposal the information about the agent's purpose is on the User-Agent itself, which enables to filter out AI training agents using simple regex and the existing semantics.¶
The author would like to thank Jari Arkko for his review and feedback on short notice.¶