ai-control J. Jimenez Internet-Draft Ericsson Intended status: Informational 6 November 2024 Expires: 10 May 2025 Robots.txt update proposal draft-jimenez-tbd-robotstxt-update-00 Abstract This document proposes updates to the robots.txt standard to accommodate AI-specific crawlers, introducing a syntax for user-agent identification and policy differentiation. It aims to enhance the management of web content access by AI systems, distinguishing between training and inference activities. About This Document This note is to be removed before publishing as an RFC. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-jimenez-tbd-robotstxt-update/. Discussion of this document takes place on the ai-control Working Group mailing list (mailto:ai-control@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/ai-control/. Subscribe at https://www.ietf.org/mailman/listinfo/ai-control/. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 10 May 2025. Jimenez Expires 10 May 2025 [Page 1] Internet-Draft robots-proposal November 2024 Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. User-Agent Update . . . . . . . . . . . . . . . . . . . . 3 1.3. Robots.txt Update . . . . . . . . . . . . . . . . . . . . 4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 4 Normative References . . . . . . . . . . . . . . . . . . . . . . 4 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 5 1. Introduction The current robots.txt standard inadequately filters AI crawlers due to its reliance on a "user-agent name" based approach and limited syntax. It is difficult to differentiate based on the intended use of data, such as storage, indexing, training, or inference. We submitted the following proposal to the AI-Control WS: https://www.ietf.org/slides/ slides-aicontrolws-ai-robotstxt-00.pdf based on further discussion, the following text may describe a solution to the problems described in the WS. 1.1. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. This specification makes use of the following terminology: Crawler: Jimenez Expires 10 May 2025 [Page 2] Internet-Draft robots-proposal November 2024 A traditional web crawler. Also crawlers that relate to AI companies but that do not use the gathered content to train any model, LLMs or otherwise, as their purpose is purely real-time data integration for inference. AI Crawler: A specialized type of crawler employed by AI companies, which utilizes the gathered content exclusively for training purposes rather than for inference. 1.2. User-Agent Update Crawlers are normally identify with the HTTP user-agent request header, the source IP address of the request or reverse DNS hostname of it. A draft that defines a syntax for user-agents would be necessary. The syntax has to be extendable, so that not only AI but potentially other crawlers can use it. it should not be mandatory for clients to implement as it should be backwards compatible. An absolutely minimal syntax would be similar to what we see in the wild, most AI companies use the -ai characters at the end of the user agent name to indicate that the crawler is used for ingesting the content into an AI system, for example: User-agent: company1-ai User-agent: company2-ai Otherwise we could reuse identifiers like URNs Namespace (https://www.iana.org/assignments/urn-namespaces/urn- namespaces.xhtml) (e.g., urn:rob:...), CRIs (https://datatracker.ietf.org/doc/html/draft-ietf-core-href-16) or cryptographically derived identifiers ... there are dozens of options on the IETF so it is a matter of choosing the right one. The -ai syntax would indicate that the crawler using it is interested in training. In this draft we treat inference as a separate process akin to normal web-crawling and thus already covered. This approach different from draft-canel-robots-ai-control, as it does not require a new field in the robot.txt ABNF as shown below: User-Agent-Purpose: EXAMPLE-PURPOSE-1 Jimenez Expires 10 May 2025 [Page 3] Internet-Draft robots-proposal November 2024 1.3. Robots.txt Update RFC9309 ABNF (https://datatracker.ietf.org/doc/html/rfc9309#name- formal-syntax) should be updated to address the new User-agent syntax. If we continue with the -ai convention above, we could use regex to indicate different policies to AI crawlers. For example: * Disallow all AI-training User-Agent: .*?-ai$ Disallow: / * Allow all images for training but disallow training on /maps for all AI agents that do AI training. User-Agent: .*?-ai$ Allow: /images Disallow: /maps* * Allow /local for cohere-ai User-Agent: cohere-ai Allow: /local This proposal is also different that the new control rules DisallowAITraining and AllowAITraining proposed by draft-canel- robots-ai-control (https://datatracker.ietf.org/doc/draft-canel- robots-ai-control/). From a semantic perspective, it is problematic to create specific purpose-oriented lines that fullfill such as DisallowThisProperty and DisallowAnotherProperty that have the same meaning and effect as the existing verbs Disallow and Allow. In our proposal the information about the agent's purpose is on the User-Agent itself, which enables to filter out AI training agents using simple regex and the existing semantics. Acknowledgements The author would like to thank Jari Arkko for his review and feedback on short notice. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . Jimenez Expires 10 May 2025 [Page 4] Internet-Draft robots-proposal November 2024 Author's Address Jaime Jimenez Ericsson Email: jaime@iki.fi Jimenez Expires 10 May 2025 [Page 5]