Network Working Group G. Illyes
Internet-Draft Google LLC.
Intended status: Informational 18 October 2024
Expires: 21 April 2025
Robots Exclusion Protocol Extension for URI Level Control
draft-illyes-repext-02
Abstract
This document extends RFC9309 by specifying additional URI level
controls through application level header and HTML meta tags
originally developed in 1996. Additionally it moves the response
header out of the experimental header space (i.e. "X-") and defines
the combinability of multiple headers, which was previously not
possible.
About This Document
This note is to be removed before publishing as an RFC.
The latest revision of this draft can be found at
https://garyillyes.github.io/ietf-rep-ext/draft-illyes-repext.html.
Status information for this document may be found at
https://datatracker.ietf.org/doc/draft-illyes-repext/.
Source for this draft and an issue tracker can be found at
https://github.com/garyillyes/ietf-rep-ext.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 21 April 2025.
Illyes Expires 21 April 2025 [Page 1]
Internet-Draft REPext for URI level October 2024
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions and Definitions . . . . . . . . . . . . . . . . . 3
3. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1. Robots control . . . . . . . . . . . . . . . . . . . . . 3
3.1.1. Application Layer Response Header . . . . . . . . . . 3
3.1.2. HTML meta element . . . . . . . . . . . . . . . . . . 4
3.1.3. Robots controls rules . . . . . . . . . . . . . . . . 4
3.1.4. Caching of values . . . . . . . . . . . . . . . . . . 5
4. Security Considerations . . . . . . . . . . . . . . . . . . . 5
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.1. Normative References . . . . . . . . . . . . . . . . . . 5
6.2. Informative References . . . . . . . . . . . . . . . . . 6
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 6
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6
1. Introduction
While the Robots Exclusion Protocol enables service owners to control
how, if at all, automated clients known as crawlers may access the
URIs on their services as defined by [RFC8288], the protocol doesn't
provide controls on how the data returned by their service may be
used upon allowed access.
Originally developed in 1996 and widely adopted since, the use-case
control is left to URI level controls implemented in the response
headers, or in case of HTML in the form of a meta tag. This document
specifies these control tags, and in case of the response header
field, brings it to standards compliance with [RFC9110].
Application developers are requested to honor these tags. The tags
are not a form of access authorization however.
Illyes Expires 21 April 2025 [Page 2]
Internet-Draft REPext for URI level October 2024
2. Conventions and Definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
This specification uses the following terms from [RFC9651]:
Dictionary, List, String, Parameter.
3. Specification
3.1. Robots control
The URI level crawler controls are a key-value pair that can be
specified two ways:
* an application level response header structured field as specified
by [RFC9651].
* in case of HTML, one or more meta tags as defined by the HTML
specification.
3.1.1. Application Layer Response Header
The application level response header field "robots-tag" is a
structured field whose value is a dictionary containing list of rules
applicable to either all accessors or specifically named ones. For
historical reasons, implementors should also support the experimental
field name, "x-robots-tag".
The value of the robots-tag field is a dictionary containing lists of
rules. The rules are specific to a single product token as defined
by [RFC9309] or a global identifier — "*". The global identifier may
be omitted. The product token is the first element of each list.
Duplicate product tokens must be merged and the rules deduplicated.
For example, the following response header field specifies "noindex"
and "nosnippet" rules for all accessors, however specifies no rules
for the product token "ExampleBot":
abc_123;a=1;b=2;cdef_456, ghi;q=9;r="+w" ~~~~~~~~ Robots-Tag:
*;noindex;nosnippet, ExampleBot; ~~~~~~~~
The global product identifier "*" in the value may be omitted; for
example, this field is equivalent to the previous example:
Illyes Expires 21 April 2025 [Page 3]
Internet-Draft REPext for URI level October 2024
Robots-Tag: ;noindex;nosnippet, ExampleBot=;
The structured field in the examples is deserialized into the
following objects: ~~~~~~~~ ["*" = [["noindex", true], ["nosnippet",
true]]], ["ExampleBot" = []] ~~~~~~~~
Implementors SHOULD impose a parsing limit on the field value to
protect their systems. The parsing limit MUST be at least 8
kibibytes [KiB].
3.1.2. HTML meta element
For historical reasons the robots-tag header may be specified by
service owners as an HTML meta tag. In case of the meta tag, the
name attribute is used to specify the product token, and the content
attribute to specify the comma separated robots-tag rules.
As with the header, the product token may be a global token,
"robots", which signifies that the rules apply to all requestors, or
a specific product token applicable to a single requestor. For
example:
Multiple robots meta elements may appear in a single HTML document.
Requestors must obey the sum of negative rules specific to their
product token and the global product token.
3.1.3. Robots controls rules
The possible values of the rules are:
* noindex - instructs the parser to not store the served data in its
publicly accessible index.
* nosnippet - instructs the parser to not reproduce any stored data
as an excerpt snippet.
The values are case insensitive. Unsupported rules must be ignored.
Implementors may support other rules as specified in Section 2.2.4 of
[RFC9309].
Illyes Expires 21 April 2025 [Page 4]
Internet-Draft REPext for URI level October 2024
3.1.4. Caching of values
The rules specified for a specific product token must be obeyed until
the rules have changed. Implementors MAY use standard cache control
as defined in [RFC9110] for caching robots-tag rules. Implementors
SHOULD refresh their caches within a reasonable time frame.
4. Security Considerations
The robots-tag is not a substitute for valid content security
measures. To control access to the URI paths in a robots.txt file,
users of the protocol should employ a valid security measure relevant
to the application layer on which the robots.txt file is served — for
example, in the case of HTTP, HTTP Authentication as defined in
[RFC9110].
The content of the robots-tag header field is not secure, private or
integrity-guaranteed, and due caution should be exercised when using
it. Use of Transport Layer Security (TLS) with HTTP ([RFC9110] and
[RFC2817]) is currently the only end-to-end way to provide such
protection.
In case of a robots-tag specified in a HTML meta element,
implementors should consider only the meta elements specified in the
head element of the HTML document, which is generally only accessible
to the service owner.
To protect against memory overflow attacks, implementers should
enforce a limit on how much data they will parse; see section N for
the lower limit.
5. IANA Considerations
TODO(illyes): https://www.rfc-editor.org/rfc/rfc9110.html#name-field-
name-registry
6. References
6.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
.
[RFC2817] Khare, R. and S. Lawrence, "Upgrading to TLS Within
HTTP/1.1", RFC 2817, DOI 10.17487/RFC2817, May 2000,
.
Illyes Expires 21 April 2025 [Page 5]
Internet-Draft REPext for URI level October 2024
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, .
[RFC8288] Nottingham, M., "Web Linking", RFC 8288,
DOI 10.17487/RFC8288, October 2017,
.
[RFC9110] Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke,
Ed., "HTTP Semantics", STD 97, RFC 9110,
DOI 10.17487/RFC9110, June 2022,
.
[RFC9309] Koster, M., Illyes, G., Zeller, H., and L. Sassman,
"Robots Exclusion Protocol", RFC 9309,
DOI 10.17487/RFC9309, September 2022,
.
[RFC9651] Nottingham, M. and P. Kamp, "Structured Field Values for
HTTP", RFC 9651, DOI 10.17487/RFC9651, September 2024,
.
6.2. Informative References
[KiB] "KibiByte", 14 October 2022,
.
Acknowledgments
TODO acknowledge.
Author's Address
Gary Illyes
Google LLC.
Brandschenkestrasse 110
CH-8002 Zürich
Switzerland
Email: garyillyes@google.com
Illyes Expires 21 April 2025 [Page 6]