repo: gemini-site action: commit revision: path_from: revision_from: d64d801613d9341ad72980073c27152351178ce0: path_to: revision_to:
commit d64d801613d9341ad72980073c27152351178ce0 Author: SolderpunkDate: Sun Nov 22 17:29:59 2020 +0100 Add robots.txt and lightweight subscription companion protocols. diff --git a/docs/companion/index.gmi b/docs/companion/index.gmi new file mode 100644 index 0000000000000000000000000000000000000000..9055b1f93c934b5da4e10c7b44eb22dde9095670 --- /dev/null +++ b/docs/companion/index.gmi @@ -0,0 +1,6 @@ +# Gemini companion specifications + +The following "companion specifications" describe optional practices which are not part of the core Gemini protocol specification, but which are intended to establish clear guidelines for "good citizenship" in Geminispace and to increase inter-operability of Gemini software via widely adopted conventions. + +=> robots.gmi robots.txt for Gemini +=> subscription.gmi Subscribing to Gemini pages diff --git a/docs/companion/robots.gmi b/docs/companion/robots.gmi new file mode 100644 index 0000000000000000000000000000000000000000..46e63e5eceebe889a46927862ddcb7b89b977471 --- /dev/null +++ b/docs/companion/robots.gmi @@ -0,0 +1,48 @@ +# robots.txt for Gemini + +## Introduction + +This document describes an adaptation of the web's de-facto standard robots.txt mechanism for controlling access to Gemini resources by automated clients (hereafter "bots"). + +Gemini server admins may use robots.txt to convey their desired bot policy in a machine-readable format. + +Authors of automated Gemini clients (e.g. search engine crawlers, web proxies, etc.) are strongly encouraged to check for such policies and to comply with them when found. + +Server admins should understand that it is impossible to *enforce* a robots.txt policy and must be prepared to use e.g. firewall rules to block access by misbehaving bots. This is equally true of Gemini and the web. + +## Basics + +Gemini server admins may serve a robot policy for their server at the URL with path /robots.txt, i.e. the server example.net should serve its policy at gemini://example.net/robots.txt. + +The robots.txt file should be served with a MIME media type of text/plain. + +The format of the robots.txt file is as per the original robots.txt specification for the web, i.e.: + +* Lines beginning with # are comments +* Lines beginning with "User-agent:" indicate a user agent to which subsequent lines apply +* Lines beginning with "Disallow:" indicate URL path prefixes which bots should not request +* All other lines are ignored + +The only non-trivial difference between robots.txt on the web and on Gemini is that, because Gemini admins cannot easily learn which bots are accessing their site and why (because Gemini clients do not send a user agent), Gemini bots are encouraged to obey directives for "virtual user agents" according to their purpose/function. These are described below. + +Despite this difference, Gemini bots should still respect robots.txt directives aimed at a User-agent of *, and may also respect directives aimed at their own individual User-agent which they, e.g., prominently advertise at the Gemini page of any public services they provide. + +## Virtual user agents + +Below are definitions of various "virtual user agents", each of which corresponds to a common category of bot. Gemini bots should respect directives aimed at any virtual user agent which matches their activity. Obviously, it is impossible to come up with perfect definitions for these user agents which allow unambiguous categorisation of bots. Bot authors are encouraged to err on the side of caution and attempt to follow the "spirit" of this system, rather than the "letter". If a bot meets the definition of multiple virtual user agents and is not able to adapt its behaviour in a fine grained manner, it should obey the most restrictive set of directives arising from the combination of all applicable virtual user agents. + +### Archiving crawlers + +Gemini bots which fetch content in order to build public long-term archives of Geminispace, which will serve old Gemini content even after the original has changed or disappeared (analogous to archive.org's "Wayback Machine"), should respect robots.txt directives aimed at a User-agent of "archiver". + +### Indexing crawlers + +Gemini bots which fetch content in order to build searchable indexes of Geminispace should respect robots.txt directives aimed at a User-agent of "indexer". + +### Research crawlers + +Gemini bots which fetch content in order to study large-scale statistical properties of Geminispace (e.g. number of domains/pages, distribution of MIME media types, response sizes, TLS versions, frequency of broken links, etc.), without rehosting, linking to, or allowing search of any fetched content, should respect robots.txt directives aimed at a User-agent of "researcher". + +### Web proxies + +Gemini bots which fetch content in order to translate said content into HTML and publicly serve the result over HTTP(S) (in order to make Geminispace accessible from within a standard web browser) should respect robots.txt directives aimed at a User-agent of "webproxy". diff --git a/docs/companion/subscription.gmi b/docs/companion/subscription.gmi new file mode 100644 index 0000000000000000000000000000000000000000..aa737445bb0a20c1a8db350ae36cf38241a266b7 --- /dev/null +++ b/docs/companion/subscription.gmi @@ -0,0 +1,96 @@ +# Subscribing to Gemini pages + +## Introduction + +This document describes a convention whereby Gemini clients can "subscribe" to a regularly updated Gemini page (such as the index page of a gemlog) even in the absence of a full-fledged syndication technology like Atom or RSS. It is intended as a lightweight alternative to such technologies to lower the barriers to publishing serial content in Geminispace which can be easily followed without tedious regular manual checking of bookmarks. In particular, it is an explicit goal that a simple, manually-updated, human readable index page of the type content authors would likely create anyway should be able to subscribed to without any special changes being necessary. Obviously, such a convention will be less powerful than more complicated technologies such as Atom and will not work as well as more complicated technologies in all conceivable use cases. Nevertheless, it is expected to function adequately for a wide range of reasonable use cases. Nothing in this convention prevents content authors from simultaneously publishing an Atom feed if they wish to. In fact, this convention can ease the generation of said feeds. + +The remainder of this document describes how to interpret a single text/gemini document as if it were an Atom feed with all required elements present. The convention is described this way to ensure it is possible for clients to support both this lightweight subscription convention and subscribing to Atom feeds with a simplified codebase and consistent UI, and to demonstrate how simple automatic generation of Atom feeds is possible. Simpler clients which support only this subscription convention are free to ignore Atom elements as they see fit. + +## Feed elements + +The URL from which the text/gemini document is fetched serves as the feed's required "id" element and the recommended "link" element. + +The contents of the first header line in the document beginning with a single # serves as the feed's required "title" element. For this reason, authors are encouraged to use titles which provide their own context, e.g. "Abelard Lindsay's gemlog" rather than "My gemlog" or "Gemlog index". + +If a header line beginning with ## occurs in the document after the first line beginning with a single # but before any non-empty, non-header lines, its contents may serve as the feed's optional "subtitle" element. + +A feed's required "updated" element should be set equal to the most recent value from all the associated entry's required "updated" elements. If no entries can be extracted from the document, then the feed is empty (which is permitted by the Atom standard), and the feed's "updated" element should be set equal to the time the document was fetched. + +## Entry elements + +A feed's entry elements are derived from a subset of its link lines, if any are present. + +Each link line where the URL is followed by a label whose first 10 characters correspond to a date in ISO 8601 format (i.e. YYYY-MM-DD) represents a single entry. Link lines which do not meet this criteria are ignored. + +An entry's required "id" element and required "link" element with rel="alternate" ("link" elements are optional in Atom entries in general, but this convention does not assign "content" elements to entries and therefore a rel="alternate" link becomes required) are both equal to the URL of the corresponding link line. + +An entry's required "updated" element is noon UTC on the day indicated by the 10 character date stamp at the beginning of the corresponding link line's label. + +An entry's required "title" element is derived from what remains of the corresponding link line's label after discarding the first whitespace-separated component (which necessarily includes the date stamp). Clients may simply take the entirety of the remainder, but some simple sanitisation may be attempted to account for the fact that users may e.g. use labels with a separator between date and title such as "1965-03-23 - Gemini 3 launch successful!". + +## Example + +The Gemini document below, served from gemini://gemini.jrandom.net/gemlog/: + +``` +# J. Random Geminaut's gemlog + +Welcome to my Gemlog, where you can read every Friday about my adventures in urban gardening and abstract algebra! + +## My posts + +=> bokashi.gmi 2020-11-20 - Early Bokashi composting experiments +=> finite-simple-groups.gmi 2020-11-13 - Trying to get to grips with finite simple groups... +=> balcony.gmi 2020-11-06 - I started a balcony garden! + +## Other gemlogs I enjoy + +=> gemini://example.com/foo/ Abelard Lindsay's gemlog +=> gemini://example.net/bar/ Vladimir Harkonnen's gemlog +=> gemini://example.org/baz/ Case Pollard's gemlog + +=> ../ Back to my homepage + +Thanks for stopping by! +``` + +may be interpreted as equivalent to the following Atom feed: + +``` + + + + +``` + +## Shortcomings + +The primary shortcoming of this convention is that it does not convey a time of day at which posts are made nor a timezone in which the date stamp is valid. This makes lightweight subscription a poor match for applications where multiple updates are expected each day and the relative order of updates (both within and across feed sources) is important, such as following breaking news headlines, weather updates, traffic conditions, etc. Such applications are strongly encouraged to instead implement more robust subscription technologies such as Atom or RSS. + +This shortcoming is not expected to have serious implications for a wide range of common and valuable activities in Geminispace which operate at "human scale". For example, this convention is perfectly viable for an individual reader using their local client to subscribe to ten or twenty hand-picked gemlogs which update every few days with non-time-critical content about people's daily lives, hobbies, opinions on the state of the world, recipes, photos, etc. It is very rarely important to read content like this which was written by Alice on Wednesday morning before that which was written by Bob on Wednesday evening, or to know exactly when each person wrote their posts. If the time of day is relevant to the post content, the author will surely mention it. diff --git a/docs/index.gmi b/docs/index.gmiJ. Random Geminaut's gemlog + +2020-11-20T12:00:00Z +gemini://gemini.jrandom.net/gemlog/ + ++ + +Early Bokashi composting experiments + +gemini://gemini.jrandom.net/gemlog/bokashmi.gmi +2020-11-20T12:00:00Z ++ + +Trying to get to grips with finite simple groups... + +gemini://gemini.jrandom.net/gemlog/finite-simple-groups.gmi +2020-11-13T12:00:00Z ++ + +I started a balcony garden! + +gemini://gemini.jrandom.net/gemlog/balcony.gmi +2020-11-06T12:00:00Z +
--- a/docs/index.gmi +++ b/docs/index.gmi @@ -5,6 +5,7 @@ => faq.gmi Project Gemini FAQ => specification.gmi Protocol specification => best-practices.gmi Best practices for Gemini implementers +=> companion/ Companion specifications ## Resources for beginners
-----END OF PAGE-----