commit 8feea97f7c065bb9a6b77828cc1cd8e7bc4cf84d
parent b2dfd8f3b26f53382350fe87a088e855479c9445
Author: René Wagner <rwa@clttr.info>
Date: Sat, 12 Aug 2023 20:57:57 +0200
docs: robots.txt clarification
Diffstat:
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/serve/templates/documentation/indexing.gmi b/serve/templates/documentation/indexing.gmi
@@ -8,8 +8,6 @@
geminispace.info is a search engine for all content served over the Gemini Protocol. It can help you track down textual pages (e.g. `text/gemini`, `text/plain`, `text/markdown`) with content containing your search terms, but it can just as easily help you track down binary files (e.g., images, mp3s) which happen to be served over the Gemini protocol.
-The main purpose of geminispace.info is to index "native" Gemini content. We are not interested in indexing mirrors of news sites, wikipedia and such. We will most likely exclude thos capsules if no proper robots.txt is in place.
-
### What does geminispace.info index?
geminispace.info will only index content within Geminispace, and will neither follow nor index links out to other protocols, like Http or Gopher. We will only crawl outwards by following Gemini links found within `text/gemini` pages. If you return a `text/plain` mimetype for a page, Gemini links within it will not register with GUS (though the content of the `text/plain` page will itself get indexed).
@@ -29,8 +27,11 @@ geminispace.info checks for specific return codes like 31 PERMANENT REDIRECT and
When your capsule served an permanent redirect for some sort of stuff, geminispace.info will not re-crawl this stuff for at least a week.
### Controlling what geminispace.info indexes with a robots.txt
+To control crawling of your site, you can use a robots.txt file. Place it in your capsule's root directory such that a request for "robots.txt" will fetch it. It should be returned with a mimetype of `text/plain`.
+=> gemini://gemini.circumlunar.space/docs/companion/robots.gmi See the robots.txt companion spec for more details.
-To control crawling of your site, you can use a robots.txt file, Place it in your capsule's root directory such that a request for "robots.txt" will fetch it. It should be returned with a mimetype of `text/plain`.
+When interpreting a robots.txt, geminispace.info will use the first line that matches the URI that should be visited.
+Be sure to sort your rules accordingly if you want use exhaustive rules with wildcards or the "Allow" rule that is not specificed in the companion spec.
geminispace.info obeys the following user-agents, listed in descending priority:
* gus
diff --git a/serve/templates/news.gmi b/serve/templates/news.gmi
@@ -2,6 +2,10 @@
## News
+### 2023-08-12 robots.txt clarification
+We've added some clarification on how geminispace.info parses robots.txt files:
+=> /documentation/indexing Indexing documentation
+
### 2023-07-30 pubnix & robots.txt
When fetching a robots.txt geminispace.info is now aware of pubnix-style user dirs (domain.com/~joe and domain.com/users/joe) allowing for per-user robots settings.
The provisions for this feature have been in the code for quite a long time, but we didn't make use of it until now.