geminispace.info

gemini search engine
git clone https://git.clttr.info/geminispace.info.git
Log (Feed) | Files | Refs (Tags) | README | LICENSE

commit 429b9d4de04bf8b95967cd7ad15ac46f2d751cbc
parent 540e3751d94e548872537902632af06ec10f1f22
Author: Natalie Pendragon <natpen@natpen.net>
Date:   Sun, 15 Nov 2020 09:19:14 -0500

[serve] Update indexing documentation

Diffstat:
Mserve/templates/documentation/indexing.gmi | 8++++++++
1 file changed, 8 insertions(+), 0 deletions(-)

diff --git a/serve/templates/documentation/indexing.gmi b/serve/templates/documentation/indexing.gmi @@ -12,6 +12,14 @@ GUS is a search engine for all content served over the Gemini Protocol. It can h GUS will only index content within Geminispace, and will neither follow nor index links out to other protocols, like Http or Gopher. GUS will only crawl outwards by following Gemini links found within `text/gemini` pages. If you return a `text/plain` mimetype for a page, Gemini links within it will not register with GUS (though the content of the `text/plain` page will itself get indexed). +Textual pages over 1MB in size will not be indexed. + +Please note that GUS' indexing has provisions for manually excluding content from it, which maintainers will typically use to exclude pages and domains that cause issues with index relevance or crawl success. GUS ends up crawling weird protocol experiments, proofs of concepts, and whatever other bizarre bits of technical creativity folks put up in Geminispace, so it is a continual effort to keep the index healthy. Please don't take it personally if your content ends up excluded, and I promise we are continually working to make GUS indexing more resilient and scalable! + +### How often does GUS index? + +GUS currently tends to update its index a few times per month. The last updated date at the bottom of each page will tell you the last time this happened. + ### Controlling what GUS indexes with a robots.txt To control crawling of your site, you can use a robots.txt file, Place it in your capsule's root directory such that a request for "robots.txt" will fetch it. It should be returned with a mimetype of `text/plain`.