indexing.gmi - geminispace.info - gemini search engine

indexing.gmi (3435B)

1
2
3 {% include 'fragments/documentation-toc.gmi' %}
4
5
6 ## Documentation: Indexing
7
8 geminispace.info is a search engine for all content served over the Gemini Protocol. It can help you track down textual pages (e.g. `text/gemini`, `text/plain`, `text/markdown`) with content containing your search terms, but it can just as easily help you track down binary files (e.g., images, mp3s) which happen to be served over the Gemini protocol.
9
10 ### What does geminispace.info index?
11
12 geminispace.info will only index content within Geminispace, and will neither follow nor index links out to other protocols, like Http or Gopher. We will only crawl outwards by following Gemini links found within `text/gemini` pages. If you return a `text/plain` mimetype for a page, Gemini links within it will not register with GUS (though the content of the `text/plain` page will itself get indexed).
13 geminispace.info does not crawl capsules behind Onion links.
14
15 Textual pages over 10 MB in size will not be indexed.
16
17 Please note that there are provisions in place for manually excluding content from indexing, which maintainers will typically use to exclude pages and domains that cause issues with index relevance or crawl success. GUS ends up crawling weird protocol experiments, proofs of concepts, and whatever other bizarre bits of technical creativity folks put up in Geminispace, so it is a continual effort to keep the index healthy. Please don't take it personally if your content ends up excluded, and I promise we are continually working to make GUS indexing more resilient and scalable!
18 => filters list of filtered URIs
19
20 Currently, especially content of the following types is excluded:
21 - mirrors of large websites like Wikipedia or the Go-docs (it's just to much to add it to the index in the current state)
22 - mirrors of news sites from the common web (too big and too frequent changes)
23
24 ### Indexing and Redirects
25 geminispace.info checks for specific return codes like 31 PERMANENT REDIRECT and will save this information.
26 When your capsule served an permanent redirect for some sort of stuff, geminispace.info will not re-crawl this stuff for at least a week.
27
28 ### Controlling what geminispace.info indexes with a robots.txt
29 To control crawling of your site, you can use a "robots.txt" file. Place it in your capsule's root directory such that a request for "robots.txt" will fetch it. It should be returned with a mimetype of `text/plain`.:
30 => gemini://geminiprotocol.net/docs/companion/robots.gmi See the robots.txt companion spec for more details.
31
32 When interpreting a robots.txt, geminispace.info will use the first line that matches the URI that should be visited.
33 Keep your robots file as simple as possible, avoid empty lines, wildcards and similar stuff, just stick to the rules defined in the companion spec.
34
35 geminispace.info obeys the following user-agents, listed in descending priority:
36 * gus
37 * indexer
38 * *
39
40 ### How can I recognize geminispace.info requests?
41
42 You can identify us by looking for any requests to your site made by the following IP addresses:
43
44 * IPv4: 82.165.79.210
45 * IPv6: 2a02:247a:207:8e00:1::1
46
47 ### Does GUS keep my content forever?
48
49 No. After repeated failed attempts to connect to a page (e.g. because it moved, or because the capsule got taken down, or because of a server error on your host), we will invalidate that page after 1 month of unavailability in its index, thus removing it from search results.

	geminispace.info gemini search engine
	git clone https://git.clttr.info/geminispace.info.git
	Log (Feed) \| Files \| Refs (Tags) \| README \| LICENSE