commit 163d97f92c49c9a576b0f0bb4035b0cb8763bc16
parent e1d7853bbc6abda3d8c3ec8f59768eb31458c136
Author: René Wagner <rwa@clttr.info>
Date: Fri, 18 Feb 2022 18:37:43 +0100
add info about redirect indexing
Diffstat:
3 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/README.md b/README.md
@@ -45,10 +45,3 @@ Now you'll have created `index.new` directory, rename it to `index`.
## Running the test suite
Run: `poetry run pytest`
-
-
-## Roadmap / TODOs
-
-- TODO: add functionality to create a mock index
-- TODO: exclude raw-text blocks from indexed content
-- TODO: strip control characters from logged output like URLs
diff --git a/serve/templates/documentation/indexing.gmi b/serve/templates/documentation/indexing.gmi
@@ -12,7 +12,7 @@ geminispace.info is a search engine for all content served over the Gemini Proto
geminispace.info will only index content within Geminispace, and will neither follow nor index links out to other protocols, like Http or Gopher. We will only crawl outwards by following Gemini links found within `text/gemini` pages. If you return a `text/plain` mimetype for a page, Gemini links within it will not register with GUS (though the content of the `text/plain` page will itself get indexed).
-Textual pages over 5 MB in size will not be indexed.
+Textual pages over 10 MB in size will not be indexed.
Please note that GUS' indexing has provisions for manually excluding content from it, which maintainers will typically use to exclude pages and domains that cause issues with index relevance or crawl success. GUS ends up crawling weird protocol experiments, proofs of concepts, and whatever other bizarre bits of technical creativity folks put up in Geminispace, so it is a continual effort to keep the index healthy. Please don't take it personally if your content ends up excluded, and I promise we are continually working to make GUS indexing more resilient and scalable!
@@ -20,6 +20,10 @@ Currently, especially content of the following types is excluded:
- mirrors of large websites like Wikipedia or the Go-docs (it's just to much to add it to the index in the current state)
- mirrors of news sites from the common web (too big and to frequent changes)
+### Indexing and Redirects
+geminispace.info checks for specific return codes like 31 PERMANENT REDIRECT and will save this information.
+When your capsule served an permanent redirect for some sort of stuff, geminispace.info will not re-crawl this stuff for at least a week.
+
### Controlling what GUS indexes with a robots.txt
To control crawling of your site, you can use a robots.txt file, Place it in your capsule's root directory such that a request for "robots.txt" will fetch it. It should be returned with a mimetype of `text/plain`.
diff --git a/serve/templates/news.gmi b/serve/templates/news.gmi
@@ -2,10 +2,14 @@
## News
+### 2022-02-08 oopsie
+So the last refactor went...erm...upside down. We had a outage for a few hours because of this.
+I rolled the changes back and will do another attempt for a (hopefully successfull) refactor in the next days *fingers crossed*
+
### 2022-02-06 filtering clients
I've blocked two ips for repeatedly doing stupid requests again and again:
2001:41d0:302:2200::180
-::ffff:193.70.85.11
+193.70.85.11
### 2022-01-25 a year after
Today one year ago geminispace.info has been set up. You probably guess what happened: the cert for the capsule expired today... :-D