geminispace.info

gemini search engine
git clone https://git.clttr.info/geminispace.info.git
Log (Feed) | Files | Refs (Tags) | README | LICENSE

commit 09abe013da9735c2007595fdc835cf928d6798b7
parent fd3a662f11bf8d1c1a52332951984ada1487c507
Author: Natalie Pendragon <natpen@natpen.net>
Date:   Sat, 29 Feb 2020 08:13:22 -0500

Update README.md

Diffstat:
MREADME.md | 28+++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md @@ -4,7 +4,9 @@ Note that doing this currently requires you to perform a full crawl of Geminispace. With little content, and few people hacking on this, it's probably fine, but we should definitely keep tabs on this to ensure we're kind and -respectful to content and server owners. +respectful to content and server owners (I think the +solution is that we need a way to create a mock index +sooner than later). 1. Get Python and [Poetry](https://python-poetry.org/docs/) 2. Generate a local Geminispace index with `poetry run crawl` @@ -22,3 +24,27 @@ Please send patches to [~natpen/gus@lists.sr.ht](mailto:~natpen/gus@lists.sr.ht) For an introduction to mailing list-based Git collaboration, see [this introduction](https://git-send-email.io/), as well as this guide to [mailing list etiquette](https://man.sr.ht/lists.sr.ht/etiquette.md). + +# Roadmap / TODOs + +- *general code cleanup*: most notably crawl.py. There are a lot + of hacks in there that I put in for expediency, but haven't + taken the time to address. +- *improve the indexing*: currently, the url is prepended to + the page content, and everything is simply indexed with the + default indexer. I think a better solution would be to have + urls indexed with a url-specific indexer that doesn't do + things like, e.g., porter-stemming, which I assume the + default indexer is doing. +- *extend the index to handle binary links in Geminispace*: + currently, there's a hack in the code to simply skip + anything that looks like a binary link. I think with the + above improvement to how indexing works, they could be + made very effectively searchable. Also in this vein, + binary links should be identified via their mime types + probably, instead of the suffix hack used now. +- *add tests*: there aren't any yet! +- *add functionality to create a mock index*: this would + be useful for local hacking on serve.py, so one does + not need to perform a real scrape of Geminispace to do + said hacking.