Update README.md - geminispace.info

commit 09abe013da9735c2007595fdc835cf928d6798b7
parent fd3a662f11bf8d1c1a52332951984ada1487c507
Author: Natalie Pendragon <natpen@natpen.net>
Date:   Sat, 29 Feb 2020 08:13:22 -0500

Update README.md

Diffstat:
M README.md  | 28 +++++++++++++++++++++++++++-

1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
@@ -4,7 +4,9 @@ Note that doing this currently requires you to perform a
 full crawl of Geminispace. With little content, and few
 people hacking on this, it's probably fine, but we should
 definitely keep tabs on this to ensure we're kind and
-respectful to content and server owners.
+respectful to content and server owners (I think the
+solution is that we need a way to create a mock index
+sooner than later).
 
 1. Get Python and [Poetry](https://python-poetry.org/docs/)
 2. Generate a local Geminispace index with `poetry run crawl`
@@ -22,3 +24,27 @@ Please send patches to [~natpen/gus@lists.sr.ht](mailto:~natpen/gus@lists.sr.ht)
 For an introduction to mailing list-based Git collaboration,
 see [this introduction](https://git-send-email.io/), as well
 as this guide to [mailing list etiquette](https://man.sr.ht/lists.sr.ht/etiquette.md).
+
+# Roadmap / TODOs
+
+- *general code cleanup*: most notably crawl.py. There are a lot
+  of hacks in there that I put in for expediency, but haven't
+  taken the time to address.
+- *improve the indexing*: currently, the url is prepended to
+  the page content, and everything is simply indexed with the
+  default indexer. I think a better solution would be to have
+  urls indexed with a url-specific indexer that doesn't do
+  things like, e.g., porter-stemming, which I assume the
+  default indexer is doing.
+- *extend the index to handle binary links in Geminispace*:
+  currently, there's a hack in the code to simply skip
+  anything that looks like a binary link. I think with the
+  above improvement to how indexing works, they could be
+  made very effectively searchable. Also in this vein,
+  binary links should be identified via their mime types
+  probably, instead of the suffix hack used now.
+- *add tests*: there aren't any yet!
+- *add functionality to create a mock index*: this would
+  be useful for local hacking on serve.py, so one does
+  not need to perform a real scrape of Geminispace to do
+  said hacking.

	geminispace.info gemini search engine
	git clone https://git.clttr.info/geminispace.info.git
	Log (Feed) \| Files \| Refs (Tags) \| README \| LICENSE