commit 09abe013da9735c2007595fdc835cf928d6798b7
parent fd3a662f11bf8d1c1a52332951984ada1487c507
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat, 29 Feb 2020 08:13:22 -0500
Update README.md
Diffstat:
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
@@ -4,7 +4,9 @@ Note that doing this currently requires you to perform a
full crawl of Geminispace. With little content, and few
people hacking on this, it's probably fine, but we should
definitely keep tabs on this to ensure we're kind and
-respectful to content and server owners.
+respectful to content and server owners (I think the
+solution is that we need a way to create a mock index
+sooner than later).
1. Get Python and [Poetry](https://python-poetry.org/docs/)
2. Generate a local Geminispace index with `poetry run crawl`
@@ -22,3 +24,27 @@ Please send patches to [~natpen/gus@lists.sr.ht](mailto:~natpen/gus@lists.sr.ht)
For an introduction to mailing list-based Git collaboration,
see [this introduction](https://git-send-email.io/), as well
as this guide to [mailing list etiquette](https://man.sr.ht/lists.sr.ht/etiquette.md).
+
+# Roadmap / TODOs
+
+- *general code cleanup*: most notably crawl.py. There are a lot
+ of hacks in there that I put in for expediency, but haven't
+ taken the time to address.
+- *improve the indexing*: currently, the url is prepended to
+ the page content, and everything is simply indexed with the
+ default indexer. I think a better solution would be to have
+ urls indexed with a url-specific indexer that doesn't do
+ things like, e.g., porter-stemming, which I assume the
+ default indexer is doing.
+- *extend the index to handle binary links in Geminispace*:
+ currently, there's a hack in the code to simply skip
+ anything that looks like a binary link. I think with the
+ above improvement to how indexing works, they could be
+ made very effectively searchable. Also in this vein,
+ binary links should be identified via their mime types
+ probably, instead of the suffix hack used now.
+- *add tests*: there aren't any yet!
+- *add functionality to create a mock index*: this would
+ be useful for local hacking on serve.py, so one does
+ not need to perform a real scrape of Geminispace to do
+ said hacking.