geminispace.info

gemini search engine
git clone https://git.clttr.info/geminispace.info.git
Log (Feed) | Files | Refs (Tags) | README | LICENSE

commit 99c223c0adf5fb12fc683a0571e2cc5f412f949e
parent 77a824e53cb77beb01341f6570c502123132daeb
Author: Natalie Pendragon <natpen@natpen.net>
Date:   Sat,  9 May 2020 10:58:18 -0400

[crawl] Adjust link line regex to only match at beginning of line

The crawler was starting to run into errors on source code, which some
people are now hosting in Geminispace, and which sometimes has syntax
that includes `=>` of it. I suppose this could have happened in
non-code contexts as well, but this is the first time it seems to have
loudly broken the crawl.

This fixes it.

Also, it occurs to me that I think there is a "raw-text block" type of
construct in the Gemini spec now, so I should probably add a TODO to
refactor the extract_gemini_links function to exclude any links found
within such a block.

Diffstat:
MREADME.md | 4++++
Mgus/crawl.py | 4++--
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md @@ -44,4 +44,8 @@ as this guide to [mailing list etiquette](https://man.sr.ht/lists.sr.ht/etiquett solution will become increasingly unappealing as the amount of content, and thus amount of search hits, in Geminispace grows). +- **exclude raw-text links**: I think there is a "raw-text block" + type of construct in the Gemini spec now, so I should probably + add a TODO to refactor the extract_gemini_links function to + exclude any links found within such a block. - **track freshness of content** diff --git a/gus/crawl.py b/gus/crawl.py @@ -100,8 +100,8 @@ def normalize_gemini_url(url): return url_normalized, host_normalized def extract_gemini_links(content, current_url): - link_pattern = "=>\s(\S+)" - links = re.findall(link_pattern, content) + link_pattern = "^=>\s(\S+)" + links = re.findall(link_pattern, content, re.MULTILINE) gemini_links = clean_links(links, current_url) return gemini_links