commit 99c223c0adf5fb12fc683a0571e2cc5f412f949e
parent 77a824e53cb77beb01341f6570c502123132daeb
Author: Natalie Pendragon <natpen@natpen.net>
Date: Sat, 9 May 2020 10:58:18 -0400
[crawl] Adjust link line regex to only match at beginning of line
The crawler was starting to run into errors on source code, which some
people are now hosting in Geminispace, and which sometimes has syntax
that includes `=>` of it. I suppose this could have happened in
non-code contexts as well, but this is the first time it seems to have
loudly broken the crawl.
This fixes it.
Also, it occurs to me that I think there is a "raw-text block" type of
construct in the Gemini spec now, so I should probably add a TODO to
refactor the extract_gemini_links function to exclude any links found
within such a block.
Diffstat:
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
@@ -44,4 +44,8 @@ as this guide to [mailing list etiquette](https://man.sr.ht/lists.sr.ht/etiquett
solution will become increasingly unappealing as the amount
of content, and thus amount of search hits, in Geminispace
grows).
+- **exclude raw-text links**: I think there is a "raw-text block"
+ type of construct in the Gemini spec now, so I should probably
+ add a TODO to refactor the extract_gemini_links function to
+ exclude any links found within such a block.
- **track freshness of content**
diff --git a/gus/crawl.py b/gus/crawl.py
@@ -100,8 +100,8 @@ def normalize_gemini_url(url):
return url_normalized, host_normalized
def extract_gemini_links(content, current_url):
- link_pattern = "=>\s(\S+)"
- links = re.findall(link_pattern, content)
+ link_pattern = "^=>\s(\S+)"
+ links = re.findall(link_pattern, content, re.MULTILINE)
gemini_links = clean_links(links, current_url)
return gemini_links