[serve] Add newest pages endpoint, revamp documentation and index - geminispace.info

commit c67268608f93486bba5ff7ce135549b73e98f5a7
parent 6df4e561eb1280bac4ddec89891a9851245604f4
Author: Natalie Pendragon <natpen@natpen.net>
Date:   Fri,  4 Sep 2020 07:50:54 -0400

[serve] Add newest pages endpoint, revamp documentation and index

Diffstat:
M serve/models.py  | 14 ++++++++++++++
M serve/templates/about.gmi  | 71 ++++++-----------------------------------------------------------------
A serve/templates/documentation/backlinks.gmi  | 19 +++++++++++++++++++
A serve/templates/documentation/indexing.gmi  | 32 ++++++++++++++++++++++++++++++++
A serve/templates/documentation/searching.gmi  | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
A serve/templates/fragments/documentation-toc.gmi  | 3 +++
M serve/templates/fragments/footer.gmi  | 2 ++
M serve/templates/index.gmi  | 15 +++++++++++++--
M serve/templates/newest_hosts.gmi  | 2 +-
A serve/templates/newest_pages.gmi  | 12 ++++++++++++
M serve/views.py  | 29 +++++++++++++++++++++++++++++

11 files changed, 181 insertions(+), 68 deletions(-)
diff --git a/serve/models.py b/serve/models.py
@@ -208,6 +208,20 @@ LIMIT 10
         return newest_hosts_query.execute()
 
 
+    def get_newest_pages(self):
+        newest_pages_query = Page.raw("""SELECT p.url, p.fetchable_url, MIN(c.timestamp) AS first_seen
+FROM page as p
+JOIN indexable_crawl AS ic
+ON ic.page_id == p.id
+JOIN crawl AS c
+ON c.page_id == p.id
+GROUP BY p.url
+ORDER BY first_seen DESC
+LIMIT 50
+""")
+        return newest_pages_query.execute()
+
+
     def get_search_suggestions(self, query):
         suggestions = []
         corrector = self.searcher.corrector("content")
diff --git a/serve/templates/about.gmi b/serve/templates/about.gmi
@@ -3,78 +3,19 @@
 
 ## About GUS
 
-GUS is a search engine for all content served over the Gemini Protocol. It can help you track down textual pages (e.g., `text/gemini`, `text/plain`, `text/markdown`) with content containing your search terms, but it can just as easily help you track down binary files (e.g., images, mp3s) which happen to be served over the Gemini protocol. GUS will only index content within Geminispace, and will not index links out to other protocols, like Http or Gopher.
-
-To control crawling of your site, you can use a robots.txt file, Place it in your document root directory such that a request for "robots.txt" will fetch it.
-
-GUS obeys User-agent of "gus" and "*". Additionally, you can identify the GUS by looking for any requests to your site made by the following IP addresses:
-
-* IPv6: 2604:a880:400:d0::17e4:b001
-* IPv4: 198.199.84.116
+GUS is a search engine for all content served over the Gemini Protocol. It provides both a search interface, so you can look for content within Geminispace by keywords, content types, content sizes, and more. It also provides data on the size and characteristics of Geminispace itself.
 
 If you have questions about or ideas for GUS, please email me at vee@vnsf.xyz.
 
-## Advanced Searching
-
-### Filters
-
-To improve the quality of your search results, you can apply filters to constrain your search results in various dimensions. The currently implemented filters are:
-* content_type
-* domain
-* charset
-* size
-
-To filter by one of these, simply add it to your query followed by a colon, and the value you wish to filter by. Some examples of doing so follow.
-
-=> /search?content_type%3Aapplication/pdf application/pdf
-=> /search?content_type%3Aaudio audio
-=> /search?content_type%3Aimage/jpeg image/jpeg
-=> /search?content_type%3Ainput input
-
-=> /search?domain%3Acircumlunar domain:circumlunar
-=> /search?contextual%20domain%3Agus contextual domain:gus
-
-=> /search?computers%20content_type%3Agemini%20AND%20NOT%20charset%3AUS-ASCII computers content_type:gemini AND NOT charset:US-ASCII
-=> /search?NOT%20charset%3Anone NOT charset:none
-
-Note that size works slightly different than the other filters, as it is numeric. Typically, you will want to limit your search results to those less than, or greater than, a certain size.
-
-=> /search?computer%20AND%20size%3A%3E2000 computer AND size:>2000
-
-For further inspiration on how to use these filters, you can visit both GUS' list of known hosts, as well as GUS' list of known content_types and charsets on the statistics page. Note that there is some nuance to the charset values, due to the fact that specifying them is optional, and if one does not specify, there is a default of utf-8 - pages that do not specify a charset have an indexed charset value of "none".
-
-=> /known-hosts GUS Known Hosts (with list of domains)
-=> /statistics GUS statistics (with list of content_types)
-
-### Verbose Mode
-
-To allow greater insight into both how pages are ranking against each other, as well as when GUS crawled their content, you can enable verbose mode on any search results page. This will show the numerical score of each search result for the given query, the exact time that page was crawled, as well as its specified charset.
-
-There is a button at the top of each search results page to toggle verbose mode on or off, but you can also specify verbose mode manually in your URLs by utilizing the below pattern. Simply add a new "v" path component to the URL preceding the "search" path component. Below is an example:
-
-=> gemini://gus.guru/search?gemini
-=> gemini://gus.guru/v/search?gemini
-
-Note that verbose mode is sticky, and will persist between pages of results results, so you will need to manually toggle verbose mode off when you are finished with it.
-
-### Backlinks
-
-For a given page in Geminispace, backlinks are all the other pages in Geminispace that link to that page. When viewing GUS search results in verbose mode, a link to view each result's backlinks, if there any, will be provided.
-
-The URL structure for retrieving a certain URL's backlinks page is predictable, should you want to link directly to it in other contexts. All you need to do is URL encode the entire URL you want information on, then pass that as a query to gemini://gus.guru/backlinks. An example follows:
-
-=> gemini://gus.guru/backlinks?gus.guru
-
-Note the distinction between "internal" and "cross-capsule" backlinks. Internal backlinks are backlinks from within your own capsule to the given page. Cross-capsule backlinks are backlinks from other users' capsules. Note that the cross-capsule determination is slightly more advanced than purely checking if the hosts are different - it also takes into account different users on pubnixes, so, for example, gemini://foo.bar, gemini://foo.bar/~ronald, and gemini://foo.bar/~mcdonald would all be considered distinct capsules, as they are all presumably authored and maintained by distinct humans.
+### What's with the name?
 
-### Threads (coming soon!)
+GUS is both an acronym for Gemini Universal Search, as well as a reference to Gus Grissom, one of the early astronauts involved in NASA's Gemini program.
 
-Oftentimes in Geminispace a post on someone's gemlog will generate a reply on someone else's gemlog. Sometimes many replies! Sometimes the replies generate their own replies! GUS Threads allow you to visualize and explore these threads within Geminispace. You can peruse threads freely, but you can also participate in them without needing any extra software on your end. Inside your reply post, simply link to the post you're replying to (which frankly most are already doing anyway!) and GUS will sort out the rest.
+### Documentation
 
-For those interested in more technical detail, what follows is a deeper description of how this functionality works, and how GUS determines which pages are eligible to participate in threads. The first important point is that some pages are indeed _not_ eligible to participate in threads - the point of threads is to capture connected discussion betwwen human authors, so there are rules to determine which pages seem likely to be gemlog posts. A lot of this logic is based on URL structure, and if you nest your post pages within a `gemlog`, `glog`, `log`, `glog`, `posts`, or one of several other similar URL components, GUS will opt the nested pages into threads. It works well in general cases, and I've also added special rules for atypical capsules as I've come across them (still feasible at the current size of Geminispace :), Gemlog Blue and The Boston Diaries.
+For more extensive documentation about how GUS works, please see the documentation pages.
 
-The next important piece of GUS Threads functionality is that, for all the eligible post pages in Geminispace at a given time, threads are constructed out of cross-capsule links between those pages (see above documentation on backlinks for more information about the cross-capsule distinction).
+{% include 'fragments/documentation-toc.gmi' %}
 
-One nuanced technical limitation is that a given page can only exist in a thread one time. And GUS keeps the "one time" that exists _latest_ in the thread.
 
 {% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/documentation/backlinks.gmi b/serve/templates/documentation/backlinks.gmi
@@ -0,0 +1,19 @@
+{% include 'fragments/header.gmi' %}
+
+
+{% include 'fragments/documentation-toc.gmi' %}
+
+
+## Documentation: Backlinks
+
+### Backlinks
+
+For a given page in Geminispace, backlinks are all the other pages in Geminispace that link to that page. When viewing GUS search results in verbose mode, a link to view each result's backlinks, if there any, will be provided.
+
+The URL structure for retrieving a certain URL's backlinks page is predictable, should you want to link directly to it in other contexts. All you need to do is URL encode the entire URL you want information on, then pass that as a query to gemini://gus.guru/backlinks. An example follows:
+
+=> gemini://gus.guru/backlinks?gus.guru
+
+Note the distinction between "internal" and "cross-capsule" backlinks. Internal backlinks are backlinks from within your own capsule to the given page. Cross-capsule backlinks are backlinks from other users' capsules. Note that the cross-capsule determination is slightly more advanced than purely checking if the hosts are different - it also takes into account different users on pubnixes, so, for example, gemini://foo.bar, gemini://foo.bar/~ronald, and gemini://foo.bar/~mcdonald would all be considered distinct capsules, as they are all presumably authored and maintained by distinct humans.
+
+{% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/documentation/indexing.gmi b/serve/templates/documentation/indexing.gmi
@@ -0,0 +1,32 @@
+{% include 'fragments/header.gmi' %}
+
+
+{% include 'fragments/documentation-toc.gmi' %}
+
+
+## Documentation: Indexing
+
+GUS is a search engine for all content served over the Gemini Protocol. It can help you track down textual pages (e.g., `text/gemini`, `text/plain`, `text/markdown`) with content containing your search terms, but it can just as easily help you track down binary files (e.g., images, mp3s) which happen to be served over the Gemini protocol.
+
+### What does GUS index?
+
+GUS will only index content within Geminispace, and will neither follow nor index links out to other protocols, like Http or Gopher. GUS will only crawl outwards by following Gemini links found within `text/gemini` pages. If you return a `text/plain` mimetype for a page, Gemini links within it will not register with GUS (though the content of the `text/plain` page will itself get indexed).
+
+### Controlling what GUS indexes with a robots.txt
+
+To control crawling of your site, you can use a robots.txt file, Place it in your capsule's root directory such that a request for "robots.txt" will fetch it. It should be returned with a mimetype of `text/plain`.
+
+GUS obeys User-agent of "gus" and "*".
+
+### How can I recognize GUS requests?
+
+You can identify the GUS by looking for any requests to your site made by the following IP addresses:
+
+* IPv6: 2604:a880:400:d0::17e4:b001
+* IPv4: 198.199.84.116
+
+### Does GUS keep my content forever?
+
+No. After repeated failed attempts to connect to a page (e.g., because it moved, or because the capsule got taken down, or because of a server error on your host), GUS will eventually invalidate that page in its index, thus removing it from search results.
+
+{% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/documentation/searching.gmi b/serve/templates/documentation/searching.gmi
@@ -0,0 +1,50 @@
+{% include 'fragments/header.gmi' %}
+
+
+{% include 'fragments/documentation-toc.gmi' %}
+
+
+## Documentation: Searching
+
+### Filters
+
+To improve the quality of your search results, you can apply filters to constrain your search results in various dimensions. The currently implemented filters are:
+* content_type
+* domain
+* charset
+* size
+
+To filter by one of these, simply add it to your query followed by a colon, and the value you wish to filter by. Some examples of doing so follow.
+
+=> /search?content_type%3Aapplication/pdf application/pdf
+=> /search?content_type%3Aaudio audio
+=> /search?content_type%3Aimage/jpeg image/jpeg
+=> /search?content_type%3Ainput input
+
+=> /search?domain%3Acircumlunar domain:circumlunar
+=> /search?contextual%20domain%3Agus contextual domain:gus
+
+=> /search?computers%20content_type%3Agemini%20AND%20NOT%20charset%3AUS-ASCII computers content_type:gemini AND NOT charset:US-ASCII
+=> /search?NOT%20charset%3Anone NOT charset:none
+
+Note that size works slightly different than the other filters, as it is numeric. Typically, you will want to limit your search results to those less than, or greater than, a certain size.
+
+=> /search?computer%20AND%20size%3A%3E2000 computer AND size:>2000
+
+For further inspiration on how to use these filters, you can visit both GUS' list of known hosts, as well as GUS' list of known content_types and charsets on the statistics page. Note that there is some nuance to the charset values, due to the fact that specifying them is optional, and if one does not specify, there is a default of utf-8 - pages that do not specify a charset have an indexed charset value of "none".
+
+=> /known-hosts GUS Known Hosts (with list of domains)
+=> /statistics GUS statistics (with list of content_types)
+
+### Verbose Mode
+
+To allow greater insight into both how pages are ranking against each other, as well as when GUS crawled their content, you can enable verbose mode on any search results page. This will show the numerical score of each search result for the given query, the exact time that page was crawled, as well as its specified charset.
+
+There is a button at the top of each search results page to toggle verbose mode on or off, but you can also specify verbose mode manually in your URLs by utilizing the below pattern. Simply add a new "v" path component to the URL preceding the "search" path component. Below is an example:
+
+=> gemini://gus.guru/search?gemini
+=> gemini://gus.guru/v/search?gemini
+
+Note that verbose mode is sticky, and will persist between pages of results results, so you will need to manually toggle verbose mode off when you are finished with it.
+
+{% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/fragments/documentation-toc.gmi b/serve/templates/fragments/documentation-toc.gmi
@@ -0,0 +1,3 @@
+=> /documentation/searching Documentation: searching
+=> /documentation/indexing Documentation: indexing
+=> /documentation/backlinks Documentation: backlinks
diff --git a/serve/templates/fragments/footer.gmi b/serve/templates/fragments/footer.gmi
@@ -1,3 +1,5 @@
+> “If I cease searching, then, woe is me, I am lost. That is how I look at it - keep going, keep going come what may.” --- Vincent Van Gogh
+
 => /add-seed See any missing results? Let GUS know your Gemini URL exists.
 
 Index updated on: {{ index_modification_time|datetimeformat }}
diff --git a/serve/templates/index.gmi b/serve/templates/index.gmi
@@ -1,10 +1,21 @@
 {% include 'fragments/header.gmi' %}
 
-=> /about About GUS
-=> /statistics GUS Statistics
+
+## Geminispace Data
+
+=> /statistics Geminispace Statistics
 => /known-hosts Known Gemini Hosts
 => /known-feeds Known Gemini Feeds
+=> /newest-hosts Newest Gemini hosts
+=> /newest-pages Newest Gemini pages
+
+## Help and Documentation
+
+=> /about About GUS
 => /news GUS News
 => gemini://gemini.circumlunar.space Gemini Project information
 
+{% include 'fragments/documentation-toc.gmi' %}
+
+
 {% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/newest_hosts.gmi b/serve/templates/newest_hosts.gmi
@@ -6,7 +6,7 @@
 Here are the ten most recently discovered Gemini hosts by GUS. Welcome to Geminispace!
 
 {% for host in newest_hosts %}
-{{ "=> {} {}: {}".format(host.domain, host.first_seen[:10], host.domain) }}
+{{ "=> //{} {}: {}".format(host.domain, host.first_seen[:10], host.domain) }}
 {% endfor %}
 
 {% include 'fragments/footer.gmi' %}
diff --git a/serve/templates/newest_pages.gmi b/serve/templates/newest_pages.gmi
@@ -0,0 +1,12 @@
+{% include 'fragments/header.gmi' %}
+
+
+## Newest Gemini Pages
+
+Here are the fifty most recently discovered Gemini pages by GUS.
+
+{% for page in newest_pages %}
+{{ "=> //{} {}: {}".format(page.fetchable_url, page.first_seen[:10], page.url) }}
+{% endfor %}
+
+{% include 'fragments/footer.gmi' %}
diff --git a/serve/views.py b/serve/views.py
@@ -87,6 +87,14 @@ def newest_hosts(request):
     return Response(Status.SUCCESS, "text/gemini", body)
 
 
+@app.route("/newest-pages", strict_trailing_slash=False)
+def newest_pages(request):
+    body = render_template("newest_pages.gmi",
+                           newest_pages=gus.get_newest_pages(),
+                           index_modification_time=gus.statistics["index_modification_time"])
+    return Response(Status.SUCCESS, "text/gemini", body)
+
+
 @app.route("/known-feeds", strict_trailing_slash=False)
 def known_feeds(request):
     body = render_template("known_feeds.gmi",
@@ -109,6 +117,27 @@ def index(request):
     return Response(Status.SUCCESS, "text/gemini", body)
 
 
+@app.route("/documentation/searching", strict_trailing_slash=False)
+def documentation_searching(request):
+    body = render_template("documentation/searching.gmi",
+                           index_modification_time=gus.statistics["index_modification_time"])
+    return Response(Status.SUCCESS, "text/gemini", body)
+
+
+@app.route("/documentation/indexing", strict_trailing_slash=False)
+def documentation_indexing(request):
+    body = render_template("documentation/indexing.gmi",
+                           index_modification_time=gus.statistics["index_modification_time"])
+    return Response(Status.SUCCESS, "text/gemini", body)
+
+
+@app.route("/documentation/backlinks", strict_trailing_slash=False)
+def documentation_backlinks(request):
+    body = render_template("documentation/backlinks.gmi",
+                           index_modification_time=gus.statistics["index_modification_time"])
+    return Response(Status.SUCCESS, "text/gemini", body)
+
+
 @app.route("/news", strict_trailing_slash=False)
 def index(request):
     body = render_template("news.gmi",

	geminispace.info gemini search engine
	git clone https://git.clttr.info/geminispace.info.git
	Log (Feed) \| Files \| Refs (Tags) \| README \| LICENSE

M	serve/models.py	\|	14	++++++++++++++
M	serve/templates/about.gmi	\|	71	++++++-----------------------------------------------------------------
A	serve/templates/documentation/backlinks.gmi	\|	19	+++++++++++++++++++
A	serve/templates/documentation/indexing.gmi	\|	32	++++++++++++++++++++++++++++++++
A	serve/templates/documentation/searching.gmi	\|	50	++++++++++++++++++++++++++++++++++++++++++++++++++
A	serve/templates/fragments/documentation-toc.gmi	\|	3	+++
M	serve/templates/fragments/footer.gmi	\|	2	++
M	serve/templates/index.gmi	\|	15	+++++++++++++--
M	serve/templates/newest_hosts.gmi	\|	2	+-
A	serve/templates/newest_pages.gmi	\|	12	++++++++++++
M	serve/views.py	\|	29	+++++++++++++++++++++++++++++