geminispace.info

gemini search engine
git clone https://git.clttr.info/geminispace.info.git
Log (Feed) | Files | Refs (Tags) | README | LICENSE

news.gmi (15947B)


      1 {% include 'fragments/header.gmi' %}
      2 
      3 
      4 ## News
      5 ### 2024-04-25 update troubles
      6 We had some trouble to update the index in the last days. We crawled over 1.7 Million(!) pages on a single capsule - usefull stuff like "node_modules" directories pushed to Git repos served over Gemini.
      7 When trying to include all those pages into our whoosh FTS index the VM simply froze - or sometimes the Python process was just oom-killed.
      8 
      9 We've removed the questionable pages and set up new excludes. Index is now up to date again.
     10 
     11 ### 2024-01-05 happy birthday :)
     12 On this very day three years ago "geminispace.info" was born and being announced an the long gone mailinglist a few days later.
     13 And still today, with some adjusted bits here and some additions there, geminispace.info is essentially still standing on the foundations that ~natpen build with GUS. Kudos to Natalie 
     14 
     15 We had some ups and downs over the years, but mostly geminispace.info has been a reliable source of information for the Gemini community. Atleast we feel so and we hope you do as well. 
     16 
     17 That being said we'd like to ask the community to help fund the hosting cost of geminispace.info.
     18 => /about More information about donations can be found on our "About" page.
     19 
     20 ### 2023-12-23 IPv6 is back
     21 geminispace.info is now available through IPv6 under the adress 2a02:247a:207:8e00:1::1
     22 
     23 ### 2023-08-12 robots.txt clarification
     24 We've added some clarification on how geminispace.info parses robots.txt files:
     25 => /documentation/indexing Indexing documentation
     26 
     27 ### 2023-07-30 pubnix & robots.txt
     28 When fetching a robots.txt geminispace.info is now aware of pubnix-style user dirs (domain.com/~joe and domain.com/users/joe) allowing for per-user robots settings.
     29 The provisions for this feature have been in the code for quite a long time, but we didn't make use of it until now.
     30 
     31 ### 2023-07-29 robots.txt
     32 Dear capsule pilots (especially those who run mirrors or gateways of bloated web site): Please put a robots.txt in place.
     33 => gemini://geminiprotocol.net/docs/companion/robots.gmi robots.txt companion spec at circumlunar
     34 
     35 ### 2023-06-07 server switch
     36 Be welcome on our brand new production instance sporting Debian 12.
     37 
     38 ### 2023-03-04 twtxt
     39 The "known feeds" page now includes twtxt.txt feeds.
     40 => known-feeds check the "Known feeds" page
     41 
     42 ### 2023-02-10 
     43 We now provide a list of URIs that are currently excluded from crawl & indexing. This should improve the transparency on what geminispace.info is doing. At the moment there is no reason given as to why a specific exclude is in place. We might add this in the future.
     44 => documentation/filters list of excluded URIs
     45 
     46 ### 2023-01-29 updated TLS certificate
     47 geminispace.info now uses an updated certificate that uses X.509 Version 3.
     48 I hope this improves compatibility with clients as the previously used X.590 v1 seems to move out of support in some implementations.
     49 
     50 ### 2023-01-27 update delay
     51 We had some issues with the crawler stuck in an "infinite maze" that should have never been crawled. This is solved for the moment and the index is up to date again.
     52 Additionally there is some intermittent trouble with name resolution. I have no clue what causes this. If someone has experience in debugging name resolution on linux (Debian) i'd thankfully accept any advice.
     53 
     54 ### 2023-01-05
     55 I've made some adjustments to the raw database for some major performance improvements. This helps mostly when we update the index or restart the server, it does not affect searching on geminispace.info.
     56 
     57 Due to the announced price increases from our hoster I'm thinking about hosting geminispace.info on a spare RasPi at home. The downside would be that geminispace.info would change it's IP every 24 hours - so IP-based blocking of the crawler would be impossible.
     58 Do you think this is acceptable? Feedback welcome.
     59 
     60 ### 2023-01-01
     61 Happy new year everyone, hope your all doing well.
     62 Our provider (netcup.de) anounced a price increasing - so maybe we are going to migrate to another provider some day in June. We need atleast 2 vCPUs (4 will be better), 8 GB of RAM and atleast 100 GB block storage. Suggestions welcome.
     63 
     64 ### 2022-12-18
     65 After some small adjustments to the indexing i'm confident we can postpone the need for a major rewrite quite a few months.
     66 
     67 ### 2022-12-01 donations
     68 As of today geminispace.info has received donations that sum up to 82.78€. Thats almost 8 months of hosting costs covered. Thank you very much.
     69 
     70 ### 2022-11-12 
     71 With the still ongoing increase of gemini capsules, the current techstack geminispace.info hits it's limits in on or another case. There are several options to improve the situation:
     72 1. do a major rewrite and move to another tech stack (especially regarding data storage and full text search)
     73 2. move to another implementation like the one tlgs.one uses
     74 3. shut down the service
     75 At the moment i have no motivation to put the required efforts into option 1 and 2. We'll see if this changes in the next few months - the current contract for the vps will end in July 2023.
     76 
     77 ### 2022-08-22 donations welcome
     78 We've set up a way to send donations to help covering the ongoing costs of running geminispace.info
     79 => about more information can be found on our About page
     80 
     81 ### 2022-08-18 duplicate results
     82 Due to a small glitch in the crawler we had duplicate results in the dataset for a few weeks.
     83 Thanks to the report of Acidus this has now been fixed and the duplicate entries were removed.
     84 
     85 Despite this, gemini keeps growing organically. The raw data known to geminispace.info at the moment exceeds 10 GB of data and we already exclude some high traffic capsules like news or wikipedia relays.
     86 
     87 ### 2022-07-21 crawling issues
     88 We had some crawling issues in the last days. In the end it turns out someone decided to serve huge video files over gemini.
     89 At the moment we process all files in memory, so the crawl simply got killed by the oom-killer once the downloaded video size hits the available memory.
     90 
     91 This is workarounded by excluding the capsule in question from the crawl. A more proper fix for this needs to be implemented in the future.
     92 
     93 ### 2022-06-07 suggestions disabled
     94 The poor performance on "no result searches" was caused by some misbehaviour when trying to compute suggestions for alternate search terms which eventually led to an exception.
     95 I disabled suggestions for empty search results for the moment. Suggestions will come back once i sorted this out.
     96 
     97 ### 2022-05-16
     98 there's currently an issue with search querys that will lead to no results (e.g. geminispace.info can't find a page that matches the criterias):
     99 These searches will take a very long time until a "no results" page is returned, sometimes they will even fail with a "42 TEMPORARY FAILURE".
    100 Any search for a known pattern will return the results within seconds, so expect that geminispace.info does not know about a page that matches your criteria if the search takes more then 20 seconds.
    101 We are looking into this.
    102 
    103 ### 2022-05-13 speeding up crawling
    104 Our crawl engine is now multi-threaded. This means that multiple requests are made in parallel and the overall crawl time is greatly reduced.
    105 Additionally the crawling is now more random, which should avoid requesting huge amounts of pages from a single capsule in a short time.
    106 
    107 ### 2022-05-09 memory usage issue solved?
    108 It seems like we've finally solved our memory issue. In the end it may have been a small parameter for whoosh which ended up loading the whole index into RAM.
    109 At a first glance this didn't cause any performance drain, it even seems the system is more responsive now. Maybe due to the high memory pressure causing overhead.
    110 
    111 ### 2022-03-27 Debian update
    112 The server running geminispace.info has been updated to Debian 11.3 without any issues. :)
    113 
    114 ### 2022-03-25 improved indexing speed
    115 With some small tweaks to the indexing process and the removal of old, now defunct, capsules which we still tried to crawl reduced the time needed for a complete update dramatically.
    116 
    117 ### 2022-03-20 dependency hell
    118 We had an outage due to a dependency upgrade that hit late. `markupsafe`, which is not used by geminispace.info but rather is a dependency of `jinja`, shipped a breaking change in a minor release which caused some trouble for various people. We were just late to the party.
    119 It's workarounded for the moment, will have a look at it later.
    120 
    121 ### 2022-03-19 TLS config update
    122 geminispace.info allows now more variants of TLS ciphers which hopefully will allow us to crawl even more capsules.
    123 
    124 ### 2022-03-05 monitoring
    125 geminispace.info is now monitored (and i will be alerted if something goes wrong) by shit.cx. Big thanks to Jon for providing this service.
    126 => gemini://status.shit.cx shit.cx status monitoring.
    127 
    128 ### 2022-02-08 oopsie
    129 So the last refactor went...erm...upside down. We had a outage for a few hours because of this.
    130 I rolled the changes back and will do another attempt for a (hopefully successfull) refactor in the next days *fingers crossed*
    131 
    132 ### 2022-02-06 filtering clients
    133 I've blocked two ips for repeatedly doing stupid requests again and again:
    134 * 2001:41d0:302:2200::180
    135 * 193.70.85.11
    136 
    137 ### 2022-01-25 a year after
    138 Today one year ago geminispace.info has been set up. You probably guess what happened: the cert for the capsule expired today... :-D
    139 A new cert is in place which now lasts for ten years...
    140 
    141 ### 2021-12-29
    142 geminispace.info is performing pretty well at the moment, it's reasonably fast and very reliable.
    143 There was no need for me to hack around something, although a few optimizations are still open. I gonna tackle this todos in the next year.
    144 
    145 ### 2021-11-24
    146 The "newest-hosts" page now shows the 30 newest host instead of only 10.
    147 
    148 ### 2021-09-15
    149 I'm currently quite happy with the reliability and performance of the crawl and indexing processes.
    150 So i removed some older excludes, you should expect to see a whole lot more indexed pages after the next crawl.
    151 We'll have to see if i regret this change... ;)
    152 
    153 ### 2021-08-18
    154 geminispace.info is now powered by Debian 11 Bullseye :)
    155 
    156 ### 2021-08-10
    157 I just pushed a small fix that allows to search for backlinks without giving the mandatory scheme. The scheme is now automatically added.
    158 
    159 ### 2021-08-07
    160 I pushed a small change to production to ensure that URIs added to the seed requests include the scheme. This was mandatory before, but due to a recent change we no longer crawl schemeless URIs as per spec.
    161 If you added your capsule in the last days without a scheme, this is now fixed and the capsule should be included in the index now.
    162 
    163 ### 2021-07-20
    164 Thanks to the contribution of Hannu Hartikainen geminispace.info now is again able to honor the user-agents "gus", "indexer" and "*" in robots.txt.
    165 
    166 ### 2021-07-11
    167 The revamped data store seems to work fine so far.
    168 Unfortunately i had to disable the "newest hosts" and "newest pages" sites as the data is currently not available. I'll add that back again later, but before this i'd like to have the cleanup mechanismn implemented to get rid of old data from capsules that are no longer available.
    169 
    170 ### 2021-07-10
    171 If finally managed to analyze the index process. In the end it turned out to be an issue when calculating the backlink counters and with an adapted query indexing is fast again.
    172 Obviously i was horribly wrong all the time blaming the slow vps.
    173 
    174 Unfortunately this is only a small step in the major overhaul of GUS.
    175 
    176 ### 2021-07-04
    177 More trouble along the way. Although the VPS hosting geminispace.info runs with 8 Gigs of RAM and does not serve other services, the index update got oom-killed. :(
    178 Seems due to the continued growth of gemini we are hitting the same problems Natalie hit a few months ago on GUS. I'm currently unsure about the next steps.
    179 
    180 ### 2021-06-26
    181 It took almost ten days the last reindex to complete as i triggered a complete index. This was necessary after the cleanup as there is currently no incremental cleanup of the search index implemented.
    182 The design of GUS - which clearly has never been meant to index such a huge number of capsules - and the slow VPS are doing no good currently to keep the index up to date. Unfortunately we are currently stuck with the VPS.
    183 Currently there is no progress to be reported on the coding site. I'm busy with various other things and late in the evening i can't bother to tackle some of the obvious tasks to improve GUS. If you are interesting in helping out improving GUS/geminispace.info feel free to comment on one of the issue or drop me a mail.
    184 => https://todo.sr.ht/~rwa/geminispace.info/ issues and todos of geminispace.org
    185 
    186 ### 2021-06-16
    187 I've made some manual cleanup of the base data the last days. This decreased the raw data size from over 3 GB to roughly 2 GB. Unfortunately a new mirror of godocs came online...another thing we need to exclude for the moment.
    188 
    189 ### 2021-05-25
    190 geminispace.info is now aware of more than 1000 capsules. Unfortunately this data is somewhat misleading: some of the capsules may already be gone, but GUS lacks a mechanism for invaliding old data.
    191 I'll probably start with some manual cleanup the next days, so don't worry if numbers go down.
    192 
    193 ### 2021-05-12
    194 We are back on track with crawl and index, everything is up-to-date again.
    195 I had to add another news and a wikipedia mirror to the exclude list. The current implementation can't handle such a huge amount of information well.
    196 
    197 ### 2021-05-08
    198 Obviously this didn't work as expected. For whatever reason indexing fails repeatedly on one or another page with a mysterious sqlite error. It may take a few days till i find enough time to search for the cause of this error.
    199 If you are familiar with peewee and sqlite or have come across this issue earlier, let me know.
    200 
    201 ### 2021-05-05
    202 The index is currently a few days behind. It will hopefully catch up during the day.
    203 From now on I will exclude any sort of news- or youtube-mirrors from the crawl without further notice.
    204 For the sake of transparency i may add a section which mention what is excluded and why it is excluded. But this is not a high priority for me.
    205 
    206 ### 2021-04-27
    207 There are currently some issue during crawl that sometimes lead to n interruption. So it may take more then the usual 3 days until new content is discovered.
    208 This will eventually be solved when the migration to PostgreSQL is done, unfortunately im quite busy with real life currently so it may take some time.
    209 
    210 ### 2021-04-14
    211 I started working on migrating the backing database to PostgreSQL instead of SQLite.
    212 This may take a while, but it will eventually solve some of the problems that currently occur around crawling and indexing.
    213 
    214 ### 2021-03-19
    215 Not sure if i can keep the updates schedule set on every 3 days.
    216 Current crawl is running for more than 24 hours now and it's still not finished yet.
    217 
    218 ### 2021-03-08
    219 The shady workaround is now in place - index updates won't block searches anymore.
    220 This is even more important with the ongoing growth of geminispace - as of today there are more then 750 capsules we know about.
    221 
    222 ### 2021-03-06
    223 I'm currently working on a workaround to avoid the index update blocking search requests.
    224 Unfortunately i broke the index during this...need to be more careful when doing maintenance.
    225 
    226 ### 2021-02-26
    227 I've made some adjustments on how GUS/geminispace.info uses robots.txt.
    228 Previously we tried to honor the settings for *, indexer and gus user-agents. That didn't work out well with the available python libraries for robots parsing and GUS ended up crawling files it wasn't intended tto.
    229 We now only use the settings for * and indexer, no special handling for GUS anymore. All indexers unite. ;)
    230 
    231 ### 2021-02-02
    232 The first fully unattended index update has happened last night.
    233 There are still some rough edges to be cleaned, but we are on the way to have up-to-date search results without manual intervention.
    234 
    235 ### 2021-01-29
    236 geminispace.info has just been announced on the gemini mailing list.
    237 
    238 ### 2021-01-25
    239 geminispace.info is going public! Yeah! :)