Wiby is amazing and one of my favorite ways to discover the more unique side of the internet. I love to kill time by clicking "Surprise me" and half the RSS feeds I subscribe to are from this!
Thank you so much for making Wiby! I have truly enjoyed what its brought!!
Only the first search brought up something very, very interesting for me [0] (bonus for looks, of course). I love and miss the old internets. Happy to know there is Wiby! (I use Millionshort and Marginalia.)
Thank you, my index is puny compared to yours and that is because I only index the page submitted by guests and don't go any further. The index is in the 10's of thousands. The hyperlink crawling feature was added and tested out specifically for the release, as I understand that some people will want to use that heavily and to build up a much larger index. My computers were super cheap and handle my puny index well, because its puny. I have no idea how well my approach compares to others in terms of performance.
This is really amazing, and can allow for niche communities to build up "useful links" that can be easily searched, WITHOUT having to use outside search engines.
>There are several forms to control the search engine. There is no central form linking everything together, just a collection of different folders that you can rename if you want.
Maybe a good addition would be /admin/ that links all of the admin functions.
This is awesome - thanks for sharing. I just so happen to have a domain left lying around for the Searx instance I never spun up, so I might end up using it for a weekend project with this.
Love Wiby, been playing with it awhile now. I especially enjoy the "surprise me" button from time to time.
the guide linked in the Readme is comprehensive and a good read - also gives instructions on who to scale and distribute load.
If the author can answer: how big did the fulltext table became for x entries on wiby.me - and what is a common response time on N amount of searches per minute for this dataset?
Would you offer a /traffic or /stats page within about/ ? duckduckgo shows traffic, not index stats though.
I don't see it in the windex schema yet, but would be interesting what the actual hitrate is on a search corpus, how many clickthroughs there are for any used search term. Answering these kind of questions adds computation and record keeping though.
Thanks for open sourcing, it's an interesting mix of languages and tools!
>how big did the fulltext table became for x entries on wiby.me
I want Wiby to be comprised mainly of human submitted pages, so for 99% of the index, only the pages submitted by users are indexed and no further crawling was done. However I recognized that not having the capability to crawl through links would not make it useful for others, so I added in the crawling capability to my liking and tested it accordingly. I imagine others might want to depend heavily on hyperlink crawling for their use case, but there is a tradeoff in the quality of the pages that get indexed and the resources they require.
>and what is a common response time on N amount of searches per minute for this dataset?
Hard to say exactly as I haven't run many benchmarks, but my goal is to keep multi-word queries to within about a second. Single-word queries are very fast. My 4 computers handle hundreds of thousands of queries per day because Wiby is being barraged by a nasty spam botnet with thousands of constantly changing IPs. If I don't keep them in check they will eventually eat all the CPU availability.
>Would you offer a /traffic or /stats page within about/ ? duckduckgo shows traffic, not index stats though.
Probably not on mine since I don't get enough traffic for it to be of that much interest to me. I privately use goaccess to get a general idea of daily traffic.
i like this approach as a possible use for a personal searchengine, that only has stuff that i have been looking at. for that it would be helpful to have some kind of browser extension that can autosubmit everything in my history. ideally that extension would also autoaccept every submission so that it can work fully in the background without my intervention.
also helpful would be a whilelist/blacklist feature, say, wikipedia and stackoverflow may always be autoaccepted while certain other sites may always be rejected, and the rest go through the regular review process.
then i can use that as my default search engine and branch out when i don't find what i am looking for. for that it would also be cool if there could be a way to search wiby and another search engine in paralell and display like 5 results from each.
For what it's worth, I ended up putting Marginalia Search behind Cloudflare to deal with what I assume is the same group. At worst I saw 30k queries per hour.
My unsubstantiated hunch based on looking at the types of queries, which at least for me were over-specified as all hell and within the sphere of pharmaceuticals, e-shopping and the like, is that they're gambling on the search engine being backed by Google or Bing, and they're effectively trying to poison their typeahead suggestion data.
I'd guess they're just aiming their gatling gun at whatever sites has an opensearch specification without much oversight.
It's also crossed my mind it might be some sketchy law firm looking for DMCA violations, since a fair bunch of the queries looked like they were after various forms of contraband. Seems weird they'd use a botnet though. Like most of the IPs seemed to be like enterprise routers with public facing admin pages and the like. Does not seem above board at all.
With this code you will start out with a blank index and will have to start making submissions to your search engine to build the index, but you can search the results as soon as the pages get crawled. The video demo provides a practical example.
The internet, somewhat ironically, really needs a search engine that works in the current day. You can't find anything anymore. It's like Google has been un-invented.
Hopefully some day soon the internet will be searchable again.
There would be a lot of trust required to use the data for anything but things like Common Crawl save a lot of time. Does Wiby support starting with that?
A group of people would have to band together to share the same table (windex) with each other. Personally I am interested in seeing people try to cultivate their own niche indexes instead of working towards a common one.
You can certainly change the crawler's database connection from "localhost" to an IP address on a different machine, but I am unsure how that works with that type of proxy (I had to look up what a socks5 is). Sounds like it can work though.
Well what I meant was, say I have a server with the ip 13.223.12.212 and I want to run the crawler there, however would like to crawl the actual websites with the ip 23.215.23.15 (aka my proxy, socks5 being one of the several protocols to do it)
If you get what i mean :P
I assume it's possible if I just change some of the curl options in the crawler code
Thank you so much for making Wiby! I have truly enjoyed what its brought!!