Scraping 6 million HN posts (out of a total of 40 million, 6M = some ~3 years of posts) and embedding them with openAI's text embedding models.
Making these full-text and fuzzy searchable with Postgres+pgvector, using reciprocal rank fusion to merge the two rankings. Very rapid query times on a 6 year old intel NUC; a kind of hn.algolia.com search only with added semantic search, not just text search.
Reducing the dimensionality for average per-user and per-submissoon vectors to 3D with UMAP, to see where my posts are located, in "hn discussion space". (expectedly, some users such occupy very specific clusters in discussion-space. There's a "dang telling you to please keep it intellectually stimulating" corner)
I should slap a frontend on the latter, host it somewhere not inside my own house and submit it, honestly.
Having this natively here would be pretty useful. There's so much good stuff that "disappears" in the barrage of submissions, that can be hard to surface via simplistic search.
I'd most definitely want it to mainly link to the source comments/threads, no regurgitation necessary.
UI could be anything really as I'm not picky, self-hosted would be a huge plus.
Example queries could look like: "(what's a) setup for X?" (where X might be "on-prem ETL" for instance), "troubleshooting Y", "books about Z" or "obscure web finds" (this one's tricky because it's subjective and perhaps implicit). I guess many more inspirations can be found in the Ask section. Pretty sure a lot of questions get repeated.
I'd already be happy if the results were "ontologically obvious" close to my query, effectively searching phrases modulo synonyms.
With search as it currently is, I have to rephrase my query because I can't conjure the exact wording to exclude the many false positives (or include any true positives ;)), usually resorting to site: queries on other search engines or giving up.
Also, I should really have curated better myself these past years... HN is a cool place, but it, like other feed-like sites, surely has a short memory.
Similarly, imagine the info that is stored in Discord chat history, never to be retrieved again unless the right users come together at the right time and the info is repeated (as if we're back to oral history, except this time with perfect recordings).
Hope this helped somewhat, trying to keep it brief.
Making these full-text and fuzzy searchable with Postgres+pgvector, using reciprocal rank fusion to merge the two rankings. Very rapid query times on a 6 year old intel NUC; a kind of hn.algolia.com search only with added semantic search, not just text search.
Reducing the dimensionality for average per-user and per-submissoon vectors to 3D with UMAP, to see where my posts are located, in "hn discussion space". (expectedly, some users such occupy very specific clusters in discussion-space. There's a "dang telling you to please keep it intellectually stimulating" corner)
I should slap a frontend on the latter, host it somewhere not inside my own house and submit it, honestly.