Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun (blog.wilsonl.in)
520 points by wilsonzlin 8 months ago | hide | past | favorite | 159 comments



This is impressive work, especially for a one man show!

One thing that stood out to me was the graph of the sentiment analysis over time, I hadn't seen something like that before and it was interesting to see it for Rust. What were the most positive topics over time? And were there topics that saw very sudden drops?

I also found this sentence interesting, as it rings true to me about social media "there seems to be a lot of negative sentiment on HN in general." It would be cool to see a comparison of sentiment across social media platforms and across time!


Thanks! Yeah I'd like to dive deeper into the sentiment aspect. As you say it'd be interesting to see some overview, instead of specific queries.

The negative sentiment stood out to me mostly because I was expecting a more "clear-cut" sentiment graph: largely neutral-positive, with spikes in the positive direction around positive posts and negative around negative posts. However, for almost all my queries, the sentiment was almost always negative. Even positive posts apparently attracted a lot of negativity (according to the model and my approach, both of which could be wrong). It's something I'd like to dive deeper into, perhaps in a future blog post.


The sentiment issue is a curious one to me. For example, a lot of humans I interact with that are not devs take my direct questioning or critical responses to be "negative" when there is no negative intent at all. Pointing out something doesn't work or anything that the dev community encounters on a daily basis isn't an immediate negative sentiment but just pointing out the issues. Is it a meme-like helicopter parent constantly doling out praise positive so that anything differing shows negativity? Not every piece of art needs to be hung on the fridge door, and providing constructive criticism for improvement is oh so often framed as negative. That does the world no favors.

Essentially, I'm not familiar with HuggingFace or any models in this regard. But if they are trained from the socials, then it seems skewed from the start to me.

Also, fully aware that this comment will probably be viewed as negative based on stated assumptions.

edit: reading further down the comments, clearly I'm not the first with these sentiments.


Speaking from experience, debate is easily misread as negative arguing by outsiders, even though all involved parties are enjoying challenging each other's ideas.


You may be right, a more tailored classifier for HN comments specifically may be more accurate. It'd be interesting to consider the classes: would it still be simply positive/negative? Perhaps constructive/unconstructive? Usefulness? Something more along the lines of HN guidelines?


Just one point of note : people are FAR more likely to respond and take to writing to something negative than positive. I don’t know the exact numbers but it just engages people more. People just don’t pick up the pen to write how good something is as much.


Every helicopter gets a trophy


wait, the parents get a trophy?


I did something related for my ChillTranslator project for translating spicy HN comments to calm variations which has a GGUF model that runs easily and quickly but it's early days. I did it with a much smaller set of data, using LLM's to make calm variations and an algo to pick the closest least spicy one to make the synthetic training data then used Phi 2. I used Detoxify then OpenAI's sentiment analysis is free, I use that to verify Detoxify has correctly identified spicy comments then generate a calm pair. I do worry that HN could implode / degrade if there is not able to be a good balance for the comments and posts that people come here for. Maybe I can use your sentiment data to mine faster and generate more pairs. I've only done an initial end-to-end test so far (which works!). The model, so far is not as high quality as I'd like but I've not used Phi 3 on it yet and I've only used a very small fine-tune dataset so far. File is here though: https://huggingface.co/lukestanley/ChillTranslator I've had no feedback from anyone on it though I did have a 404 in my Show HN post!


Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Posts written in sweet syrupy tones wouldn’t do well here, and jokes are in short supply or outright banned. Most people here also seem to be men. There’s always someone shooting you down. And after a while, you start to shoot back.


(Without wanting to sound negative or cynical) I don’t think it is, but maybe I haven’t been here long enough to notice. It skews towards technical and science and technology-minded people, which makes it automatically a bit ‘cynical’, but I feel like 95% of commenters are doing so at least in good faith. The same cannot be said of many comparable discussion forums or social media websites.

Jokes are also not banned; I see plenty on here. Low-effort ones and chains of unfunny wordplay or banter seem to be frowned upon though. And that makes it cleaner.


I've been here a hot minute and I agree with you. Lots of good faith. Lots of personal anecdotes presumably anchored in experience. Some jokes are really funny, just not reddit-style. Similarly, no slashdot quips generally, such as "first post" or "i, for one, welcome our new HN sentiment mapping robot overlords." Sometimes things get downvoted that shouldn't, but most of the flags I see are well deserved, and I vouch for ones that I think are not flag-worthy


I wonder how much of a persons impression of this is formed by their browsing habits.

As a parent comment mentions big threads can be a bit of a mess but usually only for the first couple of hours. Comments made in the spirit of HN tend to bubble up and off-topic, rude comments and bad jokes tend to percolate down over the course of hours. Also a number of threads that tend to spiral get manually detached which takes time to go clean up.

Someone who isn't somewhat familiar with how HN works that is consistently early to stories that attract a lot of comments is reading an almost entirely different site than someone who just catches up at the end of the day.


some of the more negative threads will get flagged and detached and by the end of the day a casual browse through the comments isn't even going to come across them. eg something about the situation in the middle east is going to attract a lot of attention.


I think it's the engineering mindset. You're always trying to figure out what's wrong with an idea, because you might be the poor bastard that ends up having to build it. Less costly all round if you can identify the flaw now, not halfway through sprint 7. After a while it bleeds into everything you do.


> Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Sure, sometimes. But usually it's

Truth seeking > group thinking

There's a fine line between critical and cynical. Sometimes that line gets crossed. Sometimes the ambiguity of text-only comms clouds the water.


> Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

I don't think this is particularly unique to HN. Anonymous forums tend to attract contrarian assholes. Perhaps this place is more, erm, poorly socially-adapted to the general population, but I don't see it as very far outside the norm outside of the average wealth of the posters.


Really? Mmm i think hn is a place with on avarage above intelligent people. People who understand that their opinion is not the only one. I rarely have issues with people here. Might be also because we are all in the same bubble here.


its so interesting that in Likert scale surveys, I tend to see huge positivity bias/agreement bias, but comments tend to be critical/negative. I think there is something related to the format of feedback that skews the graph in general.

On HN, my theory is that positivity is the upvotes, and negativity/criticality is the discussion.

Personally, my contribution to your effort is that I would love to see a tool that could do this analysis for me over a dataset/corpus of my choosing. The code is nice, but it is a bit beyond me to follow in your footsteps.


Great work! Would you consider adding support for search-via-url, e.g. https://hn.wilsonl.in/?q=sentiment+analysis. It would enable sharing and bookmarks of stable queries.


Thanks for the suggestion, I've just added the feature:

https://hn.wilsonl.in/s/sentiment%20analysis


It will be a deep dive into the most essential of HN staples, the nitpick


[flagged]


Lol what a typical comment for today's HN. Condescending ("just plain wrong") with a jab ("this isn't a hugbox") placed in just to remind you that not only are you perceived to be wrong but you've provoked anger. No proof to provoke the jab, no feedback to help fix what you perceive as wrong sentiment analysis. Just thoughtless condescension and anger. Why is the sentiment wrong? Is this a data analysis trap the OP fell into? Nah let's insult the OP instead.

In my experience having run a bunch of different sentiment models on HN comments, HN comments tend to place around neutral to slightly negative as a whole, even when I perceive the thread to be okay. However I've noticed a huge bump in negative sentiment on large HN threads. I generally find that absolute sentiment doesn't work in most corpuses because the model reflects its training set's sentiment labels. I generally find relative sentiment to be a lot more useful. I have yet to do a temporal sentiment analysis on HN but I have a suspicion that it's gotten more negative over time. I agree with another poster that I think HN needs to be careful to not become so negative that it just becomes an anger echo.

Relative sentiment on this site between topics is something I've done and the obvious results show. Crypto threads are by-and-large negative, most political and news related threads are also highly negative.


Cynicism is perceived as more intelligent [0]. I personally find the HN brand of discussion to be difficult to bs my way into. But no matter your level of competency you can always find something to criticize and feel you've contributed. I wonder if academia or even "more intelligent" discussion in general would be counted as more negative.

https://journals.sagepub.com/doi/pdf/10.1177/014616721878319...


As someone who is not an academic myself, but likes to listen to podcasts where academics discuss issues with each other, I often find that the conversations feel contentious, and sometimes they are, but the vast majority of the time the academics themselves feel like they're having a perfectly cordial and productive conversation. So I do think there is something to the idea that academic discussion comes across as being negative.


HN definitely has a negative valence.

Sure, there's the 20% of comments that are outright rude, or tie everything back to their pet grievance (job satisfaction, government surveillance, the existence of JS).

But beyond that, the technical conversation has a negative, critical edge. A lot of comments come from the angle "You did something wrong by...", or only reply to correct.

There are still golden comments, and most personal anecdotes are treated respectfully, but it makes for an intimidating environment.


Whoosh, I was making a point by styling my comment in a way that would be perceived as negative by sentiment analysis.

Good job doing a whole psychoanalysis based on what's basically a joke, though.


Heh did I miss the joke? That was a whoops indeed! Sentiment is hard on the internet ;)

> Good job doing a whole psychoanalysis based on what's basically a joke, though.

Guess there's still some work to be done on that positive sentiment replying eh? :)


That one was intentional ;)


> sentiment across social media platforms and across time!

Also time zones and weekday/weekend.


I actually did a blog post a few months ago where I analyzed HN commenter sentiment across AI, blockchain, remote work and Rust. The final graph at the very end of the post is the relevant one on this topic!

https://openpipe.ai/blog/hn-ai-crypto


thanks, the sentiment in these graphs seem more positive in comparison. Did you run the sentiment on the whole corpus? What did that look like?


It's really unfortunate the HN API does not provide votes on comments: I wonder if and how sentiment analysis would change if they were weighted by votes/downvotes?

My unsupported take is that engineers are mostly critical, but will +1 positive feedback instead of repeating it, as they might for critism :)


Crypto i imagine is in that bucket


HN is a pretty toxic place indeed.


> HN is a pretty toxic place indeed

This may be a personal style difference, but I find HN to be the least toxic of all social media I’ve tried. LinkedIn would be my example of ultra toxicity – the aggressive positivity there is unbearable. At least on HN people tell you what they think and even use a constructive decently argumented approach to doing so.

HN to me feels like a good technical discussion where people tear apart ideas instead of each other.

But yeah if you put a lot of ego into your ideas, HN must be an awful place to visit.


I agree, HN is much less toxic than about any other place on the internet.


How did you get from negative sentiment to toxicity? Are those the same to you?

It may be a cultural thing, but I think many people see negative sentiment as a constructive tool and a demonstration of trust and respect among people who recognize each others as robust and capable peers.

Avoiding it is something you do with people who you believe need special delicacy: whether because they've told you so, because they intimidate you, or because you sense something pitiable and fragile about them.

If you can trust that it's given in good faith, and by the guidelines of HN you are asked to, negative sentiment should be seen as an expression that someone thinks you're a fully capable adult and peer. Personally, I deeply appreciate that it's generally so comfortably shared and received here and would never include "toxicity" in one of my critiques of HN.

It's a surprising thing to read someone say!

(Unless you're thinking of the nastiness that can surface on flamewar topics, but there are numerous means by which those get downranked and displaced, and they're otherwise sparse and easy to avoid.)


Negative sentiment is more general than toxicity in my understanding - but it does include it. The fact that the study found HN consistently negative does not surprise me, one of the ways HN is negative (the most disruptive and which makes me post here less often) is indeed toxic comments. But I am still here (in the comments no less) so the benefit still outweighs the pain.


Perhaps... it can be toxic if you dip into the comments sometimes... Otherwise the content and links are the stuff of gold!


links are indeed the best. It is hard not to click on the comments however, which is a roll of a dice.


Good example of data engineering/MLops for people who aren't familiar.

I'd suggest using HDBScan to generate hierarchical clusters for the points, then use a model to generate names for interior clusters. That'll make it easy to explore topics out to the leaves, as you can just pop up refinements based on the connectivity to the current node using the summary names.

The groups need more distinct coloring, which I think having clusters could help with. The individual article text size should depend on how important or relevant the article is, either in general or based on the current search. If you had more interior cluster summaries that'd also help cut down on some of the text clutter, as you could replace multiple posts with a group summary until more zoomed in.



Ooo thanks for this


Thanks for the great pointers! I didn't get the time to look into hierarchical clustering unfortunately but it's on my TODO list. Your comment about making the map clearer is great and something I think there's a lot of low-hanging approaches for improving. Another thing for the TODO list :)


Amazing work, I'm impressed by the scope of your project!

I must say though, is it jina or bge-3/flag - the embeddings (and tokenizer?) do not do a good job on tech topics. It's fine for natural words, but searching for tech concepts like "xaml", "simd", etc cause it fall back to tokenizing the inputs and tries to grab similar sounding words.

Also, just some constructive feedback, if there were some way to stop it from showing the same "hn leaderboard" of results when there are no results because a topic is too niche would be nice. I get a lot of "Stephen Hawking has died" when searching for words the embeddings aren't familiar with.

Edit: I'm not so sure how well the sentiment analysis is working. I had the feeling that there was too much negative sentiment that didn't match up to reality, so I tried looking up things HN would feel overwhelmingly positive about like "Mr Rogers", I mean, who could feel negatively about him? The results show some serious negative spikes. Look up "Carter" and there's a massive negative peak associated with the passing of Rosalynn Carter. It was an HN submission talking about all the wonderful things the Carters did.

Also, I think the "popularity over time" needs to be scaled by the median number of votes a story got that month/year, because the trend lines just go up and up if you plot strictly the number of posts. Look at the popularity of "diesel" and you'll see what I mean - this is a term that peaked ten years ago! Or perhaps it should be some sort of keyword incidence rate or number of items with a cosine similarity index of less than x from the query rather than post score, maybe?

Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.


How does one tell programmatically that any given embedding model doesn't recognize a term or word?


Here's a great tool that does almost exactly the same thing for any dataset: https://github.com/enjalot/latent-scope

Obviously the scale of OP's project adds a lot of interesting complexity, this tool cannot handle that, but it's great for medium-sized datasets.


I'd like to see an analysis of the rise of self promotion on HN.

I define self promotion on HN as a "Show HN: I ..." post vs "Show HN: Something ..."

Examples from the top 100 right now

* "Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun"

* "Show HN: Browser-based knitting (pattern) software"

These are not self promotional titles. The subjects are the exploration and the software respectively.

* "Show HN: I built a non-linear UI for ChatGPT"

* "Show HN: I created 3,800+ Open Source React Icons"

These are self promotional titles. The subject of each is "I"

My own simple check just via algolia search results checking for titles that start with "Show HN: I" gave these results for years starting April 1st. Graphed divided by the total number of results for that year

    2023 ****************************************
    2022 ***********************************
    2021 ***************************
    2020 **************************************
    2019 *************************
    2018 *************
    2017 *******
    2016 **********
    2015 ********
    2014 ************
    2013 *********************
    2012 *****************
    2011 *********
    2010 ***
I feel like maybe I grew up in a time when generally, self promotion was considered a bad character trait. Your actions are supposed to be what promotes you, calling attention to them is not but I feel that culture is changing.

I wonder if the rise in self promotion (assuming there is a rise) has to do with social media etc...

I perceive a similar rise on Youtube but I have no data, just a feeling from the number of youtube recommendations for videos of "I....."


Your definition of self promotion is a bit different from what I usually think. I usually consider self promotion to be someone promoting something that that same person did. Both of your non-self-promotion examples would be self promotion under my definition.

So what you consider to be self promotion vs non-self-promotion, I consider to be self promotion with a title that very clearly indicates that vs self promotion with a title that less clearly indicates that. However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.


> However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.

I think that's an extremely cynical view though a common one. I've never thought of "Show HN" as self promotion if it doesn't include "I" unless I go through to the actual product/library/post and find it full of self promotion. I agree with you that a post that doesn't include "I" can be self promotion but I don't think it always is even if the person made/worked on it.

"Show HN: XYZ and LLM library in rust" to me is informational. It's point is, more often than not, to inform people of something they might get use out of. I know that's true when I've posted something like that. It's meaning is "here's a useful resource that was just created". Sure I get pleasure from knowing I helped people with something but I'm not trying to promote myself, I'm trying to promote the library/post/info.

"Show HN: I made an LLM Library in rust" to me is self promotional. It might be useful to others but it's intent was clearly self promotion given the subject is "I", not the library/post/product.


>Show HN is for sharing your personal work and has special rules.

https://news.ycombinator.com/newsfaq.html


Interestingly, the special rules are more in favor of the more "self-promotionals" variants

> On topic: things people can run on their computers or hold in their hands. For hardware, you can post a video or detailed article. For books, a sample chapter is ok.

> Off topic: blog posts, sign-up pages, newsletters, lists, and other reading material. Those can't be tried out, so can't be Show HNs. Make a regular submission instead.

https://news.ycombinator.com/showhn.html


Show HN is defined in the rules (as the sibling comment quotes) as something someone made to be shared, ie self promotion, regardless of whether they used "I" in the title. Your definition seems more arbitrary than what Hacker News itself intends.


All show HN has to be created by the author, so I’m not sure what is self promoting about making the implicit explicit.

They are all “look, I made something cool, what do you think?”


This is talked about a lot in Einstein's Walter Isaacson biography, so people have been observing this trend for a long time (e.g the Germans accusing Einstein of doing self promotion, the US having celebrity culture in contrast), maybe it's cyclical


I think this is easily the coolest post I've seen on HN this year


It was not obvious at first glance to me, but the actual app is here: https://hn.wilsonl.in/


I'm curious if the link to the landing page was intentionally near the end. Only the people who actually read it would go to the site.

(That's not a dig, I think it's a good idea.)


1) it doesn’t appear search links are shareable or have the query terms are in it

2) are you embedding the search phrases word by word? And using the same model as the documents used? Because I searched for „lead generation“ which any decent non-unigram embedding should understand, but I got results for lead poisoning.


I found me and my post there ! Nice


A modern recommendation for UMAP is Parametric UMAP (https://umap-learn.readthedocs.io/en/latest/parametric_umap....), which instead trains a small Keras MLP to perform the dimensionality reduction down to 2D by minimizing the UMAP loss. The advantage is that this model is small and can be saved and reused to predict on unknown new data (a traditionally trained UMAP model is large), and training is theoetically much faster because GPUs are GPUs.

The downside is that the implementation in the Python UMAP package isn't great and creates/pushes the whole expanded node/edge dataset to the GPU, which means you can only train it on about 100k embeddings before going OOM.

The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.


It exists in cuML with a fast GPU implementation. Not sure why cuMl is so poorly known though…


I'll give that a look: the feature set of GPU-accelerated ops seems just up my alley for this pipeline: https://github.com/rapidsai/cuml

EDIT: looking through the docs it's just GPU-acceletated UMAP, not a parametric UMAP which trains a NN model. That's easy to work around though by training a new NN model to predict the reduced dimensionality values and minimizing rMSE.


Tested it out and the UMAP implementation with this library is very very fast compared to Parametric UMAP: running it on 100k embeddings took about 7 seconds when the same pipeline on the same GPU took about a half-hour. I will definitely be playing around with it more.


Yeah we advise Graphistry users to keep GPU umap training sets to < 100k rows, and instead focus on doing careful sampling within that, and multiple models for going beyond that. It'd be more accessible for teams if we could raise the limit, but quality wise, it's generally fine. Security logs, customer activity, genomes, etc.

RAPIDS umap is darn impressive tho. Instead of focusing on improving further, it did the job. Our bottleneck shifted to optimizing the ingest pipeline to feed umap, so we released cu_cat as a GPU-accelerated automated feature engineering library to get all that data into umap. RAPIDS cudf helps take care of the intermediate IO and wrangling in-between.

Downstream, we generally stopped doing DBSCAN , despite being so pretty. We replace with cugraph/GFQL on the umap similarity graph, to avoid quality issues we see in practice, and then visually & interactively investigate the similarity graph in pygraphistry. Once you can see the k-nn similarity edges - and lack thereof -- you realize why scatter plot clusterings (visual or algorithmic) are so misleading to analysts and treat with more caution. There is a variety of umap contenders nowadays, but with this pipeline, we haven't felt the need to go beyond. That's a multi-year testament to Leland and team.

The result is we can now umap and interactively visualize most real world large datasets, database query results, and LLM embeddings that pygraphistry & louie.ai users encounter in seconds. Many years to get here, and now it is so easy!


From a quick glance, it appears that it's because the implementation pushes the entire graph (all edges) to the GPU. Sampling of edges during training could alleviate this.


Indeed, TensorFlow likes pushing everything to the GPU by default whereas many PyTorch DL implementations encourage feeding data from the CPU to the GPU as needed with a DataLoader.

There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.


Looks like there is a little motion on this topic:

https://github.com/lmcinnes/umap/pull/1103


This is a surprisingly big endeavour for what looks like an exploratory hobby project. Not to minimize the achievement, very cool, I'm just surprised by how much was invested into it.

They used 150 GPUs and developed two custom systems (db-rpc and queued) for inter-server communication, and this was just to compute the embeddings, there's a lot of other work and computation surrounding it.

I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.

PS: Having done a lot of similar work professionally (mapping academic paper and patent landscapes), I'm not sure if 150 GPUs were really needed. If you end up just projecting to 2D and clustering, I think that traditional methods like bag-of-words and/or topic modelling would be much easier and cheaper, and the difference in quality would be unnoticeable. You can also use author and comment-thread graphs for similar results.


Hey, thanks for the kind words. I wasn't able to mention the costs in the post (might follow up in the future) but it was in the hundreds of dollars, so was reasonably accessible as a hobby project. The GPUs were surprisingly cheap, and was only scaled up mostly because I was impatient :) --- the entire cluster only ran for a few hours.

Do you have any links to your work? They sound interesting and I'd like to read more about them.


"Hundreds of dollars" sounds a bit painful as an EU engineer and entrepreneur :), but I guess it's all relative. We would think twice about investing this much manpower and compute for such an exploratory project even in a commercial setting if it was not directly funded by a client.

But your technical skill is obvious and very impressive.

If you want to read more, my old bachelor's thesis is somewhat related, from when we only had word embeddings and document embeddings were quite experimental still: https://ad-publications.cs.uni-freiburg.de/theses/Bachelor_J...

I've done a lot follow-up work in my startup Scitodate, which includes large-scale graph and embedding analysis, but we haven't published most of it for now.


A golf membership can cost 1000s of euro.. Any hobby costs money


Thanks for sharing, I'll have a read, looks very relevant and interesting!


As an EU-based engineer, you wouldn't do this, it's a massive GDPR violation (failure to notify data subjects of data processing), which does actually have extraterritoriality, although I somehow doubt that the information commissioners are going to be coming after OP.


Processing comments on a forum being a violation of the GDPR? That's crazy, the OP is neither the data controller (HN is) nor a data processor on behalf of the controller. If you post your data in public, it's not a GDPR violation for people to use it for things.


The author is definitely very skilled. I find it interesting they submit posts on HN but haven’t commented since 2018! And then embarked on this project.

As far as funding/time, one possibility is they are between endeavors/employment and it’s self funded as they have had a successful career or business financially. They were very efficient at the GPU utilization so it probably didn’t cost that much.


Thanks! Haha yeah I'm trying to get into the habit of writing about and sharing the random projects I do more often. And yeah the cost was surprisingly low (in the hundreds of dollars), so it was pretty accessible as a hobby project.


(1) Definitely you could use a cheaper embedding and still get pretty good results

(2) I apply classical ML (say probability calibrated SVM) to embeddings like that and get good results for classification and clustering at speeds over 100x fine-tuning an LLM.


I didn't think the OP used LLMs? They did use a BERT based sentiment classifier but that's not an LLM.

My HN recommender works fine just using decision trees and XGBoost FWIW. I'll bet SVM would work great too.


Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

https://huggingface.co/BAAI/bge-base-en-v1.5

is described as an "LLM" by the people who created it. It can be used in the SBERT framework.

I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

If you look at the literature

https://arxiv.org/abs/2405.00704

you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.


> Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

Oh thanks! Right I had heard about T5 based embeddings but didn't realize it was basically an LLM.

> I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

XGBoost worked the best for me but maybe I should retry with other techniques.

> you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

Definitely. Use the right tool for the right job. LLMs are probably massive overkill here. My non-LLM based embeddings work just fine for my own recommender so shrug.


Are you applying an embedding to titles on HN, comment full-text or something else?

When it comes to titles I have a model that gets an AUC around 0.62 predicting if an article will get >10 votes and a much better one (AUC 0.72 or so) that predicts if an article that got > 10 votes will get a comment/vote ratio > 0.5, which is roughly the median. Both of these are bag-of-words and didn't improve when using an embedding. If I go back to that problem I'm expecting to try some kind of stacking (e.g. there are enough New York Times articles submitted to HN that I can train a model just for NYT articles.)

Also I have heard the sentiment that "BERT is not an LLM" a lot from commenters on HN a lot but every expert source I've seen seems to treat BERT as an LLM. It is in this category in Wikipedia for instance

https://en.wikipedia.org/wiki/Category:Large_language_models

and

https://www.google.com/search?client=firefox-b-1-e&q=is+bert...

gives an affirmative answer in 8 cases out of 10, one of which denies it is a language model at all on a technicality that has since been overthrown.


> We can see that in this case, where perhaps the X axis represents "more cat" and Y axis "more dog", using the euclidean distance (i.e. physical distance length), a pitbull is somehow more similar to a Siamese cat than a "dog", whereas intuitively we'd expect the opposite. The fact that a pitbull is "very dog" somehow makes it closer to a "very cat". Instead, if we take the angle distance between lines (i.e. cosine distance, or 1 minus angle), the world makes sense again.

Typically the vectors are normalized, instead of what's shown in this demonstration.

When using normalized vectors, the euclidean distance measures the distance between the two end points of the respective vectors. While the cosine distance measures the length of one vector projected onto the other.


The issue with normalization is that you lose a degree of freedom - which when you're visualizing, effectively means losing a dimension. Normalized 2d vectors are really just 1d vectors; if you want to show a 2d relationship, now you have to use 3d vectors (so that you have 2 degrees of freedom again).


This is amazing, the amount of skill and knowledge involved is very impressive.


Thank you for the kind words!


This is wild. I've been creating my own dataset of trending articles and ironically this is how I came across your post. I'm doing a similar project for my uni thesis.

I set out with similar hypotheses and goals like you (on a slightly different scale though, haha) but I've been completely stuck on the interactive map part. Definitely getting a lot of pointers from how you handled this!

Maybe one key difference in approach is that I've put more emphasis on trying to extract key topics as keywords.

For ex:

article (title): "Useful Uses of cat"

keywords: ['Software design', 'Contraction', 'Code changes', 'Modularity', 'Ease of extension']

My hypothesis is this will be a faster search solution than using the embeddings, but potentially not as accurate. Not that far yet to really prove this though.

Would love to hear what you think! Any other cool ideas on what could be done with the keywords? I explain my process a bit more here if interested: https://hackernews-demo.streamlit.app/#data-aggregation-meth...


This search engine is amazing. I was looking for an old story about curing acid reflux by some exercise, Google/DDG/Kagi/HN's Algolia were completely useless, this found it first hit. Well done, this is the HN search engine I've always wanted.

Is it possible to keep it up to date?


Very nice. Since Hn data spawns so many such fun projects, there should be a monthly or weekly updates zip file or torrent with this data, which hackers can just download instead of writing a scraper and starting from scratch all the time.


It is very easy to get this dataset directly from HN API. Let me just post it here:

Table definition:

    CREATE TABLE hackernews_history
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = MergeTree(update_time) ORDER BY id;
    
A shell script:

    BATCH_SIZE=1000

    TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"

    rm -f maxitem.json
    wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json

    clickhouse-local --query "
        SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
        GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
    while read ITEMS
    do
        echo $ITEMS
        clickhouse-client $TWEAKS --query "
            INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
    done
It takes a few hours to download the data and fill the table.


May I hijack this thread for a related q. I love the public up-to-date hn dataset.

I saw recursive cte blog post..but this doesn't seem to work your hn dataset

https://play.clickhouse.com/play?user=play#V0lUSCBSRUNVUlNJV...

Are recursive ctes disabled on this instance or am i doing something wrong?


Done, and now it works perfectly.


what was broken?


This is unclear to me, I will ask the author.


The reason is trivial - I disabled the new feature flag on the playground service long ago (when it was in development). I will enable it back and send an example.


While trying the script, I am getting the following error -

<Trace> ReadWriteBufferFromHTTP: Failed to make request to 'https://hacker-news.firebaseio.com/v0/item/40298680.json'. Error: Timeout: connect timed out: 216.239.32.107:443. Failed at try 3/10. Will retry with current backoff wait is 200/10000 ms.

I googled with no luck. I was wondering if you have a solution for it.


It makes many requests in parallel, and that's why some of them could be retried. It logs every retry, e.g., "Failed at try 3/10". It will throw an error only if it fails all ten tries. The number of retries is defined in the script.

Example of how it should work:

    $ ch -q "SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/40298680.json')" --format Vertical
    Row 1:
    ──────
    by:     octopoc
    id:     40298680
    parent: 40297716
    text:   Oops, thanks. I guess Marx was being referenced? I had thought Marx was English but apparently he was German-Jewish[1]<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx</a>
    time:   1715179584
    type:   comment


Also, a proof that it is updated in real-time: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...


There is a public dataset of Hacker News posts on BigQuery, but it unfortunately has only been updated up to November 2022: https://news.ycombinator.com/item?id=19304326


I have a daily updated dataset that has the HN data split out by months. I've published it on my web page, but it’s served from my home server so I don’t want to link to it directly. Each month is about 30mb of compressed csv. I’ve wanted to torrent it, but don’t know how to get enough seeders since each month will produce a new torrent file (unless I’m mistaken). If you’re interested, send me a message. My email is mrpatfarrell. Use gmail for the domain.


As a starting point, that project has Apache Arrow files. I don't know if they'll update them though.

https://github.com/wilsonzlin/hackerverse/releases/tag/datas...

The comments text table is 13 GB, to give you an idea. Can definitely be processed on a laptop.


I very much support this idea. Put them on ipfs and/or torrents. Put them on HuggingFace.


I’ve had this same thought but was unsure what the licensing for the data would be.


that's a nice idea


Absolutely wonderful project and even more so the writeup!

Feedback: on my iOS phone, once you select a dot on the map, there is no way to unselect it. Preview card of some articles takes full screen, so I can’t even click to another dot. Maybe add a “cross” icon for the preview card or make that when you tap outside of a card, it hides whole card strip?


Thank you! And thanks for raising that issue. I've pushed a fix that should hopefully mitigate this for you: it's possible to unselect, card images are hidden on mobile, and the invisible results area around a card (caused by the tallest card stretching the results area) should no longer intercept map touches. Let me know if it helps!


I'm.. shocked there's been 40 million posts. Wow.

Really neat work

edit: Also had no idea HN went back to 2006. https://news.ycombinator.com/item?id=1

edit2: PG wrote this? https://news.ycombinator.com/item?id=487171


An HN "item" is not just posts but everything: posts, comments, the parts of a poll, etc.

Still an impressive number


Awesome visualisation, and great write-up. On mobile (in portrait), a lot of longer titles get culled as their origin scrolls off, with half of it still off the other side of the screen - wonder if it'd be worth keeping on rendering them until the entire text field is off screen (especially since you've already got a bounding box for them).

I stumbled upon [1] using it that reflects your comments on comment sentiment.

This also reminded me of [2] (for which the site itself had rotted away, incidentally) - analysing HN users' similarity by writing style.

[1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2] https://news.ycombinator.com/item?id=33755016


Thanks for the kind words, and raising that problem --- I've added it as an issue to fix.

Thanks for sharing that article, it was an interesting read. It was cool how deep the analysis went with a few simple statistical methods.


What a great read! Thats for taking the time and effort to provide the inside into your process


Very nice project and documented really well. I learned a lot reading the post. The examples of the improved HN search are pretty awesome.

Any idea why password reuse is so far away from security? That was the only oddity of the map for me.


Worth trying Cagra (Raft)/CuVS and Lucene-CuVS for the vector search. (https://github.com/SearchScale/lucene-cuvs)


Really love the island map! But the automatic zooming on the map doesn't seem very relevant. E.g. try typing "openai" - I can't see anything related to that query in that part of the map


Indeed I've long been intreagued by the idea of rendering such clustering maps more like geographic maps for better readability.

It would be cool to have analogous continents, countries, sub-regions, roads, different-sized settlements, and significant landmarks... This version looks great at the highest zoom level, but rapidly becomes hard to interpret as you zoom in, same as most similar large embedding or graph visualizations.


Ok I just noticed there is a region "OpenAI" in the north-west, but for some reason it zooms in somewhere close to "Apple" (middle of the island) when I type the query


Thanks! Yeah sometimes there are one or two "far" away results which make the auto zoom seem strange. It's something I'd like to tune, perhaps zooming to where most but not all results are.


Often embeddings are not so good for comparing similarity of text. A cross-encoder might be a good alternative, perhaps as a second-pass, since you already have the embeddings. https://www.sbert.net/docs/pretrained_cross-encoders.html Pairwise, this can be quite slow, but as a second pass, it might be much higher quality. Obviously this gets into LLM's territory, but the language models for this can be small and more reliable than cosine on embeddings.


It would be cool to see yearly changes of UMAP, by different years or the overall evolution in pseudotime on the embedding. Such a cool side project!


Would be cool to see member similarity. Finding like-minded commentors/posters may help discover content that would be of interest.


We implemented member similarity in our hacker read app: https://apps.apple.com/in/app/hacker-read/id6479697844

Once you register on ios, you can also login through webapp: https://hn.garglet.com

probably not ready for a hacker news hug of death yet, but you can try.


Reminds me of a similar project a few months ago whose purpose was to unmask alt accounts. It wasn’t well received as I recall.


Accidental dating app.


> Accidental dating app.

Possibly the greatest indicator of social startup success.


A suggestion for analysis:

Compare topics/sentiment etc. by number of users and by number of posts.

Are some topics dominated by a few prolific posters? Positively or negatively.

Also, How does one seperate negative/positive sentiment to criticism/advocacy?

How hard is it to detect positive criticism, or enthusiastic endorsement of an acknowledged bad thing?


Adding a subscribe feature to get an email with the most recent posts in a topic/community would be really cool. One of my favorite parts of HN is the weekly digest I get in my inbox; it would be awesome if that were tailored to me.

What you've built is really impressive. I'm excited to see where this goes!


Thanks! Yeah if there's enough interested users I'd love to turn this into a live service. Would an email subscription to a set of communities you pick be something you'd be interested in?


I made something very similar a few weeks ago. I also included usernames with the average of their comments: https://tomthe.github.io/hackmap/


As a novice, is there a benefit to using custom Node as the downloader? When I did my download of the 40 million Hacker News api items I used "curl --parallel".

What I would like to figure out is the easiest way to go from the API straight into a parquet file.


I think your curl approach would work just as fine if not better. My instinct was to reach for Node.js out of familiarity, but curl is fast and, given the IDs are sequential, something like `parallel curl ::: $(seq 0 $max_id)` would be pretty simple and fast. I did end up needing more logic though so Node.js did ultimately come in handy.

As for the Arrow file, I'm not sure unfortunately. I imagine there are some difficulties because the format is columnar, so it probably wants a batch of rows (when writing) instead of one item at a time.


A long term side project of mine is to try to build a recommendation algorithm trained on HN data.

I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.

I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.

I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.

Only problem I have is that I get a lot of 'out of distribution' content in my RSS feeds which causes my ranking models to get 'confused' for this I trained models to predict if a given entry belongs HN or not. On top of that I have some tagging models trained on data I scraped from lobste.rs and hand annotated.

I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.

AMA


Could you explain more about what you mean by modeling interactions between comments and entities?


did you find if submitted entries are more likely to reach the frontpage depending on the title or the content?

i.e. do HN users upvote more based on the title of the article or on actually reading them?


I tried making an LLM generate different titles for a given article and compared their ranking scores. There seems to be a lot of variation in the ranking scores based on the way the title is worded. Titles that are more likely to generate 'outrage' seems to be getting ranked higher, but at the same time that increases is_hn_flagged score which tries to predict if a entry will get flagged.


This is pretty great.

Feature request : Is it possible to show in the graph how famous the topic / sub topic / article is ?

So that we can do an educated exploration in the graph around what was upvoted and what was not ?


Thanks! Do you mean within the sentiment/popularity analysis graph? Or the points and topics within the map?


Points and topics within the map.


I can’t tell from the documentation on GitHub: does the API expose the flagged/dead posts? It would be interesting to see statistics on what’s been censored lately.


I couldn't help but notice that Hy is on the map but Clojure isn't.

Am I out of touch?

https://hylang.org


HN submissions and comments are very different on weekends (and US holidays). Your data could explore and quantify this in some very interesting ways!


This is super cool! Both the writeup and the app. It'd be great if the search results linked to the HN story so we can check out the comments.


I'm impressed with the map component in canvas. It's very smooth, dynamic zoom and google-maps like.

Gonna dig more into it.

Exemplary Show HN! We need more of this.


Where is lisp?! I thought it was a verifiable (urban) legend around these parts that this forum is obssessed with lisp..?


Maybe lisp is so niche that even a rather small interest makes HN relatively lispy?


Very cool! I was hoping to be able to navigate to the HN post from the map though? Is that possible?


AI is the most popular topic (by far) that I could find. Is there anything more popular?


If anybody found this interesting and would like some further reading, the paper below employed a similar strategy to analyse inauthentic content/disinformation on Twitter.

https://files.casmconsulting.co.uk/message-based-community-d...

If you would like to read about my largely unsuccessful recreation of the paper, you can do so here - https://dfworks.xyz/blog/partygate/


Getting "Argo tunnel error" on the page


Thanks for the heads up, just fixed this.


This is the type of content I'm here for.


Very cool project. Thanks for sharing it!


If you have a blog, add an RSS feed :)


I tried to fetch his RSS too! :)

Turns out, there's only 1 post so far on his blog.

Hoping for more! This one is great.


Truly, amazing work! Not only because of the final results, but also because of the whole process it took the author to bring this to life. If I could upvote this by giving points from my karma, I wouldn't hesitate to easily give a hundred points. Without a doubt, I would classify this on par with "40k HN comments mentioning books, extracted using deep learning" (https://news.ycombinator.com/item?id=28595967), which is the highest-voted "Show HN" project related to hacker news so far with 1359 points.

I'm not in the ML/AI arena yet, so I couldn't fully understand the second half of the article except for having a general idea about Embeddings and their potential, but the first part is what interests me as a software engineer.

Following are some of the challenges the author came across, was able to overcome each of them, and published the full source code.

Downloading HN database

> There's also a maxitem.json API, which gives the largest ID. As of this writing, the max item ID is over 40 million. Even with a very nice and low 10 ms mean response time, this would take over 4 days to crawl, so we need some parallelism.

> I've exported the HN crawler [1] (in TypeScript) to its own project, if you're ever in need to fetch HN items.

Fetching and parsing linked URLs' HTML for metadata and text

> For text posts and comments, the answer is simple. However, for the vast majority of link posts, this would mean crawling those pages being linked to. So I wrote up a quick Rust service [2] to fetch the URLs linked to and parse the HTML for metadata (title, picture, author, etc.) and text. This was CPU-intensive so an initial Node.js-based version was 10x slower and a Rust rewrite was worthwhile. Fortunately, other than that, it was mostly smooth and painless, likely because HN links are pretty good (responsive servers, non-pathological HTML, etc.).

Recovering missing/dead links

> A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. Fortunately, the Internet Archive has an API that we can use to use to programmatically fetch archived copies of these pages. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute).

Finding a cost-effective cloud provider for GPUs

> Fortunately, I discovered RunPod, a provider of machines with GPUs that you can deploy your containers onto, at a cost far cheaper than major cloud providers. They also have more cost-effective GPUs like RTX 4090, while still running in datacenters with fast Internet connections. This made scaling up a price-accessible option to mitigate the inference time required.

This is the type of content that makes HN stands out from the crowd.

_____________________________

1. https://github.com/wilsonzlin/crawler-toolkit-hn/

2. https://github.com/wilsonzlin/hackerverse/tree/master/crawle...


how much you paid to generate those embeddings?


excellent work


Related a month ago:

A Peek inside HN: Analyzing ~40M stories and comments

https://news.ycombinator.com/item?id=39910600


“Cloud Computing” “us-east-1 down”

This gave me a belly laugh.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: