Hacker News new | past | comments | ask | show | jobs | submit login

I have to admit this is over my head. But sometimes being stupid also what works, and I've always abided by the old adage that "it's better to have a tractor that works than a limo that you can't get out of the garage". I realize that in this case maybe the problem was just a pretext for demonstrating HyperLogLog, but in case someone is looking for a simple solution to this very problem maybe this naïve solution could work (use at your own peril!):

So I'd create an associated table that records the user, time and object viewed, and a field with the rolled up count on the object record. The problem with duplicate views is solved by wrapping the insert and the upping of the counter in the same transaction: if the insert fails because of a breach of the unique index then the upping fails too.

I probably missed something?




I felt like I didn't understand the blog post very well, either.

It spends a lot time building a benchmark framework and measuring things. At the end, hstore is the choice supported by the data. But instead the author writes:

> If we go strictly with the data, the best way looks to be the hstore-powered solution, but I think the HLL is probably the right choice.

?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.

I'm also confused by the benchmark. The autocannon script has:

    // 10% of the time, get *all* the users
    request.method = 'GET'
    request.path = `/posts/${postId}/seen-by/users`

Does this imply there's a path it can hit to get the list of users who saw a post? HLL is a probabilistic data structure, so it doesn't support that with 100% accuracy. You could reverse engineer it from the list of users, but the performance will vary with the number of users, which didn't seem to be captured by the benchmark. I tried to look at the source code, but the GitLab URL was 404ing for me.


> ?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.

Just guessing: the benchmark tells you the time complexity. It doesn't tell you the space complexity. The author is optimizing between the time- and space-complexity of the solutions, with the time-complexity benchmarks as an input. (Also, at a scale the benchmark doesn't reach, space-complexity starts to affect time-complexity, as large datasets become less able to be hot in disk cache.)


That sounds like it's a bad benchmark, then? I mean, it's still interesting, but a synthetic benchmark that fails to map to real use is... I guess actually negative value since it's misleading.


Yeah, this is probably the reason and the author even elaborates on this in the sentences following the bit quoted by cldellow:

> Even though the data says hstore, knowing that posts will be seen by more and more people over time, I might choose the HLL solution for an actual implementation. It's far less likely to pose a bloated row problem, [...]


Just posting here too, but yup this is exactly what I was trying to convey.

Hstore might have been the fastest but the way I am using it or the scales the use case could scale to might not work out.

Bigger benchmarks could have been done! Maybe a multi part post would have been better so I could split apart methodology and results/blabbering about approach!


Yup this is it! You’ve said it much better than I did, thank you.


I don't think I'm getting 404s, but the repo's gitlab url does keep bouncing me to a login page.

> You need to sign in or sign up before continuing.

Has gitlab always enforced that, even for public repos? Or is the linked repo not public?


Hey sorry for that, we've changed the links to a public Github repo, so everyone can take a look or submit a PR: https://github.com/VADOSWARE/supabase-seen-by


I can follow the links on the post now - thanks for updating!


Linked repo isn’t public, you can check through the author’s public repos on gitlab.


Hey sorry the post wasn’t more well written!

I tried to cover all the basic or obvious approaches (counter, assoc table) so that people could see the complexity kind of ramp up.

You’re right that an assoc table is a very workable solution, along with the partitioning capabilities of postgres these days.


Unless I knew there was going to be a huge boatload of data, I'd go for an association table too. That's the cleanest thing in terms of the DB.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: