I have to admit this is over my head. But sometimes being stupid also what works...

cldellow · on July 18, 2022

I felt like I didn't understand the blog post very well, either.

It spends a lot time building a benchmark framework and measuring things. At the end, hstore is the choice supported by the data. But instead the author writes:

> If we go strictly with the data, the best way looks to be the hstore-powered solution, but I think the HLL is probably the right choice.

?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.

I'm also confused by the benchmark. The autocannon script has:

    // 10% of the time, get *all* the users
    request.method = 'GET'
    request.path = `/posts/${postId}/seen-by/users`

Does this imply there's a path it can hit to get the list of users who saw a post? HLL is a probabilistic data structure, so it doesn't support that with 100% accuracy. You could reverse engineer it from the list of users, but the performance will vary with the number of users, which didn't seem to be captured by the benchmark. I tried to look at the source code, but the GitLab URL was 404ing for me.

derefr · on July 18, 2022

> ?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.

Just guessing: the benchmark tells you the time complexity. It doesn't tell you the space complexity. The author is optimizing between the time- and space-complexity of the solutions, with the time-complexity benchmarks as an input. (Also, at a scale the benchmark doesn't reach, space-complexity starts to affect time-complexity, as large datasets become less able to be hot in disk cache.)

yjftsjthsd-h · on July 19, 2022

That sounds like it's a bad benchmark, then? I mean, it's still interesting, but a synthetic benchmark that fails to map to real use is... I guess actually negative value since it's misleading.

nnnnnande · on July 18, 2022

Yeah, this is probably the reason and the author even elaborates on this in the sentences following the bit quoted by cldellow:

> Even though the data says hstore, knowing that posts will be seen by more and more people over time, I might choose the HLL solution for an actual implementation. It's far less likely to pose a bloated row problem, [...]

hardwaresofton · on July 18, 2022

Just posting here too, but yup this is exactly what I was trying to convey.

Hstore might have been the fastest but the way I am using it or the scales the use case could scale to might not work out.

Bigger benchmarks could have been done! Maybe a multi part post would have been better so I could split apart methodology and results/blabbering about approach!

hardwaresofton · on July 18, 2022

Yup this is it! You’ve said it much better than I did, thank you.

dvlsg · on July 18, 2022

I don't think I'm getting 404s, but the repo's gitlab url does keep bouncing me to a login page.

> You need to sign in or sign up before continuing.

Has gitlab always enforced that, even for public repos? Or is the linked repo not public?

ramiro-nd · on July 19, 2022

Hey sorry for that, we've changed the links to a public Github repo, so everyone can take a look or submit a PR: https://github.com/VADOSWARE/supabase-seen-by

dvlsg · on July 19, 2022

I can follow the links on the post now - thanks for updating!

jakear · on July 18, 2022

Linked repo isn’t public, you can check through the author’s public repos on gitlab.

hardwaresofton · on July 18, 2022

Hey sorry the post wasn’t more well written!

I tried to cover all the basic or obvious approaches (counter, assoc table) so that people could see the complexity kind of ramp up.

You’re right that an assoc table is a very workable solution, along with the partitioning capabilities of postgres these days.

davidw · on July 18, 2022

Unless I knew there was going to be a huge boatload of data, I'd go for an association table too. That's the cleanest thing in terms of the DB.