I have to admit this is over my head. But sometimes being stupid also what works, and I've always abided by the old adage that "it's better to have a tractor that works than a limo that you can't get out of the garage". I realize that in this case maybe the problem was just a pretext for demonstrating HyperLogLog, but in case someone is looking for a simple solution to this very problem maybe this naïve solution could work (use at your own peril!):
So I'd create an associated table that records the user, time and object viewed, and a field with the rolled up count on the object record. The problem with duplicate views is solved by wrapping the insert and the upping of the counter in the same transaction: if the insert fails because of a breach of the unique index then the upping fails too.
I felt like I didn't understand the blog post very well, either.
It spends a lot time building a benchmark framework and measuring things. At the end, hstore is the choice supported by the data. But instead the author writes:
> If we go strictly with the data, the best way looks to be the hstore-powered solution, but I think the HLL is probably the right choice.
?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.
I'm also confused by the benchmark. The autocannon script has:
// 10% of the time, get *all* the users
request.method = 'GET'
request.path = `/posts/${postId}/seen-by/users`
Does this imply there's a path it can hit to get the list of users who saw a post? HLL is a probabilistic data structure, so it doesn't support that with 100% accuracy. You could reverse engineer it from the list of users, but the performance will vary with the number of users, which didn't seem to be captured by the benchmark. I tried to look at the source code, but the GitLab URL was 404ing for me.
> ?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.
Just guessing: the benchmark tells you the time complexity. It doesn't tell you the space complexity. The author is optimizing between the time- and space-complexity of the solutions, with the time-complexity benchmarks as an input. (Also, at a scale the benchmark doesn't reach, space-complexity starts to affect time-complexity, as large datasets become less able to be hot in disk cache.)
That sounds like it's a bad benchmark, then? I mean, it's still interesting, but a synthetic benchmark that fails to map to real use is... I guess actually negative value since it's misleading.
Yeah, this is probably the reason and the author even elaborates on this in the sentences following the bit quoted by cldellow:
> Even though the data says hstore, knowing that posts will be seen by more and more people over time, I might choose the HLL solution for an actual implementation. It's far less likely to pose a bloated row problem, [...]
Just posting here too, but yup this is exactly what I was trying to convey.
Hstore might have been the fastest but the way I am using it or the scales the use case could scale to might not work out.
Bigger benchmarks could have been done! Maybe a multi part post would have been better so I could split apart methodology and results/blabbering about approach!
So I'd create an associated table that records the user, time and object viewed, and a field with the rolled up count on the object record. The problem with duplicate views is solved by wrapping the insert and the upping of the counter in the same transaction: if the insert fails because of a breach of the unique index then the upping fails too.
I probably missed something?