Implementing "seen by" functionality with Postgres

fermentation · on July 18, 2022

This is a bit of a weird post. The author sets up benchmarks, shows that hstore appears to work best, then suggests to use HyperLogLog despite performing worse. The reason being because it scales better, but the author didn't really discuss how HLL works so I'm not sure why it scales better other than that it has a cool name.

hardwaresofton · on July 18, 2022

Looks like I could have been much clearer, apologies!

The main ideas around picking HLL was that it was a good compromise between fast operation, forget nothing nature of the assoc table, and avoided the should-you-really-do-that nature of storing possibly 100s of thousands or millions of numeric kvs in a hstore column alone.

There’s definitely a whole ‘nother post in there on pushing it to it’s limits but it was already long enough, I thought!

asperous · on July 18, 2022

I agree; why not do the work to show that it scales better? It violates scientific integrity to have a hypothesis, run the experiment, find that your hypothesis was wrong but then say it is probably right regardless

Groxx · on July 19, 2022

They also point out that HLL doesn't have false positives but it does have incorrect counts (I don't really see how that's different tbh), and later say it has the same "absolute correctness" as an association table...

dstroot · on July 19, 2022

Since the author works for an organization that is built on Postgres I imagine this started out as a content marketing idea for Postgres. Somewhere along the way it went off the rails and became a piece about HLL technology (which is independent of the underlying DB tech).

I do like articles like this that attempt to solve real world challenges though! Keep up the good work.

stareatgoats · on July 18, 2022

I have to admit this is over my head. But sometimes being stupid also what works, and I've always abided by the old adage that "it's better to have a tractor that works than a limo that you can't get out of the garage". I realize that in this case maybe the problem was just a pretext for demonstrating HyperLogLog, but in case someone is looking for a simple solution to this very problem maybe this naïve solution could work (use at your own peril!):

So I'd create an associated table that records the user, time and object viewed, and a field with the rolled up count on the object record. The problem with duplicate views is solved by wrapping the insert and the upping of the counter in the same transaction: if the insert fails because of a breach of the unique index then the upping fails too.

I probably missed something?

cldellow · on July 18, 2022

I felt like I didn't understand the blog post very well, either.

It spends a lot time building a benchmark framework and measuring things. At the end, hstore is the choice supported by the data. But instead the author writes:

> If we go strictly with the data, the best way looks to be the hstore-powered solution, but I think the HLL is probably the right choice.

?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.

I'm also confused by the benchmark. The autocannon script has:

    // 10% of the time, get *all* the users
    request.method = 'GET'
    request.path = `/posts/${postId}/seen-by/users`

Does this imply there's a path it can hit to get the list of users who saw a post? HLL is a probabilistic data structure, so it doesn't support that with 100% accuracy. You could reverse engineer it from the list of users, but the performance will vary with the number of users, which didn't seem to be captured by the benchmark. I tried to look at the source code, but the GitLab URL was 404ing for me.

derefr · on July 18, 2022

> ?!? Why bother defining a framework for measurement if you're just going to substitute your own judgment? Perhaps the framework didn't capture something important -- like the absolute number of people looking at any given post, concurrent updates, etc.

Just guessing: the benchmark tells you the time complexity. It doesn't tell you the space complexity. The author is optimizing between the time- and space-complexity of the solutions, with the time-complexity benchmarks as an input. (Also, at a scale the benchmark doesn't reach, space-complexity starts to affect time-complexity, as large datasets become less able to be hot in disk cache.)

yjftsjthsd-h · on July 19, 2022

That sounds like it's a bad benchmark, then? I mean, it's still interesting, but a synthetic benchmark that fails to map to real use is... I guess actually negative value since it's misleading.

nnnnnande · on July 18, 2022

Yeah, this is probably the reason and the author even elaborates on this in the sentences following the bit quoted by cldellow:

> Even though the data says hstore, knowing that posts will be seen by more and more people over time, I might choose the HLL solution for an actual implementation. It's far less likely to pose a bloated row problem, [...]

hardwaresofton · on July 18, 2022

Just posting here too, but yup this is exactly what I was trying to convey.

Hstore might have been the fastest but the way I am using it or the scales the use case could scale to might not work out.

Bigger benchmarks could have been done! Maybe a multi part post would have been better so I could split apart methodology and results/blabbering about approach!

hardwaresofton · on July 18, 2022

Yup this is it! You’ve said it much better than I did, thank you.

dvlsg · on July 18, 2022

I don't think I'm getting 404s, but the repo's gitlab url does keep bouncing me to a login page.

> You need to sign in or sign up before continuing.

Has gitlab always enforced that, even for public repos? Or is the linked repo not public?

ramiro-nd · on July 19, 2022

Hey sorry for that, we've changed the links to a public Github repo, so everyone can take a look or submit a PR: https://github.com/VADOSWARE/supabase-seen-by

dvlsg · on July 19, 2022

I can follow the links on the post now - thanks for updating!

jakear · on July 18, 2022

Linked repo isn’t public, you can check through the author’s public repos on gitlab.

hardwaresofton · on July 18, 2022

Hey sorry the post wasn’t more well written!

I tried to cover all the basic or obvious approaches (counter, assoc table) so that people could see the complexity kind of ramp up.

You’re right that an assoc table is a very workable solution, along with the partitioning capabilities of postgres these days.

davidw · on July 18, 2022

Unless I knew there was going to be a huge boatload of data, I'd go for an association table too. That's the cleanest thing in terms of the DB.

derefr · on July 18, 2022

If the goal is to count distinct users who have seen something, but also to efficiently retrieve the actual set of users who have seen something — then why wasn't a roaring bitmap (https://pgxn.org/dist/pg_roaringbitmap/) considered as an option? Presuming user IDs are non-sparse nonnegative integers (as SERIAL/BIGSERIAL pkeys usually are), a roaring bitmap should do everything hstore does, just a lot more efficiently.

jakswa · on July 18, 2022

I just wanted to say thanks for linking. I feel like I'm always on the search for new PG storage extensions, even if I'm not able to use them everywhere (sad to see AWS RDS wasn't in list of vendors that include/support it).

djKianoosh · on July 18, 2022

the docs/paper for roaring bitmap says it's to be used for large datasets, but I must have missed where "large" is defined. In practice, how large does a set have to be to take advantage of roaring bitmaps?

derefr · on July 18, 2022

The essential consideration is that roaring bitmaps are compressed — trading increased CPU overhead per read/write, for lower size-on-disk => more rows fetched per IOP + better cache-coherency. So think of the point where the CPU overhead of compression/decompression for every read/write becomes worth those gains.

Most developers make this kind of judgement every day: when deciding whether to e.g. gzip a file before SCPing it somewhere; or whether to apply e.g. lz4 to the data being written to a KV store like Redis. IMHO that same intuition applies here.

See also: the heuristic Redis uses to decide whether to back a set with a hashtable, vs. a sorted list — sits around 512 elements: https://redis.com/ebook/part-2-core-concepts/01chapter-9-red...

---

Mind you, the compression in roaring bitmaps is very efficient, has extremely low start-up costs, and is effectively "random access" rather than streaming; these low costs are why roaring bitmaps are preferred over other compressed-bitmap approaches.

And also, it not like, at small data sizes, even a "large" (hundreds/thousands of ops) CPU penalty would really be much of a penalty, either — because, with small data, you wouldn't be paying it very often! And within a "heavyweight" RDBMS like Postgres, that does a bunch of things when you make a request to it, these sorts of small CPU costs usually get lost entirely under the static overhead of query planning. (Postgres takes advantage of this property in a lot of other ways as well.)

It's only when you're trying to do something at scale, but where dataset size isn't the scaling factor (example: running a core DNS server — trillions of queries, but against only a small amount of data) that a roaring bitmap would be an explicit "wrong choice" vs. a sorted-list or b-tree or hashtable-backed presence set.

5e92cb50239222b · on July 18, 2022

> deciding whether to e.g. gzip a file before SCPing it somewhere

Off topic, but to not waste time on such trivia in the future, add

  Compression yes

to your ~/.ssh/config.

derefr · on July 18, 2022

Not a good idea if you're frequently sending things that are already compressed, though. Which many formats today are.

(Also, I was using `scp` as a general abstraction; personally, I'm not often pushing things around between VMs directly, but rather between VMs and object storage; where part of the question is not just about the transport encoding, but rather whether it's worth it to have the data be compressed-at-rest in the object store. Usually depends on how many times you're going to be downloading it.)

hardwaresofton · on July 18, 2022

Author here, a roaring bitmap could absolutely have been used!

The concept is similar and it absolutely should have been mentioned.

> Presuming user IDs are non-sparse nonnegative integers (as SERIAL/BIGSERIAL pkeys usually are),

I personally wanted to build a method that worked for textual content easily but with that constraint (which is valid and what I ended up doing in practice) a roaring bitmap definitely works!

I want to give this a try in the future

bastawhiz · on July 18, 2022

A way to do this for read heavy applications is to use the separate table to store who has viewed the post, but then use a trigger to update a seen count column. When you write a (unique) row to the table, the trigger counts all the views and updates the posts table. All reads have no extra overhead, only views. You can queue/process the views at your leisure, or pause them altogether as a circuit breaker.

hyzyla · on July 18, 2022

I expected to find that solution in article. Or if someone doesn’t like triggers, you can just insert row in associated table, if failed with duplication error, then do nothing, otherwise increase counter.

hardwaresofton · on July 18, 2022

This is a great tip, thanks for sharing — reducing read overhead to a minimum and even delaying the processing is fantastic.

This would have been great to include as a sort of better assoc table approach

mongrelion · on July 18, 2022

Sidetracking a bit: does anybody know what tool was used to generate the table diagrams?

junon · on July 18, 2022

If it's a tool I'd love to know, too. However it very well might just be a hand-crafted vector.

kiwicopple · on July 18, 2022

These ones are custom (we use figma)

triggercut · on July 19, 2022

Haha, I love the start as a form of riffing on bizzare things you come across in the wild.

The year is 2022. You're head DBA at the hot new social site, SupaRate... Your startup is seeing eye-boggling growth because everyone loves sharing their positive AND negative ratings restricted to UNSIGNED TINYINTs.

Why Unsigned? Because! How do you know if it's positive or negative? There's a VARCHAR(256) field for that, duh.

SigmundA · on July 18, 2022

Can anyone else see the source? I am getting 404 and would like to see the sql statements that implement these tests.

>not to mention a lot of concurrent database updates (which isn't necessarily Postgres's forte to begin with)

Isn't PG multithreaded db server and the default row store is tuned for OLTP with row level locking over OLAP? I could see this being said for Sqlite but not PG.

cldellow · on July 18, 2022

I can't see the source either.

Re the concurrent updates - yes, Postgres is multi-process. However, the hstore and HLL approaches are going to have a lot of contention at the row level. The association table wouldn't have that contention, since each user would get their own row.

The hstore/HLL approaches _could_ be sharded on a hash of the user's ID. In fact, this would work really well for HLL since I think you can union HLL structures quickly. But I don't think any of this was addressed in the article...

SigmundA · on July 18, 2022

>Re the concurrent updates - yes, Postgres is multi-process. However, the hstore and HLL approaches are going to have a lot of contention at the row level. The association table wouldn't have that contention, since each user would get their own row.

Each process is a OS level thread, again saying that concurrent updates is not PG's forte seems pretty ridiculous just because one some of your approaches use in row storage rather than the typical relational design. The statement seems unnecessary and may give someone with less experience the wrong idea about PG's concurrent abilities.

ramiro-nd · on July 19, 2022

Hey sorry for that, we've changed the links to a public Github repo, so everyone can take a look or submit a PR: https://github.com/VADOSWARE/supabase-seen-by

ilrwbwrkhv · on July 18, 2022

The ending was weird. After all that test setup and data crunching it became a bit of "I feel this is better".

kiwicopple · on July 18, 2022

If you like to jump to the comments before reading the post (like me), this is an exploration of a fairly common situation: calculating the total number of people who have "viewed/seen" a blog post.

tldr: use either Use HyperLogLog [0] (if you can install extensions on your postgres database), or hstore[1] if you can't.

HyperLogLog is a probablistic data structure, which can "estimate" the number of viewers, so it scales very well for Twitter/FB-scale platforms.

[0] HyperLogLog: https://en.wikipedia.org/wiki/HyperLogLog

[1] hstore: https://www.postgresql.org/docs/current/hstore.html

js4ever · on July 18, 2022

About 500 rps max? That seems pretty low, I can easily do something like 25k rps with 1 cpu using clickhouse for the same scenario. Am I missing something?

hardwaresofton · on July 19, 2022

Nope you're not missing anything -- sometimes postgres isn't the absolute best solution to a problem.

Part of the setup was to get to that "v1" as fast as possible, and what the imaginary team member already had available was Postgres. Even with that setup I made sure to leave this in there:

> (It will be a recurring theme but this is a spot where we probably don't necessarily want to use stock Postgres but instead want to use tools like Citus Columnar Storage, ZedStore, or an external choice like ClickHouse).

Also note that there are some optimizations that are not discussed here -- an unlogged table for the HLL itself, for example, triggered/delayed computation, etc. In general, Postgres isn't the greatest for OLAP workloads, but it is quite flexible.

The code is open though: https://github.com/VADOSWARE/supabase-seen-by

Lots to improve upon there, but I tried to make sure the code was very easy to understand and easy to extend -- would love to have a Clickhouse version :)

WayToDoor · on July 18, 2022

So how does the HyperLogLog work exactly? It seems like the solution of choice, but I'd like to know more about the internals.

periodontal · on July 18, 2022

Basically, hash all your data items, keep track of the maximum number of leading 0 bits seen in a hash so far.

This is a decent proxy (but not without error) for how many distinct items you've seen so far since large numbers of leading zeros should be uncommon. Finding this means you probably saw a lot of other things to get to that point (intuitively, about N leading zeroes are expected after seeing 2^N items).

This is actually the same thing that a proof of work cryptocurrency does to control difficulty: change target number of leading zeros so miners have to do more/less work to find a match.

Of course, you could get "lucky" with a single counter and the resolution isn't great, so HLL first separates the values into buckets which are estimated separately and then combined to give a more robust estimate.

hardwaresofton · on July 18, 2022

Hey I wrote the article —- the way I think of HyperLogLog and similar (but different) approaches like bloomfilters and count-min sketches is that the key insight is hashing.

The fact that we can reduce some piece of data to a hash, then work on the distributions of data/entropy of those hashes in number space many different ways is what makes these data structures work.

I think of it this way —- if you have a hashing algo that turns “someone@example.com” into “7395718184927”, you can really easily count with relative certainty how often you see that email address by keeping a sparse numeric associative array. Getting the same exact value again produces the same hash. Obviously doing this isn’t SUPER useful because then you just get strict cardinality amount (same as just checking equality with the other keys). but if you choose a weak enough, fast hashing function (or multiple, in the case of a count-min sketch), you can have collisions in what some values hash to but others do not — so that means there will be some amount of error — you can control that to suit your needs, on a scale of everything in one bucket to exact cardinality # of buckets.

Here’s an actual proper guide (CS lectures, as you might expect!) I found after a quick search, which you might find useful:

https://williams-cs.github.io/cs358-f21/lectures/lecture10/l...

(IMO focus on the algorithm description)

And here’s some old HN discussion:

https://news.ycombinator.com/item?id=7508658

To really try and understand the concept I found it useful a while back to actually try and build it:

https://vadosware.io/post/countmin-sketch-in-haskell/

http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf (This paper has a really good visual depiction)

I find the count-min sketch to be much more approachable

kiwicopple · on July 18, 2022

I don't have a good enough grasp HLL to summarize accurately, so for convenience here is the original paper[0] and hopefully HN comes through with a more digestible explanation.

The blog post uses Citus' Postgres extension[1] which uses an augmented various of the original algorithm to reduce the size of the data structure so that it is fixed-size

[0] HLL paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

[1] postgresql-hll: https://github.com/citusdata/postgresql-hll

proto-n · on July 18, 2022

If I understand correctly these numbers are a result of users viewing posts uniformly random? While the post is a nice overview of options, I doubt the numbers generalize to the much more realistic scenario of a power-law distribution of view counts for posts.

hardwaresofton · on July 18, 2022

I thought of that! I cover it very crudely and the generation code isn’t as nice as I would have liked, but I did try to pick certain numbers of posts that would receive a disproportionate amount of views (and also making sure posts were not generated equally by everyone).

I’d love to have some feedback on the attached repo!

mrhands556 · on July 19, 2022

> We won't have false positives (like with a bloom filter) -- we'll have a degree of error (i.e. the actual count may be 1000, but the HLL reports 1004).

Isn’t what the article is describing a false positive? I expected an undercount.

maxloh · on July 18, 2022

The mentioned Progres extension, citus/postgresql-hll, hasn't been updated for about a year.

Is it really reliable to use it in production?

frankgrecojr · on July 18, 2022

Have there been issues that have gone on unmaintained?

cldellow · on July 18, 2022

When we used it (granted, years ago), it would crash occasionally.

Looking at GitHub, there's at least one open issue around crashes based on how large of an HLL you use.

VincentEvans · on July 19, 2022

If a simple implementation provides 80% of the benefit - just use the simple implementation.

O__________O · on July 19, 2022

Author suggests a data store without explicitly explaining why; reason being the benchmarks focused on time, not space at scale, but the context assumes needing to scalable. At scale, HyperLogLog appears to be a good choice for managing space demands.

This paper covers analysis of HyperLogLog (HLL):

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

TLDR: Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.

Source:

https://en.wikipedia.org/wiki/HyperLogLog