At Heap, we allow users to send custom event properties through our API. Since w...

leothekim · on Sept 1, 2016

Maybe I'm missing something, but I think of optional columns as nullable (but declared) values. It sounds like you use JSONB to store arbitrarily declared values. If so, then I'm still confused then by how you're able to hoist values from JSONB data to save on perf and space. That implies these values weren't that arbitrary to begin with.

malisper · on Sept 1, 2016

My bad. As of right now, we use jsonb to store all of the properties of events besides the user_id, the event_id, and the time of the event. We have lots of builtin properties which are stored in some events but not others (for example, the text of a button that someone clicks on). We thought storing these properties in jsonb was a good idea because at the time we didn't know any of the downsides of jsonb. The issues brought up in Dan's post are things we wish we knew before deciding to use jsonb.

We are currently planning on moving all of the builtin fields into true postgres columns instead of keeping them in jsonb. All of the custom properties will remain in jsonb.

leothekim · on Sept 1, 2016

Got it. I got hung up on the meaning of "optional values", JSONB makes sense for storing arbitrary metrics. Thank you for clarifying!

ekimekim · on Sept 1, 2016

The user-provided data is arbitrary but not random. Certain keys like "name", "email", "browser", etc will show up very often, and might even be included automatically by the client-side libraries and therefore be present in almost all rows. There may be other values that only appear very rarely - say, "my_super_special_id", and you want to be able to support those as well.

wvenable · on Sept 1, 2016

That's perfectly reasonable. But then you also let them query on those arbitrary custom properties and that's where the performance issues are? If so, that's a fairly hard problem to solve.

Taking the well-defined subset of searchable properties and making them columns, as described in the article, is the really the best solution.

malisper · on Sept 1, 2016

As of right now, our schema is literally:

    user_id | event_id | time | data

where data is a JSONB blob that contains every other piece of information about an event. Currently, we get a row estimate of one for pretty much every query. We've been able to work around the lack of statistics by using a very specific indexing strategy (discussed about in a talk Dan gave[0]) that gives the planner very few options in terms of planning and additionally by turning off nested loop joins.

We are planning on pulling out the most common properties that we store in the data column, which will give us proper statistics on all of those fields. I am currently experimenting with what new indexing strategies we will be able to use thanks to better statistics.

[0] https://www.youtube.com/watch?v=NVl9_6J1G60