I don't quite get this. How fast was running this query: Select loyaltyMemberID ...

owenmarshall · on April 11, 2018

> Stuff like this is bread and butter SQL.

Ten or fifteen years ago, sure - a DBA would look at a query plan and figure out how to do it properly. Worse case you'd slap a materialized view in and query that.

But this is 2018! Programmers don't want to treat the database as anything but one big key value store ;)

Matheus28 · on April 11, 2018

If you aren't implementing your own tabular database on top of your existing tabular database, you aren't 2018 enough :)

mathgladiator · on April 11, 2018

I'm changing the game by code generating a tabular database on top of an eventualy consistent store. 2020 here we come!!!

elvinyung · on April 11, 2018

Rows by any other name would smell as sweet...

mgkimsal · on April 11, 2018

better make sure it's all immutable too.

avereveard · on April 11, 2018

oh boy do I have news for you https://www.cockroachlabs.com/

elvinyung · on April 11, 2018

not an eventually consistent store though.

cookiecaper · on April 11, 2018

Yeah, sadly, this is not too much of an exaggeration. I've worked on teams that insisted they needed DynamoDB, because, well, Dynamo is for "Big Data", and they certainly wouldn't work somewhere that had "Small Data"! Replace the buzzwords/products as applicable; you could actually probably just scramble them and it'd work just as well, since someone out there thinks "RabbitMQ means Web Scale", etc.

SQL databases are amazing, robust examples of engineering. They are your friends and they're the appropriate choice for the vast majority of software. They are not outmoded or passe. Though I acknowledge there is a separate use case for K-V stores, I almost want to make policy preventing their use just because I know so many developers will abuse them badly and then stare back at you blankly during the semi-annual massive downtime event, muttering something like "Well, it's based on research at Google, so I'm sure there's a way to recover the data..."

MrMorden · on April 11, 2018

Every fool knows that MongoDB is web scale. (Not as web scale as /dev/null, but MongoDB has a better logo.)

reificator · on April 11, 2018

I disagree. /dev/null's logo is way cooler.

https://www.nasa.gov/sites/default/files/cygx1_ill.jpg

toddBarkus · on April 11, 2018

I think this is a case for the return of the traditional "sysadmin" as "devops"/"SRE" is now the role of unblocking deploying a solution instead of questioning it's complexity/fitness.

cookiecaper · on April 11, 2018

While I agree, I also think that overt gating and approval processes create a high tension dynamic that frequently breaks down, whether it's ops v. dev, security v. dev, or others. It's easy for people to get their pride wounded, and they end up encouraged to find workarounds to the process. The simple answers to this are pretty much imaginary, unfortunately.

toddBarkus · on April 11, 2018

Sure, of course. I was hoping to point out that there's an increasingly overlooked value in having someone question complexity. The "no, you don't need React" of the frontend dev or the "our data is actually relational" of the back end dev.

rbranson · on April 11, 2018

If your “SRE” team is only “unblocking deploying a solution” then I’m sad to say they are an operations team who has rebranded themselves to appear more relevant.

toddBarkus · on April 11, 2018

That's most "SRE" -- it's a title arms race in that field between the underqualified and those that wish to convey they know how to do more than write system scripts in DSLs

EGreg · on April 11, 2018

Question, though...

If your columns have types, how can you encrypt them using custom keys for each one? Is it possible?

I want the keys to reside on the client and search for encrypted data. Basically single row lookups at a time.

pandapunchpower · on April 11, 2018

The column type would have to be the type of the encrypted value. The type of the unencrypted data could not be enforced by the DB and you would have to rely on code doing the correct thing.

I am however extremely wary of doing it that way. I don't know your requirements of course.

EGreg · on April 11, 2018

Here is the thing - encrypted stuff is just a weird encoded string. So I can’t really use columns normally.

What I really need is just a huge table with two fields: “token”, “content”

And the token is basically the primary key but encrypted with whatever encryption.

You could even do foreign keys this way.

Hmm I suddenly have an idea. What about a layer above the database that basically enforces foreign keys and joins in this way to support end to end encryption? The content would reference ENCRYPTED foreign keys. Only clients would decrypt stuff.

pandapunchpower · on April 12, 2018

>What I really need is just a huge table with two fields: “token”, “content”

Sounds more like a key value store and less like a relational database. Although you can store key value data in a relational db of course, there may be a better tool for the job.

JoachimSchipper · on April 12, 2018

You'll be interested in CryptDB.

LgWoodenBadger · on April 11, 2018

Are you sure you aren't better off searching for hashed data? The only thing it doesn't really support is non-exact matches.

nurettin · on April 11, 2018

Also, how do you index that?

Regardsyjc · on April 11, 2018

Do you have any recommendations for resources to learn best database practices? I'm currently designing my first database and I'm not sure what information is worth storing (like calculations) and how to choose which data to group in tables.

fnord123 · on April 11, 2018

SQL Antipatterns is good as another commenter recommended. But my favourite book on the topic is Markus Winand's SQL Performance Explained. Most of it is online here: https://use-the-index-luke.com/ but I recommend buying it since it's tiny and worth its weight in gold.

It's short so you actually read it and possibly reread it. It's to the point. It has pretty pictures. And it had directly applicable advice.

jmts · on April 11, 2018

I'm going to go out on a limb and suggest you start writing the application without a database first.

My early education on databases always seemed to follow a "how do we make a database do this?" rationale rather than "what data do we need to store to support these features?", which I think leads to a software design that is too strongly coupled with the database. Software modules end up dependent on database features, or table structure, and refactoring or switching data stores becomes more costly.

Instead, start with a simple in-memory data store - a list of objects with some interface for accessing them, will probably your starting point. Add some basic serialisation/deserialisation features (CSV, JSON, etc) when you get past initial testing and require some persistent data. Then, once you have your API in place and your software design is stabilising, you should be able to map that data to a database fairly easily:

* The primary structure maps to your main table

* Child structures become additional tables, with foreign keys

* Data used to lookup records can be indexed for better performance

Beyond that, you should profile/benchmark your application to find what needs to be optimised, and then investigate whether your software design or your data store should be doing the optimisation.

Let your software's features influence the design of your database. Don't let the database's features influence the design of your software.

jonah · on April 11, 2018

I read SQL Antipatterns[0] - Avoiding the Pitfalls of Database Programming by Bill Karwin a while ago and really liked it.

[0] https://pragprog.com/book/bksqla/sql-antipatterns

matte_black · on April 11, 2018

Ok, first you need to decide: is this going to be a purely transactional database (for business processes) or do you also plan to do data analysis straight inside this database (meaning you won’t be extracting, transforming and loading data into another database and analyzing it there).

If it’s transactional, I recommend keeping calculations only if you need to access summarized data frequently. For example, if you are tracking inventory by storing the history of transactions that occur into and out of inventory, it’s trivial to find out how much of each item you have in stock at any point in time by doing a sum of the change in quantities for each item type up to that point.

If you were usually interested in the “current” count, it would be expensive to perform this sum every time, so instead you could keep a separate table for calculating the running total of inventory per item and referring to that. Keep this table up to date through the use of triggers on insert events (Note that your log of inventory transactions would thus be an immutable stream of events)

An example of something not worth storing is derived data that is a combination of separate columns in a table. For most queries it’s probably trivial to be lazy and wait to perform such a calculation until you actually need it. If you still want to have a ready made “table” that has all the computations you want already entered as columns for easy querying, use a view. If you find yourself making liberal use of views, you’re on the right track.

If you are using a separate data warehouse for data analysis, then precompute and denormalize as much as you can.

*Disclaimer: most of what I’m talking about is from a Postgres perspective.

dugmartin · on April 11, 2018

I’d suggest learning the fundamental normal forms - once you understand them database design is very straightforward.

https://en.wikipedia.org/wiki/Database_normalization

jehlakj · on April 11, 2018

You can thank full stack developers for that

flukus · on April 11, 2018

If they don't know how to look at a query plan and craft a solution can you really call them full stack?

vishnugupta · on April 11, 2018

The stack has to bottom our somewhere and it's usually at the application code

matte_black · on April 11, 2018

If that’s where they bottom out they should be called “mid stack” rather than full stack.

If you are claiming to be full stack you better be prepared to go all the way.

zaarn · on April 11, 2018

I would describe myself full stack under that definition. Most "full stack" people I know that sit in my Uni courses have mostly learned Java EE + Oracle DB or Javascript + /dev/null^w^w MongoDB. Most of them would probably not be able to construct a relational database or libc from scratch.

Granted, such knowledge isn't immediately useful since it's something I or anyone is likely to do but it grants insight into systems. I know roughly how a query optimizer does and what it can, and more importantly, can't do.

When you know a system you can optimize for it. When you don't know a system you can only follow someone else's advice on how to optimize for it.

majewsky · on April 11, 2018

> /dev/null^w^w MongoDB

Nitpick: That's actually either three ^w or just one ^w, depending on how your WORDCHARS is set up. :)

zaarn · on April 11, 2018

I'm sadly not that much of a Vim expert, Hackernews lacks formatting for a strikethrough and I wasn't sure how a ~~/dev/null~~ would be interpreted.

adrianN · on April 11, 2018

How much quantum mechanics do you really need to know to code the next Uber for hamsters?

flukus · on April 11, 2018

Isn't the current quest in quantum mechanics (to find a grand unified theory) to find an abstraction that does leak? At the moment it's too self contained and doesn't explain anything about the macroscopic world.

In any case, having a basic understanding of the next level up (the electron) has proved quite useful to my career, otherwise I wouldn't know how turning things off then on again affects the machines I'm working with.

chiefalchemist · on April 11, 2018

Them, or the manager that hired them? If no one up the chain brings on a DBA what are they supposed to do? I hear ya. But this is as much a symptom of naive (and budget stretched) leadership as it is of the hands on deck.

olalonde · on April 11, 2018

Your solution costs 150K$/year while theirs took a weekend and saves 50K$/year...

chiefalchemist · on April 11, 2018

Allow me to clarify.

1) It took a weekend to complete. The friction was building for far longer. There's a cost to that, esp if it effects customer satisfaction and retention. They didn't refactor for fun, did they :) How many dev teams aren't so lucky? Is this article a no choice outlier, or a best practice?

2) My comment wasn't directed at the article but on another comment that blamed the developers. These problems should be owned by ownership / leadership / management more and engineers less.

3) That said, hire a DBA? I don't think that's necessary.

collyw · on April 11, 2018

I call myself full stack and I can tune a database. decent database design comes first though.

nemothekid · on April 11, 2018

>Seems like a fundamental misunderstanding of SQL rather than a particularly hard problem to solve.

Without knowing the rest of their stack, or what their data ingestion looks like, I think your query is oversimplified. If they are doing a union, then it's likely they aren't querying one table, but they are querying multiple tables. The article mentions that individual customers had as many as 500 million rows. Likely each customer has their own set of data they also pipe into the system. Next their own custom query language may support more complex algebra than standard equality.

IMO, the article doesn't sufficiently describe the problem for us to understand why their solution works. To you and I there are 100 other solutions they could have tried that seem simpler than the one they presented.

It's less likely that they overengineered - we are probably just underinformed.

hueving · on April 11, 2018

>It's less likely that they overengineered - we are probably just underinformed.

Based on 15 years in software companies in the valley it's much less likely that this isn't over-engineered. Nearly every decision I've seen chasing technology hype has been based on ignorance of existing solutions.

any626 · on April 11, 2018

The user data is most likely in rows instead of columns. Instead of having

    id, name, age, gender
    1213, fake, 60, female

they would have

    property_id, user_id, value
    1 (assume age), 1213,    60
    2 (gender),     1213,    female

This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.

thorin · on April 11, 2018

Entity attribute value anti pattern - this has been well known for at least 20 years. It can be tempting when you want to design a "flexible" system but really needs to be used sparingly. I was BI team lead on a product where the architect insisted that it be used on every entity (>300) as you never knew when you might want to add some bit of data. It led to some interesting (multi-page) sqls and the project ultimately failed. This was one of the reasons. Slow performance and often runtime errors when expected data wasn't present and the application layer couldn't cope. It was a good learning experience. https://mikesmithers.wordpress.com/2013/12/22/the-anti-patte...

zimpenfish · on April 11, 2018

We have this as a "meta" field (because MySQL is balls at adding new columns to big tables without multi-hour downtime) with some arcane nonsense format. Totally unqueryable with any efficiency.

cleong · on April 11, 2018

EAV pattern has trade-offs you need to compensate for (performance). Production systems that use EAV have flat tables, and heavy caching to have be flexible with acceptable performance.

bpicolo · on April 11, 2018

You could argue that there are cases for it. Datomic[0] is basically EAV on steroids.

https://www.datomic.com/

nstart · on April 11, 2018

Oh gosh this pattern. The first time I encountered it was in my first job where we used Magento. Super flexible. Also super slow. Does anyone have any advice how to make a db design like this work faster? Generally I thought when data is arranged like this it might be a prime candidate for document based storage. But I'm no dba so I have no idea if that would be correct.

matte_black · on April 11, 2018

If you are using Postgres, the JSONB datatype will let you do exactly this while still using the full power of SQL. Simply create a column where you keep a JSON object full of random user properties, if flexibility is what you want. You can even index properties.

mhd · on April 11, 2018

Or just create ad hoc tables with user fields. Quite often it's not that a customer has n different fields for n entities, but a few that apply to the majority (like internal ERP ids, classifcation etc.). Put them in a few tables, index them, join them. If you don't want to parse internal DB descriptors, create a set of "schema" tables to build queries from.

icebraining · on April 11, 2018

The question is whether it can be stored like this while allowing for fast queries. For example, unless it changed recently, Postgres doesn't calculate statistics to help the query planner on jsonb fields.

zaarn · on April 11, 2018

JSONB columns should behave just like any datatype with a GIN index in the recent releases, to my knowledge.

Still, a JSON column will arguably be faster than a pure KV table since you can more efficiently query it, especially any non-JSON columns.

ptman · on April 11, 2018

IIRC JSONB still has problem with index statistics

So values in JSONB columns can be indexed nicely, but the statistics can be much worse than for non-JSONB columns, which can lead the query planner astray.

NickNameNick · on April 11, 2018

Postgres will let you create an index on an expression into the JSON column, so querying should still be very quick.

matte_black · on April 11, 2018

The answer would make a great blog post.

bitexploder · on April 11, 2018

This just isn't that hard. They don't have that much data. It is really late for me, but, put it all in memory and figure it out. These just aren't hard problems. DBAs have been solving performance issues for decades with a clever index on the right column for 30+ years. Sorry if this is get off my lawn-ish, but I have been on too many projects where I made a DB index and solved a major bottleneck. Too many new developers are ignorant to the nuance of RDBMS tuning. I am not even a DBA.

collyw · on April 11, 2018

I am not a DBA, but I could see this crap when the NoSQL hype took off.

"Relational databases don't scale"

Well they worked fine for decades before Mongo's marketing claimed so.

thecatspaw · on April 11, 2018

> Does anyone have any advice how to make a db design like this work faster?

Normalize it properly. If this is not possible, ensure that both the userid and the property are indexed

bpicolo · on April 11, 2018

If what you actually want is data arranged like such, Datomic is probably a prime candidate

https://www.datomic.com/

siculars · on April 11, 2018

If they are using some sort of middleware orm, which they may well be because of their model, they are most likely using an EAV[0] schema which, although flexible for writes, is horrendous for reads. The join plus pivot is a disaster on virtually any relational system.

[0]https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

orf · on April 11, 2018

Hmm, that does seem probable. In fact that could make the SQL even more efficient as you'd only need a combined index on the 'prop' and 'value' columns, rather than N arbitrary combinations of indexes that may or may not be used.

Edit: Had some bad attempt at writing this query but it's rather late and it made no sense.

any626 · on April 11, 2018

You would need to have a new join for each new property

    SELECT DISTINCT loyaltyMemberID
    from members as m
    INNER JOIN properties as p1 on m.id = p1.user_id
    INNER JOIN properties as p2 on m.id = p2.user_id
    INNER JOIN properties as p3 on m.id = p3.user_id
    AND (p1.prop = 'gender' AND p1.value = x)
    AND ((p2.prop = 'age' AND p2.value = y) OR (p3.prop = 'censor' AND p3.value = z))

TimJYoung · on April 11, 2018

There's no need for the extra joins, you can just do the one join and then filter everything in the WHERE clause:

  SELECT DISTINCT loyaltyMemberID
  from members as m
  INNER JOIN properties as p on m.id=p.user_id
  WHERE (p.prop='name' AND p.value = value) AND

  ...etc.

83457 · on April 11, 2018

But how would you do exclusions with your approach?

TimJYoung · on April 12, 2018

I'm not sure what you're asking - could you give me an example of what you're envisioning that couldn't be satisfied with a combination of Boolean expressions in the WHERE clause ?

83457 · on April 19, 2018

With this query approach how do you find people that have a prop1 but do not a prop2?

If you get records back with prop1 then you have to remove those records from results based on another record.

There are multiple ways to accomplish this but it can't be done with a single join and simple where clause.

TimJYoung · on May 2, 2018

Sorry, I missed your reply. Yes, you are correct, in that case you would need to use a except, sub-query, derived table, etc.

pbnjay · on April 11, 2018

Especially with partial indexes, I still feel like this structure will be significantly faster than the original UNION ALL ... GROUP BY on calculated fields.

And they mention in the post that most queries don't use that many fields.

roel_v · on April 11, 2018

Confession time: in my first job, I build something like this (and it worked pretty well in the sense that it was very flexible), but then I also had to do a 'select' based on iirc 14 of such properties. I don't really recall the exact problem I had at first, but my solution was to create two separate (temporary) tables, select 7 of the properties into one and 7 into the other, run a select on both of those tables, then join the results in code. This ran at and acceptable speed (I must have done something so that adding criteria made the run time increase non-linearly - doing it on 14 was orders of magnitude slower than on 7).

Then years later I ran into the guy who had to do some work on it after I left that company. I must have scarred him pretty badly, because he remembered it enough to bring it up as pretty much the first topic after the obligatory 'hey so what are you up to nowadays'. When I think back about it now, it was a cringey solution - then again, this was at a company where nobody had ever heard of a 'database index' (or if they did, never mentioned or implemented them).

tengbretson · on April 11, 2018

Christ do people do this??

DCoder · on April 11, 2018

This is a pretty popular pattern known as Entity-Attribute-Value [0]. It's used by many products where a) data model needs to be very flexible and allow new attributes without schema changes, or b) a typical entity has a large number of possible attributes that may or may not be set for all entities ("sparse" attributes). WordPress uses this to store post metadata, Magento uses this to store product attributes and most of other data, Drupal uses a variation of this to store all the posts and other content you create… I have too much experience with this model to be surprised.

[0]: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

dbrgn · on April 11, 2018

Sounds like a great case for Postgres hstore (like OpenStreetMap does it)?

stu432 · on April 11, 2018

I mean, not really, theres no magic in Postgres' implementation of the pattern. They even spell it out for you:

"can be useful in various scenarios, such as rows with many attributes that are rarely examined"

They are querying these quite heavily, they aren't just random attributes they need to retrieve.

dbrgn · on April 11, 2018

You're right, I think most OSM databases extract the most commonly used keys into dedicated, potentially indexed columns.

bpicolo · on April 11, 2018

hstore querying is quite slow (and GIN indexes on hstores are pretty massive). I'd always go jsonb over hstores these days, but jsonb has the same indexing problem. JSON has a well-optimized/spec compliant serializer/deserializer in every language you can imagine as a baseline, whereas hstore does not.

rickycook · on April 11, 2018

i think that’s basically what it was designed for :p

friedman23 · on April 11, 2018

Is `value` a string here?

roel_v · on April 11, 2018

I once implemented a variation of this where there was a column called 'data_type', the valid values were the various SQL data types, and in code I would do a switch() on the (string) value of that column and then cast the contents of the 'value' column based on that... Ah the folly of youth...

any626 · on April 11, 2018

Depends

You could have

    property_id, user_id, value(string)
    1 (assume age), 1213,    60
    2 (gender),     1213,    female

or

    property_id, user_id, value_str, value_int
    1 (assume age), 1213,    null, 60
    2 (gender),     1213,    female, null

or have a mapping in the application to get the type of the property. Plenty of ways to handle it.

candiodari · on April 11, 2018

> This gives them the freedom to add more properties to the user without always having to add a column to the users table. When querying the database you'll have to do unions or joins.

I think you're right. Oh ... my ... god ...

I wish I could say this is the worst example of a database schema I've ever seen, but it isn't.

Technology cycle:

X gets invented -> idiots abuse it -> X "is bad" -> Y (strictly worse than X) is "so much better" -> idiots abuse it -> Y "is bad" -> ...

icebraining · on April 11, 2018

EVA is a valid pattern if the keys are dynamic. For example, in a CRM, the user might want to store properties of their clients that you haven't thought of. In our platform, we use different schemas for each company, so we can actually do a ADD COLUMN ..., but you don't want to do that if you have a multi-tenant DB :)

ajcodez · on April 11, 2018

It makes sense if each client gets to add their own fields.

candiodari · on April 11, 2018

Using the builtin type for that purpose is going to work way better. This depends on the DB you're using but is generally referred to as a "JSON" field (why ? Because they're a response to MongoDB, which calls is that). Oracle and SQL server have very similar things.

In Mysql, it is JSON data type [1], in Postgres JSON/JSONB [2].

Creating indexes across them is doable, through a workaround (involving what is generally referred to as "VIEWS", but can be called calculated columns or something like that).

And, frankly, in the worst case for indexing, these databases still perform comparable to key-value stores in speed (especially SQLite).

[1] https://dev.mysql.com/doc/refman/5.7/en/json.html#json-paths

[2] https://www.postgresql.org/docs/9.4/static/datatype-json.htm...

icebraining · on April 11, 2018

They may be generally a better option, but they have their own disavantages. For example, JSONB fields in Postgres won't deduplicate keys, so if you have large keys, your table size will increase quite a bit (which also makes it harder to keep it memory).

Using an EVA, you can have a "keys (id, key_name)" table, and then only use the IDs in the values table, reducing that waste.

By the way, you don't need views for indexing on JSONB fields, it's supported out of the box in PG.

jjeaff · on April 11, 2018

At least in MySQL, json field types are rather new. MySQL 5.7 is not yet an option with AWS Aurora or Google cloud SQL even.

And I don't think you will necessarily get better performance with json fields vs an EAV model. Yes, you can index json fields by creating virtual views, but that requires that you know the field ahead of time. With an EAV model, you can have your values table indexed and then join.

But I am excited to start using the json field types. In many cases, it will really simplify things over the traditional EAV stuff.

ThrustVectoring · on April 11, 2018

The UI they showed in the blog post looks like it has enough data available to generate that kind of query, too. Like, the ands/ors/nots are right there on the page, the filters are already there too getting translated to SQL as well, just mash them together and you get the same "algebra of sets" stuff right in the WHERE clause.

As it stands the SQL query is quite silly. It gets a list of every user ID that is included by each filter and compares which ones are in the filters you want and not the filters you don't want. Much better is to pass the filters into SQL, let it figure out which users match the filters you want and not the filters you don't, and just use that result.

jwatte · on April 11, 2018

Most Enterprise CRM like solutions store tables of customer-property-value instead of using one column per property.

This leads to lots of unions in advanced queries, and makes filtering harder. Some databases even calculate column block statistics to optimize these queries by doing less IO even for seeming table scans.

Why not one table with all customers and one column per property? There are a few reasons, having to do with anything from MySQL sucking at schema alters for really big tables, to expectations of Enterprise customers.

deizel · on April 11, 2018

Can be tricky to use EAV data models with traditional ORMs.. this type of functionality can often be slow or require plugins, if implemented at all:

https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...

philliphaydon · on April 11, 2018

Wondering if they shoved the data in PostgreSQL with JSONB how well it would perform over EAV.

jjeaff · on April 11, 2018

I think that jsonb may not be as performant as EAV. You don't need joins or unions, but if you are dealing with dynamic fields, you need to know the fields ahead of time and set indexes for them in jsonb. For eav you just have to index your values table.

philliphaydon · on April 11, 2018

PostgreSQL can use multiple indexes so you don't need to worry about needing to know about the fields ahead of time.

Likewise you can get away with a full document GIN index.

I played around with some basic report stuff at work last year, the EAV data on my local machine, the report took ~7 seconds to run. I shoved the same data into PostgreSQL as JSONB, indexed it just as full doc cos I was lazy, the same report took ~80ms.

Obviously this isn't 'proof' my dataset was only 1.5m by 15m records. But with my limited knowledge i do believe it would perform better, I don't know how much better... but I think better...

wetha · on April 11, 2018

I’m not the author of the post. Your comment assumes a well known schema. My understanding from the post is that this solution can join and filter on “custom” datasets of arbitrary schema that each of their customers upload.

Matheus28 · on April 11, 2018

I've never played with this, but couldn't you create a table based on the dataset that the customers upload, and let your database engine handle filtering those queries? From the looks of it, even if they were doing full table scans for each query, it'd still be faster than all those unions...

TimPC · on April 11, 2018

I think the point is they don't know in advance what the query is and they didn't think they had a good solution to optimize all user entered variants across the range of possible groupings so they wanted a solution that was easier to optimize globally.

The general form of this is:

Select loyaltyMemberID from table WHERE V1_1= x_1 OR ... OR V1_n=x_n) AND (V2_1 = x_2_1 OR V2_2=x_2_2 OR ... V2_n=x_2_n) AND ... AND (Vn_1 = x_n_1 OR ... OR Vn_n= x_n_n) (some of these n's should actually be m_i's but I was lazy)

There may be some ability to optimize this in a number of ways but optimizing one example is not optimizing the general form. I can easily see how technology change could be a cleaner solution.

orf · on April 11, 2018

> There may be some ability to optimize this in a number of ways but optimizing one example is not optimizing the general form.

I totally get that, but isn't that the point of the query optimizer within the database itself? Why are you trying to outwit it? It should select the right indexes, provided the columns are indexed, and "do the right thing(tm)". It might take a bit of cajoling but they seem pretty good at this. Postgres collects statistics about the distribution of values themselves within the table to guide its choice of index, so in theory it could rewrite the boolean logic to use a specific index if it's sure that it will eliminate a higher % of the rows than another plan.

In any case, it seems the SQL they posted is a bit off. Why nest each individual filter as a UNION? If you wanted to go down the UNION route couldn't you do each individual group as a UNION, with standard WHERE filters?

rypskar · on April 11, 2018

I blame ORMs', if you don't understand SQL and how databases work you should not be allowed to use an ORM. If you know how databases work you, in many cases, will not use an ORM except for the most simple CRUD operations.

orf · on April 11, 2018

I know how databases work and I use a (good) ORM almost exclusively for app work (SQL for exploration usually though). The benefits are huge.

collyw · on April 11, 2018

I use both. I know the Django ORM and I know where its limits are. For getting data in it saves a load of time. When the query gets complex it starts adding time, or making things impossible (multiple join conditions weren't possible unless its been update in the most recent version).

rypskar · on April 11, 2018

OK, the part about not using ORM if you know SQL is a bit of an exaggeration. At least when you know SQL you know when to use an ORM and when to not use it. If all you know is ORM then you will always use it, and ORM seems to lead to many developers not learning SQL

scruple · on April 11, 2018

I think it depends... Who are we talking about here? Juniors, even intermediates, in my experience, haven't had enough time on the job to have learned enough to be writing raw SQL statements or query objects unless they're actively punching up on a daily basis. I am unfortunately talking from experience here.

What I am saying is, I really do not want a situation on my hands where the juniors that I work with, or most of the intermediates, and even a few of the seniors and leads, are writing raw SQL or query objects. Most of these folks have n years of experience in web and desktop application development and couldn't give you a passable answer to simple questions like, "What's a database index?" I know this isn't isolated to my current employer, or former employers, and I've seen it in other organizations where I've done some consulting on the side, and all of these folks I'm talking about here have largely worked else where in the past, too. And this in itself leads to other third-order effects, like the "SQL wizards" who get asked all of the "tough" SQL / database questions.

I want to stress that I understand the point that you're making, and I do agree with it, and of course, so do many (all?) ORM authors themselves, but I think the advice is wrong and is prone to take you to a much worse situation. I think we have an obligation as people who do grok SQL and databases to gently introduce our less experienced co-workers to the idea that ORMs are not a panacea to all database interactions, but until the companies we work have enough of an incentive to give us that sort of time and empowerment then I, for one, am going to recommend ORMs for everyone for everything unless they really, absolutely, demonstrably know what they're doing.

rypskar · on April 12, 2018

I do also understand your points and think we agree on most. I think that if a "developer" can't write SQL I would not trust that one to set up the ORM correct either. For basic usage, sure they will get it to work and all is good. But when you want to join table or run aggregate functions the same peoples who write bad SQL could also write bad ORM code with N+1 queries. ORM has its place and optimized beautiful SQL has its place, a craftsman know which tool to use where and when to ask for help.

One of the problems, in my opinion, is that SQL isn't "cool" or hip and by many seen as not important to learn. While the new fancy Javascript based language or framework which nobody use and that will be replaced next week is much more important to learn.

Btw, get of my lawn :) /end old man rant

scruple · on April 13, 2018

Yes, exactly!

> could also ORM code with N+1 queries

Oh, they absolutely do, and when we're lucky they actually catch them on their own before they get to code review. Some folks reach for tools like Bullet [0] and, that's great, but unfortunately, sometimes they treat that tooling like the Holy Gospel. They develop an over-reliance on them as if those tools exist to offload critical thinking. Drives me crazy... in my experience, it's been hard to combat this type of thing, too. The pace of "agile," the calculus between paying down technical debt and mentoring and progress, I don't really know why but I haven't had a lot of long-term luck.

> One of the problems, in my opinion, is that SQL isn't "cool" or hip and by many seen as not important to learn.

I think you're really right about that. I happen to like writing SQL quite a bit and I take a little bit of pride in that I kind of sort of actually understand a little about what is going on in the database and even then I neglect that skill. I picked up copies of both "SQL Anti-Patterns" and "SQL Performance Explained" based on recommendations from this thread and am eager to get in to them this weekend. Still lots to learn... And, I have some SQL problems that I can see coming up over the horizon today and I hope this gives me the edge I need to start grappling with them sooner rather than later.

[0]: https://github.com/flyerhzm/bullet

flukus · on April 11, 2018

> Edit: I've just read the query in the post again and I really can't understand why you would write it like that. Am I missing something here?

Oh I've seen this happen a lot. Somewhere along the line, often from a DBA, it is decided that sql in an app is evil and that everything must be in a stored proc. Then instead of some simple string concatenation you have to jump through hoops like this.

voltagex_ · on April 11, 2018

One of the things I've done to harden an app is to revoke all permissions other than EXEC on a particular schema, then make sure everything is done via paramatised stored procedures - no chance of SQL injection then.

flukus · on April 11, 2018

But that creates situations like this where you have to jump through hoops to solve simple problems. You solved one potential issue at the cost of creating many more.

> no chance of SQL injection then.

You know you can have sql injection attacks inside stored procedures? If you think stored procedures are a panacea then you don't understand the problem you're solving.

voltagex_ · on April 11, 2018

Perhaps I should have said "greatly reduced the risk of".

I'm not using something like Entity Framework and the CRUD apps I mostly wrote at work were well suited to a few simple sprocs.

The time it takes to write an ALTER script to change something pales in comparison to the two week change control process anyway...

voltagex_ · on April 11, 2018

https://stackoverflow.com/a/24029829/229631

https://blogs.msdn.microsoft.com/brian_swan/2011/02/16/do-st...

https://security.stackexchange.com/questions/68701/how-does-...

est · on April 11, 2018

I am building a rule engine quite similar to this. An AST parser will run all python DSLs and generated list of tables for INNER JOIN, then SELECT all the tables data out with filters in one pass, then run all results through the Python code.

It's quite fun.

benmmurphy · on April 11, 2018

age/gender are probably simple but i'm guessing censor is probably derived from a transaction table. if they are letting users select arbitrary time ranges to filter the transactions then you can't store a precomputed censor value for each user. but seeing that they are talking about caching maybe a lot of stuff can be precomputed.

BinaryIdiot · on April 11, 2018

Damn, you're not kidding. I wonder why they needed more than one query here plus UNION is slowwwwwwww. They never mention how frequent this query needs to run either, only the amounts of data involved in some aspects of this table.

toddBarkus · on April 11, 2018

I was going to ask how to optimize the SQL in your post as it seems like the obvious/naive implementation of the query. If you're missing something, so am I. I can only imagine it was built up over time from googling specific terms that already missed the point, e.g. "rds mySQL union query"