I like it, it's readable, unlike some SQL alternatives I've seen it doesn't make me feel like I'm dumb and don't understand what a query even is.
I can't decide if it would be better or worse if it stuck more closely to SQL keywords. You use "from" and "select", but not "where", "order by", "group by". There's some danger of it being in an uncanny valley of SQLish, but I'm pretty sure I'd prefer just using those terms verbatim (including the space in "order by"... that style is less common in modern languages but it's not really that much harder to parse).
I'd like to see more examples of joins and composing SQL. Does this language make it easier to make more general SQL queries? Can I take two queries and squash them together in a reliable way? I feel like I end up with a lot of theme and variation in my queries, often involving optional filters.
I might even like a notion of encapsulation that could help this query language when it's embedded in other languages. Like if I could say, in the language itself, that a query has certain unbound variables (and not just ? or other placeholders). This language seems like it would be better for generating than SQL, and sometimes generation is just necessary (like in any application that supports data exploration), but for most common cases I'd hope to avoid that. Defining inputs and then making whole filter sections or other statements conditional on those inputs would help here.
Yup, I like a lot of things about the way this looks. In particular, I like how friendly this looks to be for things like auto complete (pretty annoying to need to practically type the entire sql query only to go back and fix up the columns in order to get autocomplete to work).
Specific things I'd like to see.
How do you handle column ambiguity. In the examples, they show a join of positions to employee on employee_id == id. But what happens when you have 2 columns with the same name that you are joining on? (like employee_id to employee_id in some mapping table).
Subqueries are pretty important in what I do, so what do those look like (perhaps covered by the "thinking about CTEs section").
How about opportunities for optimization hints? In T-SQL you can hint at which index the optimizer should prefer to a specific query.
Common SQL patterns would also be interesting. Like, how would you do keyset pagination?
Edit: Also, I'd like a discussion about null. SQL null handling rules are terrible. I understand them, I work with them, but at the same time, they are so different from other languages concept of "null" that they are easy to trip over.
> SQL null handling rules are terrible. I understand them, I work with them, but at
> the same time, they are so different from other languages concept of "null" that
> they are easy to trip over.
Could you elaborate? I'm really only versed in the MySQL accent, but I don't find anything unusual or unexpected about NULLS in MySQL. If there are any pitfalls that I should be aware of, I'd love to know about it here before my users start complaining about bugs.
In SQL NULL does not mean "no value" it means "unknown value". Existence of such value introduces three value logic because expression "NULL = <anything>" is neither true nor false. This makes queries harder to understand without any benefit.
Additionally "unknown value" concept is not used consistently. Things like DISTINCT, or UNIQUE indexes (in some databases) treat NULL as single "no value".
The value in your if is not false, it's NULL. NULL=NULL behaves exactly like NULL=42, the value is NULL. Which is what the parent was trying to explain.
Not the person you replied to, but I don't think by “from other languages” he means other dialects of SQL.
Instead, I think other languages away from the database are being referred to - in many of those NULL is treated like any other value², for instance in Javascript¹ null==null is true and null!=null is false, and due to type coercion null on its own is “falsey”. Personally I have no problem with SQLs handling of NULL with one exception, and find other languages treating it as a single value rather than an unknown feels odd.
The one thing that I have occasionally tripped over with NULL in SQL is the effect of “<val-or-var> NOT IN (<set>)” when NULL is one of the entries in <set> - it makes sense when you think about it because the IN operator can only return true or false and it can't definitively say the searched for value isn't equal to the unknown one(s)³ but this doesn't seem intuitive.
Some SQL dialects do handle NULL a little differently, more like languages like JS. MS SQL Server can be made to with SET ANSI_NULLS OFF to force its ancient not-standards-compliant behaviour⁴.
[1] quick & easy to test in your browser's console
[2] well, technically in JS I think null is specifically a null object reference, that being one of the differences between null and undefined
[3] more concretely, “var NOT IN (1, 2, NULL)” being equivalent to “var<>1 AND var<>2 AND var<>NULL” which becomes “true AND true AND NULL” which is NULL as any logical operator against NULL returns NULL.
[4] though note that this option is officially deprecated, as of at least 2016, and might be removed or just ignored in future versions
I think supporting variables and functions already solves most of my composability gripes with SQL.
Another problem that I have with composing SQL is that large queries quickly become unreadable, and error messages are also often not terribly helpful. I think having a more expressive type system would help with the error messages.
Do you have any plans on adding a type system to PRQL?
> Can I take two queries and squash them together in a reliable way? I feel like I end up with a lot of theme and variation in my queries, often involving optional filters.
That's essentially what SQL views do. Each view is a query and then you can treat it like a table and filter/join on it.
Of course, then the problem becomes whether or not the query planner can see through the view to the underlying tables to optimize correctly.
> I can't decide if it would be better or worse if it stuck more closely to SQL keywords. You use "from" and "select", but not "where", "order by", "group by". There's some danger of it being in an uncanny valley of SQLish, but I'm pretty sure I'd prefer just using those terms verbatim (including the space in "order by"... that style is less common in modern languages but it's not really that much harder to parse)
I agree 100% here. As a SQL veteran, it would make the transition a lot easier if you used common SQL keywords like group by, order by, limit, etc.
e.g.
from employees
where country = "USA"
derive [
gross_salary: salary + payroll_tax,
gross_cost: gross_salary + benefits_cost
]
where gross_cost > 0
group by:[title, country] [
average salary,
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum_gross_cost: sum gross_cost,
count,
]
order by:sum_gross_cost
where count > 200
limit 20
I like the flow direction compared to standard SQL. SQL is supposed to read like a sentence I suppose but I have many times looked at it and really wanted things to be in a more logical order.
My main suggestion would be to be a bit less terse and introduce a bit more firm formatting. I'm not a huge fan of the term "split" and feel like jazzing that up to "split over" or even just reviving "group by" would improve readability. Additionally the aliasing could use work, I'd suggest reversing the assignment to be something closer to `use salary + payroll_tax as gross salary`. In terms of firm formatting, unless I'm missing something there isn't any reason to allow a filter statement before any aliases - so you can force two fixed positions for filter clauses which would make it always legal to reference aliases in filters.
On the brief topic of form vs. flexibility. SQL is a thing that, when complex, is written by many people over the course of its lifetime - removing the ability to make bad decisions is better than enabling the ability to write simple things even simpler - those silly do nothing queries like "SELECT * FROM customer WHERE deleted='f'` are written once[1] in a moments time and never inspected again. The complex queries are what you want to optimize for.
1. If they even are - with ORMs available a lot of those dead simple queries just end up being done through an ORM.
> On the brief topic of form vs. flexibility. SQL is a thing that, when complex, is written by many people over the course of its lifetime - removing the ability to make bad decisions is better than enabling the ability to write simple things even simpler
Hallelujah! But, to your footnote, this is a major reason why I despise ORMs. In my mind they make writing simple code slightly easier, but they make complicated SQL statements, especially when you get some weird issue under load and you're trying to debug why your DB is falling over, a ton more difficult and you spend so much time just battling your ORM.
On ORMs, the best use I see of them is for “transparent” queries that you don’t define.
Like fetching a record by id, or a single record and all of its related properties. Or a list of all the record in a table matching a simple filter.
That’s 98% of what we do against the DB, and I’m all for having it basically invisible.
Then let’s just bypass the ORM altogether the minute we think about joining or grouping things together. There are libs in most language that help just sanitize queries, so it’s no difficult really.
With a middle ground like a micro ORM those transparent queries are barely visible anyway, literally a line or two lines of embededed sql strings. Especially micro ORMs that can handle dynamic filters. They're generally write once and only get looked at again when modifications are necessary, so they're not worth "optimizing" by adding the complexity of an ORM.
A common pattern seems to be over engineering these simple scenarios though. Someone decides that embedded sql is evil and needs to be extracted out of normal code, often to stored procs. Then these simple queries have enough friction that an ORM starts to look good, then you end up with an ORM generating simple queries dynamically in the same place that used to have a simple embedded string.
The fundamental problem with an ORM is that you're using a lower level language to compile to a higher level language. This is completely backwards. It's like having a framework in your assembly to generate Java code for you, so you don't have to bother with all that "weird" Java, and can just stay in your comfort zone.
Isn’t it more important that the query you write with the ORM is readable than the underlying SQL it spits out? Using an ORM I can get reusable parts of a query, while writing complex joins, I’m not sure why skipping that part is good?
In my experience, a lot of very semantically reasonable and readable code end up with very penalizing SQL at the end, and it's a real challenge to then rewrite the whole into decent queries.
There can be part of an app where a very bad query here and there is not important, but more often than not it creeps up in key parts of the user experience, and it becomes very hard to untangle when it becomes something important enough to thoroughly optimize, but also complex enough that the existent tests only cover a tiny portion of the important use cases (if you're reusing a bunch of query bits, you're probably working with a wide combination of input/outputs). I've seen literally weeks spent on trying to optimize ORM chained subqueries.
My experience is that the ORMs I've used most (LINQ and Ruby's Sequel) can produce far more efficient SQL than a human can, and if not, you change the code, just as you would have to if you wrote a slow SQL query.
Of course, I've not seen every query in existence so it's more than possible you've seen bad SQL from an ORM, but the untangling part would again fall to those skilled in the language of the ORM - unless the ORM can't produce efficient SQL in a particular case. And just as it would if the query was originally written in SQL, you'd need someone skilled in SQL to untangle that.
What would that case (where an ORM cannot produce efficient SQL) look like?
I have some serious doubts that an ORM can tune queries as well as a human due to the fact that the ORM lacks one key piece of information that both me (and the database planning a query) can leverage - table statistics. An ORM can produce a query that will behave well in the best general circumstances, but as soon as you get into topics like subquery performance fencing ORMs simply have no ability to compute optimality on the fly. One specific example I've seen is where multiple paths exist through the database to transit from one fact to another with one path being more strictly optimal and the other path having more associations that may be needed - this can effect the join strategies you want to use so if your ORM is anything more than "I'll essentially tell you the SQL but in a weird syntax" then there's a good chance it'll chose the wrong path.
But you write the code via the ORM, so I don't see the difference between you writing the SQL with knowledge of the table statistics and you writing the code via the ORM with knowledge of the table statistics.
I haven't got the chance yo try Sequel, on the ruby side I played more with plain ActiveRecord and querying layers like ransack (my predecessor on the job loved abstraction layers)
In general ORM queries become ugly at three to four levels of joins and/or excluding under non trivial conditions (e.g. finding users that have not participated to a specific set of events). They will spit out something that works, but will take a few orders of magnitude more than an optimized query.
As you say, there is the option to play jenga with the ORM code to hit the right combination that produces a better output. But that feels like teaching a toddler to solve a puzzle that you already solved and are keeping the cheat sheet in your pocket. I personally don't see the beauty of it and would prefer to directly use the right SQL and call it a day.
On people skilled in SQL, you should have a few onboard anyway if you're doing more than basic CRUD on the DB, and it's easier to find than ORM gurus IMHO.
> I personally don't see the beauty of it and would prefer to directly use the right SQL and call it a day.
I think that's fair enough, there are enough ways to do things now that it should be possible to accommodate both.
> On people skilled in SQL, you should have a few onboard anyway if you're doing more than basic CRUD on the DB, and it's easier to find than ORM gurus IMHO.
I agree but I'm not sure there are more SQL gurus than those used to ORMs nowadays. Lately I've favoured using SQL but even 15 years ago most devs I knew couldn't use it well, I can't see devs used to Rails et al having the chops for it, sadly. What was once convenient easily becomes one's master.
> I haven't got the chance yo try Sequel
If you get the chance, I think it's worth it. It's easy to drop into plain SQL without dumping the ORM, and I've never had a problem with the stuff it generates. It's a pity ActiveRecord gets all the love instead.
Im not so sure it's always best to optimize for absolute performance, how you should code a solution to a specific problem is always dependant on it's context IMO.
I work on a lot of smaller IT projects for SME's, internal tools and platforms that are made on a small budget and, thus, end up having a tight deadline in order to not go over budget.
The vast majority of these projects are versions of CRUD apps for this and that, we tend to value readable code over performant hard-to-read code as it makes code review faster.
I think it’s all nice and fine if performance has no impact (which can be the case). For instance if your client can wait 30s for a report, no big deal.
Things go down when a query that used to take 4s now takes 30s as the product has taken off, handles 10x more data, and it’s not one user but a few hundreds having their request queued.
You wont have the luxury to not re-write that part in (probably) hard to read performant code. It can be a huge enough effort to blow away your deadlines and budget and sour pretty hard the relationship if your team struggles on something they aren’t used to do at all.
In my experience neither the output or the input is readable when using an ORM to generate complex queries. For simple queries they're great. Writing raw SQL also makes it much easier to jump out into SQL-specific tooling to debug a query, then copy it straight back into the code once you're done.
> Isn’t it more important that the query you write with the ORM is readable than the underlying SQL it spits out?
I would look at the issue from a slightly different angle. Performance issues aside, I personally prefer either the ORM or the SQL depending on which is easier for the guy maintaining it to understand. Getting a row from the database and transforming it into an object? ORM. Generating a report on historical data across half a dozen tables? SQL.
I like the flow direction specifically for intellisense/autocomplete. I'm sure it would be easier to provide hints when the table name is known immediately.
FWIW the separate `group_by()` is one of my greatest design regrets with dplyr — I wish I had made `by` a parameter of `summarise()`, `mutate()`, `filter()` etc.
Is it too deeply entrenched to change? The number of times I have had a data.frame grouped when it wasn't supposed to be, I can count on my fingers. But the hours that I spent trying to figure it out must amount to a paycheck or two.
There's a lot of dplyr code out there, and a lot of people who know most every part of the tidyverse by heart, making breaking changes like this so far into a frameworks life would cause a lot of unnecessary work in re-coding old code as well as requring people to re-learn syntax.
IMO for such a small adjustment the benefits don't outweight the costs.
Yeah, that's my current position. It's possible that we might be able to add it in optionally (by adding a new `.by` argument to summarise and friends), but just the analysis to determine how it would affect existing code is a lot of work.
Now this is actually nice, unlike the other suggestion posted today[1].
Maybe I'm just too used to non-standard extensions of our database but the SQL example could, at least for our db, be rewritten as
SELECT TOP 20
title,
country,
AVG(salary) AS average_salary,
SUM(salary) AS sum_salary,
AVG(gross_salary) AS average_gross_salary,
SUM(gross_salary) AS sum_gross_salary,
AVG(gross_cost) AS average_gross_cost,
SUM(gross_cost) AS sum_gross_cost,
COUNT(*) as count
FROM (
SELECT
title,
country,
salary,
(salary + payroll_tax) AS gross_salary,
(salary + payroll_tax + healthcare_cost) AS gross_cost
FROM employees
WHERE country = 'USA'
) emp
WHERE gross_cost > 0
GROUP BY title, country
ORDER BY sum_gross_cost
HAVING count > 200
This cuts down the repetition a lot, and can also help the optimizer in certain cases. Could do another nesting to get rid of the HAVING if needed.
Still, think the PRQL looks very nice, especially with a "let" keyword as mentioned in another thread here.
with usa_employees as (
SELECT
title,
country,
salary,
(salary + payroll_tax) AS gross_salary,
(salary + payroll_tax + healthcare_cost) AS gross_cost
FROM employees
WHERE country = 'USA'
AND (salary + payroll_tax + healthcare_cost) > 0
)
select title,
country,
AVG(salary) AS average_salary,
SUM(salary) AS sum_salary,
AVG(gross_salary) AS average_gross_salary,
SUM(gross_salary) AS sum_gross_salary,
AVG(gross_cost) AS average_gross_cost,
SUM(gross_cost) AS sum_gross_cost,
COUNT(*) as emp_count
from usa_employees
group by title, country
having count(*) > 200
order by sum_gross_cost
limit 3
Readability is pretty similar to prql. It would really help in SQL if you could refer to column aliases so you don't have to repeat the expression.
To me those nested sub sub sub SQL queries come from a similar place as beginner coders who tend to make nested IF statements - a lack of experience with the language.
For very complicated stuff, SQL does become very hard to read compared to e.g. tidyverse + targets in R.
In some cases for removing repeating (intermediate) calculations, I generally find it easier to use a lateral join (in postgres), like
select
title,
country,
avg(salary) as average_salary,
sum(salary) as sum_salary,
avg(gross_salary) as average_gross_salary,
sum(gross_salary) as sum_gross_salary,
avg(gross_cost) as average_gross_cost,
sum(gross_cost) as sum_gross_cost,
count(*) as emp_count
from
employees,
lateral ( select
(salary + payroll_tax) as gross_salary,
(salary + payroll_tax + healthcare_cost) as gross_cost
) employee_ext
where
country = 'usa'
and gross_cost > 0
group by title, country
having count(*) > 200
order by sum_gross_cost
limit 3;
So now we have easily come up with three different ways of rewriting the query to avoid that duplication (which obviously was not a problem at all to begin with): subquery, CTE and lateral join. And there are also several more well known ways (views, custom functions, computed columns etc) so the whole premise now for even inventing a "better" language than SQL is then false? Or what am I missing.
It's also weird how people always argue for immutability and eliminating local state, when using procedural languages, but as soon as they switch to SQL, that actually works like this, they immediately want to introduce mutability and local state.
> so the whole premise now for even inventing a “better” language than SQL is then false?
I don't think anyone is using the above examples to try invalidate PRSQL, just suggesting the baseline for comparisons should account for all constructs available in the SQL standards and common implementations there-of.
> Or what am I missing.
The statement “I can do X better than <SQL example> with <something else>” does not properly show the benefit of <something else> if “I can do X better than <SQL example> with <another SQL example>” is also true (assuming <another SQL example> is actually agreed to be better, not for instance convoluted/confusing/long-winded/other so just replacing some problems with others).
If there's multiple ways to do the same thing that's usually a BAD thing in terms of language design. Especially if some approaches are just newbie traps that experts learn to avoid, or if deciding the best method is a really subtle context-dependent decision. The ideal design is that the language encourages the one obviously "good" way to do it.
I'm not aware of any general purpose programming language that doesn't have multiple ways to achieve a specific goal. Can you give an example of language with good design?
Column aliases would have saved me hundreds of hours over the course of my career. Sorely missing from standard SQL, and would make the need for PRQL less acute.
Snowflake lets you refer to column aliases, and it's great!
There's the slight issue of shadowing of table column names, which they resolve by preferring columns to aliases if both are named the same. So sometimes my aliases end up prefixed with underscores, but that's not a big deal.
The trade-off is that a schema change (adding a column) unrelated to your query can modify its behavior.
Favoring aliases over columns instead has the potential to introduce irresolvable ambiguities as you can’t “qualify” a column alias with a SELECT list or subquery ID the way you can qualify a column by its table/view alias.
Not all database systems can optimize queries well over CTE boundaries. I believe this is still true for PostgreSQL (no longer true, see below -- it was true a few years ago). So there's a potential performance hit for (the otherwise excellent advice of) writing with CTE's.
In Microsoft SQL cross apply can be used for this in even more situations and with less repetition:
select top(20)
title,
country,
...
avg(gross_salary) as average_gross_salary,
...
from employees
cross apply ( select
gross_salary = employees.salary + employees.payroll_tax, -- or "as .."
gross_cost = ...
) v -- some name required but don't need to use it if column names are unique
where ...
Very cool! A couple questions/suggestions off the top of my head:
1. Did you consider using a keyword like `let` for column declarations, e.g. `let gross_salary = salary + payroll_tax` instead of just `gross_salary = salary + payroll_tax`? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.
2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
> Did you consider using a keyword like `let` for column declarations
Yeah, the current design for that is not nice. Good point re the keyword scanning. I actually listed `let` as an option in the notes section. Kusto uses `extend`; dplyr uses `mutate`; pandas uses `assign`.
Awesome that you're responding to feedback like this!
Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`. Those are in effect pretty different operations: you need to know which one is happening to know how many columns the table will have, and renaming a table column can with your current system change which operation is happening.
Splitting them into separate operations would make things easier on the reader: they can tell what's happening without having to know all the column names of the table. And it shouldn't really be harder for the writer, who ought to already know which they're doing.
I encountered something like this at my previous job. We had a DSL with an operation that could either create or modify a value. This made the code harder to read, because you had to have extra state in your head to know what the code was doing. When I rewrote the DSL (the rewrite was sorely needed for other reasons), I split the operation in two. I was worried people would have been too used to the old language, but in practice everyone was happy with it.
> Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`.
Couldn’t we just not allow modifying an existing column? Ie. we would not allow
count = count + 1
But force the use of a new variable name:
new_count = count + 1
I think this makes for much more readable code, since the value of a variable does not depend on line number.
> 2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
I don't have an example on the Readme, but I was thinking of something like (toy example):
table newest_employees = (
from employees
sort tenure
take 50
)
from newest_employees
join salary [id]
select [name, salary]
Or were you thinking something more sophisticated? I'm keen to get difficult examples!
There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.
I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.
Maybe you'd like to check FunSQL.jl, my library for compositional construction of SQL queries. It also follows algebraic approach and covers many analytical features of SQL including aggregates/window functions, recursive queries and correlated subqueries/lateral joins. One thing where it differs from dlpyr and similar packages is how it separates aggregation from grouping (by modeling GROUP BY with a universal aggregate function).
FunSQL.jl requires Julia to run (obviously as it is a Julia library) but it
produces standard SQL so Julia in this case is just an implementation language.
I have re-implemented parts of FunSQL in Python and OCaml (the one I have ended
up using) and have added a concrete syntax similar to what you have in PRQL.
from employees
define
salary + payroll_tax as gross_salary,
gross_salary + benefits_cost as gross_cost
where gross_cost > 0 and country = 'usa'
group by title, country
select
title,
country,
avg(salary) as average_salary,
sum(salary) as sum_salary,
avg(gross_salary) as average_gross_salary,
sum(gross_salary) as sum_gross_salary,
avg(gross_cost) as average_gross_cost,
sum(gross_cost) as sum_gross_cost,
count() as count
order by sum_gross_cost
where count > 200
limit 20
But, in my mind, the biggest difference between PRQL and FunSQL is the way
FunSQL treats relations with `GROUP BY` - as just another kind of namespaces,
allowing to defer specifying aggregates. A basic example:
from users as u
join (from comments group by user_id) as c on c.user_id = u.id
select
u.username,
c.count() as comment_count,
c.max(created_date) as comment_last_created_date
The `c` subrelation is grouped by `user_id` but it doesn't specify any
aggregates - they are specified in the `select` below so you have all selection
logic co-located in a single place.
I think this approach is very powerful as it allows you to build reusable query
fragments in isolation but then combine them into a single query which fully
specifies what's being selected.
Writing an alternative syntax is straight forward. Perhaps prototype PRQL using xi's excellent FunSQL backend? This way it's working out of the gate. Once syntax+semantics are pinned, writing another backend in the language of your choice would then be easier. Getting the backend correct is non-trivial work, and xi has done this already. Besides, we need a sandbox syntax anyway, so it might be fun to collaborate.
I like the explicit pipelining idea, seems much easier to reason about. Some comments:
I found the "# `|` can be used rather than newlines." a bit odd. So when using let, you're only transforming one column? I think the example would look weird with returns instead of |.
Depending on your intended target, it might help adoption if you stay closer to the naming conventions of that target. If you're targeting mainstream Java/Python/C#/Javascript etc. then functions need parentheses, "take 20" may be worse than slice, etc.
I think annotating microversions would get tiresome fast. I think the right way to think of this is that you put in a single version number like 1, and then only ever change that if you need to do backwards-compatible changes that cannot be handled by clever hacks in the runtime.
Also I think you should try writing one or more native wrappers in your intended target languages to make sure it's easy to interface between the two, even if it means you'd have to use dots in that language.
I could imagine an end game where the ergonomics were so good that a database like Postgres ends up with a native PRQL frontend. Not sure you're there yet, though. IMHO SQL as a query language suffers from a) sometimes really bad ergonomics, b) it's hard to wrap in another programming language (also ergonomics), c) it has far too many concepts - it's not orthogonal.
> I think annotating microversions would get tiresome fast. I think the right way to think of this is that you put in a single version number like 1, and then only ever change that if you need to do backwards-compatible changes that cannot be handled by clever hacks in the runtime.
Thanks good idea, I just changed this to remove the microversions. If we use SemVer, then before `1`, we'd hold versions compatible to the 0.X, and then to the X.
IMO you are at the forefront of where query languages need to and will go.
Some programmers like you see that SQL ordering is backwards to human thinking, except in the simplest cases. But many people with practice and sunk costs in their SQL expertise will be resistant. The resistance usually wins the day.
But sometimes, a useful tool gets created by one person, and a rift is created in that resistance. Think John Resig creating jQuery, leading to LINQ and many other similar patterns. You could be that person for database query languages, but how do you ensure that?
Maybe imagine what made jQuery easy to adopt and indispensable for programmers: easy availability as a simple .js download; solved the problem of DOM differences between browsers. Good luck to you, and thanks for sharing.
wr to linq to sql: the difference is linq works by making objects to tables and dotnet primitives to sql types, often producing really poor queries as a result
This is a nice idea, especially given all the work people have done recently to make in-language querying nicer (Spark comes to mind).
My only gripe is the 'auto-generated' column names for aggregates. This seems like a recipe for disaster - what if there is already (as there almost certainly will be) named "sum_gross_cost"? The behavior also just seems rather unexpected and implicit. My suggestion would be simple syntax that lets you optionally give a name to a particular aggregate column:
...
filter gross_cost > 0
aggregate by:[title, country] [
average salary,
sum gross_salary,
average gross_cost,
let sum_gc = sum gross_cost,
count,
]
sort sum_gc
While it might seem a little uglier, it seems much more sustainable in the long run. If this is really too gross, I'd advocate some token other than underscore that is reserved for aggregation variables; perhaps `sum@gross_cost` or `sum#gross_cost`.
I'm quite opposed to the idea "from should be first".
I want to understand what exactly the query returns, not the implementation detail of the source of this data (that can later be changed).
Literally first example from page - I have no idea what is being returned:
from employees
filter country = "USA" # Each line transforms the previous result.
let gross_salary = salary + payroll_tax # This _adds_ a column / variable.
let gross_cost = gross_salary + benefits_cost # Variables can use other variables.
filter gross_cost > 0
aggregate by:[title, country] [ # `by` are the columns to group by.
average salary, # These are the calcs to run on the groups.
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum gross_cost,
count,
]
sort sum_gross_cost # Uses the auto-generated column name.
filter count > 200
take 20
of course, similar things are happening to SQL too, with CTEs becoming more widespread and "real" list of the columns hidden somewhere inside, but it's still parseable
Quick, what is this query about? What's ironic is that I think you have it backwards: the columns are the implementation detail, not the table. The table is the context: you can't change that without having to change everything else. But columns are the last step, the selection after the filters, joins, etc. They can be changed at any time without affecting the logic.
This is... An odd choice. I'd assume I'm not without context looking at a query to know why I would want those columns.
And the auto complete story is backwards. Often I know what columns I want, but I'm not clear what table I need to get them from. Such that, if you make a smarter suggest in the from to only include tables that have the columns, I'd be much happier.
Just throwing in another point of anecdata onto this pile: "Often I know what columns I want, but I'm not clear what table I need to get them from" does not make sense to me. I don't relate at all to their being a global namespace of columns, rather than a namespace of tables, each with its own columns specific to its context.
I challenge this. I accept that there are ambiguities, but I assert that you can go really fast by just telling someone to fetch a few columns by name.
I further assert that if your database is filled with "Id" and "name" columns, instead of "department_name" and similar, you are probably as likely to mess up a join as any benefit you get from the name being short. (And really, what advantage is there in short names nowadays?)
That all said. I worded my take too strongly. My point should have been that auto suggest should not be confined in either direction.
I think we have just done most of our data work in different environments.
When I'm trying to query stuff, the first question is "which service's database is that in?", so I can guess "user_service" (or whatever I think it is called), but I have no idea what they call anything in their schema, but now that the autocomplete system knows what table I'm interested in, it can help me figure that out.
I'm used to emacs, with a global namespace. Such that I'm used to searching all variables globally. Feels that searching all columns would be just as easy, all told.
That said, I want to be clear that I think both methods are valid and work.
Yeah, it's just that I don't know what the columns are called. The namespaces (databases and tables) are what guides me to the columns, not vice versa.
I can think back to projects / companies where this may have been different, basically just with fewer different distinct schemas, but it just isn't my recent experience.
(I appreciate your magnanimity in these later comments by the way.)
It's query asking for the id, name, and author fields. Very straightforward, I have no idea how this is confusing.
> The table is the context: you can't change that without having to change everything else.
Except even in the provided single-table example this isn't true - you're getting subselected/CTEd results. No functional joins are demonstrated unfortunately.
For example:
from employees
left_join positions [id=employee_id]
...is equivalent to...
SELECT * FROM employees LEFT JOIN positions ON id = employee_id
No data is selected from positions in either example, and it's unclear on why we're joining that table (other than just for the heck of it). It's not a workable example.
You restated the query; I was asking what it's about. Is it a query across publications? Or is it a query over news articles? That context changes everything: how the query is written, what it can be joined with, how it can be filtered, how it is used, etc. Putting the FROM clause first means that you immediately have context to understand the rest of the query.
What fields am I expecting in the resultset? Sure I have some context, I know it'll be about articles, but I have no idea what actual data I care about.
You're arguing that:
FROM articles
SELECT id, name, author
Is substantially superior to:
SELECT id, name, author
FROM articles
I don't see them as markedly different with such a small example.
HOWEVER, where the difference comes in is the "at a glance what am I getting in my resultset" data that is much easier to see in the latter, second easiest in the former, and not at all present in the linked article.
I think what this boils down to is what you (any reader, not you specifically) individually expect to need knowledge of when you're writing SQL. In most cases when sitting down to write a brand new piece of code to pull some data from the database, putting the list of tables involved in the query first matters most to some, whereas putting the list of fields to expect in the resultset matters most to others (I put myself in this camp).
For what it's worth, as I've stated elsewhere while I don't prefer it and would find it annoying to debug personally, I do recognize that the idea of SQL that allows you to list FROM / JOIN / etc. first is very appealing to some. What I think is completely off is the near-obfuscation of the examples in the linked article.
> I don't see them as markedly different with such a small example.
Of course not, because we're talking about a fundamental change in syntax that affects more than just two-line queries. PRQL puts the SELECT clause practically at the end of the query, after every join, filter, and aggregate. If you only care about the output, that's cool too: just look at the end. But if I want to understand a query in SQL, I have to read the entire thing backwards, clause by clause! By contrast, if the SELECT is at the end, that's not much of a problem.
Now you'll come back and say "but it's the same if you care about context: just look at the end!" And that's where we differ. I care more about writing, debugging, and understanding queries, whereas you think it's more important that the column names are up front, even if it makes understanding a nontrivial query much, much harder.
The big advantage of "from first" like we have in Kusto KQL (a database we use at Microsoft) is that it provides much better autocomplete (if I write the `from` it can easily autocomplete the projection).
If you want an interesting example of how a query language built for developer experience and autocompletions looks definitely check it out!.
That's interesting because it also explains why I was going to say I do like having from first. When trying to reason about a query, I mentally go through the following:
1. What tables are being pulled from? This speaks to the potential domain of the query.
2. What data is being selected (I can now know what is or isn't being pulled from the aforementioned tables...)
3. What operations, aggregations, groupings, etc. are being performed to work on the pulle data
Of course from vs select ordering is completely arguable, but my thinking process seems to follow that of the auto complete--in other words that my cognitive load of looking at the select statement is lessened when I know from what the columns are being selected.
It also follows (at least to me) the mental process of writing the query. First look at the tables, then decide what columns, then decide what functions to apply.
I said it in a sibling, but I feel this is somewhat missed. Auto complete that simply lists the tables is easier if from is first. But... Auto complete that helps me know what tables can give me my requested columns works the other direction.
You would think that but having used both I find writing Kusto/KQL much smoother, neater and faster and if I have to choose between writing a query in either one I'd pick KQL.
I understand this is just an opinion but it's an opinion held by everyone in my org who writes both.
(I can see the result type both by hovering on the query but also by just looking at the end of it - and in SQL most of the SELECTed items in complex queries are from subqueries anyway - at least in my use case)
Building for autocomplete is building for human understanding. If it is impossible for a computer to determine the context of your query, why would a human do much better?
Technically to be equivalent you need to wrap the second one in parentheses so you can use ToList() on it. Unfortunately a bit ugly. I'm not sure why they didn't add one more keyword to handle pipelining into other functions. Something like "feed", "into", or "pipe". Or just pluck the |> operator from F#.
I’ve been frustrated by toilets where I have to contort my body to reach the dispenser. Similarly, I’ve had dispensers intrude on the space where my legs would normally be and make it awkward to even just sit on the toilet.
Toilets are absolutely designed to make the dispenser placement convenient. You just don’t think about it because 95% of toilets get it right, so it just doesn’t bother you that much that it can be wrong.
In SQL, some decisions are right about 10% of the time and are annoying and awkward the other 90%.
That’s why the order matters. Because everything else got it right.
I'd agree if there was any way whatsoever of fixing this issue, but there simply isn't. The editor can't even begin to guess what you might want until you write your FROM.
Maybe in gigantic systems with more tables than makes sense. Realistically, all of the columns available in a database can be fit in memory with ease.
Then, the ide could basically fill my from out for me, based on what I'm asking for. Can even suggest what join I will need, if I list columns from multiple tables.
>Maybe in gigantic systems with more tables than makes sense. Realistically, all of the columns available in a database can be fit in memory with ease.
Every table has more than one column
So there's always more columns to remember than tables and generally tables are pretty easy like user invoices blabla
I worked with systems that had like 500 tables and some of them with 20-50 columns
you really want good intellisense in such a environment
500 times 50 is still not a big number. And you could do decent statistical suggestions on the current columns in.
Good intelligent suggestions is, of course, helpful. And I agree that suggesting one of 500 is easier than the other. That said, neither is hard for a computer. And even asking friends what table I want will often be done with starting with the actual columns I want.
And one of use cases is writing queries which it helps immensely.
Best of both worlds would allow both orders.
Just automatically transform the query to the usual form after it's execution.
I think it's quite a common convention in engineering - not just software - that the input to a process "goes in the top and out the bottom". We humans read top->bottom (regardless of left/right/vertical, I don't know any languages that write bottom up). Conventional voltage in circuit diagrams usually flow top to bottom. Gravity loads in schematics flow top to bottom. Chemical pathways are usually written top to bottom. And of course functions take arguments up top and return at the bottom, maybe with some short circuits. I think the only counter example of note is distillation columns.
Where is the data coming from? Employees table. What's coming out? 20 rows of sum_gross_cost.
What could improve this is function signatures. It's kind of nice to have the whole abstraction up top...like an abstract.
I agree that the columns of the results should be more obvious. But I am a proponent of "from should be first". I have never written a SQL query without thinking about the contents of a table or its relations. If it was my way, I would describe where the data I'm pulling from, then describe any filters/joins, then describe the columns that I'm interested in (last).
It's a fair sentiment, but it can be handled without losing directional flow and composability, some of the bigger advantages of reworking SQL.
One idea would be along the lines of a function prototype: a declaration, up front, about the columns and types that a query is expected to return. It's a good place to put documentation, it's redundant information which should protect against mistakes but not so redundant that it would be too taxing - the author should know what the query returns. The prototype would only be used for validation of column names and types.
Another idea would be requiring the last element in a query to be a projection, a bit like the return statement in a function body: here's what I'm returning out of the grand set of symbols available (e.g. via various joins) in scope from previous operations in the flow.
Without the `let` I would imagine having trouble reading it as well, I'm not sure if that would go away with familiarity but my instinct is that it's a useful addition.
This feels like a English-language thing. In english we tend to put our adjectives first, it feels natural, "Where is my red, round ball?", rather than some other languages (like German) where you put the subject first. Equivalent of "Where is my ball, red & round?"
While it inherently feels unnatural I do agree with the others here that the context is actually easier to understand once over the initial uncomfort.
_from_ is kind of one of the most important context about the data being returned. It provides the type information. Columns you select are just properties of that type.
In SQL, where _from_ is placed at the end, we are essentially writing equivalent of 'property.object'; eg.: name.person, age.person
Both CTEs and this idea address the same problem: poor readability of complex SQL queries. Compared to CTEs, the author takes the idea to split the complex query into parts to the next level.
To your point - a solid IDE will show you what's being processed at each line (or returned, if the cursor is on the last line) - in an autocomplete window or a side panel.
First, kudos because it takes courage to take on SQL in this way.
Second, this kind of reversed SQL (filter-first, select-last) is much easier to reason about than the original and keep in mind that I prefer to code complex queries in SQL than to build or translate them in the ORM of the project I'm working on.
Maybe a transpiler is an inevitable first step but I think that any SQL replacement should be itself the target of ORMs and run directly in the database CLI tools (psql / mysql ...) or IDEs (pgAdmin, MySQLAdmin, ...). What's the long term plan of the project?
> Second, this kind of reversed SQL (filter-first, select-last) is much easier to reason about than the original
Given that SQL clauses tend to be unambiguously terminated by the start of the next clause or the end of the statement, it surprises me that no engine has gone to accepting otherwise standard(-ish, as much as real DB vendor dialects are) SQL but without a mandated order of clauses.
And then combine that with dev tools that allow easy rearrangement of clauses, perhaps based on configured preferences so that you don’t even see the original if its not your preferred order, so that “Bob likes old-school SELECT FROM WHERE GROUP BY and Alice likes FROM WHERE GROUP BY SELECT” isn’t a problem.
I agree that integrating with the DB would allow much more from a lang. But PRQL is a bet that languages which start there (e.g Kusto) get lost because it requires changing DB, which is really hard. I worry EdgeDB may hit this issue too (but I'm really hoping it works, and they have an excellent team).
As I think you're suggesting — you could imagine a language starting out as a transpiler, and then over time DBs working with it directly, cutting out some of the impediment mismatch.
Malloy [1] is another point in space — it targets existing DBs through SQL queries but can also ask for schemas etc while developing.
I'm kinda surprised that the list of influences doesn't mention XQuery. Yes, it's not a relational query language... but it covers much of the same ground in practice, especially the part that they call "FLWOR expressions" (for/let/where/order/return) that operate on "tuple streams":
And it has grouping, windowing functions etc. I bet you could define a subset that is specifically tailored to the same use cases as SQL - basically, get rid of everything to do with elements and attributes, and only allow scalars and sequences (and maybe maps?). But otherwise keep XDM data types and their semantics.
shakti / K / kdb+ implements "real SQL", which is concise but readable, and could give you a few ideas. Here's a copy-paste from https://shakti.sh/ under document/sql.d (cannot deep link, unfortunately). The most most magical aspects are automatic joins - both left joins and "foreign key chase" joins. The fk-chase joins, in particular, should be part of every query language, and can possibly be added in a backward compatible way to existing SQL implementations.
example: TPC-H National Market Share Query 8 http://www.qdpma.com/tpch/TPCH100_Query_plans.html
what market share does supplier.nation BRAZIL have by order.year for order.customer.nation.region AMERICA and part.type STEEL?
real: select revenue avg supplier.nation=`BRAZIL by order.year from t where order.customer.nation.region=`AMERICA, part.type=`STEEL
ansi: select o_year,sum(case when nation = 'BRAZIL' then revenue else 0 end) / sum(revenue) as mkt_share from (
select extract(year from o_orderdate) as o_year, revenue, n2.n_name as nation
from t,part,supplier,orders,customer,nation n1,nation n2,region
where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and
c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'AMERICA' and
s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'STEEL') as all_nations
group by o_year order by o_year;
Thanks for the tip. That automatic "foreign key chasing" looks phenomenal. Byebye, much of that big tedious chunk in the middle of your 2nd example... Wish I had that for more of the SQL I write.
I see EdgeQL as an excellent replacement for SQL in OLTP settings — it has great language integration and a unified relational & typing approach. (Please correct me if this is mistaken though).
I wrote the PRQL proposal for analytical / OLAP queries, where the pipeline of transformations are more important, and relations and typing are relatively less important.
EdgeQL is getting support for generic partitioning/aggregating `GROUP` very soon [1], so we are giving some love to the analytical side of things too :-)
We definitely need more collective effort put into "Better SQL", so PRQL is a welcome sight!
1) SQL also allows you to define windowing reusably, like this
select
sum(blah) over window_abc,
avg(blah) over window_abc
from table_xyz
window window_abc as (partition by x order by y)
so that second example could be written somewhat less repetitively but it wouldn't change the whole point.
2) Sadly my main pain point with SQL for ETL is not possible to solve with a transpiler - SQL has exactly one target so doing things like "I want these records to go to a table A and those records go to table B" is not possible with one query.
3) It would be cool to see how this does typically annoying and repetitive cases from analytics / data warehousing world. I'm thinking like SCD1/2 implementation. But I don't even know if mutation is there yet.
4) I would recommend investing in one canonical formatter, like Go has. So that there isn't infinite number of ways the same query could be formatted for people to argue over preference.
EDIT:
5) Since this seems to be focused on analytics (by the choice of queries and Snowflake in examples), I want to highlight that someone suggested to use TPC-H (or TPC-DS) queries as a benchmark. It does sound like a good idea.
Awkward syntax — when developing the query, commenting out the final line of the SELECT list causes a syntax error because of how commas are handled, and we need to repeat the columns in the GROUP BY clause in the SELECT list.
There are some SQL varieties that actually allow a hanging comma! Also, the provided examples seem comma-dependent, no?
As someone who writes a ton of analytical SQL, i think this would get super messy for long, complex queries with casting, case statements, windows functions, etc.
Most people just need to learn to write better SQL!
There’s a lot to like here. The ordering and the ability to write functions.
I’m not a big fan of the ternary operator. I think the ‘? :’ is hard to read and caters to programmers from system programming languages. A query language should cater to BI and stats people as they typically have a harder time learning syntax than a CS person has learning non-C-like syntax.
I’ve always liked Pythons ‘val if condition else other_val’. It’s easy to read for new-comers, while the ‘? :’ is a devil to google for.
In the second example I don’t understand why there are both let statements and select statements for the same columns. I also don’t understand why select uses square brackets. Maybe I’m too used to sql but why not just not have brackets?
To be fair, lateral joins (cross/outer apply in mssql) can help with name aliasing and table functions give sql some reusability. I think the main pain points for sql are pivots and window functions.
A lot of the time you just want to transpose your result, but you have to choose an aggregate and handle null cases to force pivot to work the way you want it.
And a lot of the time you want to aggregate a window but keep the ids of the row so you avoid the having keyword altogether and go for row_number and dense_rank to get your aggregate results.
If I were to write a query language, I would discard group by and having and make it easier to apply transpose and window functions.
I love quality language proposals like this. I'm not so much in data processing/bigdata, but have had to interact with SQL a lot.
This syntax is lovely! It's more intuitively readable (and SQL is not that bad in that regard).
My feedback:
1. Lower case, underscored everything makes the terms a bit hard to differentiate.
Maybe set some classes of symbols in CamelCase, or add a !@#$%& prefixes to them to make it more readable.
2. I dont like to use another language (SQL or PRSQL for that matter) to db interaction, I like to write the queries in the language that I'm using to develop in.
There are ORMs in this design space, but I'm a little fed up with them.
In Java there's jOOQ. Other less-OO-more-functional ORMs exist in Rust and Haskell land. These often have a code generation step, a library is generated that guarantees some type safety for a give schema version. Some are more SQL-like, some provide a different API.
PRQL is much more diverted from SQL than these, and for good reasons.
Maybe several languages could easily have libs like this building on top of JPQL?
3. You solution is a bit like GraphQL in some regards; where there is a tool needed to convert the query to SQL.
Tools like this exist, like Hasura and the likes. Hasura does a lot more. To me GraphQL has the huge advantage of serving a schema so that clients can be generated. I can interact with GraphQL in a type safe fashion from by generating a client in, say, Elm.
The generated client lib does not allow my to write syntax errors in my queries and ensures all type conversions are sound.
Maybe PRQL can also be a language like GraphQL in that regard, and provide a schema too.
4. JPQL.
It's close to SQL. It improves to SQL, but I never found it enough of an improvement to justify the cost. I think your proposal is better. But still I think JPQL deserves a mention as maybe one of the most widespread compile-to-SQL languages.
Awesome! Would love to see an implementation. I worked on something similar over the Summer. It’s just relational algebra with pipes for composition. If you are interested, we could get an antlr grammar going and plug it into this basic execution engine to get a feel for the language.
I've always wondered why there aren't query languages that embrace algebraic data types and pattern matching. Seems like an obvious fit to me. There's many times where you'd want to model a table that has either this scheme or that schema.
They can work well. In the project I'm working on the database uses algebraic datatype keys (i.e. tags and tag-dependent columns) to make the database faster and smaller than an equivalent relational schema, but the database is used via API rather than via a query language.
There was this professor of language who would say "Do you think the question ('are carpets furniture?') tells you something about the ambiguity of the word carpet, or do you think it tells you something about the ambiguity in the world?"
Similarly, I think joins are "tough" not because of the way SQL expresses them but because the logical possibilities of merging data from multiple tables are varied.
There is no such thing as a domain-agnostic SQL database that holds up under this kind of semantic scrutiny. I don't think that there ever could be.
If you are rolling a SQL schema for a home improvement contractor, it is extraordinarily unlikely that their specific business would expect any scenarios in which carpets are sometimes known as furniture.
Having a bounded context to operate within is what makes SQL magical for me. When people don't understand the business or simply the game around how you talk about the business, things start getting messy wrt joins.
The carpet discussion was simply to say that you can't take out all the complexity of a language if the domain it is meant to describe is complex. The language has a limit to how simple it can be.
I was not proposing a SQL database of carpets, or furniture, as a thought experiment.
SQL has effectively failed, as a standard, despite it's ubiquity. It's literally being aged out, which makes for opportunities for PRQL, etc to fill pragmatic gaps.
eg the lack of default column aliasing from joins
SELECT
A.id AS A__id,
A.name AS A__name,
B.id AS B__id,
B.name AS B__name
FROM A
LEFT JOIN B
ON A.other_id = B.other_id
When you could have:
SELECT
A.*,
B.*
FORMAT (TABLE__)
FROM A
LEFT JOIN B
ON A.other_id = B.other_id
IMO, this appears not to be something that solves the SQL learning curve but rather the usability of a query language with tooling.
I don't think there is much that could be done to address left, right, inner, outer join semantics. It's just something you have to learn if you want to do a lot of SQL (though, you are likely only ever going to use left and inner joins).
Looks really nice, i've been scribbling away in a little notebook all the things i would do in "akdor's dream sql", and what you have here hits pretty much exactly.
Wondering about generic use of `let` - you have let for col defns, but `func` for functions and a TODO for tables/CTEs - could/should `let` do the lot? (Like another commenter posted, this is how MS's M language, used in PowerQuery in PowerBI and Excel works). Could enable an escape from point-free for entire queries if taken to extreme generality, not sure if that's a good thing, maybe it could be?
Bikeshedding: even with some OCaml/F# experience, i find `f x y` harder to read than `f(x, y)`.
At the moment `let` is used to add a column as part of an existing pipeline. [1]
`func` is the start of new expressions / pipelines. And I just added a proposal for `table = `, which would be the same.
Does that make sense? Very open to more feedback...
[1] I just added `let` based on feedback here, it's better than it was, but not perfect, as it can be confused for a new pipeline given its use in other langs.
I like that everyone is trying to make something like SQL that reads more naturally to them. More alternatives is good! SQL is a widely accepted standard, and has strictly defined and super broadly accepted semantics.
As someone who has written quite a few half-baked-for-general-use but fit-for-purpose SQL generator utilities over the years, I'll suggest that if you intend for a novel syntax to be a general SQL replacement then being isomorphic to SQL would massively increase usefulness and uptake:
1. novel syntax to SQL; check! Now novel syntax works with all the databases!
2. any valid SQL to novel syntax; a bit harder, but I'd start by using a SQL parser like https://github.com/pganalyze/libpg_query and translating the resulting AST into the novel syntax.
3. novel syntax to SQL back to novel syntax is idempotent; a nice side effect is a validator/formatter for "novel syntax"
4. SQL to novel syntax back to SQL is idempotent; a nice side effect is a validator/formatter for SQL, which would be awesome. (See also https://go.dev/blog/gofmt, which is where I learned this "round trip as formatter" trick.)
I don't mean for this to sound negative, and I know that 2, 3, and 4 are kind of hard. Thank you for building prql!
This actually looks like an improvement (and I like SQL). This feels closer to non-programmers, contrary to some other SQL "competitors" like that query language from InfluxDB.
I both love this (it's very well done!) and hate this, because I feel it misses the forest for the trees.
It's like that meme "Nobody says 'I want to be an Excel guru when I grow up.'"
SQL is just the means to an end - to alter or retrieve data in a system.
We shouldn't be coming up with nicer ways to handle autocompletion, or recursive semantics, or windowing functions.
SQL was created to be declarative and human-friendly, to be read aloud.
The next evolution of SQL should be natural language.
It should be bounded at the schema and enriched with as much context as it can, both about the domain of the data, the data itself, and its relationships within itself and to other schemas.
I just want to see all the people in my organization who haven't submitted their timesheet this week; all the orders in my ERP that are waiting for parts from a specific vendor; how much lift I'm getting from my targeted marketing campaign. I just want to delete any contact from CRM who hasn't responded to an email in the last 45 days!
I don't want SQL; I want something that translates natural language into a logical plan, and a physical plan; and a separate tool that allows me to express many, many natural language concepts for my bounded schema ... perhaps in a SQL-like way.
Like a metrics / definition store on steroids, or just a sea of rich aliases and computed columns and subqueries that can be composed without stressing over syntax or newlines or ordering or complex join conditions.
This only targets SELECT queries right? I guess there's not much to improve on for INSERT, UPDATE and DELETE queries.
BTW window definitions are reusable using the WINDOW clause, there's no need to define it over and over in SELECT.
SELECT
date,
CASE WHEN is_valid_price
THEN price_adjusted / LAG(price_adjusted, 1) OVER w - 1 + dividend_return
ELSE NULL
END AS return_total,
CASE WHEN is_valid_price
THEN price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER w - 1 + dividend_return
ELSE NULL
END AS return_usd,
CASE WHEN is_valid_price
THEN price_adjusted / LAG(price_adjusted, 1) OVER w - 1 + dividend_return - interest_rate / 252
ELSE NULL
END AS return_excess,
CASE WHEN is_valid_price
THEN price_adjusted_usd / LAG(price_adjusted_usd, 1) OVER w - 1 + dividend_return - interest_rate / 252
ELSE NULL
END AS return_usd_excess
FROM
prices
WINDOW
w AS (PARTITION BY sec_id ORDER BY date)
;
I am writing a language and CLI that mixes SQL with Python (spyql: https://github.com/dcmoura/spyql), and it is interesting to see that we are tackling some of the same problems:
- code/formula reutilization where we need to repeat logic over the query
- functionalities like `EXCEPT` and `REPLACE` modifiers for `SELECT *`, like in google bigquery (in most SQL databases its frustrating when you have a large number of columns and you only want to hide or replace a couple of them)
I do think SQL is all over the place, and while not perfect, it’s familiar and we got used to express our queries the SQL way. At the end, since you are generating SQL to interact with databases, you would have to understand SQL in order to optimise your queries (it might be challenging to get the perfect SQL query from PRQL as you do not know statistics about the tables or which indexes are available).
With SPyQL I am taking a different approach that tries to extend simple SQL SELECT statements so that some of these annoying features are tackled. In addition, by using Python to define expressions and conditions you solve another problem typically present on databases: extensibility. By including an IMPORT clause in your query you can import any Python module, so the sky is the limit. You also get a simple and intuitive way to work with objects and hierarchical data (like JSON).
I do find the language you are proposing very readable and flexible, bringing several advantages to SQL. If you build a parser I would love to bring it to spyql :-) The issues I have brought earlier would not be a problem to spyql since it is a tool to query files and data-streams in the command-line.
>Compatible — PRQL transpiles to SQL, so it can be used with any database that uses SQL. Where possible PRQL can unify syntax across databases. PRQL should allow for a gradual onramp — it should be practical to mix SQL into a PRQL query where PRQL doesn't yet have an implementation.
Awesome.
I hate SQL so much, I know for personal projects this is gold. I imagine actually using it at work might draw some questions though
SPARQL. Representing human information in relational tables goes against how people actually think and use information. We humans think in tremendous numbers of nested hierarchies, and recursive hierarchy traversal is a nightmare in relational databases. A graph is the structure for data that works best, is most efficient, and actually reflects how things are connected in our brains.
I'm a big fan of SPARQL, but the one thing that would concern me about trying to use it outside of the SemWeb context is simply that it assumes data is stored in <S,P,O> triples. Legacy databases by and large are not, so you need an adapter to bridge the representations. And while I know some exist, I haven't really used them and am not sure about the performance impact.
You can get quite far mapping the triple concept to (PK, column, value) or (PK, FK, related-row) and transpiling from there.
(I played around with this some years back, not to the point where anything came out of it worthy of publishing, but enough to be pleasantly surprised how far 'quite far' turned out to be in practice)
I'm really excited about languages that build on or are compiled to SQL, in the long-term (because I think it will take a very long time to build adoption).
The ones that particularly excite me are shorthands for SQL, even though their heavy use of symbols may be a detriment. One particular use case is in easily defining static authorization policy-queries that are backed by database data plus and have request variables injected during evaluation.
I am not very excited by datalog/prolog-based languages because I think logic languages are too unnatural to ever go mainstream. But I'd be excited to be wrong or for logic languages to become more friendly.
in a way SQL is Prolog and all reasoning for improvement of SQL should start from Prolog, because is where SQL started from. the expressive power of both languages is theoretically the same, even though SQL is much more comprehensible. but then again - certain complex task turn SQL in difficult-to-comprehend series of nested declarative operations on algebraic sets.
We don't need better SQL. We need a programming language with relational concepts built into its collections library and a syntax and type system that makes using it easy.
The fact that you have to go through all these intermediate layers to access your data is just stupid. Data should just be "there."
I hope PRQL has a better fate! unfortunately, by deviating from SQL lexical conventions (using :, using [], etc) we lose the ability to copy-paste from sql code elswhere.
I want a better SQL, but I also want some compatibility. Like typescript is for javascript.
The link refers to dplyr not being able to use databases but actually there is a database backend for it in package dbplyr. See https://dbplyr.tidyverse.org/
> Unnecessary repetition — the calculations for each measure are repeated, despite deriving from a previous measure. The repetition in the WHERE clause obfuscates the meaning of the expression
In my own in-house SQL-like language I solve this simple issue by allowing previously defined columns to be reused:
select
salary + payroll_tax as _gross_salary,
_gross_salary + benefits_cost as gross_cost,
...etc...
the prefixing underscore meaning those columns are just temps excluded from the output.
I do not believe other issues are actually issues, and am quite surprised by the volume of interest for this SQL alternative. Can't be because rust is mentioned, can it?
From dbplyr experience, folks want to be able to do stuff like `across(which(is.numeric), mean)`, which you can't do currently because dplyr doesn't know the column types (although it does maintain a list of the column names).
I wasn't thinking it'd be useful, but more that it's good to remain as free as possible from any assumptions about the schema of the data. I often work with tables that have unusually large schemas (> 100MB) and have seen some products' performance severely degrade as a function of schema complexity.
(But otherwise +1 to schema awareness during authoring.)
This is awesome, I really like the ML syntactic approach. I also really like the composability. I might just have to sign myself up as a contributor, because this checks off a huge chunk of my list of things that is wrong with SQL, and I'd love to see this succeed.
I know this is meant to transpile to SQL and so maybe this language is the wrong place to do this, but my biggest pet peeve with SQL is the ternary logic introduced by SQL nullability. I'm begging and pleading for this wart to go away, and I would love to see some algebraic sum types (Optional/Maybe, etc.) used in their place.
>PRQL is intended to be a modern, simple, declarative language for transforming data
It's not declarative. It's functional.
I believe that the approach that is followed by PRQL is more practical than SQL. We've implemented a similar approach in our visual ETL tool for non-technical people (https://easymorph.com) and it works wonderfully. Other cool things you can do with this approach (and can't with SQL):
* Modify existing columns without re-selecting the whole dataset
* Loops (iterations)
* Conditional IF/THEN/ELSE branching as a workflow statement
Love it. This reads so much better in my mind than SQL. I love how easy it is to abstract with functions.
My one suggestion: use a distinct symbol for assignment and equality, either :=/==, =/==, or even :=/=. I kinda like the Go way of doing things, := assigns and initializes, == is equality (not sure you have a need for = re-assignment, but maybe). But I would definitely warn against = for context-dependent assignment or equality comparison.
You've got "let" so I guess that is kind of a syntactical difference; I guess the trippy part is seeing "=" do equality comparisons.
from population
select country, rollup(city), count(*)
sort
Can represent this repetitive SQL query:
select country, city, count(*)
from population
group by country, rollup(city)
order by country, city
Information in group by is often redundant. You can tell which columns are measures vs dimensions by examining the 'aggregate' function - rollup or no function vs sum, count, avg. Order by can have a default to sort by all columns instead of naming them one by one.
Nice. Why OCaml though? I think using a more conventional language to construct queries could yield more adoption. It also seems that ORMs kinda of exist to tackle a similar issue, at least in part.
It would be definitely interesting to have a TypeScript of some sort but for SQL.
So a more practical and prettier syntaxe like what I'm seeing here that compiles to SQL queries.
Go to https://sqlframes.com/demo and in the code editor enter the following and execute (this example is taken from the first example on PRQL github page). It generates SQL, but it also computes and displays the results within the browser (though the data set below gives no results).
TypeScript is more verbose than JavaScript. While I love to use TypeScript I don't think I'd categorize it as prettier than JavaScript. And practical... well if you mean it is more maintainable then yes but if you mean faster to write then no.
I don't want a more verbose SQL I want a less verbose SQL!
I meant more the aspect that with TypeScript people would prefer to write in it and then compile to JavaScript because there is a benefit. With a PRQL with an SQL compilation target we would reap the benefit of a more practical and better syntax. In both cases they bring benefits but not in the same way.
Yup, I posted that yesterday too. I think Malloy is really interesting — compile to SQL but give more integrations to the DB, like schema-during-development. It has a proper team, led by Lloyd Tabb.
I would also take a look at MDX which comes from Microsoft for classical analytical processing workloads. I feel that it provides a nice way to think of data in terms of dimensions/attributes and hierarchies and measures as it gives a good way to think of what to aggregate vs what fields to use to do that aggregation.
SQL is feels a little more generic and it would be nice for it to have a little more schema aware specifically for analytical workloads.
This is another in a series of these kinds of proposals that look excellent on first glance for perhaps the 75% case but start getting syntactically messy when I want to customize the resultset returned.
On the surface, they're always neat but when you start to dig into how you'd implement something in an RDBMS, it begins to fall apart.
Let's look at the example syntax:
from employees
filter country = "USA" # Each line transforms the previous result.
gross_salary = salary + payroll_tax # This _adds_ a column / variable.
gross_cost = gross_salary + healthcare_cost # Variable can use other variables.
filter gross_cost > 0
aggregate split:[title, country] [ # Split are the columns to group by.
average salary, # These are the calcs to run on the groups.
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum gross_cost,
count,
]
sort sum_gross_cost # Uses the auto-generated column name.
filter count > 200
take 20
Where in here is it clearly stated which fields are returned?
In the original SQL it's right up front but here it's buried into the "aggregate" function, and I'm not clear that this isn't an oversight.
Another example that speaks to the "how do I implement this" side of the equation:
from employees
filter country = "USA" # Each line transforms the previous result.
gross_salary = salary + payroll_tax # This _adds_ a column / variable.
gross_cost = gross_salary + healthcare_cost # Variable can use other variables.
filter gross_cost > 0
Does this mean that the database must scan all records of the employee table in order to return the result before moving to the next step in the query? Must I index all fields? If not, how does a query planner prepare for this scenario?
The major tradeoff you make in most ORMs is exactly this: You lose out on being able to be explicit about how many queries are sent to the DB (and in many cases how efficient those queries are). Now this would become a language feature? What do I gain for that loss?
I'm not saying that SQL Syntax is perfect; far from it. I'm not seeing how this is an improvement.
I think if you want traction though, a proof of concept using an existing RDBMS would go a long way into providing evidence that this will work and is sufficiently thought out to deal with even the basics of what existing SQL databases have to. Query planning is hard, especially if you want it to be fast.
> Where in here is it clearly stated which fields are returned? In the original SQL it's right up front but here it's buried into the "aggregate" function, and I'm not clear that this isn't an oversight.
It's in the aggregate portion, like you said. Other example queries have a select portion. Why does it matter that it's not in the leading position like SQL?
> Does this mean that the database must scan all records of the employee table in order to return the result before moving to the next step in the query? Must I index all fields? If not, how does a query planner prepare for this scenario?
No, they are just describing how the statement is supposed to be interpreted by a human. I think you can basically just shuffle all the filter statements to the end and keep it logically equivalent.
This is a proposal for a "transpiles to SQL" language. So long as that transpliation is predictable, you cannot run into the sort of issues you are describing.
> It's in the aggregate portion, like you said. Other example queries have a select portion. Why does it matter that it's not in the leading position like SQL?
I don't mind it not being in the leading position. The author provided a very simple query and in that case it's not immediately apparent what fields to expect the resultset to contain when returned to the consumer.
This is a troubleshooting issue more than anything else. IMO placing the "selected fields" into the very centre of the query is distracting and obfuscates what is happening.
> This is a proposal for a "transpiles to SQL" language. So long as that transpliation is predictable, you cannot run into the sort of issues you are describing.
I think a good test of whether any transpiled language works well is to look at whether it could work on its own as a language. See: Typescript.
I don't think it's falling apart at all. Personally, I would require that what columns get returned be explicit (optionally with a * type syntax that you have to enable - the defaults should be safe and * has its risks). For one thing, you don't necessarily want to return all the columns you have aggregated. E.g., you may be running the equivalent of a HAVING clause on an aggregate column, so don't need the value returned.
"Each line transforms the previous result." - I assume this is referring to the order that transpilation happens, so you can read it top to bottom and understand the flow easily.
One thing I would like to see is how a recursive CTE might look.
using COMMON TABLE EXPRESSIONS (cte) can greatly improve readability of complex sql queries. adding "flow" just feels like a variation on sequential programming.
i am against "improving" sql. instead, i thing a whole rethink of the engineering behind relational engines needs to occur. for example, why can't a relational database support both SQL and other languages simultaneously, instead of being so black-boxish?
It is a language created to read like natural language, to facilitate its adoption.
Not only developers use SQL, less technical users also use it, especially in the finance industry.
Now, do you use BASIC professional software development? While some do, most don't. So why do we use SQL then?
An imperative language that compiles to SQL can be attractive, but unless the compilers come with good compiler warnings, we may end up created bloated database code.
Jeez I don’t know - SQL is the eternal language and people are always trying to replace it or come up with alternatives so they don’t have to learn it, but those efforts have always turned into their own obscure learning domains with the exact same problems people complained about in the first place (I have to learn this obscure syntax to get and shape data). Just learn SQL fundamentals and move forward.
Interesting, this language reminds me a lot of Ecto[1] (the DSL "ORM" for Elixir).
It took a bit of getting used to but I'm coming to really enjoy Ecto and especially now that sqlite is fully supported, I'm finding myself using it in places I normally wouldn't have.
Given that a lot of times (most of the time?) SQL is written by applications, not humans, how about a syntax based on some data language (e.g. json)? It always seemed strange to me that the API to our databases are (SQL) strings.. The order isn't even important.. Something like yaml or json seems like a better fit? Then of course it would be trivial to build other languages PRQL? on top of that..
To the author, if he's reading this. I wish you luck! To get this adopted, try to make it modular so maybe it can be made a core module of PostgreSQL, MariaDB, MySQL. If you somehow get into those 3 I wouldn't be surprised if Oracle, SQL Server and DB2 integrate it themselves just to keep up.
Ah, Presto could be another popular target.
It's a huge effort but something which could have colossal payback.
This language transpiles to SQL, so it can be implemented entirely client-side. No need for modules for the database engines, just wrappers round the clients.
I prefer "from first". When I write a SQL query, 99 times out of 100, I'll start with "select * from", then fill in the query, then go back and select the columns. After a few basic joins, "easy" column names have probably grown aliases or have been subsumed entirely. For me, "select * from" is automatic.
This is what I'm most interested in. This is where SQL becomes very awkward and repetitive.
select
customer,
( select sum(revenue) from orders where customer=accounts.customer) as revenue,
( select count(transactionid) from orders where customer=accounts.customer) as orders
from
accounts;
That's a very simple example, but it can get VERY wordy and complex, and in my dreams I'd be able to write something like:
getrevenue(cust) is select sum(revenue) from orders where customer=cust
getorders(cust) is select count(transactionid) from orders where customer=cust
select
customer,
getrevenue(customer),
getorders(customer)
from
accounts;
In a way, it's just dynamic sql without hacking strings together in unmaintainable ways.
My SQL life is seemingly dominated by correlated subqueries and self-joins where the join checks another row does _not_ exist (the row R with the largest column X is the one where there is no row R1 with a larger X1).
I don't believe SQL queries like these can be improved by a metalanguage. They _can_ be meaningfully improved by indentation, whitespace, careful naming and detailed comments.
PartiQL[0] is an open source library that is a superset of SQL that I really like. It supports querying nested structures inside columns, so if a column contains some JSON data you can use the standard dot notation to query nested JSON data directly
This looks like a turing-complete language if you add function.
One benefit or current SQL is that it can (and is) often rewritten to be executed more efficiently. This requires a simpler and declarative model - far from a programming language.
If you want simple 'forward declaration' take a look at the SQL WITH clause.
\tangent What are today's data transformation needs (differ from Codd's inspiration?), how does relational algebra serve them, and design an "SQL" around that. There's got to be a 10x leapfrog in benefit for some niche in there, and that's the gateway to adoption.
Could there be a tool that would translate PRQL to SQL? One could then write in ~/my_scripts/closest_points.prql and then run a command to get the sql equivalent and use that in exiting SQL tools that do not currently access PRQL (like Postgres).
I think it would be worthwhile to develop a shorthand of the same thing, suitable for use on the command prompt. Something using symbols as synonyms for keywords. Less eligble but more useful in a future when shell tools understand this syntax.
I worked on a language at one of the big banks that looked very similar. The goal was to have a better-than-SQL language on top of Spark and I think we succeeded at that very well. Unfortunately, politics killed the language and platform.
Have you considered making this a frontend to apache arrow or datafusion? It seems like Flink or kSQLdb could use this as an alternative syntax to produce the same physical queries. Love kusto!!
Misses autocompletion opportunities: SELECT sucks because the tables follow the columns, while if SELECT FROM <tables> COLUMNS <columns> would enable autocompletion.
For any who want this kind of pipelining way-of-writing-SQL that has the benefit of existing in live production databases today, I highly, highly recommend looking into Postgres' and Snowflake's LATERAL JOIN keyword.
The TL;DR is that they allow you to reuse annotated and aggregated columns in an incredibly elegant way. Compared to OP's proposal, you still do need to start the query with what columns you want to come out at the end, and normal SQL weirdnesses still apply - but it's far, far, easier when writing massive analytics queries to see the flow of variables from one stage to another.
"sort sum_gross_cost # Uses the auto-generated column name." ... seems like a huge landmine. Languages really should not have any implicit way of constructing identifiers (among other reasons it is not easily greppable).
You might consider using a syntax like `sum:gross_cost` which can function as a sort parameter and an aggregation, but is actually recognizable as an object instead of having an implicit transformation going on in the background. Like this:
Yes, I agree with the downsides of the existing approach, and you're absolutely right re "Languages really should not have any implicit way of constructing identifiers".
I'd still say it's fine if it implicitly names output columns, for convenience.
The fact that someone felt the need to add that comment hints at a design mistake. Synthesizing symbols is weird, unnecessary and is probably a violation of the principle of least surprise.
Otherwise I think PRQL has some value. Nice work. I strongly suspect that if SQL looked more like this there would be a lot more people willing to use the query language directly and perhaps fewer that are compelled to bury it under shifting layers of fragile abstractions.
There are plenty of good ideas in here, I really like it.
Here's a few things I can say from my experience (I have been working on a "better SQL" for almost 10 years now):
- Your "aggregate by" taking a set of calculations to be performed on the groups is an excellent idea: the fact that "group by X" isn't a separate statement from the "select" that describes the calculations, means you can keep the idea that each line is an operation applied to the result of the previous line.
- Be very careful with auto-generated column names, they will bite you when you implement "Go To Definition" or "Rename Symbol" tools.
- The use of "filter" instead of "where" is a missed opportunity to be easily understood by people who know SQL. I suspect that you wanted to avoid the WHERE/HAVING confusion, but I'm not sure it is worth it. I do appreciate "sort" instead of "order by" (it's good to have only one keyword).
- To support a space-based call convention `f a b` you will pay a very heavy price in the language grammar. Also, allowing optional arguments in this convention will prevent you from implementing first-class, higher-order or partial functions. This may be a price you have decided to pay, if not, look into the OCaml rules for optional arguments, they are very well thought-out (the general idea being, you must have at least one non-optional positional argument _after_ the optional ones).
- Having [X, Y] be your list syntax, and X be a shorthand notation for [X], works out pretty well in practice, so long as you have specific positions in your grammar where lists are expected, so there is no ambiguity between X-as-a-column and X-as-a-shorthand-for-[X]
- General syntax opinions: "from" is fine, your CTE syntax (non-point-free) is fine. Boolean operators should be usable as infix keywords (or prefix if unary), just like arithmetic ones, because that is the first thing users will attempt.
- Raw syntax: for the sake of your syntax highlighter, I recommend having asymmetrical delimiter pairs like [|SQL|] instead of symmetrical pairs like `SQL`.
In addition to the above, one of the issues I have with SQL in an analytics context is that it tends to "lose" the actual tables: since it operates by joining/filtering/aggregating tables into new tables, your "products" table from the database schema is quickly replaced by a CTE named "products5" which (in theory) contains the same lines as the original, plus some additional columns, but (in practice) some lines have been lost or duplicated in the middle. I called this the "vanishing schema" paradox[1]. Your design, with the "let" statements to add columns to a table, solves one portion of the problem, but I wonder if you would consider this worth solving more fully.
Absolutely. If I use a SQL db for my applications (I'm a software dev for context), I generally write raw SQL vs using an ORM. I find the long term issues of an ORM to not be worth investing and understanding SQL.
I'm also not having to learn a new library, in addition to the standard DB connection libraries, ~if~ when I switch a language or platform for some project.
I prefer to write SQL as most alternatives ether have runtime surprises or require more roundtrips to the database. I mostly work on line-of-business software, so if I was doing simple CRUD apps I might have a different opinion.
You kinda have to. Assuming that you're using an ORM, you still need to understand how it translate to SQL, and help it do the translation correctly.
Personally I've seen developer use the Django ORM, and create application with terrible performance. Tweaking the queries, you can help guide the ORM to generate better SQL, which in turn will affect your performance greatly.
We're currently facing a problem with a custom who have an application with terrible performance/scaling issues. The entire thing is very database heavy, but interaction is done solely via Hibernate. I have nothing against Hibernate, it's a fine ORM, but you need to understand it well enough that you can guide it towards better queries (Which sometimes involve actually writing SQL). At some point you need to decide if your time isn't better spend learning SQL directly, as that via always provide you with better access to the functionality provided by the database.
Yes, quite a lot I would imagine. I use it extensively at work and similarly sql heavy software companies in the past. That being said, I’ve also worked at places where they’ve avoided it like the plague – largely because few people were competent at it – and were moving away from relational DBs due to scale.
Of course. Large companies like Pepsi have teams of analysts that only write SQL. I applied for a programming job there a long time ago and didn’t follow up when they explained in the interview that’s the only language they used.
Why not use R's data.table? You can achieve literally the same pipeline style flow, internally it's written in C, it's not the fastest but gets the job done.
If you want real speed but no external package, kdb/q is probably better (has q-sql).
Good work for the effort etc. but I often don't see the value added of these toy projects of people apart from personal development and a bullet point on the CV of the author.
I can't decide if it would be better or worse if it stuck more closely to SQL keywords. You use "from" and "select", but not "where", "order by", "group by". There's some danger of it being in an uncanny valley of SQLish, but I'm pretty sure I'd prefer just using those terms verbatim (including the space in "order by"... that style is less common in modern languages but it's not really that much harder to parse).
I'd like to see more examples of joins and composing SQL. Does this language make it easier to make more general SQL queries? Can I take two queries and squash them together in a reliable way? I feel like I end up with a lot of theme and variation in my queries, often involving optional filters.
I might even like a notion of encapsulation that could help this query language when it's embedded in other languages. Like if I could say, in the language itself, that a query has certain unbound variables (and not just ? or other placeholders). This language seems like it would be better for generating than SQL, and sometimes generation is just necessary (like in any application that supports data exploration), but for most common cases I'd hope to avoid that. Defining inputs and then making whole filter sections or other statements conditional on those inputs would help here.