Product looks good.. apart from the most important point from a dev point of view - the syntax!
They created a new query language but didn't change the syntax at all from SQL?? Why does every database product seem to think SELECT FROM is the be all end all? Ok they bolted on C# Linq, and you can reference other assemblies, but these are OLD ideas, and the combination/love-child of these ideas looks horrible.
For example writing SELECT before FROM is a massive productivity problem - mainly because the computer doesn't know what you are doing until you write the FROM table part, before it can finally auto-suggest column names for you.
Don't write SELECT col1, col2 FROM {table}, but how about FROM {table} Select col1, col2, etc.
Second problem with the syntax is it isn't modular: combine 4 tables together and get data from each and you have something like this:
SELECT a.col1, a.col2, b.col5, b.col1, a.col3, c.col5
FROM a
JOIN b on etc
JOIN c on etc
What if I want to remove the join to table B, I have to modify the syntax quite extensively. (First comment out parts of the select, then get rid of the from - but this breaks the 'join c', so I need to repeat this process again..).
What if each from-select was it's own piece of code - revolutionary I know.
FROM a
SELECT col1, col2, col3
JOIN
(FROM b
SELECT [3]col5, col1)
ON etc
JOIN
(FROM c
SELECT col5)
ON etc
Indenting is used to quickly show which table joined to which. If I want to just run table A, then I highlight the top 2 lines of code and run. Or for table B the 4th and 5th etc, you get the idea. The [3] syntax is used to tell SQL not to append the columns to table A, but insert starting from column 3.
This is what I thought of in 10 minutes, I'm sure there's some holes but already I like it better than the crappy SQL syntax I have to read every day at work :(
I think your chained, tab-indented from-select structure is the worst idea ever conceived of by mankind, and the only purpose I can think of for it is you want to encourage my carpel-tunnel to flare up again. So much extra verbosity added to the standard SQL of writing that query for no additional clarity. (And you still need the from! Just comment out one line and two columns, it's no harder than commenting out the weird syntax you've come up with.)
Which is not to say that SQL can't be improved upon, but the amount of improvement you'd need to make to make it worth learning a different query language is I think higher than the benefits you can get from improving SQL, especially if you're not going to radically remake it, because being 80% SQL is worse than being 0% SQL because you keep running up against the principle of least surprise not working.
Hilarious that your argument is your carpel-tunnel - Look at the code again and you'll see it's laid out so the computer can fill in the information for you = less typing. By typing the FROM first, the whole next SELECT statement can be auto added with a single key (perhaps `). Or you can write the SELECT yourself, but only have to type the first char of each column name before the IDE auto completes it.
The tab structure shows you at a glance what tables join to what (not uncommon to see this parent-child in XML for example, or you know, any other programming language ever).
Your last point is void - just look at C#, Java, C++, Go and so on. All reuse keywords for different purposes. It's not hard to switch context in your head depending on the language.
Having been one of the co-inventors of XQuery, I completely understand the benefit of placing the FROM in the beginning. However, there is a vast set of existing familiarity among users along SQL that we felt supporting this will help people to learn the language quicker than introducing a new syntax.
As to your join syntax: Conceptually, U-SQL is a functional lambda expression composition language. One of the major lambdas are rowset expressions. Thus in your example above a is by itself a rowset lambda, "a JOIN b ON comp" is a rowset lambda etc. What does this mean? It means that eventually, the language may allow you to just write:
a join b on a.id == b.id
instead of
SELECT * FROM a join b on a.id == b.id;
At that point the SELECT ... FROM clause becomes a projection and applier of scalar transforms and aggregators.
Today, U-SQL is not quite as flexible, but we do support this pattern already for Table-Valued Functions where you can say
You may be interested in Date & Darwen's work on The D data language[0] as presented in The Third Manfiesto[1]. It addresses these issues with SQL and many more.
I think that we as Open Source community can do better than that.
We just have to make a product out of sort, join, awk, grep - not meaning making a wrapper over them but rather taking inspiration.
I did a great amount of medium-to-big data crunching in command line, and it's usually okay.
There are still some pain points:
- Selecting by list of ids specified in one of columns.
- Joining two files together by field (you want them to be sorted on it obviously)
Actually U-SQL gives you the ability to map files to a virtual table (aka view :)).
CREATE VIEW V AS EXTRACT <insert schema here> FROM "/a/b/{*}.csv" USING Extractors.Csv();
If the data is more complex, you can write your own Extractors that for example handle the divergence in columns. Although we are planning on some features here in the future.
They created a new query language but didn't change the syntax at all from SQL?? Why does every database product seem to think SELECT FROM is the be all end all? Ok they bolted on C# Linq, and you can reference other assemblies, but these are OLD ideas, and the combination/love-child of these ideas looks horrible.
For example writing SELECT before FROM is a massive productivity problem - mainly because the computer doesn't know what you are doing until you write the FROM table part, before it can finally auto-suggest column names for you.
Don't write SELECT col1, col2 FROM {table}, but how about FROM {table} Select col1, col2, etc.
Second problem with the syntax is it isn't modular: combine 4 tables together and get data from each and you have something like this:
What if I want to remove the join to table B, I have to modify the syntax quite extensively. (First comment out parts of the select, then get rid of the from - but this breaks the 'join c', so I need to repeat this process again..).What if each from-select was it's own piece of code - revolutionary I know.
Indenting is used to quickly show which table joined to which. If I want to just run table A, then I highlight the top 2 lines of code and run. Or for table B the 4th and 5th etc, you get the idea. The [3] syntax is used to tell SQL not to append the columns to table A, but insert starting from column 3.This is what I thought of in 10 minutes, I'm sure there's some holes but already I like it better than the crappy SQL syntax I have to read every day at work :(