Introducing U-SQL, a language for big data processing

hacker_9 · on Sept 30, 2015

Product looks good.. apart from the most important point from a dev point of view - the syntax!

They created a new query language but didn't change the syntax at all from SQL?? Why does every database product seem to think SELECT FROM is the be all end all? Ok they bolted on C# Linq, and you can reference other assemblies, but these are OLD ideas, and the combination/love-child of these ideas looks horrible.

For example writing SELECT before FROM is a massive productivity problem - mainly because the computer doesn't know what you are doing until you write the FROM table part, before it can finally auto-suggest column names for you.

Don't write SELECT col1, col2 FROM {table}, but how about FROM {table} Select col1, col2, etc.

Second problem with the syntax is it isn't modular: combine 4 tables together and get data from each and you have something like this:

  SELECT a.col1, a.col2, b.col5, b.col1, a.col3, c.col5
  FROM a 
  JOIN b on etc
  JOIN c on etc

What if I want to remove the join to table B, I have to modify the syntax quite extensively. (First comment out parts of the select, then get rid of the from - but this breaks the 'join c', so I need to repeat this process again..).

What if each from-select was it's own piece of code - revolutionary I know.

  FROM a
  SELECT col1, col2, col3
  JOIN
      (FROM b 
      SELECT [3]col5, col1)
      ON etc
      JOIN
          (FROM c
          SELECT col5)
          ON etc

Indenting is used to quickly show which table joined to which. If I want to just run table A, then I highlight the top 2 lines of code and run. Or for table B the 4th and 5th etc, you get the idea. The [3] syntax is used to tell SQL not to append the columns to table A, but insert starting from column 3.

This is what I thought of in 10 minutes, I'm sure there's some holes but already I like it better than the crappy SQL syntax I have to read every day at work :(

cwyers · on Sept 30, 2015

I think your chained, tab-indented from-select structure is the worst idea ever conceived of by mankind, and the only purpose I can think of for it is you want to encourage my carpel-tunnel to flare up again. So much extra verbosity added to the standard SQL of writing that query for no additional clarity. (And you still need the from! Just comment out one line and two columns, it's no harder than commenting out the weird syntax you've come up with.)

Which is not to say that SQL can't be improved upon, but the amount of improvement you'd need to make to make it worth learning a different query language is I think higher than the benefits you can get from improving SQL, especially if you're not going to radically remake it, because being 80% SQL is worse than being 0% SQL because you keep running up against the principle of least surprise not working.

hacker_9 · on Sept 30, 2015

Hilarious that your argument is your carpel-tunnel - Look at the code again and you'll see it's laid out so the computer can fill in the information for you = less typing. By typing the FROM first, the whole next SELECT statement can be auto added with a single key (perhaps `). Or you can write the SELECT yourself, but only have to type the first char of each column name before the IDE auto completes it.

The tab structure shows you at a glance what tables join to what (not uncommon to see this parent-child in XML for example, or you know, any other programming language ever).

Your last point is void - just look at C#, Java, C++, Go and so on. All reuse keywords for different purposes. It's not hard to switch context in your head depending on the language.

M_Rys_Msft · on Oct 1, 2015

Having been one of the co-inventors of XQuery, I completely understand the benefit of placing the FROM in the beginning. However, there is a vast set of existing familiarity among users along SQL that we felt supporting this will help people to learn the language quicker than introducing a new syntax.

As to your join syntax: Conceptually, U-SQL is a functional lambda expression composition language. One of the major lambdas are rowset expressions. Thus in your example above a is by itself a rowset lambda, "a JOIN b ON comp" is a rowset lambda etc. What does this mean? It means that eventually, the language may allow you to just write:

a join b on a.id == b.id

instead of

SELECT * FROM a join b on a.id == b.id;

At that point the SELECT ... FROM clause becomes a projection and applier of scalar transforms and aggregators.

Today, U-SQL is not quite as flexible, but we do support this pattern already for Table-Valued Functions where you can say

TVF(42)

in addition to

SELECT * FROM TVF(42)

Good discussion though... keep it coming!

lgas · on Sept 30, 2015

You may be interested in Date & Darwen's work on The D data language[0] as presented in The Third Manfiesto[1]. It addresses these issues with SQL and many more.

[0]: https://en.wikipedia.org/wiki/D_(data_language_specification...

[2]: http://www.thethirdmanifesto.com1

baldfat · on Sept 30, 2015

Yes and No:

Yes the syntax of SQL is very different and isn't the best.

No the syntax of SQL and data languages are nice because I mentally KNOW that I am doing Data Management because of this vastly different syntax.

I agree a little more on the new syntax side of your argument, but like an old pair of jeans it is comfortable.

Lofkin · on Sept 30, 2015

Solved here: http://www.ibis-project.org/index.html

guard-of-terra · on Sept 30, 2015

So, it's a glorified awk over .csv files?

I think that we as Open Source community can do better than that. We just have to make a product out of sort, join, awk, grep - not meaning making a wrapper over them but rather taking inspiration.

I did a great amount of medium-to-big data crunching in command line, and it's usually okay. There are still some pain points:

- Selecting by list of ids specified in one of columns.

- Joining two files together by field (you want them to be sorted on it obviously)

joshuaellinger · on Sept 30, 2015

Interesting extensions.

The nice part of the syntax is that: 1. it is clear when you are going external. 2. procedural code can be written in C# -- big win there.

But it misses the two big issues in ETL that I always run into:

    1. You have a bunch files that are 95% the same but have 2-5 columns that are different.
    2. Someone sticks the wrong type of data a field.

For those issues, you'd be better off mapping a file (or a group of files) to a virtual table so you can specify the schema up front.

M_Rys_Msft · on Oct 1, 2015

Actually U-SQL gives you the ability to map files to a virtual table (aka view :)).

CREATE VIEW V AS EXTRACT <insert schema here> FROM "/a/b/{*}.csv" USING Extractors.Csv();

If the data is more complex, you can write your own Extractors that for example handle the divergence in columns. Although we are planning on some features here in the future.

Olap84 · on Sept 30, 2015

Interesting to see ETL extensions to (T-)SQL. I wonder if we will see anybody else offer some? (Come on Postgres!)

pella · on Sept 30, 2015

for PostgreSQL check the "Foreign data wrappers" + "IMPORT FOREIGN SCHEMA" ...

* http://blog.taadeem.net/english/2015/03/15/70_Shades_of_Post...

* https://wiki.postgresql.org/wiki/FDW?nocache=1

cbsmith · on Oct 1, 2015

Man, does this look a lot like Pig. Why not just use Pig?

M_Rys_Msft · on Oct 1, 2015

Because Pig's overall adoption is low. 80% of Hadoop jobs last time I checked were written in Hive.

Besides Pig's extensibility story is not that good.

But I am glad you like Pig, because Pig is supported in the system as well :).

cbsmith · on Oct 6, 2015

> Because Pig's overall adoption is low. 80% of Hadoop jobs last time I checked were written in Hive.

Yeah, but putting another name on it doesn't seem likely to change that, does it?

> Besides Pig's extensibility story is not that good.

Really? I consider it far more extensible than say Hive....