I don't think SQL can be more brittle than untyped object-oriented Python. The l...

dragonwriter · on Aug 15, 2023

> SQL is common for ETL exactly because it’s not brittle,

SQL is common for ETL because typically at least one, sometimes both, ends of an ETL operation is an RDBMS for which SQL is the standard language. It has nothing to do with lack of brittleness.

loveparade · on Aug 15, 2023

I guess it's surprising then that both Hadoop/Hive and Spark, which were the originators of SQL for ETL, typically work on data lakes instead of RDBMSs. In fact, RDBMs support didn't come for a long time. The choice of SQL has nothing to do with RDBMs. It's because SQL is a declarative language that's easy to parse and convert into a physical query plan that can be parallelized and optimized extremely well. Why is that? Because it's not a general-purpose imperative loosely typed brittle language like Python.

dragonwriter · on Aug 15, 2023

> Hadoop/Hive and Spark, which were the originators of SQL for ETL

They weren’t.

I guarantee you, before either of those existed, when Data Warehousing was often done with a different version/configuration of the same brand of RDBMS as the transactional store (the latter likely using something closer to a normalized schema, the former using a star or snowflake schema), using SQL for ETL was absolutely normal.

Which is why newer data warehousing / data lake systems support SQL even though they aren’t RDBMSs: a couple decades of RDBMS dominance made it the JavaScript of data storage.

> Because it’s not a general-purpose imperative loosely typed brittle language like Python.

Its not general-purpose or imperative, its just as much “loosely typed” as Python (both Python and SQL are strongly typed.)

Its not clear what concrete meaning “brittle” is supposed to have in this claim, so I can’t evaluate its accuracy.

Alanhlwang · on Aug 15, 2023

Definitely, I can jump into what we meant by brittle—we mainly meant that SQL scripts are hard to debug/undescriptive, you can't parametrize and customize error messages that you receive from transforms, and you can only execute one complete statement at a time that are often chained together with CTEs (which is a nightmare if its a statement of 400 lines of SQL). Python makes it easier to debug since we turn the approach from a declarative to a procedural one, and that's even the case with breakpoints when you write your actual transformers in Python.