Hacker News new | past | comments | ask | show | jobs | submit login
Select * from cloud (steampipe.io)
790 points by kiyanwang on Sept 30, 2022 | hide | past | favorite | 282 comments



Since starting my career, the tech stack I use has changed numerous times. Technologies have come, gone, evolved. But except for a brief period in 2010 where we thought we could kill it, SQL is the one technology that has stayed around. It might be the only TECHNICAL knowledge I learned in the first year of my career that is still relavent. You're probably not making the wrong decision if you decide to build on top of SQL.


> except for a brief period in 2010 where we thought we could kill it

Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect. This proved to be a stupid idea - non-relational data stores have value of course, but so do relational databases.

SQL, as you point out, is 48 years old. It shows. The ergonomics are awful. It uses natural-language inspired syntax that has fallen completely out of favor (with good reason imho). The ordering of fields and tables precludes the use of autocomplete. It uses orthogonal syntax for identical constructions, i.e.

    INSERT ... (col1, col2, ...) VALUES (val1, val2, ...)
vs.

    UPDATE ... col1=val1, col2=val2, ...
And worst of all, it comes with such a woefully inadequate standard that half of the universally supported features are expressed in a completely different fashion for each vendor.

It is very common to argue that its longevity is due to its excellence as a language. I think it's longevity is due to its broad target audience, which includes plenty of people in less technical roles who are high on the org chart. Tech workers are happy to jump on new technology that improves ergonomics (or shiny new toys that are strictly more interesting because they are new). Executives who's primary way of interacting with the tech stack is querying a read only replica are much less open to learning a new language.


SQL expresses relational algebra [0] which is the language defining data relationships and manipulation. It hasn't changed in 50 years for the same reason why B-Trees and sorting algorithms haven't: it's foundational to computer science.

Database-specific "SQL extensions" are in my experience just administration commands, e.g. `VACUUM` or `CREATE EXTENSION` in Postgres. They help operate the DB, but have little to do with the actual data manipulation.

Killing SQL is like trying to kill computer science: you better come up with something better than a synonym.

[0] https://en.wikipedia.org/wiki/Relational_algebra


Nothing about relational algebra says you have to put your projection before you specify what relations are involved. SQL is a bad fit for modern tooling in ways that are a property of the language and are unrelated to it expressing relational algebra.

The notion that nothing has changed about B-Trees or sorting algorithms doesn't, I think, actually hold up... but this question is more like changing the API for a standard-ish B-Tree library than about modifying how the B-Tree itself works.


And that’s why Microsoft’s LINQ puts the entities (table names) first and then autocomplete just works. https://learn.microsoft.com/en-us/dotnet/csharp/linq/perform...

As an aside: LINQ is the only “ORM” that is truly worth using. It is integrated into the language and you can use the same expressions and query expressions are first class types that can be passed around as expression trees and are then only translated either as code if you’re using in memory lists, database specific SQL, MongoQuery etc based on the provider you pass the query to.


> LINQ is the only “ORM” that is truly worth using

I've been thinking a lot about ORMs and they almost usually fall flat at some level since they are not nearly as expressive as a first order language is. I haven't had experience with LINQ but I know the Django ORM is this way.

I'm thinking the best approach if I had time/money would be to develop a better query language that is close enough to SQL for people to learn, but then also translates back down to SQL. LINQ looks really close to this, but I'd really want it to be cross language.

One of the aspects of QUEL that I really liked was that if you couldn't describe what you wanted, you could program it out if you needed to. Say you had some complex ranking algorithm that was hard to write in SQL directly.


> I'm thinking the best approach if I had time/money would be to develop a better query language that is close enough to SQL for people to learn, but then also translates back down to SQL

Have a look at EdgeQL, the query language that powers edgedb.


LINQ is a full query language you can left join, right join, inner join limits, etc. It even maps C# functions to Sql functions. If you can express it with SQL, you can express it with LINQ.

Cross language will still be an issue. LINQ isn’t really an “ORM” it is a method to translate C# commands used to work over related collections to expression trees and those expression trees are interpreted/translated at runtime by a provider.

You need support from the language/runtime to treat expression trees as a first class type.

expression tree parser. It’s not for the faint of heart.


> If you can express it with SQL, you can express it with LINQ.

Not if it uses CTEs or other advanced SQL features. It can do basic select, join, group, sure. But that's just scratching the surface of how powerful SQL is. As for requiring language support: in Elixir, Ecto.Query essentially implements a LINQ-like DSL as a library (albeit with macros). See https://hexdocs.pm/ecto/Ecto.Query.html


From what I see a CTE is just an expression that you can reference.

In C#, if you did something like this

var seniorMales = from user in users where u.Age > 65 && u.Sex == “male” select u

seniorMales is just an IQueryable<User>

Later on you could say

var gaMales = seniorMales.Where(s => s.State == “GA”)

var floridaMale = seniorMales.Where(s => s.State ==“Florida”)

All three of those are just expressions that have not hit the database yet.

Then when you want to actually run the query and limit the number of rows, you can do

var result = floridaMales.Limit(20).ToList()

It would then create a query including a limit clause just as you would expect it to.

Now if you do

floridaMales.ToList().Limit(10)

It would return all of the rows in the table to the client and it would be limited on the client side (don’t do that).

Worse case, if LINQ can’t express in query syntax, a provider can add its own extensions in function syntax and the provider can still parse the expression tree.


Yes, I've been a professional C#/.NET developer before and understand LINQ and Entity Framework. The difference is that CTEs allow you to reference multiple tables at once and do inserts and updates on the intermediate result set. You can also use recursive CTEs to do hierarchical and graph-style queries, which you can't do with LINQ as far as I know.

LINQ also can't do window functions. Like I said, it covers the basics of SQL that most applications need. But as soon as you need to go beyond that you'll be writing custom SQL.


Never used C#, but in Java land we have a similar dsl jooq, which supports everything i’ve ever needed, including window functions https://www.jooq.org/doc/latest/manual/sql-building/column-e...


Yeah, there's nothing else quite like jOOQ as far as I know. I think SQLAlchemy in Python is the only other library that even comes close.


jooq is amazing and one of the things I really miss any time I leave Java land.


Then don't leave


> LINQ puts the entities (table names) first and then autocomplete just works

I grew up with SQL so LINQ for a long time just reads so weird to me. I won't admit how long it was before I realized this fact you stated.

It still reads weird to me, but I get it.


SQL does express relational algebra, but OP is arguing it expresses it poorly, in a cumbersome way. There's nothing foundational about how exactly SQL chooses to express relational algebra, you could imagine many different syntaxes and specific semantics.

I concur, SQL is a liability in many ways.


>Killing SQL is like trying to kill computer science

SQL is a language that expresses relational algebra, killing it is like killing any other language i.e. nearly impossible because there is still code sitting around written in it, thus needing competent programmers to maintain, competent programmers who will be asked to write more code and will choose to do it in SQL if they can get away with it, rinse and repeat and so forth.

Once a language has a certain installed base it may be eternal - what size that installed base is I don't know.


The installed base is kind red herring, because most of the worlds largest companies use Oracle to hold their data. So you'd have to get Oracle on board.

Could there be a better alternative, oh yes, most certainly.

E.g. Every company I worked for needed to do SQL migrations. Unfortunately that isn't a part of SQL. Could it be? Yes! Is it? No! So everyone re-invents the wheel...


The Third Manifesto[0] is 27 years old at this point. If it were possible to take seriously the proposition that SQL and relational algebra over databases were equivalent before its publication, it hasn't been since.

[0]: https://en.wikipedia.org/wiki/The_Third_Manifesto

I've been trying to get my hands on a corpus of Dataphor code, because Tutorial D is unsatisfactory for different reasons than SQL. What I've been able to glean of Dataphor is quite promising, but it isn't much.


You mean it butchers relational algebra; composition is the name of the game, and yet that's somehow incredibly cumbersome in SQL.


Db-specific extensions unfortunately aren't only admin commands. They include the whole PL/SQL thing, with concepts like triggers and stored procedures that should definitely be fundamental in SQL, but that nobody bothered to standardize, and eventually everybody implemented as they liked.


Stored procedures and even triggers are non-relational and have little to do with SQL proper. They are like Lua scripting inside nginx which are not directly related to anything HTTP.

On one hand, it might be nice to have completely portable triggers. On the other hand, the lack of standardization allowed to have approaches as differer as PL/SQL and T-SQL to emerge. It would be terrible to be stuck with some ancient COBOL-like SP / trigger syntax, required by the standard.


As someone who has worked across many DBs, I find the lack of a unified standard for procedures and triggers appalling and very time-consuming.

Granted, stored procedures and triggers aren't part of the relational model (strictly speaking). But they are the first layer built on top of the core - the one that packs multiple queries/statements together, introduces variables, and allows custom logic to be executed when the data changes.

The lack of standardization means that migrating from e.g. Oracle to Postgres/MySQL involves knowing the dialects and statements supported by each of them in order to migrate the PL/SQL or T-SQL logic - and, especially in the case of MySQL/MariaDB, some of those constructs may not be available at all.

That creates really a lot of friction when it comes to database migrations, especially for large databases. I've myself been working with this stuff for years, but I still have to regularly lookup how to write an IF or declare a variable for this or that DMBS, since we have such a proliferation of dialects and standards that one person can't keep all the variations in their mind.


Relational algebra is what makes SQL great: some actual strict math behind the design.

Ad-hoc syntax with hard-to-compose parts,is what makes SQL arbitrary and quirky. I would like a more algebraic syntax, with uniform, composable parts.


> SQL expresses relational algebra

Actually, no. Relational algebra is based on sets. SQL is not. Perhaps you are confusing SQL with QUEL, which was used by Postgres in its younger days? QUEL evolved out of Codd's original Alpha language.


How is SQL not based on sets? One of the big issues I see when people first come to SQL is not thinking in sets. Instead they think in individual records and loops, which is the exact opposite on how to think about SQL.


Sets are collections of unique elements, and indeed relations are sets of tuples. But SQL only sticks to the tuples part, foregoing sets. It's quite possible, and quite likely if you're not careful, to end up with duplicate entires in your results to an unwanted effect.

SQL is only relational-like. With experience and care you can craft your queries in such a way that you do get true sets (e.g. using DISTINCT), but I see beginners – and sometimes even experts in a rush – overlook this quirk in a complex query resulting in what might seem okay in limited testing, but blows up when there is more data than the bare minimum needed to test. A language that truly follows Codd's relational model would produce sets always.

I expect SQL is considered difficult to learn exactly because of this deviation. It's quite unintuitive, especially if you go in thinking that it is actually relational. Nothing you can't work around with sufficient knowledge and experience, but that added knowledge and experience required is where the difficulty no doubt stems from and undoubtedly scares many away out of frustration when it doesn't work like you think it should in the interim.


> It uses orthogonal syntax for identical constructions, i.e.

Regarding your complaint ... those aren't identical constructions.

Insert has to have the syntax it does because it lets you insert multiple rows all with different values in a single statement:

    INSERT ... (C1, C2) VALUES (v1, v2), (v3, v4), (v5, v6)
Update doesn't allow that. The syntax for each of the above makes sense for the specific intention of each statement.


They could have consolidated the syntax for both, though. No reason you couldn't have this instead:

    INSERT ... VALUES (C1=v1, C2=v2), (C1=v3, C2=v4), (C1=v5, C2=v6)
This would actually be more powerful because it would allow you to arbitrarily omit columns on a per-row basis, which you can't do with SQL today.


You can omit columns on a per-row basis with the DEFAULT keyword, it's just more cumbersome if your rows are very sparse. Your proposed syntax of:

  INSERT ... VALUES (C1=v1, C2=v2), (C3=v3, C4=v4), (C5=v5, C6=v6)
in standard SQL is

  INSERT ... (C1, C2, ... C6) VALUES (v1, v2, DEFAULT, DEFAULT, DEFAULT, DEFAULT), (DEFAULT, DEFAULT, v3, v4, DEFAULT, DEFAULT), ...


It would maybe be more powerful, but also less efficient. If you want to insert different sets of columns, you just use multiple INSERT statements. The point of the current syntax is to be able to efficiently bulk-insert many rows using the same set of columns. That set of columns needs to be validated and mapped to the internal offsets (or whatever) only once for inserting an arbitrarily large number of rows, instead of having to repeat that validation and preprocessing for each row.


And the point of gp is that SQL could do some work and sort these (a=b) values into groups that fit its inner requirements on its own. You can just omit a useful feature because it creates a minor inconvenience at a preparation phase, or you can include it. This is exactly these bad ergonomics some grand-commenter talked about.


What a bizarre use case. I highly doubt your scheme performs better than just using PREPARE AS INSERT to create a prepared statement and execution plan on the database side. Or are we still generating SQL using string interpolation in 2022?


Most people have an orm between them and the database.

Think bulk inserting into via the command why the format exists vs what you do in your language and/or framework.


If you call row.save() on one ORM object at a time, you'll get neither the performance benefits of prepared statements nor that of multi-row inserts. If you save many objects at a time, the ORM library will use prepared statements.


We're certainly still slurping up bulk SQL commands from raw text files in 2022. Kinda odd not to come across that use case.


I think it is pretty standard usage for ETL tools.

You are not inserting a few dozen records, you are inserting a several thousand records each query.

In fact, sometimes insert speed is measured in MB per second, not records per second.


They still aren't identical constructs, because the intended action is very different; multiple insertions result in multiple distinct records all with different values while multiple updates result in multiple records with the same new values

Having the syntax identical is inferior in every way except for the case where you want to build the string up programmatically, and that is a trivial problem to solve for the caller.

I think your proposal works better for an UPSERT keyword, as that is yet another different intended action.


And further, insert … (select …)


> Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect.

I don’t know who “we” is, but, before NoSQL tried to replace RDBMS(well, starting before, these things overlapped it), there were efforts to provide alternatives to SQL for RDBMSs, some examples:

Query-by-Example. https://en.wikipedia.org/wiki/Query_by_Example

the D class of data languages. https://en.wikipedia.org/wiki/D_(data_language_specification...


No mention of PRQL yet, it's an effort to tackle some of these shortcomings

https://prql-lang.org/


Which is plain terrible and unnecessary. SQL works, it's easy to understand, doesn't need any kind of abstractions or stupefying measures like PRQL.


I think your last sentence gets at the crux of it. I think SQL is truly great for that use case of analytical queries over a read only database / data warehouse / parquet / whatever (ie. "OLAP" use cases). I think it's less great for the "OLTP" use cases. It still works fine, which is why it sticks around, but other approaches are more convenient for doing lots of smaller inserts, updates, and lookups and fewer table scans and aggregations.


You are selling SQL way short. SQL has stuck around because, compared to its complexity, it lets you do incredibly powerful things.


Look at the analogy to general purpose programming languages.

Every single one, without exception, lets you do incredibly powerful things compared to its complexity.

This hasn't stopped the progression of general purpose programming languages, and I think the vast majority of developers would agree that there have been huge improvements in the last 50 years.


TIOBE [0] reckons that the top 10 programming languages are currently Python, C, Java, C++, C#, Visual Basic, JavaScript, "Assembly Language", SQL and PHP. The first five languages represent more than 50% of the users in TIOBE.

Apart from SQL, all of these languages are imperative, and most of them derive from C in some shape or form; K&R C was published in 1978, 44 years ago.

So, while I'd agree that there have certainly been huge incremental improvements in imperative programming in the last ~50 years, when we look at what people use on a daily basis, it's not really clear to me that the changes have been any greater in significance than the changes that SQL has gone through during the same period.

[0] https://www.tiobe.com/tiobe-index/


Yet every one of these “ancient” languages allows you to have human-readable common expressions, functions, sorts of classes, identifier scoping, heavy code reuse, flow parametrization, higher order operations (even C with some effort), libraries, frameworks, package management, interoperability, to name a few.

SQL is really good at its in-place one-time this-specific-case relational querying.

Things SQL is not even mediocre at: programming.


But aren't most people using ORM's? In which case the SQL code is more like a kind of byte code. SQL definitely is supper powerful, but it's also embedded deeply in all of the RDBMS's and tooling around them etc.

Lots of stuff untilizing SQL and the ability to do lots of cool stuff with SQL, doesn't actually mean that the Developer Experience or ergonomics around using SQL are good.


It's also great that much of the time if I come up with a crappy query... SQL knows how to optimize it for me. Easy to use, and fixes my stupids.


That's nothing to do with SQL, that's the database engine you're using (lexing, parsing, then) planning & optimising your query as best it can.

Much like C code say, it's not a feature of C the language that gcc, for example, can optimise your crappy code for whatever target platform.


I also feel like part of the staying power of SQL is because it's the lingua franca of the database. I wonder if it's time to build and popularize some lower level language that could be targeted by ORMs and others to speak to DB engines. If we had a more flexible assembly language we might get more freedom at the programmer language level.


MySQL has something in that direction, where some operations can be expressed as expression trees by using Protobuf where a client can put some arbitrary query language on top: https://dev.mysql.com/doc/internals/en/x-protocol-messages-m... it is still a bit limited, though (like no join ...)


C# already does that, it converts LINQ to expression trees that are then converted by providers to their target query language. I’ve used it on one project to pass around the same query expressions to SQL server and Mongo - you pass a type of IQueryable<T> which represents an expression tree.

For instance

var seniors = query(from u in Users where u.Age > 65 && u.Sex == “male” select u)

IEnumerable<Users> query(IQueryable<Users>)


I've been using SQL for ~25 years and don't think the ergonomics are nearly as bad as people say. In your example the insert is done that way for good reason.

I know people hate listing the fields before the tables, but when I deal with data that's how I tend to think. I need X, Y, Z, now where do I get it from and do I need to filter.

From a vendor standpoint, it can take time getting new standards implemented, but it's definitely not half. One of the big issues here is that MySQL was and still is in some ways woefully inadequate. It's shortcomings are also often make people think they are relational database shortcomings in general.


> The ergonomics are awful. It uses natural-language inspired syntax that has fallen completely out of favor

I think SQL is so popular and has staying power for the same reason that our ridiculous gregorian calendar has staying power.


The reason we can't replace SQL is that it is the lingua franca between many client tools and data sources. From programming languages, to Analytics tools, to ETL tools to bulk data transfer tools, everything knows how to define SQL queries. On the datastorage side, almost everything is standardizing on SQL as its API: RDBMSes, but also NoSQL storage (etc, hadoop + hive, etc), data virtualization layers (denodo, aws athena, steam pipe).

You can't change this by providing a better query language in one of the datasource products, or by providing a better ORM. Because of the network effects it will be very hard to come up with something better that is universally adopted. Many have tried though but it's always limited to a small stack: MDX, DAX, Linq, Dplyr, Graphql, etc.

There may be opportunities to replace SQL as soon as we need something that moves away from relational algebra. Currently you see al lot of adoption on graph storage and graph query languages in the Data Fabric space, as users need to build queries reasoning about relationships between different datasets in the enterprise.

The other reason Data Fabrics could offer an opportunity here is that they're basically adding an layer of abstraction between all data sources and data consumers, and they the possibility to translate SQL into something else, e.g. graphql.


> SQL, as you point out, is 48 years old. It shows.

That's irrelevant and a fallacy. Plenty of amazing things are old and have stood the test of time. Vim and SQL have empowered me my entire career.


> It uses orthogonal syntax for identical constructions

There is an alternative INSERT syntax matching the UPDATE SET one:

  INSERT INTO table SET col1='val1', col2='val2';


The SELECT syntax is the one that's backwards IMHO (and that order hurts autocomplete too).


I disagree, the people high on the org chart are clamouring for cloud-based low structure stores on microsoft azure cloud maintained in India, while we warn them that we can migrate at a giant cost for a net loss in efficiency.

SQL and dbs are fine, they work, we could discuss beautifying sql for sure, but frankly we shouldnt solving data structuring problems that were already solved 50 years ago: we have real business to support instead.


The ergonomics issue of SQL can be workaround by transpiler.

I do like SQL the syntax but it's hard to discard the ecosystem around SQL.

For instance, I made better-sql [1] recently which generate SQL from a language similar to GraphQL and EdgeDB query language.

[1] https://better-sql.surge.sh


I hope that those two syntaxes could be unified in some future version.

Also, surely allowing FROM and SELECT clauses to be swapped in order shouldn't stress out any parser too much.


FROM and all subsequent JOINs, to be precise. They all generate a single namespace which SELECT uses. As far as I understand the grammar, FROM isn’t even a separate clause, but is itself a separator between a columnset and an expression of the form `t1 JOIN t2 ON cond12 JOIN t3 ON cond123 JOIN …`, where t<n> may be either a table name, or an arbitrarily nested subexpression of the same form.


When I was first learning to code, I once told someone that working with SQL felt like trying to wrestle with a mainframe, and I stand by that assertion.


To my primitive brain, NoSQL always sounded like "I am going to makeup data model and integrity on the fly and let the user figure out how to handle it". No Thanks. But may be I don't get it. GraphQL honestly gives me the same vibes. Happy to be corrected.


> GraphQL honestly gives me the same vibes.

Not that this is particularly relevant to the discussion, but GraphQL is actually much closer to SQL than most NoSQL solutions because it's statically typed and the schema is statically defined. The GraphQL schema acts as the contract of what the API can serve to the client, much like a schema in a relational database.


"Replacements" for SQL have been around as long as SQL. Remember the OODBMS rage of the eighties and nineties?

https://en.wikipedia.org/wiki/Object_database

I also wanted to kill old, obsolete, crummy SQL when I first learned it as a teenager. It reminded me of FORTRAN or COBAL or something.

It took quite a lot of theory and maturity to understand why RDBMS is the right paradigm. SQL is imperfect, but it's good enough to not be worth replacing.

Whenever I've seen benchmarks of postgresql managing JSON versus mongodb, postgresql won (at least comparing apples-to-apples; postgresql is much more conservative by default, in terms of reliability and data consistency, than mongodb). There's literally no reason to going with a lot of NoSQL databases over using a SQL database as a key-value store.


> SQL is imperfect, but it's good enough to not be worth replacing.

Great way to put it.

> There's literally no reason to going with a lot of NoSQL databases over using a SQL database as a key-value store.

To be devil's advocate, the arguments I hear range more on ease of ops/scalability for noSQL, apparently some do it better than SQL implementations, and some drivers can make trivial operations easier as well. That said, I'd pick PostgreSQL for pretty much any new project I get to work on.


It turns out that most SQL databases, when used as a key-value store, shard better than most NoSQL databasing.

You run into issues with things like JOINs across shards, but there's literally no upside even there to NoSQL versus using a SQL databased in a disciplined way. I've designed systems with the same type of infinitely horizontally scalable KVS-type storage for all the changing state in SQL.

SQL also means you can do things like:

- Do local joins (if e.g. all the data for a user is in the same shard)

- Keeping a small set of relational data (not the stuff which needs to scale) and have one technology

- Use read replicas to scale some of the stuff which doesn't fit in the "infinitely horizontally scalable KVS" model

.... and so on. You can't not think about it, but if you do think about it, it's all upside and no downside.


> I also wanted to kill old, obsolete, crummy SQL when I first learned it as a teenager. It reminded me of FORTRAN or COBAL or something.

> It took quite a lot of theory and maturity to understand why RDBMS is the right paradigm.

You and the guy you are replying to are doing it again, conflating SQL and relational databases. :(


No. I'm not. I said:

1) "RDBMS is the right paradigm." That's high praise. RDBMS is exactly the Right Thing, with a trademark and all.

2) "SQL is imperfect, but it's good enough to not be worth replacing." That's definitely not high praise. It means that this implementation has all sorts of warts and annoyance. However, that doesn't rise over the bar of replacing.

As a teenager, I definitely did conflate the two. Perhaps that's where your misread came from?


It probably comes from 90% of discussion on SQL revolves around de-conflating the two. It’s pretty much impossible to complain about SQL without clarifying the difference, and even then. Really I think we should just permanently assume the other party has never conceived of such a difference, no matter how much they insist otherwise.

SQL is here to stay, if only because few can be convinced that another language could query the RDBMS directly (without first compiling down to SQL). Peolle also keep persisting in this belief it’s standardized, for god knows what reason


In abstract, I don't see a lot of downside to compiling down to SQL as an intermediate format. Or vice-versa, for that matter.

In practice:

1) Most attempts to do this historically have missed the point of an RDBMS (e.g. most ORMs).

2) SQL is cumbersome, but it's not quite cumbersome enough I'd want to bother with something else.

As a footnote, I find SQL-grade standards to be incredibly helpful. In 2022, I wish there were a strict standard, but in new domains, standards like this mean:

1) I can read code for virtually any database, with maybe gentle use of search engines, unless I get into really hairy corners. Ultimately, those, I can look up too.

2) If I write code for a new database, I don't need to learn anything; I can understand syntax with a few web searches.

3) For 90+% of SQL, I can write automatic scripts quickly and easily to translate between databases (when migrating, the remaining few percent, someone can do by hand, and it's a very tractable chore).

That's not the case jumping into a graph database or other stuff.

At the same time, they don't over-constrain things. If you want your database to have a BCD type, JavaScript stored routines, or some wonky form of virtual tables, you can.

Now, in 2022, we know enough to standardize all of this stuff, but I'm not sure we did when these tools were coming out.

As a footnote: I'm working in a domain with no standards, and if everyone could just pick JSON (or XML, or just about any one thing), we'd already have 50% of the benefit of a full standard. If people standardized a few nouns and verbs (e.g. 'user' versus 'user_id' versus 'actor' versus 'agent' versus ...), we'd be another 50% of the way there. Systems do different things, and I don't think going 100% of the way makes sense until we understand the domain better, but partial standardization is a huge win. I've been pushing hard for having partial standards, and hard for not having full standards yet.


Sometimes you just want to store data and worry about validation later. I'm in that situation now, with different tasks requiring different kinds of validation for the same data. Sometimes I want all the postal codes that users have entered, sometimes I want only the valid postal codes.


Just to be pedantic you can use SQL to query NoSQL databases, just look at the PartiQL ExecuteStatement API for DynamoDB. I assume your question is about why anyone would use a NoSQL databases instead of a relational database. The short answer is scale. If you have well define access patterns that rely only on a primary index, you can effectively scale to billions of QPS. If you want to make relational queries, and/or can’t know your access patterns in advance, use a relational database.


NoSQL is like simple assignment to storage and one layer of obvious indexing. Its lower level, and therefore much faster and predictable because there is no query planner guessing how to assemble things.

Tuning SQL on the other hand is a guessing game with the query planner. You can express a lot more but 1% of those expressions could down your DB.

I like both, but if you want performance NoSQL is the winner.


> if you want performance NoSQL is the winner.

Performance for which use cases?

It matters to clarify this because if you get rid of every use case SQL+RDBMS optimize for, NoSQL is obviously going to be the winner. A car with no safety measures and without most consumer-oriented features is probably going to be "faster" than one that has them.


I have plenty of beef with SQL, but your insert versus update example doesn't make sense to me. That's a perfectly reasonable difference; insert is always a dense operation while update is usually a sparse operation.


It's also the simplest use case of INSERT imaginable and fails to be applicable with a slightly more complex one:

  INSERT INTO table1 (col_1, col_2, col3)
  SELECT COUNT(col_6), col_4, col_5 FROM table2 INNER JOIN...


I think in principle we could allow

    INSERT INTO table1
        col_1 = COUNT(table2.col_6),
        col_2 = table2.col_4,
        col3 = table2.col_5
    FROM table2 INNER JOIN ...


You can always do

    update t set (x, y) = (n, m);
At least you can in more recent versions of Postgres, and I think it’s standard SQL.


Hey man if SQL is so great why are people always trying to kill it??!?!?

/s


Declarative vs Imperative Language.

A huge benefit of SQL is that it's a declarative language (you describe what the output should be); whereas with imperative languages like C/Python/Ruby you have to describe how to generate the output with specific instructions and procedures.


There's plenty of improvement to be had while still keeping it declarative.


I consider SQL imperative. You say exactly how to do what you want, not what do you want.

If we take as example a table with a single row:

In imperative (SQL):

INSERT INTO TABLE1 (name, wage) VALUES ('John','50k');

Then one day John gets a raise:

UPDATE TABLE TABLE1 SET wage = '60k' where name = 'John'

In declarative (pseudo code in yaml):

- table: TABLE1 - name: John wage: 50k

Then one day John gets a raise:

- table: TABLE1

  - name: John

    wage: 60k


It's sort of a weird hybrid monster really, some times it's the worst of both worlds.

You state what you want, yes - but you are forced to articulate it as a specific set of table navigations / logistics through the relational model, as if you were writing the implementation. Then, however, the database then may choose ignore those and do something else to resolve the data you asked for if it wants, if it can prove the outcome is equivalent.

For example I want all the ice creams bought by John. Even though the schema knows the foreign key relationship between "user" and "purchase" I have to tell it back to the database engine in my query. But even after doing that there's no requirement the database will actually implement the steps I was forced so ungraciously to specify. It may "optimise" them away and do something else.


You’re describing to the RDBMS what you wanted added to the database. The SQL/RDBMS takes care of how that’s done.

In C, you’d have to program how the data should be stored (data structure) and written to disk.


> In C, you’d have to program how the data should be stored (data structure) and written to disk.

Just like declaring the columns you're specifying values for and then the values for them?

If SQL was declarative, you wouldn't have an error on duplicate CREATE, there would be no CREATE OR REPLACE, and changing column types would not be an error. It would just 'table x should be like this, make it so'.

(Actual queries/projections I would say are declarative, just the nomenclature pretends they're not. (Select from join all sounds very imperative, but really you're just describing what you want, and have no say over how it's retrieved.))


Is there a formal specification of what “what” and “how” means in this context? I’m familiar with your description, but it appears subjective to me because the phrases “what the output should be” and “how to generate the output” are poorly defined.


You might notice that the only way to create a declarative language is to use an Imperative language.

Declarative languages are basically a DSL, which (hopefully) translate the desired steps into efficient instructions. Nonetheless, your cpu will execute imperative code at the end.

SQL is an example of a very well established and generally well done declarative language, but that doesn't mean that declarative languages are inherently better.


I think you're reading into something that isn't there. GP never said anything about declarative being better than imperative, merely that they are different approaches and SQL is declarative.

But even if they had, your argument that the CPU ends up running imperative code makes it better seems silly. The CPU ends up "running" machine code and a compiler has to turn the vast majority of imperative code into a different form. Does that make machine code better than assembly? better than C?

I don't think so. They are simply different levels of abstraction, each with their own pros and cons. Neither are "better" than the other. Each are "better" at some tasks and worse at others.

GP said there was a "benefit" with SQL being declarative, because when you want to abstractly request some arbitrarily structured data, it's beneficial in most cases to not have to know exactly how to find and retrieve that data. It's certainly not "better" if you need to ensure some specific bit format on the hard drive. But it's "better" if you want to succinctly express a query that is broadly reusable and understandable even to people who have no idea how the database works on the inside.


You're reading something into my comment I didn't want to say too

I just felt the need to point out that declarative languages are essentially always a DSL, wherever this DSL is actually performant and should be used depends on it's implementation.

Generally speaking, SQL is very well implemented so using any of the well established databases is probably a good choice. Nonetheless, few declarative languages come even close to SQLs efficient implementation so they're very rarely the answer.


There’s nothing stopping you from writing a compiler or interpreter in a non-imperative language like prolog.

Also performing the equivalent of a table join in a non-declarative query language is going to be a burdensome task. It’s certainly beneficial to let a query planner figure out the details for you, instead of iterating over all the rows of your data.


Which compiles it to bytecode, which is Imperatively executed on your cpu.

You didn't think that one through, did ya?

I'm not even sure where your outage comes from. DSLs aren't inherently bad either.


Design and analysis of hardware for high-performance Prolog

https://www.sciencedirect.com/science/article/pii/S074310669...


It means you can let the database engine determine the best way to get you what you are after. If you write imperative code, you are telling the DB exactly what to do and you don't really give it much ability to optimize your queries. Before SQL most DB engine queries were imperative and it was terrible... which is why we don't do it anymore.


People keep twisting themselves into knots trying to fit things into these labels, but the net effect is unrelated to the description.

Using a "declarative language":

  my_bucket = aws_s3_bucket(aws_region, bucket_name)
Using an "imperative language":

  if aws_s3_bucket.exists(aws_region, bucket_name):
    my_bucket = aws_s3_bucket.update(aws_region, bucket_name)
  else:
    my_bucket = aws_s3_bucket.create(aws_region, bucket_name)
I prefer the latter. In every DSL I've ever used, people end up needing to handle weird edge cases, and it's extremely hard to do that with a "declarative-only" DSL, so they end up adding imperative-ness to the DSL. Give me a regular "imperative" programming language and lots of convenience functions that do black magic behind the scenes, and I'll do regular programming when the black magic falls short.

This is basically why AWS CDK / Terraform CDK / Pulumi exist.


...and you have nice imperative race condition right there in your simple example.


bravo.


From a simplistic point of view, yes, but from a practical standpoint, no. IaC assumes a global lock on any resources being managed. As soon as that expectation is violated, everything starts breaking, regardless of how good the code is.


Furthermore, the only way to implement an imperative language is to use a declarative language.

Verilog and VDSL, you see, are declarative. They have to be.

It's not a better or worse thing. It's a domain thing.


Only if you limit yourself to current CPU architectures. Even GPUs right now are significantly more declarative than CPUs, and it definitely seems like some point in the future we could end up moving towards asynchronous circuits built with memristors.


I'm sure this will ruffle some feathres, but the technical skill that has been the most persistently valuable across the past 25 years of my career is object-oriented programming and design. It's not the only paradigm I use, but being able to organize large programs in terms of OOP has been incredibly valuable and hugely productive for me across several languages and domains.


On the contrary, I think that's a self-fulfilling perception. If assets are already organized as objects, then it'll continue to work well with oop. But if you take the approach from the get-go to use data and FP approaches, things can be quite nice. Examples:

- interface/trait based programming / structural subtyping (widely used in go/rust, increasingly in TS and python)

- terraform

- react

- aws lambda / cloud functions

- flow-based data processing (the whole of deep learning, spark/hadoop)

- and of course, anything declarative DSL based (SQL, jq and friends)

So I would counter that the more valuable skill is "how do I solve problems in terms of applying and composing transforms to data"

To clarify, since everyone has their own definition of OOP, and of the four pillars, Abstraction, Polymorphism, aren't at all unique to OOP, and Encapsulation is just Abstraction: the defining features of OOP are inheritance and poking-and-prodding-state into opaque objects. Inheritance is subsumed by interfaces / structural subtyping, and poking at state is contrasted with reactor patterns, event sourcing, persistent data structures, etc.

Oop really shines at the middlin-low level, in languages without a lifetime (state for things like IO resourcese, at the GUI widget level, and the (micro)service level, which is more like the original smalltalk sort of objects, in which case inheritance isn't a think.


> On the contrary, I think that's a self-fulfilling perception.

So is using SQL. My point, as well as the parent author's point, is that I've found a technology that has been a useful source of value for my entire career. I make no claims that there aren't other useful sources of value, but I've never for a second regretted the time I've spent getting better at writing object-oriented code.

I'm also quite comfortable writing functional and dataflow-oriented code, but I've never found those to be as exciting as some people seem to.


Fair enough!

And yeah, since OOP is everywhere, there's zero downside to getting better at it.

> I've never found those to be as exciting as some people seem to.

It's less about excitement (for me at least) and less about the hair pulling associated with reasoning about complex state. I find the more independent attributes an object has, the harder it gets to reason about, unless I can serialize the whole thing (which usually goes against encapsulation).

I'd say that's the number 1 thing I do to get "better at OOP": have a healthy suspicion towards any long-lived object I can't serialize/deserialize using primitive types.


React is implemented using OOP, which I think is great for libraries that are defining and defending boundaries.


That's like saying the Glasgow Haskell Compiler is implemented in part with C, therefore Haskell is imperative.

The React "way" is very much leaning into pure immutable data that flows through components which react to changes. Components aren't objects in the OOP sense. You don't call methods on them to update their internal state.


Never said React was imperative.


Hooks is not really OOP. The original OOP interface was abandoned because it lead to a very complicated lifecycle that was hard to reason about.


I'm referring to the implementation of React, not its API.


Can confirm, unless it's changed radically since I last looked at it, yes, everydamnthing in React ends up as an object representing a component, and I don't just mean in the way that JS functions are technically also objects.


Except the little detail that JavaScript is an OOP language in the linage of SELF, where even basic types have methods.


No, it's not.


It is, in large part, right down at what you might call the heart of the whole thing. Though I haven't read the React source code since a bit after Hooks were added so I guess maybe that's changed—though I doubt it.


Everyone has a different experience! Object oriented programming and design have had approximately no impact on my own career, while SQL knowledge has been indispensable. I don't think they're really in competition.


Tell me you've never worked on large FP codebases (as a point of comparison) without telling me you've never worked on large FP codebases ;)


That's true, I haven't.

Weird how many large OOP codebases there are to find oneself working on and how few large FP ones. :) Maybe that's just coincidental, or maybe not.


Should the inference be:

FP doesn't scale?

- or -

FP inherently creates small code bases?


Well, having worked on both, I was able to tell that you only worked on one since your stated opinion would likely be very different had you worked with both, so take from that what you may


> Weird how many large OOP codebases there are to find oneself working on and how few large FP ones.

How on earth is that weird when OOP is a hundred times more popular than FP?


The point is to ask why OOP is a hundred times more popular if, according to FP advocates, it's so much worse as a paradigm.


Inertia is a powerful thing. Look at how many people still use C and C++ (and Java!) despite there being arguably better alternatives.


I can see how can that be the case if one never works with products that rely heavily on having a database.


Yes, my career is possibly a bit unusual in that I've mostly not worked on heavily database-driven stuff. Mostly client applications, games, tools, and programming languages.


Just a slow-to-die 90’s fad in the end.


OOP is considerably older than '90s. Turbo Pascal 5.5 had objects in 1989 and was far from the first.

Anyway structs plus FP is no less object oriented. You just don't have to write the methods in the same code module. See NIM Method Call Syntax for example: https://nim-lang.org/docs/tut2.html#object-oriented-programm...

Yes, I know NIM is not purely functional. Use F# instead to get a bit closer to purity, it also does OOP quite nicely as does pretty much any functional language that has closures.

OOP is a tool just as Git is a tool.

See also: https://medium.com/extreme-programming/oop-vs-fp-182475457a0...


> FP is no less object oriented.

In fact, Erlang (and Elixir, I suppose) may be the only object-oriented programming language[1].

[1] https://www.infoq.com/interviews/johnson-armstrong-oop/


> OOP is considerably older than '90s.

Fad means popular (in a deragetory way). 90’s was its popularity heyday.

> See NIM Method Call Syntax for example: https://nim-lang.org/docs/tut2.html#object-oriented-programm...

So oop is when you `obj.methodName(args)` instead of `methodName(obj, args)`? Well if that’s all it is about then I can get behind it...

> OOP is a tool just as Git is a tool.

An iron maiden is just a tool. Don’t be hatin’ it’s just a tool what has it done to you?


I feel my abilities as a developer really took off when I took the time to really learn SQL and take advantage of the features that my database offers. For example, learning to use window functions felt like that galaxy brain meme where you become one with the universe.

Being able to take 50 lines of ruby code (manual (anti-)joins and aggregates, result partitioning, etc...) and replace it with a couple lines of sql that is much faster and less buggy is a life changing experience.

The only other time I had such a dramatic shift in the way that I look at building applications is when I learned Scala (functional programming).


> replace it with a couple lines of sql that is much faster and less buggy is a life changing experience.

you have snatched the pea.


I googled this expression but didn’t find anything. What does it mean?


Pretty sure it's a reference to https://www.youtube.com/watch?v=9selPW2lL-M


correct. sql moves the mind from pondering scalars to groking sets.


I’ve been using HTML since the mid 1990s. That is probably the oldest technology that I still use on a regular basis. SQL would be a close second (or maybe third - I probably started learning sql around 2001 and I can’t remember when JavaScript came out. Surely before that.)


As above has noted, Both HTML and SQL are effectively declarative, hence their staying power.


And JavaScripts staying power is backwards compatibility for billions of web pages, don't want to introduce another language to the browser because it would cost a lot of money for developers to build the engine and would increase the security attack surface, and all the browsers would have to add a new engine around the same time so developers would be willing to use it.


Sure, I don't disagree that Javascript also has some staying power.

But I suppose if you look at the web programming stack and its history (which helps explain how stupid and nonsensical it is) I'll bet that in the long run HTML and SQL stick around longer as a part, even when they're perhaps not very "good?"

As in, a good bit of the energy around Javascript is "dealing with its extreme shortcomings directly in a way that implies possible replacement?" Like, transpiling feels different from layering on top. And people transpile Javascript, but layer on top of HTML? Something like that.


And yet webassembly seems to be going somewhere.


It isn't replacing JavaScript. At the very least you still need JS to access the DOM. And, for backwards compatibility, I don't see JS going away during our lifetimes. And I don't see another language being added to browsers in place of JS (not including webassembly, which is not a replacement for JS).


XHTML looked like it would totally usurp HTML at one point but then we went back to HTML with avengence when everyone jumped on the HTML5 hype train.

Postgres used QUEL instead of SQL for the first decade of its life. There were several competing languages in the relational database space at one time, but SQL eventually won, no doubt thanks to Oracle and IBM's utter dominance in the space a few decades ago.

Declarative doesn't mean staying power. Reaching ubiquity through luck and random chance seems to be what achieves staying power. Javascript likely fits this category too. Browser developers 30 years from now will no doubt still be using Javascript.


Honestly, mines has mostly stayed the same. I've played with various techs but I mostly work the same underlying tech. Some items get added but very rarely does something get replaced or removed. The tech gets improved and new features added .

The last time I can remember something getting replaced is vargrant for docker. And honestly, I think that's a mistake for many companies. Before that it was git for svn.

The only way I can see tech coming and going is you're always an early adoptor and need to replace things that failed. If you build on stable tech, that stuff stays around for decades even if it was terrible in the first place. A good example is PHP. It was terrible at first still around and improving year on year.


Can you elaborate on why vagrant to docker is a mistake? I find the docker workflow infinitely better and faster to iterate for making dev environments.


If your deployment target is a VM/bare-metal rather than container, Vagrant makes a lot of sense for development.

E.G. my dev Vagrantfile fits in a small screen, and provisioning is handled by a single Ansible playbook that works for all environments (using conditionals). So there's no need to maintain a separate dev container, and a lot of subtle headaches are avoided in the deployment process

Setting up the dev environment consists of running `vagrant up` (assuming Vagrant/Ansible are installed). That's it.

You get all the benefits of IaC (fast ~reproducible dev) without shoehorning containers in.


It also comes with a lot of downsides and significantly increased complexity in many cases. Depends what you're doing.


Why do you think Vagrant->Docker was a mistake?


Most companies have a large amount of devs on macbooks. Docker on the macbook has historically been painful. Mainly due to the filesystem. Whereas, vagrant doesn't really have that problem. I've been using Linux for so long that I haven't really experienced the pain myself. But I have gotten to enjoy the smugness of them all complaininga about their performance while mines is super snappy.

Then if you look at the toolchain surounding vagrant such as packer. It looks really nice. Docker doesn't seem so nice. In fact, I still use packer to build my docker container images because I like the setup nicer.

With Docker it's all about the production env. I see the benefits for prod but for dev when you're literally not using the same docker setup as you would in prod it makes no real sense to me.


Vagrant still allows people to have deviating personal environments, which destroys productivity when people have to constantly diagnose what changes they have in their local env don't match other things. Vagrant+Docker is a decent compromise. Best is cloud-native development. If everyone [on a single team] SSH's into one box in the cloud with one set of dependencies, everyone gets the same environment every time, and updates go out to everyone immediately without fail. K8s has similar projects to let you develop in a pod on the actual cluster, eliminating difference between dev<->prod.


> Vagrant still allows people to have deviating personal environments,

So does docker the way many folk use it for their development. In fact, I saw in one place people set up env setups for their docker so each team member could have theirs the way they wanted it. Create a container and use forever. Very rarely in my experience are people completely rebuilding their env.


Docker was painful on Mac, but quite a while ago. I don't have any problems these days.


The last complaints I heard were about a year ago. Which point those devs left.


The problem I see here is building SQL on top of the cloud, which is like building SQL on top of a database where the tables are always changing - not just adding columns but the meaning of the columns and rows, or even the existence of the table.

I think this is a good idea, but the cloud doesn't seem to care too much about backwards compatibility. If you use MySQL/PostgreSQL/SQL Server, you can be fairly sure in 20 years time your SQL will still work :-).

The need for this library is more an indictment on "the cloud" than anything. I am mostly an Azure user so AWS/Google might be better but man it changes it's UI and CLI interfaces alot - way too much.


SQL is a technology that engineers should know in an out, and while in early 2000's you could not move an inch without knowing it, I find that the new generation of devs really considers database calls a freebie - hidden behind abstractions and ORMs. It's just a line of code, right?

This leads to highly non-performant software since, unless we have bent the laws of physics in the last 15 years, your data access layer is the most expensive one.

But we are crushing LeetCode, yeah?


"Since starting my career, the tech stack I use has changed numerous times. Technologies have come, gone, evolved. But except for a brief period in 2010 where we thought we could kill it, SQL is the one technology that has stayed around."

Ethernet comes to mind when I think of technologies that just can't be killed ...


> It might be the only TECHNICAL knowledge I learned in the first year of my career that is still relavent.

Agree. IIRC, I wrote my first SQL is 1996 and wrote some today. While not perfect, it’s amazing at what it does. I suggest reading Joe Celko to really up your SQL skills.


yep.

and i wrote my first quel in 1984 at BLI. and sql still won the war.


That's not random. SQL is basically math. The syntax is a little old school and arcane (but not terrible) but the principles behind it are set theory and logic.


> That's not random.

It's kind of random. There were plenty of competing languages in use until the 90s. Even our beloved Postgres used QUEL for its first decade of life. But Oracle took the dominant lead and its competition either disappeared or started adding SQL support to be compatible with it.

Had a different butterfly flapped its wings all those years ago, things could have turned out very different.

> SQL is basically math.

All programming languages are basically math.

> but the principles behind it are set theory

Thing is, one of the challenges with SQL is that it doesn't adhere to set theory. This leads to some unintuitive traps that could have been avoided had it carefully stuck to the relational algebra.

Such deviation may have been a reasonable tradeoff to help implementations, perhaps, but from a pure language perspective it does make it more terrible than it should have been.


Reminds me of Git.


Eh, I’ve seen various things like Git come and go. I spent the majority of my career in the days before git existed and we used things like SVN. I wouldn’t be surprised to see Git supplanted. I would be surprised to see SQL supplanted.


Git is a bit newer, and there have been many attempts to solve the same problem it solves, but it solves that specific problem VERY well... it'll be a long time before it's dethroned in my opinion.


Git is surprisingly only 17 years old while SQL is 48 years. Not that you’re wrong, it’s just interesting how comparatively newer it is


Also, git had major competition for almost half of that time where it was not clear at all what software people would settle on (or even if they they would actually settle down).

SQL was nearly undisputed its entire life. It quickly killed every previous architecture, and everything that came after it made a point on being compatible.


>It quickly killed every previous architecture

This feat was much easier to achieve when the industry was much, much smaller, and a single company mostly owned the entire business sector (who were the people that needed databases in the first place).


Well, a single company with a huge unexplainable help from the US government.

But SQL stayed unopposed due to its qualities, not because of its iffy start.


SQL is at least 30 years older than Git. Git has yet to prove its longevity in the same way.


That statement will be true forever. It doesn’t tell you much.


Let me restate in relative terms then. SQL is over twice as old as Git. When Git was originally released, SQL had been around for longer than Git has been today. Git has had a pretty good track record, but it's not comparable yet in terms of age or even in terms of mass adoption -- plenty of orgs don't use Git to manage their code.


Excited to see Steampipe shared here - thanks kiyanwang! I'm a lead on the project, so sharing some quick info below and happy to answer any questions.

Steampipe is open source [1] and uses Postgres foreign data wrappers under the hood [2]. We have 84+ plugins to SQL query AWS, GitHub, Slack, HN, etc [3]. Mods (written in HCL) provide dashboards as code and automated security & compliance benchmarks [3]. We'd love your help & feedback!

1 - https://github.com/turbot/steampipe 2 - https://steampipe.io/docs/develop/overview 3 - https://hub.steampipe.io/


Hey Nathan. Can you comment on some of the known performance pitfalls of steampipe? I'm not super familiar with the Postgres foreign data wrappers API. I assume steampipe inherits many of its technical limitations from this API.

Having done some work in this space, I'm aware that it's no small thing to compile a high-level SQL statement that describes some analytics task to be executed on a dataset into a low-level program that will efficiently perform that task. This is especially true in the context of big data and distributed analytics. Also true if you're trying to blend different data sources that themselves might not have efficient query engines.

Would love to use this tool, but just curious about some of the details of its implementation.


The premise of virtual tables (Postgres FDW) is to not store the data but instead query it from the original source. So, the primary performance challenge is API throttling. Steampipe optimizes these API calls with smart caching, massive parallelization through goroutines and calling the minimum set of APIs to hydrate the exact columns requested. For example, "select instance_id from aws_ec2_instance" will do 2 API calls get 100 instances in 2 pages, while "select instance_id, tags from aws_ec2_instance" would do 2 calls (instance paging) + 100 tag API calls (one per instance). We've also implemented support for qualifiers (i.e. where clauses) so API calls can be reduced even further - e.g. get 1 EC2 instance without pagination etc.

The Postgres planner is not really optimized for foreign tables, but we can give it hints to indicate optimal paths. We've gradually ironed out many cases here in our FDW implementation particularly for joins etc.

If you can tolerate my Aussie accent, I explain many of the details in this Yow Data talk - https://www.youtube.com/watch?v=2BNzIU5SFaw


Awesome, thanks so much for that! I guess then it sounds like, even if bandwidth were not an issue for a datasource, there would still be the physical limitations of the machine on which steampipe would do any later processing (such as joins between datasources). In other words, there's no sense in which steampipe distributes this work across multiple processes or machine. Is that correct?

Sorry if a dumb question. I could be thinking entirely in terms of the wrong paradigm here since my work in this space was primarily concerned with distributed computing and big data.


Steampipe computes the SQL part in a single Postgres instance (not distributed) - we've not found this to be a significant limit (so far). A difference in this case is that the query is really combining and organizing results that may been processed at the source. In 2022 most services are API first and offer great searching and filtering, so we can use that to do much of the query processing in many cases.


Great, thanks again! Yeah, a lot of the things steampipe is doing are right up my alley. Fun to learn about! Watching your talk now and enjoying it.


To add somewhat of a counterpoint to the other response, I've tried the Steampipe CSV plugin and got 50x slower performance vs OctoSQL[0], which is itself 5x slower than something like DataFusion[1]. The CSV plugin doesn't contact any external API's so it should be a good benchmark of the plugin architecture, though it might just not be optimized yet.

That said, I don't imagine this ever being a bottleneck for the main use case of Steampipe - in that case I think the APIs themselves will always be the limiting part. But it does - potentially - speak to what you can expect if you'd like to extend your usage of Steampipe to more than just DevOps data.

I've used the benchmark available in the OctoSQL README.

[0]: https://github.com/cube2222/octosql

[1]: https://github.com/apache/arrow-datafusion

Disclaimer: author of OctoSQL


Yep, I'm aware of DataFusion and I'd expect nothing less than top notch performance from it. Interesting comparison.


Love to see Postgres FDW used. It’s a really powerful and imo not utilized as much as it could be.


Hey! Steampipe looks great and I think the architecture you chose is very smart. Some feedback from myself:

I've tried setting up steampipe with metabase to do some dashboarding. However, I’ve found that it mostly exposes "configuration" parameters, so to say. I couldn't find dynamic info, like S3 bucket size or autoscaling group instance count.

Have I done something backwards or not noticed a table, or is that a design decision of some sort? That was half a year ago, so things might've changed since then, too.


Steampipe mostly exposes underlying APIs or data through SQL. So, "select * from aws_s3_bucket" will return bucket names, tags, policies, etc all together. But, bucket size is not a direct API so doesn't have a direct table.

In some high value cases we've added tables to simplify / abstract / normalize data - for example AWS IAM policies - https://steampipe.io/blog/normalizing-aws-iam-policies-for-a...

This example query returns the number of instances attached to an autoscaling group - https://hub.steampipe.io/plugins/turbot/aws/tables/aws_ec2_a...

BTW, we recently published a Metabase integration guide - https://steampipe.io/docs/cloud/integrations/metabase


It's a very cool project!

It might just be a coincidence, but an hour before this HN post, I discovered it way back in our queue of things to review for https://golangweekly.com/ and featured it in today's issue. Hopefully kiyanwang is one of our readers :-D


Awesome - thanks for the shout out in golangweekly!


No worries, it's my job :-) I'm just so intrigued to see (well, guess, in this case) how the path of word of mouth works because most of the time it's a complete mystery!


Congratulations! This looks incredibly powerful and I'm excited to check it out.

Although this is pitched primarily as a "live" query tool, it feels like we could get the most value out of combining this with our existing warehouse, ELT, and BI toolchain.

Do you see people trying to do this, and any advice on approaches? For example, do folks perform joins in the BI layer? (Notably, Looker doesn't do that well.) Or do people just do bulk queries to basically treat your plugins as a Fivetran/Stitch competitor?


While it's extensible, the primary use case for Steampipe so far is to query DevOps data (e.g. AWS, GitHub, Kubernetes, Terraform files, etc) - so often they are working within the Steampipe DB itself and combining with CSV data etc.

But, because it's just Postgres, it can be integrated into many different data infrastructure strategies. Many users query Steampipe from standard BI tools, others use it to extract data into S3, it has also been connected with BigQuery - https://briansuk.medium.com/connecting-steampipe-with-google...

As opposed to a lake or a warehouse, we think of it as a "Data Rainbow" - structured, ephemeral queries on live API data. Because it doesn't have to import the data it works uniquely well for small, wide data and joining large data sets (e.g. search, querying logs). I spoke about this in detail at the Yow Data conference - https://www.youtube.com/watch?v=2BNzIU5SFaw


Over the last week I’ve been wanting to use Steampipe for DevSecOps by layering Sliderule.io on top. Do you have any customers doing this or something like it already? I am going to give it a try tomorrow and take it up with our Sliderule contacts on Monday.


Sounds interesting - let us know how you go!


This is awesome. Makes me want to write a plugin for my SaaS, just because it looks fun.


I'm biased, but writing plugins is really fun - and then you can create mods / dashboards for your users as well! Please give it a go - we have a guide [1], 84+ open source plugins to copy [2], many authors in our Slack community [3] and we're happy to get your plugin added to the Steampipe Hub [4].

1 - https://steampipe.io/docs/develop/writing-plugins 2 - https://github.com/topics/steampipe-plugin 3 - https://steampipe.io/community/join 4 - https://hub.steampipe.io/plugins


This is amazing. Can't believe I hadn't seen it before. Nice job


Are people usually setting up centralized shared instances of Steampipe or is it more of a "run on everyone's laptop" deployment preferred?


We see users running Steampipe on their desktop, in pipelines (e.g. to scan Terraform files) and also as a central service. See "service mode" in the CLI for a Postgres endpoint and a dashboard web server [1] or try docker / kubernetes [2]. We also have a cloud hosted option with dashboard hosting, multi-user support, etc [3].

1 - https://steampipe.io/docs/managing/service 2 - https://steampipe.io/docs/managing/containers 3 - https://steampipe.io/docs/cloud/overview


My guess is, that it makes sense to get regular reports, e.g. weekly. But you also want to experiment and develop queries. So probably both. Not sure if there is something like a notebook for steampipe.


Because it's just Postgres, connecting Jupyter notebooks to Steampipe via SQL works really well :-)


What about joins, are they supported? Can't see them in the examples.

The real power of SQL is locked until you can join different data sources.



Yes - joins work across tables and schemas! This (toy) example connects IAM user records with Slack user data:

    select u.name, s.id, s.display_name
    from aws_iam_user as u, slack_user as s
    where u.name = s.email


Steampipe is pretty much "just" PostgreSQL with a foreign data wrapper and some fancy extras on top. The data is in tables from the database's perspective, so pretty much everything you can do with PostgreSQL, you can do with steampipe, including creating your own tables, views, functions and whatnot.


FYI hiring page requires login to notion


hmmm ... I just tested in Incognito mode and it's working for me. Could you please try again? We are hiring for multiple roles!


Steampipe killed my startup. I couldn't be happier about it.

I started a company six years ago to do exactly this -- make a SQL interface for the AWS API. Steampipe out-executed us and made a way better product, in part because their technology choices were much smarter than ours. I shut down the startup as soon as I saw steampipe.

I wish them all the best, it's a great product!


I'm curious: what were the bad technological choices you made and the good ones they made?


They use Postgres foreign data wrappers and caching with live queries against the Api. We were doing periodic scanning and then storing it in elastic. Then trying to convert the SQL to elastic queries.

The elastic part let us do free text queries but converting from SQL was challenging (this was before elastic had it built in). And also everything was only as fresh as our last scan.


Can't speak for op but one common mistake I see is having a bunch of JS developers try to make a CLI and being surprised it is slow.


Steampipe is unbeliveably powerful.

Writing custom scripts for even the simplest of queries comes nowhere near the convenience of using PostgreSQL to get what you want. If you wanted to find and report all instances in an AWS account across all regions, with steampipe it's just:

   SELECT * FROM aws_ec2_instance;
Even the simplest implementation with a Python script won't be nearly as nice, and once you start combining tables and doing more complicated queries, there's just no competition.


Tried steampipe out to replace my simple python script. It wasn't able to handle our 1k+ accounts and multiple regions, initialization took over 1 hour before I was even able to do a query.

Meanwhile my python script can run any API calls in all our accounts and regions and finish in a few minutes. Maybe 100 lines of python using boto3 and multiprocessing that outputs to json or yaml.


Python and boto3 are very powerful in many cases for sure - glad they work well for you. No doubt that multi-region for 1k+ accounts is a large surface area, so we'd love to learn from your experience! Performance is an ongoing focus area for us, including radical improvements in memory usage and startup time in our most recent version - https://steampipe.io/blog/release-0-16-0


1K accounts? What're you doing over there exactly? I'm intrigued!


Long story but currently it's a mix of per team or per system accounts to have clean permission borders (so one team cannot touch another teams resources). Though we also support account multi tenancy through scoped actions (tag or path based).


I don't agree that this simple query wouldn't be as nice with boto3. But combining different services is definitely a nice feature, if you have the use case for it.


the fun part of that query is that it'll search across all regions that you have configured steampipe to use. If you did that in Python, you at the very least have to manually iterate over each region, handle pagination yourself and combine the results.

It gets much more convenient when you want to ask more complicated questions like "What is the most used instance type per region" or "How much EBS capacity do instances use across all regions, grouped by environment tag and disk type?"


Good to see steampipe here :)

Im the author of CloudQuery (https://github.com/cloudquery) and I believe we started at the same time though took some different approaches.

PG FDW - is def an exciting PostgreSQL tech and is great for things like on-demand querying and filtering.

We took a more standard ELT approach where we have a protocol between source plugins and destination plugins so we can support multiple databases, data-lakes, storage layers, kinda similar to Fivetran and airbyte but with focus on infrastructure tools.

Also, as a more standard ELT approach our policies use standard SQL directly and dashboards re-use existing BI tools rather then implementing dashboarding in-house.

I can see the power of all-in-one and FDW and def will be following this project!


/me waves

Immediately thought of CloudQuery when I saw this and sure enough, here you are :)


https://developer.valvesoftware.com/wiki/SteamPipe

No way they did not know about that name collision as SteamPipe has been around since 2013-ish.


Name collision researcher here!

This made me curious and I started doing research on other problematic name collisions. Did you know that Valve’s game distribution service has a name that’s already widely used in the scientific community for the gaseous form of heated water? It looks like steam has been around for several centuries at least so Valve really ought to have known better.

Anyway, if you’re interested, I’m putting together a long investigation on whether Microsoft’s flagship operating system is in collision with a very common term used in the construction and glass industry. Hope to have it done soon


Both things called “steampipe” are cloud-based technology services. They even both involve a CDN. They're close enough to be confusing. The name of a gas isn't. This is silly.


Try starting a technology company called Apple, and tell me how that works out for you.

aCtUaLLy, aPpLes hAvE eXisTeD bEfoRe 1976




I'm about to blow your mind, then

https://en.wikipedia.org/wiki/Stirling_engine


Try dropping a few thousand years

https://en.wikipedia.org/wiki/Aeolipile


I've been around since 1973 and didn't know.


We have REST endpoints generated from SQL databases, and now we have SQL queries which connect to REST endpoints.


How does this handle pagination and also how does it behave towards rate limiters? For example, my API (keygen.sh) has a page size limit of 100, and often times there can be hundreds or even thousands of pages for certain resources. How would Steampipe handle `select count(*) from keygen_licenses` while obeying the rate limiter? What if a heavy query is run twice?


Pagination may be handled by the API's golang SDK, if not that's a responsibility of the plugin author.

The plugin SDK provides a default retry/backoff mechanism, and the plugin author can enhance that.

If a heavy query runs twice it'll load from cache the second time, if within the (user-configurable) cache TTL.


This makes me want to create a Grafana dashboard for everything. Instead of relying on custom plugins (or pulling & storing the data yourself), now it's just another postgres data source. I hope this becomes mainstream!


100% - works great with Grafana, Metabase, Tableau, etc because it’s just Postgres :-)

If you are into the idea of "dashboards as code", you may also enjoy our HCL+SQL approach in mods [1] - runs in your cli, open source, composable, great DX [2]

[1] - https://hub.steampipe.io/mods [2] - https://steampipe.io/blog/dashboards-as-code


Looks fun. I assume it's just a visibility tool and there is no intention of offering INSERT, UPDATE and DELETE, right?


Yes, we've focused on SELECT for read only - which provides a lot of value as a "universal API" to live query these many sources. Steampipe is built on Postgres Foreign Data Wrappers [1] under the hood, so could be extended to full CRUD in time. Our current belief is that it's better to follow CQRS [2] in this case though - using SQL for the queries and more native tools for the commands.

1 - https://www.postgresql.org/docs/current/ddl-foreign-data.htm... 2 - https://martinfowler.com/bliki/CQRS.html


Highly recommend Steampipe. No other open source tool was able to deal with the large number of Route53 records I had to deal with.


I'm in the process of determining which RDS instance type to switch to and I've had to have a browser open with a million tabs copying and pasting information and pricing data from AWS. Could I use steam pipe to help me with this?


I feel your pain - but Steampipe won't solve this (yet). We have AWS Thrifty [1] to scan your env for cost savings and other tables to query your spend [2], but haven't tackled the 1.8GB of pricing JSON data into a table yet.

1 - https://hub.steampipe.io/mods/turbot/aws_thrifty 2 - https://hub.steampipe.io/plugins/turbot/aws/tables?filter=co...


Haha thanks for the reply. Insane how much pricing data there is, right? The least AWS can do is offer an export on the particular pricing pages!


This sounds like what YQL should have grown into.

https://en.wikipedia.org/wiki/Yahoo!_Query_Language


ctrl+f “yql” and landed here. Was an amazing service for its time. Could even execute server-side JS in its engine (Rhino). Certainly the only E4X environment I ever coded in.


Seems like the next step past https://osquery.io/


osquery is awesome. It was definitely part of our inspiration!


How does this relate to or differ from Multicorn? I see that in most of the "Type" columns in the "specific web wrappers" section of the Postgres FDW page

https://wiki.postgresql.org/wiki/Foreign_data_wrappers


Postgres FDWs are extensions installed in Postgres. Multicorn provides an easy way to implement Postgres FDW extensions in Python, leading to many FDW implementations. Steampipe runs as a Postgres FDW extension with a pluggable interface to easily build and manage data sources. Multicorn-based FDWs might be installed as part of your existing Postgres server, while Steampipe is optimized to run more standalone / alongside your existing Postgres server.


So Steampipe runs alongside? In a separate process?


There is another contender in infrasql realm called cloudquery[1]. Steampipe has been around for very long. Mind you the steampipe has an AGPL license.

[1]https://www.cloudquery.io


I really like how it just worked when I installed it and the AWS plugin. The second one I tried was tailscale... I get some different results to the same query which is weird and broken, and an error "device can only be managed by original tailnet (403) (SQLSTATE HV000)"

Still, getting access to all this stuff with a SQL interface is very compelling so I'll try to get the plugins for all the stuff I have working properly.


Glad AWS worked. Sorry you hit a bump with tailscale - we'd love feedback / contributions for that plugin [1], we just released it this month so may be edges for more complex configurations :-(

1 - https://github.com/turbot/steampipe-plugin-tailscale


We built a UI-centric cross-cloud search engine, too. <edit: link redacted as other commenter perceived this to be an ad>

The first VC we pitched to felt like this was too niche of a problem. They wanted us to come back with a different, grander, pitch, so we'll see. In the current fundraising climate, it's been difficult to gather data points on whether we are on the right track (this post makes us feel like we're not crazy). We reached out to investors outside America, where we're based, and we're quickly realizing the VCs aren't as tech-savvy as we expected. After going thru the YC application process this cycle, we've have much greater appreciation for YC. They understand tech and startups, both. For one, we're instructed to define a toy / small problem, "don't talk, do" as opposed to pretty slides and ideas.

Best of luck to Steampipe and whoever else is working on this problem.


Not a fan of what feels like an advertisement hijacking the OPs post. Most would probably prefer you just make a separate HN post.


I'm sorry it was perceived as such. I should have been more mindful. I genuinely intended for the substantive part of the post to be about our frustrations as programmers navigating the business world. It felt like the short introductory sentence was necessary to establish context (I'll remove the link, but leave post intact so other viewers aren't confused). Still off topic but, at the time, it seemed to be an on-brand discussion for HackerNews. What we were really phishing for was rapport from other engineers that we aren't crazy. I even feel vulnerable putting our project out here, at this stage, and the quick cost-benefit analysis in my head was this was a net-neutral.

I doubt Steampipe knew we existed before this post. It seems like fate because I stumbled upon Steampipe just last week. I deliberated how to craft a not-awkward email along the lines of "Hey guys, we just realized we're tackling the same problem. Cheers." When I made the original post, I felt like it was the less awkward way of saying hi but now I'm not sure. At any rate, I hope you understand we did it in good will and not malice.

Edit: I also hope "Best of luck to Steampipe and whoever else is working on this problem" doesn't get misconstrued as being a passive aggressive jab. Our team would honestly be thrilled if all the friction points of IaaS goes away, regardless of who solves it. As a reverse-ad, I'll say a few nice things about Steampipe that stood out to me: written in Go, supports querying HackerNews, SQL/Tables are a really good medium.


Please make your site mobile friendly.


Apologies. This was an experimental landing page haphazardly thrown together after the investors we're pitching to wanted a two-slide problem/solution. The rest of the app is actually made mobile-first.


So this looks interesting and I wanted to test it on a use-case I have currently. I want to query both a csv and a zipped csv living online but it doesn't appear that is possible. The csv plugin implies you can only work with local files, nothing over http(s). This kind of seems like a big oversight for a product like this, no?


We have work in progress to support go-getter [1] allowing use of files from many different sources. There is a WIP branch [2] building that into the Steampipe Plugin SDK, so it can be immediately enabled for all our file based plugins (CSV, Terraform, etc) [3]. Watch this space and hopefully you can help us test!

1 - https://github.com/hashicorp/go-getter 2 - https://github.com/turbot/steampipe-plugin-sdk/compare/main.... 3 - https://github.com/turbot/steampipe-plugin-terraform/tree/ad...


I have recently tried to set up Steampipe to query all our cloud costs, but I noticed that costs are only available for AWS. Azure, GCP, DigitalOcean costs are not available in their respective plugins.

Has anyone found a way to get the costs in those plugins? It is a limitation of the plugins or of the underlying cloud APIs?


Please open GitHub issues (or PRs) for any tables you'd like to added! If there is a cost API then we'd definitely like to have a table to query it. In the meantime, you may find the "Thrifty" mods useful - they have many prebuilt queries to look for cost saving opportunities in AWS, Azure, GCP and DigitalOcean - https://hub.steampipe.io/mods?q=thrifty


I wonder if we can connect this with Hasura and query our cloud inventory via GraphQL. I'm only half-joking, actually.


Because it's just PostgreSQL, Steampipe works out of the box with SQL clients, BI tools and other integrations [1]. Please let us know how you go!

1 - https://steampipe.io/docs/query/third-party


I have discovered steampipe only recently. We work with a few AWS accounts and the steampipe dashboard and query are the best tool I have seen since terraform (of which I'm pretty sure it borrows the resource modules). Totally recommend it for making sense of your AWS accounts mess and for audits (soc2 style for example)


Shameless plug for those of you who want to query your assets with Osuqery and cloud assets with Steampipe. Check out https://www.github.com/zercurity/zercurity.


They claim that a "Join" of APIs is possible. I wonder how that would work in the general case and what the performance characteristic are? Any experiences?


I found Steampipe from a HN comment about a year and some change ago, in my last role I had to routinely audit cloud configurations and it was amazing to have a SQL query to check all S3 buckets that were encrypted, security groups, etc.


Wow. Steampipe is just bitchin'.

I've been noodling on a half-baked notion to make REST API clients easier. Think something between cURL and Postman.

Steampipe is just such a better idea. And from first looks, the implementation looks like a bullseye.

Bravo.


Does this have autocomplete in the terminal?


"steampipe query" has auto-complete for SQL queries in the interactive terminal - https://steampipe.io/docs/query/query-shell#autocomplete

"steampipe service" is a standard Postgres endpoint, so works with any BI tool or SQL client and their autocompletion capabilities - https://steampipe.io/docs/query/third-party

"steampipe completion" provides terminal completion for the Steampipe CLI - https://steampipe.io/docs/reference/cli/completion


Could someone use this query across many hundreds of accounts?


Yes - our users run queries across hundreds of AWS accounts and regions. TBH, we never expected it to scale that far when we started, but it has. Steampipe parallelizes the API calls out and performs appropriate backoff in throttling to gather the results.

Steampipe does the API calls needed for the column data requested [1]. So "select instance_id from aws_ec2_instance" is much faster than "select id, tags from aws_ec2_instance" (which does a tags API call per row). But, it will still work better than you expect, and Steampipe caches results in memory to make subsequent queries instant.

1 - https://steampipe.io/blog/selective-select


Yahoo was really ahead of its time, so sad it was mismanaged.


Anybody remembers Yahoo Pipes with YQL? This looks similar.


This looks fantastic!


looks cool, think you could expose it as a rest API?

(I kid, I kid)


You kid, but we found that useful for scripts etc! So, in addition to a secure public Postgres endpoint [1], Steampipe Cloud workspaces have a rest API to run SQL queries and get results in JSON or other formats [2].

1 - https://steampipe.io/docs/cloud/integrations/overview 2 - https://steampipe.io/docs/cloud/develop/query-api


I'd want to know how to have a hosted solution.

Ideally build reports and alerts


You can run Steampipe CLI in "service mode", getting access to a Postgres endpoint and a dashboard web server [1]. Other users leverage docker / kubernetes to run it [2]. We also have a cloud hosted option with dashboard hosting, multi-user support, etc [3].

1 - https://steampipe.io/docs/managing/service 2 - https://steampipe.io/docs/managing/containers 3 - https://steampipe.io/docs/cloud/overview


Steampipe has their own cloud at https://steampipe.io/docs/cloud/overview

Alternative is Cloudquery but they don't offer cloud anymore. https://www.cloudquery.io/

Plug, not based on Steampipe, but we also offer similar managed SQL on your cloud assets. https://www.resmo.com/


Is this IaSQL reborn or a completely new product?


Completely separate projects, with different goals. IaSQL is focused on infrastructure as code (like Terraform) in SQL. Steampipe is an "ops as code" platform to query, visualize and report on all your DevOps data (infra, SaaS, IaC, logs, etc). Steampipe's first version was released in Jan 2021.


it's all about the data sources


tl;dr unnecessary abstraction. all of this can be managed by REST APIs, and is already done so, why revert back to disgusting, archaic looking SQL limitations?


wow.


This cannot possibly require WSL2. What the heck? If I had an afternoon to waste RTFSing, I'd fix this. Some of us are stuck running WSL1 because Hyper-V breaks features we need in other hypervisors.


Steampipe installs Postgres as an embedded database under the hood. We found that the initdb step kept failing on WSL1 and couldn't find a way to make it work (yet). Similar reports on StackOverflow below [1] [2]. Suggestions or PRs would be very welcome!

1 - https://stackoverflow.com/questions/72358475/postgresql-wsl-... 2 - https://stackoverflow.com/questions/69808468/postgres-is-stu...


On the one hand, I see your point; hacking it together probably isn't hard.

On the other, they're a startup with limited resources, and time to market matters dearly for them. They probably don't even have product-market fit yet, and supporting Microsoft technologies that Microsoft has abandoned seems questionable.

Anyway, I upvoted your comment, because it could be that their target market is running in weird enterprise-y environments and your problem is common. Feedback is helpful and hard fought at small companies.

(No affiliation.)


The downvotes actually motivate me to fix this, except the docs have zero information on how to set up a development environment. I can tweak the CLI and `go build` it, but the plugin install process ends up overwriting my changes somehow. I've wasted about two hours learning Go and digging into the project's CI/CD pipeline, trying to figure out how everything works, and I haven't even begin debugging the actual problem I'm having. Ultimately, this goes on the pile of neat things I saw on HN but don't actually work IRL. It's too bad because this looks super cool.


Sorry you didn't have a smooth start there :-(. Not sure how the behavior you are describing is happening. We'd be happy to help get things going if you want to open an issue [1] or discuss on Slack [2].

1 - https://github.com/turbot/steampipe/issues 2 - https://steampipe.io/community/join


Great to see steampipe here :)

Founder of Resmo here!

Select * from cloud & SaaS + a fully managed SaaS platform with a free offering. Make sure to check us out!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: