Hacker News new | past | comments | ask | show | jobs | submit login

What is your problem with this example, doesn't it get the job done?

  WITH temperatures AS ( /* ... */ )
  SELECT
      *,
      MAX(c) OVER (
          ORDER BY t
          ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
      ) AS hottest_temperature_last_three_days
  FROM
      temperatures;
  
       t      │ c  │ hottest_temperature_last_three_days
  ────────────┼────┼─────────────────────────────────────
   2021-01-01 │ 10 │                                  10
   2021-01-02 │ 12 │                                  12
   2021-01-03 │ 13 │                                  13
   2021-01-04 │ 14 │                                  14
   2021-01-05 │ 18 │                                  18
   2021-01-06 │ 15 │                                  18
   2021-01-07 │ 16 │                                  18
   2021-01-08 │ 17 │                                  17
Why should I fetch all the data and then form the result with another tool?

What a great article.




I re-read my comment, and I think that I expressed myself too harshly. If you like SQL and it get the job done for you, no problem. I would see myself using that SQL query.

However in order to get there, you need to know why you need the ''hottest_temparature_last_three_days''. Why not 2 or 4 days? Why not using heating degree day? What about serial correlation with other regions, or metering disfunction? What if you are working directly with raw data from instruments, and still need to choose wich cleaning method you will use? What if you want to check the correlation with another dataset which is not yet in your database (ex: private dataset in excel from a potential vendor)?

If you know exactly what you want, sure SQL is the way to go. However the first step of a data project is to admit that you dont know what you want. You will perform a litterature review and might have a general idea, but starting with a precise solution in mind is a receipe for failure. This is why you could need to fetch all the data and then form results with another tool. How do you even know which factors and features must be extracted before doing some exploration?

If you know the process you need and only care about RAM-CPU optimization, sure SQL is the way to go. Your project is an ETL and your have the job of a programmer who is coding a report. There is no ''data analysis'' there...


> If you know exactly what you want, sure SQL is the way to go.

I've been doing similar aggregations with MongoDB, simply because when I start a project where I quickly want to store data and then a week later check what I can do with it, it's a tool which is trivial to use. I don't need to think about so many factors like where and how to store the data. I use MongoDB for its flexibility (schemaless), compared to SQL.

But I also use SQL mostly because of the power relations have, yet I use it only for storing stuff where it's clear how the data will look for forever, and what (mostly simple) queries I need to perform.

I've read your comment as if it was suggesting that these kind of articles are not good, yet for me it was a nice overview of some interesting stuff that can be done with SQL. I don't chase articles on SQL, so I don't get many to read, but this one is among the best I've read.

To get back to the quote: How will I know if SQL can do what I want, if I don't read these kind of articles?


If you know SQL well, then doing exploratory analysis in SQL is perfectly reasonable. Especially if you are using a high-performance analytic database.

Disclosure: I develop high-performance analytic database systems with SQL frontends.


What do you mean by exploratory analysis? Lets assume that my dataset contains features which are not normally distributed, and I want to check for coskewness and cokurtosis matrices before and after I remove outliers with a method specific to my field of research, found in a academic paper. Am I supposed to code all of this in SQL? What is your definition of exporatory analysis, and which metrics do you have in mind?

And what if I want to make graphs to present stuff to a colleague? Am I supposed to use... excel to do that?


> Am I supposed to use... excel to do that?

This is gratuitous. You have a clear bias, granted, because it seems your domain is so specific, only a procedural language will do. But it seems you are unfamiliar with modern SQL tools. Some of the obvious ones that come to mind: Metabase[0] for visualisation or Apache MADlib[1] for in-database statistics and machine learning.

[0] https://github.com/metabase/metabase [1] http://madlib.apache.org/


I humbly disagree. Its not that ''my domain is so specific'', but that data analysis in itself is a discovery process which is not straightforward. If you want to explore the jungle, you need a machete.

As for Metabase of MADlib, you are right ; I was not aware of these new tools. They look great, and I'm certain they can help a lot of people. However you assume that they are available! Not all IT departments are open to the idea of buying new software, and if the suggestion comes from an outsider it will be perceived as an insult (been there, done that many many time). And when they refuse, now what? You go back to the usual procedural languages (R, Python, Julia, etc) which are free and don't require the perpetual oversight of some DBA who thinks that all you need is an AVG(X) and GROUP BY since kurtosis is domain specific anyway.

I've meet some Excel-VBA users who couldn't care less about pro devs or decorators since ''they can already do everything by themselves''. Same thing with the SQL only, Python only, Tableau only or wathever-only crowd.


I usually use Python and R for analysis. However, when dealing with larger datasets, e.g., 0.5 - 2 PB, I have to rely on SQL/BigQuery because I can't get Python and R to deal such workloads in reasonable time. I tried Dask, but I couldn't resolve a few bugs it had at the time.

If you were to find outliers in a 1 PB table, what tools would you use?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: