Data analytics with Pandas and SQLite

vegabook · on April 11, 2015

This is riddled with problems.

a) I routinely load 6GB of financial data into R in less than 45 seconds, and it's complex and hierarchical to boot. Admittedly that's in R's native file format but even a flat CSV takes less than 5 minutes.

b) Why SQL? Has this person considered Pytables or h5py? Here is an excellent file format created by guys who work with large data sets all day long, which is vastly superior to an on-disk relational database for data sets of this size (and much bigger). Moreover, they map directly to Numpy.

c) Why SQL (II)? Why is all the wrangling being done in slow SQL when Pandas has 100x faster in memory querying capability via groupby and query methods?

d) The visualization is terrible. "Other" dominates the dataset and dwarfs any of the more interesting categories and their relationship to time of day. This is a very clear case for when a chart grid would have been preferable, easily constructable in Pandas or of course GG Plot, and surely doable with Plot.ly. We would have been table to compare how different categories have different time-of-day profiles.

e) If we're going to push data visualization tutorials into the big wide world, why not introduce people to quantile plots, useful here to compare the distribution of complaints (literally one line in R), or more simply and very intuitively, we could have superimposed all the hour-by-hour complaint density plots on each other and see how they compare (again, literally a few lines in R as in plot(density(x)); par(new = TRUE); plot(density(y), add = TRUE, col = "xxx"). After having normalised the data of course. This has its analogues in Python. A second, bog-standard, unstacked, bar chart could then have been used for aggregate complaint volume by hour, and dare I say it, even a pie chart would have been better here for comparing aggregate complaint category sizes.

This looks like a pitch for Plotly aimed at casual HTML developers, not serious data scientiests.

semi-extrinsic · on April 11, 2015

Pandas has native HDF5 read/write. So they could just do a one-time read from SQL, dump to HDF5, and enjoy subsequent fast reads.

I've recently used Pandas with HDF5 backing for building a web app with Flask that does complex queries on a small dataset (40k rows, ~ 15 columns of which four are indices). Full stack Python, no SQL anywhere, it's awesome. Stick it in a virtualenv, fire and forget.

IndianAstronaut · on April 12, 2015

>The visualization is terrible.

That's a feature not a bug. Plotting libraries for Python have always been abysmal. Matplotlib is both ugly and hard to use.

vegabook · on April 12, 2015

Wow I have to agree with this (even though the post is using plotly). I don't know where the culture difference comes from between the R and Python ecosystems that makes the latter so much worse than the former when it comes to graphics. Python seems to be able to do so many things so well, but then at the final output stage it all falls apart and I find myself back in R. On that point, I even wonder if with all the Julia bruhaha, the actual real competitor in data science might actually be Javascript! Much faster than Python, and a community that truly understands graphics quality (I'm looking primarily at D3 but things like Mathbox.js and even Hicharts are pretty decent). I'm thinking that SVG and increasingly Webgl will incentivize linear algebra skills in the Javascript ecosystem, and those same skills will suddenly be transferrable to data.

IndianAstronaut · on April 13, 2015

Besides D3, Google charts, which are also in JS, are also quite elegant.

Lofkin · on April 12, 2015

Have you tried bokeh or vispy? What about seaborn?

apl · on April 12, 2015

MPL's documentation is a catastrophe, and the API design is indeed abysmal. The plots, however, are only as ugly as you like them to be. Here's a great auxiliary package with much more sensible defaults:

https://github.com/mwaskom/seaborn

Lofkin · on April 12, 2015

Try bokeh http://bokeh.pydata.org/en/latest/

haddr · on April 11, 2015

Well, this is surely useful for some use cases. But for reapeated analyses that 50 min loading time is a no go and there is certainly big room for improvements. For instance, using R with data.table to process ~4GB file would result in few minutes spent on loading and processing, while giving a very similar flexibility of REPL and scripting. There should a similar way to do it in python i guess...

maxerickson · on April 11, 2015

The loading time is probably because of the way they are using the ORM:

https://github.com/pydata/pandas/issues/8953

I guess it also isn't a great approach to do row based filtering using a data frame.

blumkvist · on April 11, 2015

All in all, excellent showcase of the product and the team behind it.

Lofkin · on April 12, 2015

http://blaze.pydata.org/en/latest/

rcpt · on April 11, 2015

If you'd like to do this whole thing with open source tools take a look at http://bokeh.pydata.org/en/latest/ in place of plotly

elliott34 · on April 11, 2015

But if you're setting up bokeh server for the first time...bring a friend and some coffee

aheilbut · on April 11, 2015

The take home message is that there is very large class (probably a majority) of data analysis problems for which SQL databases are very well suited. Though PostgreSQL would be a better choice.

And while we're at it, we may want to consider a) defining some indexes and b) buying a little more RAM

vegabook · on April 11, 2015

SQL is useful when you have a use case that will require randomly querying the dataset. This is a case of synchronous parsing in which any database (relational or not)'s random-access capabilities are not used at all, yet you're paying for it in much lower sequential throughput and indexing overhead. HDF5 is a far superior here.

For a vast data set (> 100 gig) one may argue that a database will allow you to parse "on disk", but that is not the case here and if you're doing data science you will quickly realise that even 32 gig in your machine is a small amount of RAM. What you'll gain in on-disk ability with an indexed database, you'll quickly be frustrated with because your wrangling "discovery/cleaning" workflow will be frustratingly slow. The real answer is that 64 or 128 gig of RAM with HDF5 is almost indispensable for the working "medium data" scientist (most people), before we're having to talk about Spark, Hadoop, YARN etc for the ginormous, properly "big data" cases.

As you say, RAM is where it's at. The real takeout is that 128 gig of dimms is cheap compared with the time you'll waste trying to do medium data on disk with (no)SQL.

IndianAstronaut · on April 12, 2015

PostgresSQL is a much netter choice, especially when compared to SQLite with Python. The lack of multiple cursors can make doing any sophisticated and comllex analysis very frustrating with SQLite.

Lofkin · on April 12, 2015

This is a job for blaze, out of core pandas and numpy frontend to large datasets. http://blaze.pydata.org/en/latest/

hackerews · on April 11, 2015

Hehe - this entire workflow can be done in Google Sheets (don't even need Excel) by querying the NY open-data API. No code necessary.

aw3c2 · on April 11, 2015

On the other hand: "This entire workflow can be done in Pandas and SQLite. No proprietary third-party software or services necessary." I vastly prefer that. :)

elchief · on April 11, 2015

Best title change ever.

Can't we all just agree that "big data" requires at least 2 computers?

robmccoll · on April 11, 2015

I'm going to say not necessarily. https://www.sgi.com/products/servers/uv/uv_2000_20.html

elliott34 · on April 11, 2015

plotly is awesome. A lot of companies spending $50k + a year on BI software can do better (and save a ton of money) by throwing plotly iframes on some simple static webpage and update them at any interval using cron jobs and python.

greggyb · on April 12, 2015

I work in the business intelligence space as a consultant.

The cost of any BI project is dominated by the time and effort put into ETL and data modelling. 80% of our effort goes toward this back end work. With this in mind, the presentation layer is essentially an afterthought. The lion's share of BI cost goes to expensive humans. If you can spend a few $100Ks for presentation tools that integrate seamlessly with the back-end analysis services and streamline report design/publication, then it's a no-brainer.

With technical employees/consultants at a premium, and in short supply (especially in the data modeling space), it is worth $100Ks annually to have a toolset that allows non- and minimally-technical employees to quickly and easily build new reports.

Take a look at Microsoft's PowerBI dashboards, or at the visualization capabilities in Tableau. The functionality there is what businesses want/need.

Those tools and Plotly equally require a solid data warehouse behind them to support any meaningful and timely analysis. The marginal cost of a highly-integrated and easy-for-non-technical-resources reporting layer is pretty low after that investment.

elliott34 · on April 12, 2015

"a toolset that allows non- and minimally-technical employees to quickly and easily build new reports"

Been there, done that.

This is the mantra of the BI system marketing machine that I've never actually seen work in practice.

But I'm glad you've seen BI enterprise systems succeed where the end users are happy and feel their toolset is flexible enough for the new daily challenges they encounter (sincerely).

I love Tableau as an exploration tool and think the UX and visualization capabilities are awesome. Just not the solution end users wanted. I've also rolled out a massive cloud based enterprise BI system (data warehousing, ETL, etc.) at a separate company. That wasn't the solution, either. Plotly was. But, just my two cents.

greggyb · on April 12, 2015

Microsoft's PowerBI platform is growing on me for an end-user UI.

You're definitely right that none of the enterprise solutions are completely "there."

It still seems to me that "Export to Excel" is the strongest BI feature of any tool. Pivot tables are ubiquitous. Throw any OLAP cube up on a server and host a workbook on SharePoint and you're probably 60% of the way to a good BI ecosystem.

^^ This is in terms of data vis and presentation ^^ The backend work still dominates.

Fair disclosure: my company is a Microsoft Partner, so my primary exposure and all of my work is in that ecosystem. For geek-cred, I run Arch and OpenBSD for all my personal systems.

anthony_barker · on April 16, 2015

Quick and dirty big data

One physics postgrad + pandas + hdf5 + tableau (via csv) + angularjs + flask = big data done on the cheap.

2h query on Sybase -> response under a second.

minimaxir · on April 11, 2015

Note that most businesses with remotely-sensitive data would not use a third-party service to host the data, or even third-party tools without sufficient liability assurance.

elliott34 · on April 11, 2015

Correct. That's why those businesses get the enterprise version and it's all hosted behind a firewall.

leereeves · on April 11, 2015

> Big data analytics with Pandas and SQLite

> A Large Data Workflow with Pandas

> Data Analysis of 8.2 Million Rows with Python and SQLite

> This notebook explores a 3.9Gb CSV file

Big data?

rdtsc · on April 11, 2015

Yeah like others point out this is small data. My rules are:

1) Does it fit in a desktop computer memory? If it does it is "small data".

2) Does it fit on a hard drive in a desktop? Then it is just "data". Or medium data.

3) If you need a cluster or some centralized network storage to fit and manage it, you might have big data.

4) Next level up is streaming data. Your data doesn't even sit anywhere because it is accumulated faster than you could ever processes it.

semi-extrinsic · on April 11, 2015

The best definition I've seen of Big Data is "the database indices don't fit in memory on the beefiest single server you have access to". (And this doesn't mean you can claim big data because you only have access to your 4 GB RAM laptop.)

texthompson · on April 11, 2015

My usual rule of thumb is "does AWS sell a single instance big enough to store and process data in real time?" If you can't find an instance big enough, you have big data.

stared · on April 11, 2015

Infrastructure-wise - no.

Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet, manually investigate a chunk of it and search with regex is (a buzz word) "big data". Also, you need you forget about all O(n^2) scripts.

If you do any kind of analysis (and you are of size < Google), processing is the easy part. The hard part is to get maximum out of it, while disregarding noise. (Also, it is why when you ask data scientists what they do, a common answer is that 80-90% of time is data-cleaning.)

azernik · on April 11, 2015

  Methodology-wise - yes

No. Big Data methodology has traditionally (going back to MapReduce) been all about distributed computing, and figuring out how to get anything meaningful at all out of your data when it can't be comfortably processed on a giant server. When you talk about "size < Google" - Big Data as a term was ORIGINALLY COINED to describe what Google has to deal with. For example, Wikipedia says:

  Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

People have been dealing with multi-GB datasets for a lot longer than the term "Big Data" has been around, and it was not intended to refer to existing techniques. Unfortunately, lots of people (especially non-programmers) have tried to adopt it to describe what they do because it's buzzword-y and because Google used it.

  80-90% of time is data-cleaning

That's Data, period. I have spent 80-90% of my time cleaning when dealing with a 1MB flat text file. That doesn't make it Big Data.

stared · on April 11, 2015

I won't argue about a particular phrase, as 1) buzz-words more often mis-used than used properly 2) there is no single, strict definition of it.

But going from ~MBs to ~GBs change methodology (of course, it's more about number of items and their structure than sheer size). Even at this scale there is some tradeoff of quality/quantity. And feasibility of using some more complex algorithms.

Data cleaning - sure, for all data. But when volume is in MB, it is still possible to open it it a text editor and manually look into abnormalities. Once it's GBs some artifacts will remain undetected.

x0x0 · on April 11, 2015

The fact remains that nothing that can be analyzed on a single computer in a single threaded application in a single memory space is anything like "big data", unless you just want to tap into the trend of buzzword-laden link bait inbound marketing.

leereeves · on April 11, 2015

> Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet...

Why just spreadsheets? This is essentially loading data in SQLite and manually investigating it with SELECT queries.

bobowzki · on April 11, 2015

Haha 3.9Gb, thats nothing. I'd argue that anything that can fit into ram of a relatively modern laptop is small data.

bigiain · on April 11, 2015

The word around here is "if it fits onto a computer you own, it's not 'big data'".

I'd look strangely at anyone claiming "big data" for anything smaller than double digit terabytes.

IndianAstronaut · on April 12, 2015

Yeah. I would hesitate to call anything under 60 terabytes as big data.

minimaxir · on April 11, 2015

This is an interesting point because the typical amount of RAM in a modern computer is 8GB, with 16GB being a rare upgrade.

For data analyses which I've done, my 8GB computers will complain when analyzing 2-3GB datasets because when processing the data, the computers needs to make a copy in memory, since the analysis isn't in-place, and there are memory leaks (stupid R). So 4GB-6GB of memory usage causes hangs (since the OS is using a lot of memory too) and is why I've had to resort to SQL for data sets that hit that size. (which is perfectly fine)

e28eta · on April 11, 2015

I think it was submitted as "SQLite and Pandas for Medium Data Workflows" (that's what I see in my RSS reader), but may have been changed to reflect the article title.

duck237 · on April 11, 2015

OP here - yup, you're right, submitted as "SQLite and Pandas for Medium Data Workflows"

gaius · on April 11, 2015

That was not particularly big data in 1995. You'd just load it into Oracle running on a desktop machine and query it in SQL, draw some graphs in Developer/2000.

jszymborski · on April 11, 2015

I mean, using this approach I imagine you can do the same with much more data...

(Also Big Data is one of those words whose definition sucks to pin down)

buster · on April 11, 2015

Like what? Loading multiple terabytes of data into sqlite on a single server?

tomjen3 · on April 11, 2015

For a lot of companies that is a lot of data, yes. Maybe not big data, but the term is relative so if a business sits on 1000 times as much data as they have played with before, is it that unreasonable for them to say it is big data?

glesica · on April 11, 2015

> ...is it that unreasonable for them to say it is big data?

Let's say I have an organization and we run into a dataset that is 1,000 times larger than anything we've dealt with. Should we put out a help-wanted ad for a "big data" developer? What if the largest dataset we had previously dealt with was 100 rows? The reason we have terms for things like this is to facilitate communication. If the definition is highly sensitive to context, then the term doesn't facilitate any communication, the whole context must still be spelled out. If the term is to have any meaning at all, it can't be a relative thing. Of course I'm of the opinion that the term is already meaningless, so I guess do whatever you want :)

blumkvist · on April 11, 2015

Big data does not mean "a lot of data". It's about variety and velocity, in addition to volume.

gizmo · on April 11, 2015

As a rule of thumb, if the data can fit on something smaller than your literal thumb nail (i.e. a single microSD card) it's definitely not big data.

est · on April 11, 2015

So, 512GB?

rmc · on April 11, 2015

It's so big, that you can't even fit it in a single Excel sheet! /s

arethuza · on April 11, 2015

I think the limit for PowerPivot is 4Gb of data when saved so you might be able to do this that way in Excel - the row limit is just under 2 billion.

eva1984 · on April 11, 2015

Agreed.

Please don't call any datasets that is smaller than 100 million rows big data.

gaius · on April 11, 2015

A hundred billion more like.

blumkvist · on April 11, 2015

Aaaah, big data. Lovely.

And this from people who are marketing a visualization product. SMH...

Scott-S-McCoy · on April 11, 2015

No one is going to comment on the table concisely named "data" in all of these examples?