a) I routinely load 6GB of financial data into R in less than 45 seconds, and it's complex and hierarchical to boot. Admittedly that's in R's native file format but even a flat CSV takes less than 5 minutes.
b) Why SQL? Has this person considered Pytables or h5py? Here is an excellent file format created by guys who work with large data sets all day long, which is vastly superior to an on-disk relational database for data sets of this size (and much bigger). Moreover, they map directly to Numpy.
c) Why SQL (II)? Why is all the wrangling being done in slow SQL when Pandas has 100x faster in memory querying capability via groupby and query methods?
d) The visualization is terrible. "Other" dominates the dataset and dwarfs any of the more interesting categories and their relationship to time of day. This is a very clear case for when a chart grid would have been preferable, easily constructable in Pandas or of course GG Plot, and surely doable with Plot.ly. We would have been table to compare how different categories have different time-of-day profiles.
e) If we're going to push data visualization tutorials into the big wide world, why not introduce people to quantile plots, useful here to compare the distribution of complaints (literally one line in R), or more simply and very intuitively, we could have superimposed all the hour-by-hour complaint density plots on each other and see how they compare (again, literally a few lines in R as in plot(density(x)); par(new = TRUE); plot(density(y), add = TRUE, col = "xxx"). After having normalised the data of course. This has its analogues in Python. A second, bog-standard, unstacked, bar chart could then have been used for aggregate complaint volume by hour, and dare I say it, even a pie chart would have been better here for comparing aggregate complaint category sizes.
This looks like a pitch for Plotly aimed at casual HTML developers, not serious data scientiests.
Pandas has native HDF5 read/write. So they could just do a one-time read from SQL, dump to HDF5, and enjoy subsequent fast reads.
I've recently used Pandas with HDF5 backing for building a web app with Flask that does complex queries on a small dataset (40k rows, ~ 15 columns of which four are indices). Full stack Python, no SQL anywhere, it's awesome. Stick it in a virtualenv, fire and forget.
Wow I have to agree with this (even though the post is using plotly). I don't know where the culture difference comes from between the R and Python ecosystems that makes the latter so much worse than the former when it comes to graphics. Python seems to be able to do so many things so well, but then at the final output stage it all falls apart and I find myself back in R. On that point, I even wonder if with all the Julia bruhaha, the actual real competitor in data science might actually be Javascript! Much faster than Python, and a community that truly understands graphics quality (I'm looking primarily at D3 but things like Mathbox.js and even Hicharts are pretty decent). I'm thinking that SVG and increasingly Webgl will incentivize linear algebra skills in the Javascript ecosystem, and those same skills will suddenly be transferrable to data.
MPL's documentation is a catastrophe, and the API design is indeed abysmal. The plots, however, are only as ugly as you like them to be. Here's a great auxiliary package with much more sensible defaults:
Well, this is surely useful for some use cases. But for reapeated analyses that 50 min loading time is a no go and there is certainly big room for improvements.
For instance, using R with data.table to process ~4GB file would result in few minutes spent on loading and processing, while giving a very similar flexibility of REPL and scripting. There should a similar way to do it in python i guess...
The take home message is that there is very large class (probably a majority) of data analysis problems for which SQL databases are very well suited. Though PostgreSQL would be a better choice.
And while we're at it, we may want to consider a) defining some indexes and b) buying a little more RAM
SQL is useful when you have a use case that will require randomly querying the dataset. This is a case of synchronous parsing in which any database (relational or not)'s random-access capabilities are not used at all, yet you're paying for it in much lower sequential throughput and indexing overhead. HDF5 is a far superior here.
For a vast data set (> 100 gig) one may argue that a database will allow you to parse "on disk", but that is not the case here and if you're doing data science you will quickly realise that even 32 gig in your machine is a small amount of RAM. What you'll gain in on-disk ability with an indexed database, you'll quickly be frustrated with because your wrangling "discovery/cleaning" workflow will be frustratingly slow. The real answer is that 64 or 128 gig of RAM with HDF5 is almost indispensable for the working "medium data" scientist (most people), before we're having to talk about Spark, Hadoop, YARN etc for the ginormous, properly "big data" cases.
As you say, RAM is where it's at. The real takeout is that 128 gig of dimms is cheap compared with the time you'll waste trying to do medium data on disk with (no)SQL.
PostgresSQL is a much netter choice, especially when compared to SQLite with Python. The lack of multiple cursors can make doing any sophisticated and comllex analysis very frustrating with SQLite.
On the other hand: "This entire workflow can be done in Pandas and SQLite. No proprietary third-party software or services necessary." I vastly prefer that. :)
plotly is awesome. A lot of companies spending $50k + a year on BI software can do better (and save a ton of money) by throwing plotly iframes on some simple static webpage and update them at any interval using cron jobs and python.
I work in the business intelligence space as a consultant.
The cost of any BI project is dominated by the time and effort put into ETL and data modelling. 80% of our effort goes toward this back end work. With this in mind, the presentation layer is essentially an afterthought. The lion's share of BI cost goes to expensive humans. If you can spend a few $100Ks for presentation tools that integrate seamlessly with the back-end analysis services and streamline report design/publication, then it's a no-brainer.
With technical employees/consultants at a premium, and in short supply (especially in the data modeling space), it is worth $100Ks annually to have a toolset that allows non- and minimally-technical employees to quickly and easily build new reports.
Take a look at Microsoft's PowerBI dashboards, or at the visualization capabilities in Tableau. The functionality there is what businesses want/need.
Those tools and Plotly equally require a solid data warehouse behind them to support any meaningful and timely analysis. The marginal cost of a highly-integrated and easy-for-non-technical-resources reporting layer is pretty low after that investment.
"a toolset that allows non- and minimally-technical employees to quickly and easily build new reports"
Been there, done that.
This is the mantra of the BI system marketing machine that I've never actually seen work in practice.
But I'm glad you've seen BI enterprise systems succeed where the end users are happy and feel their toolset is flexible enough for the new daily challenges they encounter (sincerely).
I love Tableau as an exploration tool and think the UX and visualization capabilities are awesome. Just not the solution end users wanted. I've also rolled out a massive cloud based enterprise BI system (data warehousing, ETL, etc.) at a separate company. That wasn't the solution, either. Plotly was. But, just my two cents.
Microsoft's PowerBI platform is growing on me for an end-user UI.
You're definitely right that none of the enterprise solutions are completely "there."
It still seems to me that "Export to Excel" is the strongest BI feature of any tool. Pivot tables are ubiquitous. Throw any OLAP cube up on a server and host a workbook on SharePoint and you're probably 60% of the way to a good BI ecosystem.
^^ This is in terms of data vis and presentation ^^ The backend work still dominates.
Fair disclosure: my company is a Microsoft Partner, so my primary exposure and all of my work is in that ecosystem. For geek-cred, I run Arch and OpenBSD for all my personal systems.
Note that most businesses with remotely-sensitive data would not use a third-party service to host the data, or even third-party tools without sufficient liability assurance.
The best definition I've seen of Big Data is "the database indices don't fit in memory on the beefiest single server you have access to". (And this doesn't mean you can claim big data because you only have access to your 4 GB RAM laptop.)
My usual rule of thumb is "does AWS sell a single instance big enough to store and process data in real time?" If you can't find an instance big enough, you have big data.
Methodology-wise - yes. Basically, anything that you cannot confidently put into a spreadsheet, manually investigate a chunk of it and search with regex is (a buzz word) "big data". Also, you need you forget about all O(n^2) scripts.
If you do any kind of analysis (and you are of size < Google), processing is the easy part. The hard part is to get maximum out of it, while disregarding noise. (Also, it is why when you ask data scientists what they do, a common answer is that 80-90% of time is data-cleaning.)
No. Big Data methodology has traditionally (going back to MapReduce) been all about distributed computing, and figuring out how to get anything meaningful at all out of your data when it can't be comfortably processed on a giant server. When you talk about "size < Google" - Big Data as a term was ORIGINALLY COINED to describe what Google has to deal with. For example, Wikipedia says:
Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.
People have been dealing with multi-GB datasets for a lot longer than the term "Big Data" has been around, and it was not intended to refer to existing techniques. Unfortunately, lots of people (especially non-programmers) have tried to adopt it to describe what they do because it's buzzword-y and because Google used it.
80-90% of time is data-cleaning
That's Data, period. I have spent 80-90% of my time cleaning when dealing with a 1MB flat text file. That doesn't make it Big Data.
I won't argue about a particular phrase, as 1) buzz-words more often mis-used than used properly 2) there is no single, strict definition of it.
But going from ~MBs to ~GBs change methodology (of course, it's more about number of items and their structure than sheer size). Even at this scale there is some tradeoff of quality/quantity. And feasibility of using some more complex algorithms.
Data cleaning - sure, for all data. But when volume is in MB, it is still possible to open it it a text editor and manually look into abnormalities. Once it's GBs some artifacts will remain undetected.
The fact remains that nothing that can be analyzed on a single computer in a single threaded application in a single memory space is anything like "big data", unless you just want to tap into the trend of buzzword-laden link bait inbound marketing.
This is an interesting point because the typical amount of RAM in a modern computer is 8GB, with 16GB being a rare upgrade.
For data analyses which I've done, my 8GB computers will complain when analyzing 2-3GB datasets because when processing the data, the computers needs to make a copy in memory, since the analysis isn't in-place, and there are memory leaks (stupid R). So 4GB-6GB of memory usage causes hangs (since the OS is using a lot of memory too) and is why I've had to resort to SQL for data sets that hit that size. (which is perfectly fine)
I think it was submitted as "SQLite and Pandas for Medium Data Workflows" (that's what I see in my RSS reader), but may have been changed to reflect the article title.
That was not particularly big data in 1995. You'd just load it into Oracle running on a desktop machine and query it in SQL, draw some graphs in Developer/2000.
For a lot of companies that is a lot of data, yes. Maybe not big data, but the term is relative so if a business sits on 1000 times as much data as they have played with before, is it that unreasonable for them to say it is big data?
> ...is it that unreasonable for them to say it is big data?
Let's say I have an organization and we run into a dataset that is 1,000 times larger than anything we've dealt with. Should we put out a help-wanted ad for a "big data" developer? What if the largest dataset we had previously dealt with was 100 rows? The reason we have terms for things like this is to facilitate communication. If the definition is highly sensitive to context, then the term doesn't facilitate any communication, the whole context must still be spelled out. If the term is to have any meaning at all, it can't be a relative thing. Of course I'm of the opinion that the term is already meaningless, so I guess do whatever you want :)
a) I routinely load 6GB of financial data into R in less than 45 seconds, and it's complex and hierarchical to boot. Admittedly that's in R's native file format but even a flat CSV takes less than 5 minutes.
b) Why SQL? Has this person considered Pytables or h5py? Here is an excellent file format created by guys who work with large data sets all day long, which is vastly superior to an on-disk relational database for data sets of this size (and much bigger). Moreover, they map directly to Numpy.
c) Why SQL (II)? Why is all the wrangling being done in slow SQL when Pandas has 100x faster in memory querying capability via groupby and query methods?
d) The visualization is terrible. "Other" dominates the dataset and dwarfs any of the more interesting categories and their relationship to time of day. This is a very clear case for when a chart grid would have been preferable, easily constructable in Pandas or of course GG Plot, and surely doable with Plot.ly. We would have been table to compare how different categories have different time-of-day profiles.
e) If we're going to push data visualization tutorials into the big wide world, why not introduce people to quantile plots, useful here to compare the distribution of complaints (literally one line in R), or more simply and very intuitively, we could have superimposed all the hour-by-hour complaint density plots on each other and see how they compare (again, literally a few lines in R as in plot(density(x)); par(new = TRUE); plot(density(y), add = TRUE, col = "xxx"). After having normalised the data of course. This has its analogues in Python. A second, bog-standard, unstacked, bar chart could then have been used for aggregate complaint volume by hour, and dare I say it, even a pie chart would have been better here for comparing aggregate complaint category sizes.
This looks like a pitch for Plotly aimed at casual HTML developers, not serious data scientiests.