HDF5 is supported by many languages including C, C++, R, and Python. It has compression built in. It can read slices easily. It is battle-tested, stable, and used in production for many years by thousands of people. Pandas even has integrated support for DataFrames stored in HDF5.
What's the advantage of Feather over HDF5? Couldn't the Feather libraries be written with the same API but HDF5 as the storage format, if the Feather API is preferable?
HDF5 is a really great piece of software -- I wrote the first implementation of pandas's HDF5 integration (pandas.HDFStore) and Jeff Reback really went to town building out functionality and optimizing it for many different use cases.
But the HDF5 C libraries are very heavy dependency. Feather by comparison is an extremely small amount of code (< 2KLOC in the core library) and a correspondingly minimal API. It's a simple file format with excellent performance, and we wanted to make it as easy as possible for people to use Feather.
There is also the Apache Arrow factor -- integration between the Arrow memory representation and R and Python tools will have a lot of ecosystem benefits, so one of the goals of Feather is to reconcile Python's and R's metadata requirements with the "official" Arrow metadata so that we can move around data frames with very low overhead.
Is there a plan to add other language implementations (or C implementation wrappers)?
I’d love to see a nice format of this type that can easily be written/read from Javascript in a browser [e.g. to get the data into a D3 visualization] and from Matlab, in addition to Python.
I looked into trying to implement an HDF5 codec in Javascript, but that looked like a large task for one person unfamiliar with the format.
My biggest personal annoyance is that HDF5 isn't thread safe^, so it only supports parallel reading and writing via multiple processes. This makes parallel computing a pain.
This is especially annoying when using HDF5's built-in compression, which hogs a lot of CPU. Inter-process communication is slower than reading from SSDs, so that isn't a great alternative:
http://matthewrocklin.com/blog/work/2015/12/29/data-bandwidt...
There's a lot to be said for file formats that you can simply memory map, and that's exactly what Feather/Arrow are. Building out-of-core workflows on top of should be a joy.
Wes -- does the Python library for Feather already release the GIL?
^ you can use and/or compile HDF5 with a global lock, but the underlying library still isn't thread safe.
So the actual reading from and writing to disk should be with a released GIL if I'm reading this correctly. The conversion to and from arrays or dataframes holds the GIL.
Concurrent writes would not be possible but parallel reads definitely are. Haven't put any multithreading work into the library yet but it shouldn't be a large task.
I found it quite frustrating to use HDF5. It does not handle variable-length strings well (very common). In Pandas, categoricals and MultiIndex are not supported. I found that settling for CSV and pickle is more reliable & robust. Also, HDF5 basically implements a hierarchical file system, which is overengineering IMO.
Feather currently doesn't support recursive/hierarchical data structures, like lists in R. That'll be added in the future though. We'll definitely standardise in the future so you can feel confident using it in the long term.
I have no plans to integrate it with pandas, but I'm sure Wes does ;)
As one of many people flicking between R and python/pandas, do you feel there are other areas that have the potential for collaborative tools between the two communities?
In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.
Is the feather in insights from feather the right word? It reads awkwardly to me, which could just be me lacking context.
My heart is palpitating. I love where this is going. Jake Vanderplas talked about the desire for a common data frame lib to unite the warring tribes in his PyCon keynote a year or so ago, and I couldn't have agreed more.
This appears to be "only" a serialization format ("oh, my unicorn only lays golden eggs"). I really hope this is the start of some common library infrastructure that can be used for all aspects of in- and out-of-memory data frames.
Great work, and I hope it is a harbinger of good things to come. Also, I'll treat this as tangible evidence that the "language wars" are stupid.
This is precisely the goal of the Apache Arrow project http://arrow.apache.org/ -- and I've been working very hard to bring together diverse groups of data system developers to work on this problem together. Exciting road ahead!
If anyone wants to get this running on Windows, I've made a start. In cpp\thirdparty, you can run the .sh scripts with cygwin, but in build_thirdparty you'll need to add to the appropriate section:
elif [[ "$OSTYPE" == "cygwin"* ]]; then
PARALLEL=$NUMBER_OF_PROCESSORS
And you'll need to use msbuild rather than make, e.g.:
if [[ "$OSTYPE" == "cygwin"* ]]; then
msbuild gtest.sln /p:configuration=release
That got the 3rd party stuff working. But then I hit a snag, because building python 2.7 modules on Windows requires an old MSVC version that doesn't support stdint.h, which is used by feather in ext.cpp . Maybe a simple conditional compilation for the appropriate header will be enough to fix that, but I haven't got time to check today. So hopefully someone else can fix that...
I'd like to see SPARK-13534 completed so that we can bring Spark to Python/R data access performance reach a level we can deem "acceptable". I dug into this issue a bit here: http://wesmckinney.com/blog/pandas-and-apache-arrow/
The format already supports larger than RAM data, but we don't yet have an API for creating those files or just extracting slices. That will come in the future.
Thanks for working on this. Really a great effort. Hope your guys can get good inspiration from projects like h5df, bcolz, pytables having the option of incrementally adding features and maintaining an open spec.
It's often much faster than rds. And in the long long term there will be tools for computing on feather files that don't require loading it into memory. (In the short term I'll add ways to pull in slices of the full dataset)
Faster because it isn't (currently) using compression (which rds uses by default) or faster period?
Either way, the idea of mixed Python/R pipelines with feather file intermediates input/outputs is pretty sweet. Learn in scikit, save to feather, plot in ggplot2... using Make to tie the pieces together?
It's usually faster than either compressed or uncompressed RDS - but if you have heavily duplicated data, compressed RDS can be faster than feather (depending on some tradeoff between compression speed and disk speed). Feather will probably gain compression support eventually.
Hadley, Wes, what are your thoughts on how to implement compression?
I recall some open source columnar datastores (e.g. infobright) that achieved very VERY fast compression rates with just a few tricks:
https://news.ycombinator.com/item?id=8354416
In particular, compression is extremely fast for columnar datastores (its the same type one after the other). Since a lot of times the data is sorted by some ID (date, individual, etc.), you should see large improvements in both speed and disk space.
It seems like C++ (C?) is also supported. Why don't you advertise it? In my (limited) experience, exchanging data with C/C++ is even more painful than between Python and R.
Quick request - could you hook into save.image in R and give the ability to save the entire workspace in R? That would be awesome.
Incidentally I had filed a bug request for a functionality to save the entire workspace in Pandas...but was rejected as being unpythonic. Oh and the devs claimed Apache Arrow was vaporware !
It thwarts reproducibility. By saving your workspace, it drags a lot of state from session to session that isn't accounted for. If you share code with someone else, their workspace space won't be the same, and thus the code may not function the same.
Point taken. But we are again delving dangerously close to thou-shall-not . from my perspective, it is a quick and convenient way to save all the data frames in my code. It's a boon for productivity.
If not this, then I pray for Feather to be able to save multiple data frames innone file.
I don't see it as a thou-shalt-not. As a file format, feather is is lightweight. If they turned it into a container format it would be expanding the scope. If they instrumented it to comb objects in the global namespace and serialize them to the new container format, it would be heavier still--all to support a feature that the authors view as an anti-pattern. That's less a thou shalt not than it is a prioritization of their own vision.
If you're looking for a container to store lots of tabular data in one file, I'd suggest SQLite. Using dplyr, you can save those dataframes very easily. Plus, you can join tables and perform efficient aggregations on datasets too large to keep in memory.
In a lot of ways, I don't understand what limitations prevent SQLite from becoming the defacto common data.frame format. There probably are some, I just don't understand the tradeoffs (especially given how much SQLite gives you for free)!
Actually this is interesting - whhy Feather vs sqlite. I would love to know the answer!
But coming back to the anti-pattern : well, obviously the authors have the power to not spend time on something. But I'm trying to figure out why it's an anti-pattern in general. Snapshotting execution state is probably the ideal goal, but saving intermediate data structures is a decent convenience feature.
Now if that's restricted by the limitations of the format itself (no multiple frames in a single file), then we are back to thinking that HDF5/sqlite may indeed be the better format.
Basically because you should be encoding state in code, not data. If you store data between sessions, it's easy to lose the code that you use to create it and then later on you can't recreate it.
It is convenient to save your complete workspace but I've seen too many cases where it's contributed to lack of reproducibility to spend my time working on it.
So there's a use case difference. I create models from remote data sources - this is incremental on a daily basis and takes quite a bit of time.
So I snapshot the workspace after I do a run and do some experiments. Now - for me, saving the workspace is a convenience feature, NOT a programming feature.
This is what I mean by thou-shall-not. My use case is very well defined and I'm not stupid. And I completely knows the pitfalls of what you talk about - but a philosophical opposition is what hurts me (and lots of devs like me)
I hope I didn't come across as "thou shalt not" - it's just never going to be high in my priority list.
(And even for your use case I would think you'd be better off keeping the models in a list and saving that. Then other random stuff in your evn won't get carried along for the ride)
Oh no you did not! That was polite musing. Thank you for the reply - I still hope you change your mind. Because people do have genuine, but different needs ;)
Any luck in getting it to work with sas... Maybe in a 10.12 future release?
My daily grind is SAS to csv to R, because not everyone else has seen the light.
Type information is one. CSV is slow, since it has to parse everything on load. CSV has no random access, since rows can be arbitrary length. CSV takes up a lot of disk space, since a 8-byte double gets expanded into a 15+ digit string.
CSVs, in addition to being slow to parse (in practice, 40-100MB/s is the typical window, see some benchmarks here http://wesmckinney.com/blog/pandas-and-apache-arrow/), drop types (which have to be inferred), sometimes in a way that cannot be recovered (factor / category levels).
By comparison, Feather performs very close to disk performance. So speeds exceeding 500 MB/s (versus < 100MB/s for CSV) are common.
After reading through it, the spec seems to promise superior data interchange in pretty much every way while sacrificing about the only thing good about CSV... That most things can be coerced to produce or consume CSV with a simple read or print.
Not necessarily. Importing CSV data from disk in R can be pretty quick with 'fread', from the data.table package.[1]
Having said that I'm pretty excited about feather and where development will lead. I use RDS files quite heavily, mostly because of the compression which allows much smaller file sizes for distribution.[2] However, there's a trade off with parsing spread and also interoperability between languages. Looks like feather already has the interoperability part sorted, just waiting for compression now. Reading slices direct from disk is pretty exciting too.
Anything that needs to parse text into numbers will be slow. Your links just compare various slow methods and conclude that one isn't quite as slow as the others.
Of course it all depends on what you call 'slow'. Reading a few hundreds of megabytes of megabytes of csv's isn't going to be 'really' slow on modern hardware even if it was fgetc'd character by character.
Either way: anything that represents data as text will become a bottleneck when the size of the dataset grows.
The original comment was, "CSV parsing is relatively slow."
Relative to what?
The links I provided show that, using fread from data.table in R, parsing CSV data can be quicker than returning the same data from, for example, an SQLite database.
I believe the links do show CSV parsing relative to a couple of other common data parsing methods in R. Whether they're all "slow methods" is an entirely different question.
If you'd like to compare parsing CSV data relative to a "fast method", I'd very much like to read the analysis.
I guess in re-reading my comment, I phrased it in a tone that was more combative than I meant, so don't take the GP as an attack.
My point was, all conversions from text into numbers (I'm assuming here that the target use case is reading large amounts of numeric data) are slow and the slow part isn't the IO, but the conversion from text to numbers. In that light, it isn't a surprise that sqlite isn't much faster than csv, because sqlite doesn't have 'strong typing' itself. Any storage format that is concerned with speed will store data in binary format. But of course text formats are a lot easier to work with, and to interface between programs. I sometimes (when I have a lot of data that I know need multiple passes of reading) build quick and dirty 'caches' where I read file.csv and do what is basically a memory dump of the parsed data into file.csv.bin. My read functions can then check if that file exists and skip the parsing step. In my experience, this can be easily an order of magnitude (10x) faster. It's not portable or even elegant, of course.
Apart from that - when speed is of importance, one wouldn't use R in the first place, of course (I say that as someone who likes R for what it is and for the things it does well).
I don't do write ups of speed analyses, nor do I know of any, so I have to cop out on that one.
The fact that storing data in binary format gives you more speed is exactly why you'd want to use Feather. It means you don't have to translate things into an unspecified .bin format. This is the answer to the original question about "why not use CSV".
I don't know about R, but in Python, operations on objects such as NumPy arrays and Pandas DataFrames are all implemented using fast C code, and so is Feather. You can be concerned about speed.
Yes, of course, and I'd much rather use something like it; and when it will support matrices with more than 2 dimensions, I will (or at least, I will look into it). In cases where I need more robust storage I already use HDF5 or NetCDF but they're a PITA to work with.
Of course in-memory operations can be implemented efficiently, R does that too. But Python needs to parse CSV into numbers just like everybody else, and even if it's done in C underneath, it'll still be 'slow' (for some values of that word).
What's the advantage of Feather over HDF5? Couldn't the Feather libraries be written with the same API but HDF5 as the storage format, if the Feather API is preferable?