Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Feather: A Fast On-Disk Format for Data Frames for R and Python (rstudio.org)
181 points by revorad on March 29, 2016 | hide | past | favorite | 75 comments


HDF5 is supported by many languages including C, C++, R, and Python. It has compression built in. It can read slices easily. It is battle-tested, stable, and used in production for many years by thousands of people. Pandas even has integrated support for DataFrames stored in HDF5.

What's the advantage of Feather over HDF5? Couldn't the Feather libraries be written with the same API but HDF5 as the storage format, if the Feather API is preferable?


Hi, Wes here.

HDF5 is a really great piece of software -- I wrote the first implementation of pandas's HDF5 integration (pandas.HDFStore) and Jeff Reback really went to town building out functionality and optimizing it for many different use cases.

But the HDF5 C libraries are very heavy dependency. Feather by comparison is an extremely small amount of code (< 2KLOC in the core library) and a correspondingly minimal API. It's a simple file format with excellent performance, and we wanted to make it as easy as possible for people to use Feather.

There is also the Apache Arrow factor -- integration between the Arrow memory representation and R and Python tools will have a lot of ecosystem benefits, so one of the goals of Feather is to reconcile Python's and R's metadata requirements with the "official" Arrow metadata so that we can move around data frames with very low overhead.


Is there a plan to add other language implementations (or C implementation wrappers)?

I’d love to see a nice format of this type that can easily be written/read from Javascript in a browser [e.g. to get the data into a D3 visualization] and from Matlab, in addition to Python.

I looked into trying to implement an HDF5 codec in Javascript, but that looked like a large task for one person unfamiliar with the format.


It's just a matter of someone implementing the protocol. It's not a huge amount of work for an experienced js/matlab programmer.


Is there a spec somewhere, or is the existing implementation the spec?

Edit: https://github.com/wesm/feather/blob/master/doc/FORMAT.md

Seems a bit sparse/incomplete still (as would be expected for a brand new project).


How does this contrast with the new Dask library in Python?


Dask is a compute framework. So you could use dask to create lots of Feather files, then perform computations.


HDF5 is a clunky file format and dependency. There's a whole host of usual complaints, many of which have already been mentioned: http://cyrille.rossant.net/moving-away-hdf5/

My biggest personal annoyance is that HDF5 isn't thread safe^, so it only supports parallel reading and writing via multiple processes. This makes parallel computing a pain.

This is especially annoying when using HDF5's built-in compression, which hogs a lot of CPU. Inter-process communication is slower than reading from SSDs, so that isn't a great alternative: http://matthewrocklin.com/blog/work/2015/12/29/data-bandwidt...

There's a lot to be said for file formats that you can simply memory map, and that's exactly what Feather/Arrow are. Building out-of-core workflows on top of should be a joy.

Wes -- does the Python library for Feather already release the GIL?

^ you can use and/or compile HDF5 with a global lock, but the underlying library still isn't thread safe.


> Wes -- does the Python library for Feather already release the GIL?

Checking out the code: https://github.com/wesm/feather/blob/master/python/feather/l...

So the actual reading from and writing to disk should be with a released GIL if I'm reading this correctly. The conversion to and from arrays or dataframes holds the GIL.


Concurrent writes would not be possible but parallel reads definitely are. Haven't put any multithreading work into the library yet but it shouldn't be a large task.


I found it quite frustrating to use HDF5. It does not handle variable-length strings well (very common). In Pandas, categoricals and MultiIndex are not supported. I found that settling for CSV and pickle is more reliable & robust. Also, HDF5 basically implements a hierarchical file system, which is overengineering IMO.


Are you talking about a different type of MultiIndex? http://pandas.pydata.org/pandas-docs/stable/advanced.html


That's the one. Don't recall the specifics, but combinations of MultiIndex as index and columns and the other features were not supported with HDF5.


HDF5 apparently only has one implementation, which is ridiculously bloated:

https://news.ycombinator.com/item?id=10858189


HDF5 is not thread-efficient. https://www.hdfgroup.org/hdf5-quest.html#tsafe

So, python packages like h5py do not even try to release the GIL.

This makes working with HDF5 very annoying in python (when using multiple threads).


Both Wes and I (project authors) will be tracking this thread in case you have questions!


Great idea! Some questions:

- Both R and Python support strings, factors, and complex objects in a dataframe. What is NOT supported by feather?

- Feather is "not for long term data storage". Will it be standardize in a distant future?

- Do you plan to integrate it into Pandas?


Feather currently doesn't support recursive/hierarchical data structures, like lists in R. That'll be added in the future though. We'll definitely standardise in the future so you can feel confident using it in the long term.

I have no plans to integrate it with pandas, but I'm sure Wes does ;)


Will you take a storage approach for nested data similar to Parquet?


Feather doesn't support a number of pandas features (but these are features that R doesn't really support, either): https://github.com/wesm/feather/blob/master/python/README.md (not comprehensive)


As one of many people flicking between R and python/pandas, do you feel there are other areas that have the potential for collaborative tools between the two communities?


I mentioned it elsewhere in the thread, but libdataframe.c seems like a pretty natural continuation to me.

EDIT: What I DON'T know is how much libdataframe would look than libsqlite.


Yes! Anything that is mostly C/C++ for performance could be shared between R and Python now that we have an easy interchange format.


In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

Is the feather in insights from feather the right word? It reads awkwardly to me, which could just be me lacking context.


Should be insights from _arrow_


My heart is palpitating. I love where this is going. Jake Vanderplas talked about the desire for a common data frame lib to unite the warring tribes in his PyCon keynote a year or so ago, and I couldn't have agreed more.

This appears to be "only" a serialization format ("oh, my unicorn only lays golden eggs"). I really hope this is the start of some common library infrastructure that can be used for all aspects of in- and out-of-memory data frames.

Great work, and I hope it is a harbinger of good things to come. Also, I'll treat this as tangible evidence that the "language wars" are stupid.


This is precisely the goal of the Apache Arrow project http://arrow.apache.org/ -- and I've been working very hard to bring together diverse groups of data system developers to work on this problem together. Exciting road ahead!


Ah jeez, I read the feather announce, but not the arrow docs. How did I not know about this!?


If anyone wants to get this running on Windows, I've made a start. In cpp\thirdparty, you can run the .sh scripts with cygwin, but in build_thirdparty you'll need to add to the appropriate section:

    elif [[ "$OSTYPE" == "cygwin"* ]]; then
      PARALLEL=$NUMBER_OF_PROCESSORS
And you'll need to use msbuild rather than make, e.g.:

      if [[ "$OSTYPE" == "cygwin"* ]]; then
        msbuild gtest.sln /p:configuration=release
That got the 3rd party stuff working. But then I hit a snag, because building python 2.7 modules on Windows requires an old MSVC version that doesn't support stdint.h, which is used by feather in ext.cpp . Maybe a simple conditional compilation for the appropriate header will be enough to fix that, but I haven't got time to check today. So hopefully someone else can fix that...


I can see that Wes is a reporter for Spark issue (https://issues.apache.org/jira/browse/SPARK-13534), what are the plans (if any) for tighter Spark/Python/R DataFrames integration?


I'd like to see SPARK-13534 completed so that we can bring Spark to Python/R data access performance reach a level we can deem "acceptable". I dug into this issue a bit here: http://wesmckinney.com/blog/pandas-and-apache-arrow/


Are there any plans to support larger than RAM datasets? Like hdf5 or bcolz does.


The format already supports larger than RAM data, but we don't yet have an API for creating those files or just extracting slices. That will come in the future.


Thanks for working on this. Really a great effort. Hope your guys can get good inspiration from projects like h5df, bcolz, pytables having the option of incrementally adding features and maintaining an open spec.


what are some practical uses for this if living in a pure R world?

Is this like BigMemory but for data frames?

https://cran.r-project.org/web/packages/bigmemory/index.html

Thanks.


It's often much faster than rds. And in the long long term there will be tools for computing on feather files that don't require loading it into memory. (In the short term I'll add ways to pull in slices of the full dataset)


Faster because it isn't (currently) using compression (which rds uses by default) or faster period?

Either way, the idea of mixed Python/R pipelines with feather file intermediates input/outputs is pretty sweet. Learn in scikit, save to feather, plot in ggplot2... using Make to tie the pieces together?


It's usually faster than either compressed or uncompressed RDS - but if you have heavily duplicated data, compressed RDS can be faster than feather (depending on some tradeoff between compression speed and disk speed). Feather will probably gain compression support eventually.


Now if only there was a Julia package for this too


It shouldn't be hard, given that most of the work is being done by a shared C++ library.

(I'll probably take a stab at writing one this weekend, though)


What do you use Julia for?


This looks amazing!

Hadley, Wes, what are your thoughts on how to implement compression? I recall some open source columnar datastores (e.g. infobright) that achieved very VERY fast compression rates with just a few tricks: https://news.ycombinator.com/item?id=8354416

In particular, compression is extremely fast for columnar datastores (its the same type one after the other). Since a lot of times the data is sorted by some ID (date, individual, etc.), you should see large improvements in both speed and disk space.


Compression is on the long term to do list - Wes knows a lot more about it than me.


Any plans to support a pipe-aware POSIX command, or get this into PostgreSQL?


A Postgres foreign data wrapper to Arrow or Feather sounds pretty reasonable, maybe even inevitable.

https://wiki.postgresql.org/wiki/Foreign_data_wrappers


Not by us, but we expect that many other projects will add feather support now that it's used by both R and Python.


Is there any functional distinction between a data frame and a relational table? Could you implement a persistent frame just by wrapping sqlite?


It seems like C++ (C?) is also supported. Why don't you advertise it? In my (limited) experience, exchanging data with C/C++ is even more painful than between Python and R.


Quick request - could you hook into save.image in R and give the ability to save the entire workspace in R? That would be awesome.

Incidentally I had filed a bug request for a functionality to save the entire workspace in Pandas...but was rejected as being unpythonic. Oh and the devs claimed Apache Arrow was vaporware !

https://github.com/pydata/pandas/issues/12381


I think saving your entire workspace is a bad idea too, sorry!


Could you talk about why? Other than convenience factor (and R already does it), could you talk about why.

Is it stemming from a fundamental aspect of the data format - for example can you save two data frames to the same file?

Because if you can save two - why not save two hundred.


It thwarts reproducibility. By saving your workspace, it drags a lot of state from session to session that isn't accounted for. If you share code with someone else, their workspace space won't be the same, and thus the code may not function the same.


Point taken. But we are again delving dangerously close to thou-shall-not . from my perspective, it is a quick and convenient way to save all the data frames in my code. It's a boon for productivity.

If not this, then I pray for Feather to be able to save multiple data frames innone file.


I don't see it as a thou-shalt-not. As a file format, feather is is lightweight. If they turned it into a container format it would be expanding the scope. If they instrumented it to comb objects in the global namespace and serialize them to the new container format, it would be heavier still--all to support a feature that the authors view as an anti-pattern. That's less a thou shalt not than it is a prioritization of their own vision.

If you're looking for a container to store lots of tabular data in one file, I'd suggest SQLite. Using dplyr, you can save those dataframes very easily. Plus, you can join tables and perform efficient aggregations on datasets too large to keep in memory.

In a lot of ways, I don't understand what limitations prevent SQLite from becoming the defacto common data.frame format. There probably are some, I just don't understand the tradeoffs (especially given how much SQLite gives you for free)!


Actually this is interesting - whhy Feather vs sqlite. I would love to know the answer!

But coming back to the anti-pattern : well, obviously the authors have the power to not spend time on something. But I'm trying to figure out why it's an anti-pattern in general. Snapshotting execution state is probably the ideal goal, but saving intermediate data structures is a decent convenience feature.

Now if that's restricted by the limitations of the format itself (no multiple frames in a single file), then we are back to thinking that HDF5/sqlite may indeed be the better format.


Basically because you should be encoding state in code, not data. If you store data between sessions, it's easy to lose the code that you use to create it and then later on you can't recreate it.

It is convenient to save your complete workspace but I've seen too many cases where it's contributed to lack of reproducibility to spend my time working on it.


So there's a use case difference. I create models from remote data sources - this is incremental on a daily basis and takes quite a bit of time.

So I snapshot the workspace after I do a run and do some experiments. Now - for me, saving the workspace is a convenience feature, NOT a programming feature.

This is what I mean by thou-shall-not. My use case is very well defined and I'm not stupid. And I completely knows the pitfalls of what you talk about - but a philosophical opposition is what hurts me (and lots of devs like me)


I hope I didn't come across as "thou shalt not" - it's just never going to be high in my priority list.

(And even for your use case I would think you'd be better off keeping the models in a list and saving that. Then other random stuff in your evn won't get carried along for the ride)


Oh no you did not! That was polite musing. Thank you for the reply - I still hope you change your mind. Because people do have genuine, but different needs ;)


Any luck in getting it to work with sas... Maybe in a 10.12 future release? My daily grind is SAS to csv to R, because not everyone else has seen the light.


If SAS wanted to support it, it would be super easy for them to implement it.


Ignorant question: What does this solve that a csv doesn't? Type information?


Type information is one. CSV is slow, since it has to parse everything on load. CSV has no random access, since rows can be arbitrary length. CSV takes up a lot of disk space, since a 8-byte double gets expanded into a 15+ digit string.


Also there's no single, official CSV standard.


There's RFC 4180. Problem is, there is no universal acceptance in actually following it.. :)


CSVs, in addition to being slow to parse (in practice, 40-100MB/s is the typical window, see some benchmarks here http://wesmckinney.com/blog/pandas-and-apache-arrow/), drop types (which have to be inferred), sometimes in a way that cannot be recovered (factor / category levels).

By comparison, Feather performs very close to disk performance. So speeds exceeding 500 MB/s (versus < 100MB/s for CSV) are common.


After reading through it, the spec seems to promise superior data interchange in pretty much every way while sacrificing about the only thing good about CSV... That most things can be coerced to produce or consume CSV with a simple read or print.


CSV parsing is relatively slow.


Not necessarily. Importing CSV data from disk in R can be pretty quick with 'fread', from the data.table package.[1]

Having said that I'm pretty excited about feather and where development will lead. I use RDS files quite heavily, mostly because of the compression which allows much smaller file sizes for distribution.[2] However, there's a trade off with parsing spread and also interoperability between languages. Looks like feather already has the interoperability part sorted, just waiting for compression now. Reading slices direct from disk is pretty exciting too.

[1] http://www.starkingdom.co.uk/faster-csv-import-with-r/

[2] http://www.starkingdom.co.uk/faster-import-with-r-redux/


Anything that needs to parse text into numbers will be slow. Your links just compare various slow methods and conclude that one isn't quite as slow as the others.

Of course it all depends on what you call 'slow'. Reading a few hundreds of megabytes of megabytes of csv's isn't going to be 'really' slow on modern hardware even if it was fgetc'd character by character.

Either way: anything that represents data as text will become a bottleneck when the size of the dataset grows.


The original comment was, "CSV parsing is relatively slow."

Relative to what?

The links I provided show that, using fread from data.table in R, parsing CSV data can be quicker than returning the same data from, for example, an SQLite database.

I believe the links do show CSV parsing relative to a couple of other common data parsing methods in R. Whether they're all "slow methods" is an entirely different question.

If you'd like to compare parsing CSV data relative to a "fast method", I'd very much like to read the analysis.


I guess in re-reading my comment, I phrased it in a tone that was more combative than I meant, so don't take the GP as an attack.

My point was, all conversions from text into numbers (I'm assuming here that the target use case is reading large amounts of numeric data) are slow and the slow part isn't the IO, but the conversion from text to numbers. In that light, it isn't a surprise that sqlite isn't much faster than csv, because sqlite doesn't have 'strong typing' itself. Any storage format that is concerned with speed will store data in binary format. But of course text formats are a lot easier to work with, and to interface between programs. I sometimes (when I have a lot of data that I know need multiple passes of reading) build quick and dirty 'caches' where I read file.csv and do what is basically a memory dump of the parsed data into file.csv.bin. My read functions can then check if that file exists and skip the parsing step. In my experience, this can be easily an order of magnitude (10x) faster. It's not portable or even elegant, of course.

Apart from that - when speed is of importance, one wouldn't use R in the first place, of course (I say that as someone who likes R for what it is and for the things it does well).

I don't do write ups of speed analyses, nor do I know of any, so I have to cop out on that one.


The fact that storing data in binary format gives you more speed is exactly why you'd want to use Feather. It means you don't have to translate things into an unspecified .bin format. This is the answer to the original question about "why not use CSV".

I don't know about R, but in Python, operations on objects such as NumPy arrays and Pandas DataFrames are all implemented using fast C code, and so is Feather. You can be concerned about speed.


Yes, of course, and I'd much rather use something like it; and when it will support matrices with more than 2 dimensions, I will (or at least, I will look into it). In cases where I need more robust storage I already use HDF5 or NetCDF but they're a PITA to work with.

Of course in-memory operations can be implemented efficiently, R does that too. But Python needs to parse CSV into numbers just like everybody else, and even if it's done in C underneath, it'll still be 'slow' (for some values of that word).


> Relative to what?

Relative to HDF5, and relative to Feather if it's doing its job right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: