Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ignorant question: What does this solve that a csv doesn't? Type information?


Type information is one. CSV is slow, since it has to parse everything on load. CSV has no random access, since rows can be arbitrary length. CSV takes up a lot of disk space, since a 8-byte double gets expanded into a 15+ digit string.


Also there's no single, official CSV standard.


There's RFC 4180. Problem is, there is no universal acceptance in actually following it.. :)


CSVs, in addition to being slow to parse (in practice, 40-100MB/s is the typical window, see some benchmarks here http://wesmckinney.com/blog/pandas-and-apache-arrow/), drop types (which have to be inferred), sometimes in a way that cannot be recovered (factor / category levels).

By comparison, Feather performs very close to disk performance. So speeds exceeding 500 MB/s (versus < 100MB/s for CSV) are common.


After reading through it, the spec seems to promise superior data interchange in pretty much every way while sacrificing about the only thing good about CSV... That most things can be coerced to produce or consume CSV with a simple read or print.


CSV parsing is relatively slow.


Not necessarily. Importing CSV data from disk in R can be pretty quick with 'fread', from the data.table package.[1]

Having said that I'm pretty excited about feather and where development will lead. I use RDS files quite heavily, mostly because of the compression which allows much smaller file sizes for distribution.[2] However, there's a trade off with parsing spread and also interoperability between languages. Looks like feather already has the interoperability part sorted, just waiting for compression now. Reading slices direct from disk is pretty exciting too.

[1] http://www.starkingdom.co.uk/faster-csv-import-with-r/

[2] http://www.starkingdom.co.uk/faster-import-with-r-redux/


Anything that needs to parse text into numbers will be slow. Your links just compare various slow methods and conclude that one isn't quite as slow as the others.

Of course it all depends on what you call 'slow'. Reading a few hundreds of megabytes of megabytes of csv's isn't going to be 'really' slow on modern hardware even if it was fgetc'd character by character.

Either way: anything that represents data as text will become a bottleneck when the size of the dataset grows.


The original comment was, "CSV parsing is relatively slow."

Relative to what?

The links I provided show that, using fread from data.table in R, parsing CSV data can be quicker than returning the same data from, for example, an SQLite database.

I believe the links do show CSV parsing relative to a couple of other common data parsing methods in R. Whether they're all "slow methods" is an entirely different question.

If you'd like to compare parsing CSV data relative to a "fast method", I'd very much like to read the analysis.


I guess in re-reading my comment, I phrased it in a tone that was more combative than I meant, so don't take the GP as an attack.

My point was, all conversions from text into numbers (I'm assuming here that the target use case is reading large amounts of numeric data) are slow and the slow part isn't the IO, but the conversion from text to numbers. In that light, it isn't a surprise that sqlite isn't much faster than csv, because sqlite doesn't have 'strong typing' itself. Any storage format that is concerned with speed will store data in binary format. But of course text formats are a lot easier to work with, and to interface between programs. I sometimes (when I have a lot of data that I know need multiple passes of reading) build quick and dirty 'caches' where I read file.csv and do what is basically a memory dump of the parsed data into file.csv.bin. My read functions can then check if that file exists and skip the parsing step. In my experience, this can be easily an order of magnitude (10x) faster. It's not portable or even elegant, of course.

Apart from that - when speed is of importance, one wouldn't use R in the first place, of course (I say that as someone who likes R for what it is and for the things it does well).

I don't do write ups of speed analyses, nor do I know of any, so I have to cop out on that one.


The fact that storing data in binary format gives you more speed is exactly why you'd want to use Feather. It means you don't have to translate things into an unspecified .bin format. This is the answer to the original question about "why not use CSV".

I don't know about R, but in Python, operations on objects such as NumPy arrays and Pandas DataFrames are all implemented using fast C code, and so is Feather. You can be concerned about speed.


Yes, of course, and I'd much rather use something like it; and when it will support matrices with more than 2 dimensions, I will (or at least, I will look into it). In cases where I need more robust storage I already use HDF5 or NetCDF but they're a PITA to work with.

Of course in-memory operations can be implemented efficiently, R does that too. But Python needs to parse CSV into numbers just like everybody else, and even if it's done in C underneath, it'll still be 'slow' (for some values of that word).


> Relative to what?

Relative to HDF5, and relative to Feather if it's doing its job right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: