Dato open sources SFrame – a disk-backed, compressed columnnar data frame

sandGorgon · on Feb 16, 2016

this comment is very interesting:

I really appreciate you guys doing this. There's now serious consideration of building the Julia (julialang.org) data-analysis ecosystem on top of SFrames. It will hopefully allow our community to focus on the unique strengths of Julia, such as the ability to write arbitrary JITed user-defined aggregation and transformation functions, while not reinventing the wheel of low-level scalable data carpentry primitives.

Anybody know how accurate this is ? Would love to compare R data frames or Pandas to SFrame in terms of feature and performance.

shele · on Feb 16, 2016

He actually wrote a Julia wrapper exploring this a bit more https://github.com/malmaud/SFrames.jl

haijieg · on Feb 16, 2016

For those who are interested in the technical details, this blog post by Yucheng Low explains the architecture of SFrame very well: http://blog.dato.com/data-processing-architecture-of-graphla...

shoyer · on Feb 16, 2016

The title here should have a (2015) in it -- this blog post is from September 25, 2015.

elyase · on Feb 16, 2016

What I don't get is: can you use SFrame without installing the graphlab framework at all? What are the limitations? Is there an example notebook using only SFrame and not graphlab?

haijieg · on Feb 16, 2016

Yes, SFrame is a subset of GraphLab Create. SFrame provides the core data structures to work with tabular and graph data at scale on a single machine. In addition, using GraphLab Create allows you to create/use/evaluate machine learning models.

Here is some user guide to get you started: https://dato.com/learn/userguide/sframe/introduction.html Just replace "import graphlab" with "import sframe".

rixed · on Feb 16, 2016

Can someone post a link to the comparison with other columnar database storage layer?

RyanHamilton · on Feb 16, 2016

Most other stores are databases http://www.timestored.com/time-series-data/column-oriented-d...

estefan · on Feb 16, 2016

Looks like an alternative to parquet - http://parquet.apache.org/

prodigal_erik · on Feb 16, 2016

Is a "data frame" just a table with (edit: a lot of) storage optimization?

chubot · on Feb 16, 2016

As far a I know, the term "data frame" comes from R and its predecessors like S, where a data frame is the core data structure. Logically, it is indeed like an SQL table -- a column has a single type whereas a row has heterogeneous types.

AFAIK, all implementations are column-oriented, which admits certain kind of implementation and optimization. SQL databases are mostly row-oriented, probably since updating a row at a time is a common operation.

I would think of it as a table, but embedded in a programming language rather than a database (so you don't use SQL), with more operations, and which is very often used in a read-only fashion.

The syntax in R is nicer than SQL in my opinion. It's more algebraic and composable. Instead of "SELECT name, address FROM foo WHERE age > 30", you can write foo[foo$age > 30, c('name', 'address')].

Some links:

http://www.r-bloggers.com/select-operations-on-r-data-frames...

Pandas is a data frame library for Python, based on R:

http://pandas.pydata.org/pandas-docs/stable/basics.html

This article explains the relevance of the relational model to data analysis / statistics (rows are observations, columns are variables):

https://scholar.google.com/scholar?cluster=77966238326629329...

stewbrew · on Feb 16, 2016

"data frame is the core data structure"

Actually, AFAIK a data.frame in R is actually a list of vectors (i.e. columns) with some constraints.

chubot · on Feb 16, 2016

That's not true:

    > d=data.frame(a=c(1,2,3),b=c(4,5,6))
    > e=list(a=c(1,2,3),b=c(4,5,6))

    > class(d)
    [1] "data.frame"
    > class(e)
    [1] "list"

    > d[c(TRUE,FALSE),]
      a b
    1 1 4
    3 3 6

    > e[c(TRUE,FALSE),]
    Error in e[c(TRUE, FALSE), ] : incorrect number of dimensions

They are represented similarly in R, but they are distinct data types. The data frame is the core data structure in the sense that many functions in R operate on data frames (but not lists of vectors).

stewbrew · on Feb 16, 2016

1. You shouldn't use `=` for assignments in R but `<-`. `=` does late binding.

2. You shouldn't use `class()` here but `mode()` to check the actual underlying data structure.

    > mode(d)
    [1] "list"
    > mode(e)
    [1] "list"

3. The reason `[` works differently is because it a S3 method which invokes different functions for lists and data.frames -- that's why class(d) doesn't return "list". See `methods("[")`.

See https://cran.r-project.org/doc/manuals/r-release/R-lang.html... for details.

infinite8s · on Feb 16, 2016

Same in pandas, although it's closer to a dictionary of column names to singly-typed vectors. Most simple columnar databases are structured that way as well (a more complicated design involves chunking the columns into pages).

visarga · on Feb 16, 2016

It's a list of hashes, just like in MongoDB, where you can apply selection both of records and of keys inside a record and compute on them. The real advantage is that it works fast with data larger than memory.

It can also be compared to the standard linux pipes. You can do head, tail, and cut-like operations on a data stream.

asdfologist · on Feb 16, 2016

It doesn't support Python 3 :(

kermatt · on Feb 16, 2016

Hopefully soon: http://blog.dato.com/state-of-the-sframe-2016

PeCaN · on Feb 16, 2016

A lot of scientific programming still uses Python 2, so that's not particularly surprising.