For me, I use R data.table a lot and I see as the main advantages are performanc...

jarenmf · on Jan 16, 2022

Indeed, data.table is just awesome for productivity. When you're manipulating data for exploration you want the least number of keystrokes to bring an idea to life and data.table gives you that.

VeninVidiaVicii · on Jan 16, 2022

I totally agree. I often find myself wanting data.table as a standalone database platform or ORM-type interface for non-statistical programming too.

boppo1 · on Jan 16, 2022

What is terse syntax? I can parse lisp and C, how would this be different and challenging?

bckygldstn · on Jan 16, 2022

The syntax isn't self-describing and uses lots of abbreviations; it relies on some R magic that I found confusing when learning (unquoted column names and special builtin variables); and data.table is just a different approach to SQL and other dataframe libraries.

Here's an example from the docs

  flights[carrier == "AA",
    lapply(.SD, mean),
    by = .(origin, dest, month),
    .SDcols = c("arr_delay", "dep_delay")]

that's clearly less clear than SQL

  SELECT
    origin, dest, month,
    MEAN(arr_delay), MEAN(dep_delay)
  FROM flights
  WHERE carrier == "AA"
  GROUP BY arr_delay, dep_delay

or pandas

  flights[filghts.carrier == 'AA'].groupby(['arr_delay', 'dep_delay']).mean()

But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table

  flights[, head(.SD, 2), by = month]

That data.table has significantly better performance than any other dataframe library in any language is a nice bonus!

hervature · on Jan 16, 2022

Taking the first two rows is a mess in pandas?

flights.groupby("month").head(2)

Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.

Bootvis · on Jan 17, 2022

I agree the example in GP is not convincing. Consider the following table of ordered events:

    | Date | EventType |

and I want to find the count, and the first and last date of an event of a certain type happening in 2020:

    events[
        year(Date) == 2020L, 
        .(first_date = first(Date), last_date = last(Date), count = .N),
        EventType
    ]

Using first and last on ordered data will be very fast thanks to something called GForce.

When exploring data, I wouldn't need or use any whitespace. How would your Pandas approach look like?

hervature · on Jan 17, 2022

To do that, the code would look something like:

mask = events["Date"].year == 2020 events[mask].groupby("EventType").agg(first_date=("Date", min), last_date=("Date", max), count=("Date", len))

Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.

Bootvis · on Jan 17, 2022

It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:

- How many events by type

- When did they happen

- Are there any breaks in the count, why?

- Some statistics on these events like average, min, max

and so on. Terseness helps me in doing this fast.

kgwgk · on Jan 16, 2022

You mean something like

    SELECT
    origin, dest, month, AVG(arr_delay), AVG(dep_delay)
    FROM flights
    WHERE carrier == 'AA'
    GROUP BY origin, dest, month

and

    flights[flights.carrier == 'AA'].groupby(['origin', 'dest', 'month'])[['arr_delay', 'dep_delay']].mean()

bckygldstn · on Jan 16, 2022

Yep thanks, you can tell I use a "guess and check" approach to writing sql and pandas...