Discovering copy-on-write in R

ryanmonroe · on Dec 21, 2023

Good article. Some smaller changes you could make to the final function: In the last line `as.data.frame(do.call(cbind, out_list))` is used to convert the list to a data.frame. Passing it to `cbind` converts the list to a matrix (i.e. combines it into one long vector internally), and then `as.data.frame` converts it back to a list (as noted in the article data frames are lists). Instead, you can use `as.data.frame(out_list)` to make your list a data frame directly, to avoid converting the list to a matrix and back to a list again. The `unlist(lapply(split(cvec, groups), aggfun))` is also doing a lot of work, if you don't mind using an external package*, `collapse::BY(cvec, groups, aggfun)` is much faster (doesn't require converting `groups` to factor, doesn't copy `cvec`'s contents like `split`).

Here's some completely not-the-point of the article code review since I can't help myself. If you can set up earlier steps give you a named list for `col_grouping`, and use `lapply`, the code is a little more concise:

    efficient_flow_agg <- function(dat, col_grouping, gpcol_name="GroupMembership") {
      make_postproc <- function(gp, groups) {
        gp$preproc(dat[gp$which_cols]) |>
          lapply(collapse::BY, groups, gp$aggfun) |>
          gp$postproc()
      }
      col_grouping |>
        lapply(make_postproc, groups = dat[[gpcol_name]]) |>
        as.data.frame()
    }

* I had previously written here that `tapply` is probably faster, but apparently `tapply` does exactly `unlist(lapply(split(x, g), f)))` anyway? wtf R. Strange there's not something like `collapse::BY` in base R.

franklin_p_dyer · on Dec 21, 2023

Thanks for the feedback! The business with `cbind` is a facepalm, I'll definitely fix that. I don't think it will affect performance much since that last step won't be repeated many times, but it makes me cringe now knowing how redundant it is.

Good advice on `col_grouping` as well, accessing those components of an aggregation rule by index rather than by name is a bad code smell and decreases readability for sure.

kuhewa · on Dec 22, 2023

Main issue with cbind()ing to matrix then data.frame is conversion of all columns to the same type and potential loss of information

karencarits · on Dec 21, 2023

The data.table package may also make a huge difference in performance, and often simplifies the code as well https://github.com/Rdatatable/data.table

dash2 · on Dec 22, 2023

I’d also be interested to know how the tidyverse’s “tibbles” perform here.

JZL003 · on Dec 21, 2023

Copy on write is really nice, especially when I often face a very very large read-only matrix (200+GB) and want to do some embarrassingly parallel processes on subsets of it. I haven't found a language which makes it as easy, not python (although not unexpected), not Julia even

sampo · on Dec 21, 2023

> not python

Pandas has a global option to turn on copy-on-write.

https://pandas.pydata.org/docs/dev/user_guide/copy_on_write....

fbdab103 · on Dec 21, 2023

News to me! Will definitely break some of my current code (chained assignments no longer work), but is probably a more sensible default.

To be default mode in Pandas 3, but seeing as how long it took them to pull the trigger on Pandas 2, that could be a while.

mplewis9z · on Dec 21, 2023

If you haven’t tried Swift, copy-on-write is one of the core tenets of its value types (`struct`s, basically), and it’s almost entirely transparent.

GrumpySloth · on Dec 21, 2023

It’s only transparent for types that already implement copy-on-write. For custom types you need to implement it yourself, using reference typed private properties.

2devnull · on Dec 21, 2023

>” How to prevent this? The obvious way is to just not use dataframes, at least not while doing aggregation. Rather than allocating a huge dataframe and loading our partial results into its columns bit by bit, we can just store our partial results in a plain list.”

A lot to be said for not defaulting to data frames, in both r and python. Or, if you must, using something like r data.table or python’s polars if you don’t think in other data structures easily or just want convenience.

IKantRead · on Dec 21, 2023

> A lot to be said for not defaulting to data frames, in both r and python

I would even add especially in Python. The main issue I have found is that pandas heavy code is just not as easy to integrate into other Python tools/features/abstractions as code using mostly numpy, dictionaries and various comprehensions to do the vast majority of your work.

As a heavy pandas user for several years, I decided about a year ago to not import pandas by default and instead treat most data problems like regular python problems. I've been genuinely surprised as how much easier it is to create useful abstractions with the code I've been writing, and also how much easier it's been to onboard non-DS devs into the code base.

There are a few obvious cases when Pandas is very helpful, and I'll pull it out in those places, but I've been able to do a tremendous amount of data work in the last year and used very little pandas. The result is that I have an actual codebase to work with now rather than a billion broken notebooks.

kristjansson · on Dec 21, 2023

> The result is that I have an actual codebase to work with now rather than a billion broken notebooks.

This is the biggest part. Giving yourself permission to make real abstractions, rather than forcing yourself to go directly from data-on-disk to pandas (or whatever) makes it that much easier to test, repeat, modify, and extend whatever analysis you're working on.

franklin_p_dyer · on Dec 21, 2023

In what cases have you found it worthwhile to use pandas?

isoprophlex · on Dec 21, 2023

Resampling, regularizing, binning and forward/backward filling time series data is an absolute pain in the ass using only SQL and/or vanilla python. It does its thing well, there.

(Note that in general, I'm the biggest pandas hater I know)

canjobear · on Dec 21, 2023

It can be nice for groupby-aggregate logic. And it feeds into plotnine.

franklin_p_dyer · on Dec 21, 2023

For sure! Definitely an good thing to know for an R newbie like me who is handling large datasets naively.

Thanks for mentioning polars, I hadn't heard of it before but it looks neat.

kazinator · on Dec 21, 2023

There is a macro like this for Common Lisp: modf.

https://github.com/smithzvk/modf

With modf, you use the existing place syntax to refer to part of an object. It looks like you're mutating that object, but in fact it will return a clone of the entire containing object, with the modification, while the original remains untouched.

dxbydt · on Dec 22, 2023

Sir, I worked thru your dog & cat adoption example. I had a few questions. So you have x & y vectors, and want the vector z, as below:

   x<- c(1,2,3)
   y <- c(4,5,6)
   z<- c(1+4, (1*2+4*5)/(1+4), sqrt((1*(4+9) + 4*(25+36))/(1+4) - ((1*2+4*5)/(1+4))^2))

But your z[2] is just an elaborate weighted mean, I would use the following built-in function -

     weighted.mean(c(2,5),c(1/(1+4),4/(1+4)))

Similarly, your z[3] is just the weighted standard deviation, available in library modi. I was wondering, isn't it better to store the data in some vector x & the weights in a different vector y, and compute the weighted mean & weighted variance in a straightforward fashion like above, or am I missing something. Thanks.

kuhewa · on Dec 22, 2023

This is great. I've seen many discussions of efficiency in R in stats and biology circles over the years but I think this is the first time I've seen something as simple as copy-on-write and treating dataframe columns as individual lists mentioned. I'd be interested in how much faster it was.

tpoacher · on Dec 21, 2023

Matlab/Octave are also copy-on-write. It's quite a powerful mechanism to take advantage of when you're aware of it.

j7ake · on Dec 22, 2023

Good to see some R programmers around despite the dominance of python in data science these days.