Hacker News new | past | comments | ask | show | jobs | submit login

A snippet of pandas code in a notebook I saw when learning pandas, it looked like this:

  def to_minutes(df, col):
    df[col] = df[col].dt.minutes
    return df

  (
    pd
    .read_csv('data.csv')
    .assign(dur=df.finish-df.start)
    .pipe(to_minutes, col='dur')
    .query('dur < 30')
    .sort_values(by='dur')
    .pipe(
      seaborn.relplot,
      x='dur',
      y='distance',
      hue='city'
    )
  )
It's a simple snippet but the code structure is clever in many ways. The enclosure of the whole expression, starting with pd, into () makes it possible to split method chaining into multiple lines without backslashes and start each line with ".". Now we can treat DataFrame methods similarly to verbs from dplyr data grammar and "." similarly to %>% operator. Its now easy to rearrange, add and delete whole lines without thinking if we forgot to add "." to the line above or below. Everything is immutable, no side effects, it's safe to copy the code in a new notebook cell and tweak it a little without worrying about tens of temporary df2_copy1 variables. The .pipe method allows to avoid "df = df[...]" junk, to reuse code fragments and even to use plotting libraries in ggplot fashion. If the chain becomes huge, just put it in a function and apply in to df using pipe method, etc... I wonder if it's possible to add memorization to pipe method to avoid recalculating the whole construction every time when rerunning the cell during debugging.



Can you please give a link to original source?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: