Taking the first two rows is a mess in pandas? flights.groupby("month").head(2) ...

Bootvis · on Jan 17, 2022

I agree the example in GP is not convincing. Consider the following table of ordered events:

    | Date | EventType |

and I want to find the count, and the first and last date of an event of a certain type happening in 2020:

    events[
        year(Date) == 2020L, 
        .(first_date = first(Date), last_date = last(Date), count = .N),
        EventType
    ]

Using first and last on ordered data will be very fast thanks to something called GForce.

When exploring data, I wouldn't need or use any whitespace. How would your Pandas approach look like?

hervature · on Jan 17, 2022

To do that, the code would look something like:

mask = events["Date"].year == 2020 events[mask].groupby("EventType").agg(first_date=("Date", min), last_date=("Date", max), count=("Date", len))

Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.

Bootvis · on Jan 17, 2022

It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:

- How many events by type

- When did they happen

- Are there any breaks in the count, why?

- Some statistics on these events like average, min, max

and so on. Terseness helps me in doing this fast.