WRT performance ceiling, I'm mostly talking about things like Pandas which eager...

mrtranscendence · on Oct 28, 2020

You're getting some pushback, but I tend to agree with you on matplotlib and pandas. Great libraries are designed so that you can get a feel for them and -- with practice -- use them intuitively. Even after years of (admittedly light) use I still find pandas' multi-indexes confusing, and I always have to look up the best of myriad ways to do something in matplotlib. In comparison, R's dplyr and ggplot have stuck with me even ages after giving up day-to-day use of R.

disgruntledphd2 · on Oct 29, 2020

Pandas is really similar to base-R, which accounts for much of the weirdness (but at least R can claim to be copying a language developed around the same time as C).

sk2020 · on Oct 28, 2020

So who does it right? If all these APIs suck compared to an imaginary perfect library, then that isn’t a useful comparison.

Also, if an endpoint is spending minutes to respond, then I would think actually profiling the application would be a good start. Maybe researching prior art in the problem domain would be good too. If nobody can be bothered to explore the several solutions to distributing pandas computations over multiple cores, like Dask, and get the NPV of just buying more or faster cores, then “Python sucks” isn’t your problem.

throwaway894345 · on Oct 29, 2020

That’s quite a rant with a lot of assumptions. Just about every library has a better API than matplotlib or pandas. Requests has a pretty good API IMO. The team who was responsible for the slow endpoint did investigate dask and alternatives, and they probably will end up on something like spark because they didn’t feel like they have better options. Maybe our team is just stupid and Python isn’t for mere mortals, I don’t know, but I do know that these problems don’t exist in other languages.

sk2020 · on Oct 29, 2020

I would certainly hope a minimal HTTP library would be simpler than a suite of functions to manipulate and plot tabular data.

“My application is slow, the language sucks!” Doesn’t indicate a very serious investigation into the problem.

throwaway894345 · on Oct 29, 2020

> I would certainly hope a minimal HTTP library would be simpler than a suite of functions to manipulate and plot tabular data.

HTTP is pretty complex, but that's neither here nor there. The relevant bit is that there is no domain for which guessing caller intent based on reflection over argument types is appropriate.

> “My application is slow, the language sucks!” Doesn’t indicate a very serious investigation into the problem.

I was pretty explicit above and elsewhere in this thread about why Python's performance is miserable; I'm not sure why you would invoke such a poorly constructed straw man when everyone can look upthread and see my actual arguments.

solidasparagus · on Oct 28, 2020

IMO panda's lenient inputs is a godsend when you are working with real-world, dirty data regularly. It's my favorite API I've ever worked with because it lets me focus on my high-level tasks and it takes care of the things I don't really care about like whether I am working with a list of dicts or a dict of lists or whatever.

But once you've done the cleaning/exploration, you should move any heavy computing to a high-performance library like numpy.

jpeloquin · on Oct 29, 2020

> there's nothing inherent to any particular domain that demands this kind of 'magical' API

Plotting seems to tend towards magic because plots are basically art, with all the desire for aesthetic customization that applies, and it's a very common task so users also want brevity (magic). The result is a plot() function with a gazillion options hidden behind keyword arguments.

I agree that matplotlib has a sprawling interface, and this can be annoying, but I'm still not sure what "guess the caller's intent by inspecting the types of the arguments" means. Sure, the functions have multiple call signatures, but that's not exactly unusual in libraries or languages. I don't understand the context that brings guesswork into the picture. Skimming the manual—are you using the data keyword argument and hitting the `plot('n', 'o', data=obj)` ambiguity [0]? Or calling plot through `pyplot.plot` &c. (which rely on state) instead of `Axes.plot` &c.?

Asking because if there's an interface trap I'm unaware of I'd like to learn about it before walking into it blindly.

Pandas I sort of agree with; I personally find it harder to remember how to use pandas than dplyr, despite using pandas more often and spending more time reading the pandas documentation. I also find it inconvenient to represent missing values in Pandas (`None` and `NaN` are overloaded, and `None` forces the `object` dtype). But maybe the problem is on my end.

[0] https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.p...

willseth · on Oct 28, 2020

"Based on prototypes I [spent a limited amount of time on and didn't research better methods], I'm confident..."

ihnorton · on Oct 28, 2020

In the context of pandas, 3 GB of (raw, uncompressed) data could easily require 30 GB of RAM, and that kind of overhead adds up quickly.

willseth · on Oct 28, 2020

Pandas is not some mysterious black box. If you need predictable runtime performance or bounded memory usage, you have to figure it out. Pandas doesn't inherently have a staggering or unpredictable amount of overhead, given that it's a statistical analysis package. There are ways to mitigate Pandas memory usage (10x is a sign that something has gone very horribly wrong), and sometimes Pandas is simply the wrong tool for the job.

ihnorton · on Oct 29, 2020

10x reflects both experience and expert recommendations. You may recognize the author [1]:

> Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/

willseth · on Oct 29, 2020

I don't doubt Wes' upper bound for Pandas OOTB, without optimization. The context was web applications. If you're seeing 10x on a web app, either something is wrong or you probably shouldn't be using Pandas.

mlyle · on Oct 29, 2020

I think most of us use Pandas for data exploration and one-offs, so it is completely reasonable to discuss what our likely use of RAM is going to be in this circumstance.

Web application using Pandas and "highly optimized web application" would seem to be nearly disjoint sets...

xapata · on Oct 28, 2020

matplotlib and pandas were designed with the idea of mimicking interfaces more popular than the project (when they were first conceived). The "easy" interface is a large part of why those projects are now more popular then their inspirations.

0thgen · on Oct 29, 2020

very true; I found matplotlib very appealing because I didn't have to relearn anything coming from matlab

disgruntledphd2 · on Oct 29, 2020

Ah, that makes sense (I suppose I should have guessed from the name). I never understood matplotlib's popularity, but if its a matlab clone that makes way more sense.

bearzoo · on Oct 28, 2020

many scientific computing applications are considered to be bounded by io

kzrdude · on Oct 28, 2020

rather famously, one needs an intense operation like matrix multiplication to get cpu bound (an operation that has many enough arithmetic operations per data element, for I/O to not dominate).