Hacker News new | past | comments | ask | show | jobs | submit login

WRT performance ceiling, I'm mostly talking about things like Pandas which eagerly evaluate and which aren't amenable to a parallel execution model (multiple threads operating on the same data frame with minimal contention).

WRT poor APIs, I'm talking about things like matplotlib or pandas or etc that take a whole slew of arguments and try to guess the caller's intent by inspecting the types of the arguments. The referent isn't "some other scientific computing API" (although I'm sure there are some sane scientific computing APIs), but rather "other APIs in general" since there's nothing inherent to any particular domain that demands this kind of 'magical' API.

WRT 'hardly anyone is CPU bound'--the context is numeric computing; what are people bound by if not CPU? I've seen several projects where web endpoints were timing out while grinding in Pandas, largely because there weren't good options for taking advantage of multiple processors. Based on prototypes I did, I'm confident that other languages could serve those requests in single-digit seconds if not sub-second.




You're getting some pushback, but I tend to agree with you on matplotlib and pandas. Great libraries are designed so that you can get a feel for them and -- with practice -- use them intuitively. Even after years of (admittedly light) use I still find pandas' multi-indexes confusing, and I always have to look up the best of myriad ways to do something in matplotlib. In comparison, R's dplyr and ggplot have stuck with me even ages after giving up day-to-day use of R.


Pandas is really similar to base-R, which accounts for much of the weirdness (but at least R can claim to be copying a language developed around the same time as C).


So who does it right? If all these APIs suck compared to an imaginary perfect library, then that isn’t a useful comparison.

Also, if an endpoint is spending minutes to respond, then I would think actually profiling the application would be a good start. Maybe researching prior art in the problem domain would be good too. If nobody can be bothered to explore the several solutions to distributing pandas computations over multiple cores, like Dask, and get the NPV of just buying more or faster cores, then “Python sucks” isn’t your problem.


That’s quite a rant with a lot of assumptions. Just about every library has a better API than matplotlib or pandas. Requests has a pretty good API IMO. The team who was responsible for the slow endpoint did investigate dask and alternatives, and they probably will end up on something like spark because they didn’t feel like they have better options. Maybe our team is just stupid and Python isn’t for mere mortals, I don’t know, but I do know that these problems don’t exist in other languages.


I would certainly hope a minimal HTTP library would be simpler than a suite of functions to manipulate and plot tabular data.

“My application is slow, the language sucks!” Doesn’t indicate a very serious investigation into the problem.


> I would certainly hope a minimal HTTP library would be simpler than a suite of functions to manipulate and plot tabular data.

HTTP is pretty complex, but that's neither here nor there. The relevant bit is that there is no domain for which guessing caller intent based on reflection over argument types is appropriate.

> “My application is slow, the language sucks!” Doesn’t indicate a very serious investigation into the problem.

I was pretty explicit above and elsewhere in this thread about why Python's performance is miserable; I'm not sure why you would invoke such a poorly constructed straw man when everyone can look upthread and see my actual arguments.


IMO panda's lenient inputs is a godsend when you are working with real-world, dirty data regularly. It's my favorite API I've ever worked with because it lets me focus on my high-level tasks and it takes care of the things I don't really care about like whether I am working with a list of dicts or a dict of lists or whatever.

But once you've done the cleaning/exploration, you should move any heavy computing to a high-performance library like numpy.


> there's nothing inherent to any particular domain that demands this kind of 'magical' API

Plotting seems to tend towards magic because plots are basically art, with all the desire for aesthetic customization that applies, and it's a very common task so users also want brevity (magic). The result is a plot() function with a gazillion options hidden behind keyword arguments.

I agree that matplotlib has a sprawling interface, and this can be annoying, but I'm still not sure what "guess the caller's intent by inspecting the types of the arguments" means. Sure, the functions have multiple call signatures, but that's not exactly unusual in libraries or languages. I don't understand the context that brings guesswork into the picture. Skimming the manual—are you using the data keyword argument and hitting the `plot('n', 'o', data=obj)` ambiguity [0]? Or calling plot through `pyplot.plot` &c. (which rely on state) instead of `Axes.plot` &c.?

Asking because if there's an interface trap I'm unaware of I'd like to learn about it before walking into it blindly.

Pandas I sort of agree with; I personally find it harder to remember how to use pandas than dplyr, despite using pandas more often and spending more time reading the pandas documentation. I also find it inconvenient to represent missing values in Pandas (`None` and `NaN` are overloaded, and `None` forces the `object` dtype). But maybe the problem is on my end.

[0] https://matplotlib.org/3.3.2/api/_as_gen/matplotlib.pyplot.p...


"Based on prototypes I [spent a limited amount of time on and didn't research better methods], I'm confident..."


In the context of pandas, 3 GB of (raw, uncompressed) data could easily require 30 GB of RAM, and that kind of overhead adds up quickly.


Pandas is not some mysterious black box. If you need predictable runtime performance or bounded memory usage, you have to figure it out. Pandas doesn't inherently have a staggering or unpredictable amount of overhead, given that it's a statistical analysis package. There are ways to mitigate Pandas memory usage (10x is a sign that something has gone very horribly wrong), and sometimes Pandas is simply the wrong tool for the job.


10x reflects both experience and expert recommendations. You may recognize the author [1]:

> Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/


I don't doubt Wes' upper bound for Pandas OOTB, without optimization. The context was web applications. If you're seeing 10x on a web app, either something is wrong or you probably shouldn't be using Pandas.


I think most of us use Pandas for data exploration and one-offs, so it is completely reasonable to discuss what our likely use of RAM is going to be in this circumstance.

Web application using Pandas and "highly optimized web application" would seem to be nearly disjoint sets...


matplotlib and pandas were designed with the idea of mimicking interfaces more popular than the project (when they were first conceived). The "easy" interface is a large part of why those projects are now more popular then their inspirations.


very true; I found matplotlib very appealing because I didn't have to relearn anything coming from matlab


Ah, that makes sense (I suppose I should have guessed from the name). I never understood matplotlib's popularity, but if its a matlab clone that makes way more sense.


many scientific computing applications are considered to be bounded by io


rather famously, one needs an intense operation like matrix multiplication to get cpu bound (an operation that has many enough arithmetic operations per data element, for I/O to not dominate).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: