Hacker News new | past | comments | ask | show | jobs | submit login
Making Python faster with Rust (ohadravid.github.io)
387 points by teddykoker on March 30, 2023 | hide | past | favorite | 215 comments



Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.

A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.

I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.


The author did talk about it on reddit, but explained that for the purpose of the blog post they wanted to focus on the big stuff and profiling-guided optimisation of the process: https://reddit.com/r/rust/comments/125pbq0/blog_post_making_...


The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.


I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.

He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.


Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.


Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.


Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.


Languages and with some sort of decent metaprogramming support can alleviate this sort of issue (see Zig, Julia, Jai, etc.)


Do you have examples? I cannot think of a good way to do that.


For python, you can use pandas. It even has arrow support now.


Yeah, I need to try it.

But the parent was suggesting that metaprogramming could make readability issues go away and I can't wrap my head around how it would do that.


In Julia you can use https://juliaarrays.github.io/StructArrays.jl/stable/ which lets your code use the array of structs interface but internally it’s a structure of arrays


Nice, thanks!


Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s


It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.


Results with -O are below.

x1 (AoS) vs x2 (SoA): no performance difference x3 (arrays not in structure, both arrays in loop): slower x4 (arrays not in structure, one array in loop): faster

My advice is still not to assume that SoA is always faster than AoS without benchmarking.

  + cc -O -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m11.775s
  user 0m3.540s
  sys 0m6.592s
  + ./x1
  s=1808348672

  real 0m5.427s
  user 0m2.727s
  sys 0m2.682s
  + cc -O -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m5.185s
  user 0m2.296s
  sys 0m2.872s
  + ./x2
  s=1808348672

  real 0m5.140s
  user 0m2.273s
  sys 0m2.852s
  + cc -O -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m6.423s
  user 0m3.745s
  sys 0m2.660s
  + ./x3
  s=1808348672

  real 0m6.485s
  user 0m3.741s
  sys 0m2.714s
  + cc -O -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m4.875s
  user 0m2.205s
  sys 0m2.651s
  + ./x4
  s=1808348672

  real 0m4.894s
  user 0m2.189s
  sys 0m2.684s


The reason you're not seeing much difference is that your struct is very small, just 16 bytes. In the cases where you're using both x and y in the loop (x1 and x2), you can fit 4 of them in a cache line and you're not wasting space since you need to use both. In the case you're only using one of the values (x3), you're wasting half a cache line and that shows in the benchmark. If you had a bigger struct and/or where you're not using all the members in the calculation, you'd see a much bigger difference in performance between SoA and AoS.


You need to add “-march=native” to turn on SIMD at the highest level your processor will support, otherwise it will just use SSE4.1 by default on most compilers (the lowest common denominator, as all x86-64 processors have it).


This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?

So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???


Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.

Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.


>Today, there is a Python package for everything.

The same could be said about CPAN and NPM. Yet Perl is basically dead and JavaScript isn't used for any machine learning tasks as far as I'm aware. WebAssembly did help bring a niche array of audio and video codecs to the ecosystem[1][2], something I'm yet to see from Python.

I don't use Python, but with what little exposure I've had to it at work, its overall sluggish performance and need to set up a dozen virtualenvs -- only to dockerize everything in cursed ways when deploying -- makes me wonder how or why people bother with it at all beyond some 5-line script. Then again, Perl used to be THE glue language in the past and mod_perl was as big as FastAPI, and Perl users would also point out how CPAN was unparalleled in breadth and depth. I wonder if Python will follow a similar fate as Perl. One can hope :-)

[1] https://github.com/phoboslab/jsmpeg

[2] https://github.com/brion/ogv.js/


That’s a lot of opinions for so little exposure. There are a lot uses that don’t involve docker or a dozen virtual envs.


Honestly, I use python everyday in the ML/AI space. If we're talking in that context they're pretty spot on about python, virtualenvs, and docker.


I've used Python quite a lot and their experience sounds about right.


To counter these anecdotes, I used python for building web apis and only needed poetry to manage 1 virtual env, and containerizing with docker was straight forward.


> JavaScript isn't used for any machine learning tasks as far as I'm aware

https://github.com/facebookresearch/shumai


> WebAssembly did help bring a niche array of audio and video codecs to the ecosystem

Python already has all those: the ctypes module is just as hard to use as WebAssembly, with a much lower barrier-to-entry.


WebAssembly has the benefit of portability, though. Python on Windows... is still an open problem. Or even Python on dev-oriented distro vs. numsci-oriented distro.


This is how I got into software development.

During my PhD I was running some simulations using poorly written python code. initially it would take several hours. In that time i could go to the lab, run some wetlab experiments and the results of my simulations would be there when i got back to the office. It was only taking python "home" and building some of my own projects that i learned how to 1. write more pythonic code and 2. write more performant code. Now i work for a software company.

If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.


You can have your cake and eat it with the likes of

* PythonCall.jl - https://github.com/cjdoris/PythonCall.jl

* NodeCall.jl - https://github.com/sunoru/NodeCall.j

* RCall.jl - https://github.com/JuliaInterop/RCall.jl

I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself


* NodeCall.jl - https://github.com/sunoru/NodeCall.jl

// just fixed missing 'l' in link


Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.

The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.


It also matters if you write code to run once or to serve in production, if it is experimental or stable.

If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.


tbf you don't need to go to C. You could write Common Lisp or Ocaml, both academic high level languages and very performant. Hell SBCL can get you to C range performance wise while you're writing dynamic, GCed code. Sure it's a little bit more involved than learning Python but not that much if you get 50x performance for free. Prevalence of Python is really baffling to me because compute resources cost money.


academic for CS maybe. Not so much for chemistry, biology etc. If you work in computation areas of those subjects you are more likely to know matlab, R, python and maybe Julia.

I didn't know of anybody who had done any Lisp or Ocaml in my time in academia (in chemistry, chemical engineering and biology departments), but that's just 3 universities, and i certainly didn't know everybody.


I've known people who write OCaml for biology, but they sure looked lonely :)


Not even readability. Academic code is mostly unreadable. If you need example: IBM Qiskit.

Everything is just a prove of concept and no one expect anything more than that.


C is definitely not a good choice for this, I would hate to come back 2 days later to my computation and see “segfault” as the only output.


I'd be happy getting just a "segfault" from C. It would be so much better than subtly wrong results from reading uninitialized memory or out-of-bounds access, results that change depending on debug vs optimized build, or when changing some adjacent code.


Sure, a segfault was just more visual, that’s why I went with that.


They would have gotten the same performance in python with numpy if they did it like this instead of calling norm for every polygon

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

They were just using numpy wrong. You can be slow in any language if you use the tools wrong


You made this assertion multiple times, but so far it’s been entirely unsupported in fact, despite TFA having made the entire code set available for you to test your hypothesis on.


On Google colab

    import numpy as np
    import time


    vals = np.random.randn(1000000, 2)
    point = np.array([.2, .3])
    s = time.time()
    for x in vals:
        np.linalg.norm(x - point) < 3
    a = time.time() - s

    s = time.time()
    np.linalg.norm(vals - point, axis=1) < 3
    b = time.time() - s

    print(a / b)
~296x faster, significantly faster than the solution in the article. And my assertion was supported by nearly 20 years of numpy being a leading tool in various quantitative fields. It’s not hard to imagine that a ubiquitous tool that’s been used and optimized for almost 20 years is actually pretty good if used properly.


Finally some dude knows how to use numpy properly. I wish I can upvote 5 times.

I basically raise the same question somewhere below and got downvoted LOL.


what is the difference?

though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario

on one hand, i feel like you should just learn how to use your stuff properly. on the other hand it is interesting to see that people who can't write fast code or use libraries properly are actually writing fast code. like fast code for the masses almost hah. though maybe theyll just run into the same issue when they misuse a library in rust


The first issue I have with it is that they've now convinced a large portion of people that read this article that a very good tool is not as good as it actually is. This is a disservice to the great engineering that has gone into it.

The rest of my issue with it is hypothetical. I don't care what he does at work, but I would imagine if I was that dude's manager and he convinced me that he put in all this work and determined that the best path forward is to introduce a brand new language and tool chain into our environment to maintain (obviously not as big a deal if it was already well engrained in the team), and then I come to find out that he could have gotten even better results by changing a few lines with the existing tools, that I would have to reevaluate my view of said developer.


everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?

but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side


One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.

The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.

Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.

[1]: https://julialang.github.io/PackageCompiler.jl/dev/

[2]: https://github.com/tshort/StaticCompiler.jl

[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...

[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...


Thank you for settling a question for me - I was looking at julia's aot compilation abilities last week and the situation seemed like kind of a hassle.


IME, for having used Julia quite extensively in Academia:

- the development experience is hampered by the slow start time;

- the ecosystem is quite brittle;

- the promised performances are quite hard to actually reach, profiling only gets you so far;

- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)

> what's stopping people from switching???

All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.


I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.

Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.


Julia doesn't get the latest models first, or have as big of a community.


A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.


What's a good guide to learn how to make (and see) vectorized code? It's a mindshift and not one I find easy.


I think a great start is to make arrays of similar data. Instead of an array of (x,y,z) use an array for x, an array for y and another one for z. If you then square these and sum them for example, the compiler might figure out good optimizations for it if you write it as a simple loop.

Also read about SIMD instructions like AVX2. They are often used under the hood when possible, but just knowing what they require can help "triggering" them, depending on which language you use. In C++ for example, the compiler really looks for opportunities to use those instructions. You can tell the compiler did it, by looking in the assembly code if any XMM or YMM registers are being used (these are the names of the SIMD registers).


A more accruate keyword for googling is "SIMD". Single Instruction Multiple Data.

Numpy's tutorial for broadcasting is also a good starting point.

https://numpy.org/doc/stable/user/basics.broadcasting.html


The gist of it is that you give numpy two arrays, and what operation to apply. Then numpy will figure out what the for loop(s) should look like depending on the shape of the arrays.

You can look at various tutorials to see how it works. For example: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-co...


Exactly, that is the real issue, vectorization might be good enough in terms of performance. It doesn't seem to be mentioned in the article at all.


It might have been added later, but the author mentions vectorization in the beginning:

> It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).


Semi Vectorized code:

https://github.com/ohadravid/poly-match/blob/main/poly_match...

Expecting Python engineers unable to read defacto standard numpy code but meanwhile expect everyone can read Rust.....

Not to mention that the semi-vectorized code is still suboptimal. Too many for loops despite the author clearly know they can all be vectorized.

For example instead the author can just write something like:

   np.argmin(
    distances[distances<=threshold]
    )
Also in oneplace there is:

    np.xxx( np.xxx, np.xxx + np.xxx)
You can just slap numexpr on top of it to compile this line on the fly.

https://github.com/pydata/numexpr


Author here:

For the original library we did all the numpy tricks we could think of, but we really needed to do this type of exhaustive search for some of the data.

If someone wants to open a PR with a "fully optimized" numpy code, that would be very cool just for comparison :)


Not super familiar with Python but isn't that append call within a loop going to cause a lot of allocations?


3000 calls to list.append() cost only 2ms. In a computational intense program, no one bothers. Because usually one call to do matrix multiplication is already costing 500ms or so.

Of couse you can prelocate memory for size=3000, and append stuff in a loop. But this saves only 10ms. Too insignificant.


The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

        def find_close_polygons(
            polygon_subset: List[Polygon], point: np.array, max_dist: float
        ) -> List[Polygon]:
            close_polygons = []
            for poly in polygon_subset:
                if np.linalg.norm(poly.center - point) < max_dist:
                    close_polygons.append(poly)

            return close_polygons
And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

        use pyo3::prelude::*;
        use ndarray_linalg::Norm;
        use numpy::PyReadonlyArray1;

        #[pyfunction]
        fn find_close_polygons(
            py: Python<'_>,
            polygons: Vec<PyObject>,
            point: PyReadonlyArray1<f64>,
            max_dist: f64,
        ) -> PyResult<Vec<PyObject>> {
            let mut close_polygons = vec![];
            let point = point.as_array();
            for poly in polygons {
                let center = poly
                    .getattr(py, "center")?
                    .extract::<PyReadonlyArray1<f64>>(py)?
                    .as_array()
                    .to_owned();

                if (center - point).norm() < max_dist {
                    close_polygons.push(poly)
                }
            }

            Ok(close_polygons)
        }
Why is the Rust version 13x faster than the Python version?


Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

    import numpy as np

    n_sides = 30
    n_polygons = 10000

    class Polygon:
        def __init__(self, x, y):
            self.x = x
            self.y = y
            self.center = np.array([self.x, self.y]).mean(axis=1)


    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        close_polygons = []
        for poly in polygon_subset:
            if np.linalg.norm(poly.center - point) < max_dist:
                close_polygons.append(poly)

        return close_polygons

    polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
    point = np.array([0, 0])
    max_dist = 0.5

    %timeit find_close_polygons(polygons, point, max_dist)
(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:

    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        centers = np.array([polygon.center for polygon in polygon_subset])
        mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
        return [
            polygon
            for polygon, is_pass in zip(polygon_subset, mask)
            if is_pass
        ]
then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)


Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...


Nope. Map() is same speed as for loop.

Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.


Yeah I guess this isn't true anymore, it looks like maybe it was true in 2.6 days.


I immediately verified both claims.

list(map(func, arr)) did bring 10% benefits if the func is builtin e.g. int(), str().

But if func is tuple(), list(), set() or any kind of user defined function, list(map()) is always slower.

You can try yourself to see list(map()) is not working well:

    import numpy as np
    a = np.arrange(100000, 100000)
    %%timeit
    b1 = [np.sum(x) for x in a]
    # repeat once
    %%timeit
    b2 = list(map(np.sum, a))
    # repeat once
    import gc
    gc.collect()
    %%timeit
    b2 = list(map(np.sum, a))
    # repeat once
    b1 = [np.sum(x) for x in a]
    # repeat once
I guess that's why I only use map() if and only if is it the case 'list(map(itemgetter, arr))', because generally there is no benefit to use it.


thanks!


I don't think it's the loop implementation. The stuff in the loop should take multiple orders of magnitude more time than the loop itself:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)


I don’t know if numpy fixed this, but it used to be that mixing Python numbers with numpy in a tight loop is horribly slow. Try hoisting max_dist out of the loop and replacing it with max_dist_np that converts it to a numpy float once.


Speaking of this, I once find that

    for x in numpy.array: 
is 9X slower than

    for x in numpy.array.tolist():
in 2021.


Its not the looping itself that is slow in the article you linked, its that every element is appended to the list. If you use a list comprehension its even faster and it still loops over all elements of the list.


Here is the decompilation of the listcomp

    [x for x in range(5)]
:

    RESUME 0
    BUILD_LIST
    LOAD_FAST
    FOR_ITER 4
    STORE_FAST (x)
    LOAD_FAST (x)
    LIST_APPEND
    JUMP_BACKWARDS 5
    RETURN_VALUE
As you can see from the third last instruction, a listcomp does append individual elements to the list. What it doesn’t need to do is call a method to do so (let alone lookup the corresponding method).


No, AFAIK each for loop iteration appends and pops the stack in the interpreter, while map loops all entirely in the native implementation of the interpreter itself.


I was surprised that the Rust version is _only_ 13x as fast as the Python version.


Probably because it wasn't pure Python to start with.


Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?


One carries the entire feature set of the python runtime, the other is compiled.


The time is spent in this 3-line loop:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)
I don't think the entire feature set of the Python runtime is involved in this.


Without using every feature you still have to conform to the complexity of the runtime. Every variable in that loop is a hash map lookup into the locals. `np.linalg.norm` is two field accesses, necessitating more hash map lookups on the module objects. `-` and `<` are attribute lookups as well as full function calls.


> Every variable in that loop is a hash map lookup into the locals.

No, it's a LOAD_FAST bytecode instruction. (The other stuff is mostly right, and probably contributes.)


The final code takes just 2.90ms per iteration.


The rest is not a fair comparison, because it rewrites the used libraries, not the application code.

You can always speed up an application if you rewrite the used libraries to match your specific use case.


It's a fair comparison if the purpose is to guide people in fixing performance issues in their python code.

"That Rust library will be faster than the corresponding python library" is a useful thing to know here.


Usually not by 10x though, unless the original implementation involved some really bad decisions.


The Rust code is still only brute force - using suitable persistent acceleration structures you can probably get a 10x again or maybe even 100x, in 2D a kd-tree is really fast for NNs.

So much faster that the allocations for the result will probably be the bottleneck.


Making Python (near infinitely) faster by using it as a glue language, and running all the computation outside Python :-P


Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.

No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.


It sounds amazing, but bear in mind there are a lot of code which can’t be sped up like this because:

- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.

- Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.

- A lot of programs in languages like Python are slow because the working set size contains a lot of small objects, and the GC struggles. You can optimize code like this by moving large parts of the object graph into rust. But it can become a mess if the objects rust retains then need references to Python objects, in turn.

The optimization described in this blog post is the best case scenario for this sort of thing - the performance hotspot was clear, small, and CPU bound. When you can make optimizations like this you absolutely should. But your mileage may vary when you try this out on your own software.


> Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.

They definitely do, but I’d usually suggest that if you find this an issue then perhaps the function you’re exposing from the compiled language should be higher level, with more work done in the compiled code to avoid the overhead of returning control back to the interpreted language.


Maybe. But that can also be a self tightening knot. Sometimes there’s no elegant place to cleave a program or library in two, and you really just want to pick a single language for the whole project.

Mixing languages can also be a bit of a disaster for maintainability. Refactoring codebases which meaningfully span multiple languages is miserable work.


I wish I had understood this in 2004 when I decided to go all-in with an interpreted language (in this case, Lua) for code that needs to make FFI calls in hot loops. Then again, I suppose my best alternative at the time would have been C++98 as compiled by Visual C++ 6. I'm glad we have much better options now.


Python is a rough language to be productive in. It's a great scratchpad, but dynamic typing, exceptions/poor error handling, and a horrifying deployment and dependency system make me reach for something like Go in any case where I need something to be even vaguely reliable.

The more ML I do, the more disappointed I get.


Ever tried gluing Go with either Python or JavaScript? I'm interested in learning what libraries are there to glue them and how complicated and slow they could be.


I've used gopy[0] recently to access a go library in Python. It surprisingly Just Worked, but I was disappointed by some performance issues, like converting lists to slices.

[0] https://github.com/go-python/gopy


The gulf between the world Python was written to address and the world we live in continues to grow. Python's native memory layout is actively hostile to any sort of performance on modern systems, so it's going to continue to have the problem that interacting with a faster runtime will require a ton of copying.

I've never been impressed by the claim that you can just drop in other faster languages into Python; the costs of communication are so staggeringly high that you have to write around it to have any gain, and by the time you've moved effectively your entire task into the underlying faster language, Python's gain is often quite minimal. NumPy works to some extent because it can fire off huge tasks from a single Python call, a lot of less numerical code can't do that, and I think the ML community still generally underestimates and fails to understand how much performance they leave on the floor in the Python portions of their code, at least based on my interaction with them.


To some degree, it's about familiarity with your tools. And different tools are optimized for different tasks.

Besides, aren't you deploying Docker containers, anyway?


I do, absolutely. But it seems like an exceptionally rare practice for most Python code, at least in the ML space, which puts me back at the start.


Many ML practitioners aren't software engineers. I don't expect that cohort (non-engineers) would manage a deployment well in any language.


That's just habitual. Anyone coming from Python could say the same things about another language. The typing never bothered me because you learn to work with it. But there is of course type hinting now, which I barely use.


I don't think so. I was a Python dev for years, and plenty of other languages don't have the issues Python does. It was a great learning language when I started, but I think most folks that have only used Python just don't realize what they are missing.


Can you give me an example from your recent memory?

I've been coding since the 90s, ASP, PHP, JS, C, Perl, I transitioned from Perl to Python back in 2012-2013. Dabbled in Go when it first came out because I was a Plan9 fanatic and recognized some of the source files, but never went further than tutorials.

Honestly I find very little wrong with the Python ecosystem, except the general insecurity of using package managers. But that applies to most package management, it's a social/infosec issue that Fedora has mitigated fairly well, if you want role models.

The languages that I found most annoying, as a user and developer, were C, Javascript, Typescript and Ruby.


I agree. Add in the fact that you will have to do awkward FFI to some faster language... It's better just to use that language in the first place.


Right? I mean, it's not like we haven't been doing this already. All the computationally intensive python libraries are just a convenient wrapper for C anyway, the only reason python can be used for ML.


Then you give up the benefits of using a managed language, and you now have to maintain two stacks.

IMO/IME much better to go with a language where you don't have this dichotomy in the first place - e.g. Java or C#.


Also, there is a faster Python which is also Python. And the author considered it as well (both PyPy and Numba), it's just in this particular scenario they were not the best way to go.


This is basically what Python was first designed for and as evidenced by the article still excels at


Well, that's exactly where Python works well. As scripting glue sitting between or above native code which does the heavy-lifting.


> and running all the computation outside Python

You don't have to move all the computation, just `for` loops will help alot.


I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.

My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.

Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.

Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.


Numba is great if you want to write a naive loop approach in python.


Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.


Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.


This was a silly and unnecessary optimization. He’s just using numpy wrong.

Instead of:

for p in ps: norm(p.center - point)

You should do:

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

You’ll get your same speed up in 2 lines without introducing a new dependency


Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?

  There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.
[0] https://github.com/ohadravid/poly-match


No, their v1.5 is still calling norm on every polygon. They’re still using it wrong

On Google colab

    import numpy as np
    import time


    vals = np.random.randn(1000000, 2)
    point = np.array([.2, .3])
    s = time.time()
    for x in vals:
        np.linalg.norm(x - point) < 3
    a = time.time() - s

    s = time.time()
    np.linalg.norm(vals - point, axis=1) < 3
    b = time.time() - s

    print(a / b)
~296x faster, significantly faster than the solution in the article.


This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.

Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.

It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.

My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.


This is a very valid point that I can’t disagree with. I’ve gone through the pain of learning that subset of the language decently well, but also been lucky enough to work at places that compensate very well for that knowledge.


This is nice, how would you go about as a performance noob? I can't imagine there's a line in the docs saying "this is slow!".


LineProfiler is the best tool to learn how to write performant Python and do code optimization.

https://github.com/pyutils/line_profiler

You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.

For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.

Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.


You should take a look at Scalene - it's even better.

https://github.com/plasma-umass/scalene


Wow this looks sick! Great thanks!


So in addition to what akasaka said (another thumbs up for line profiler from me, great tool) this isn’t a problem with linalg.norm being slow. It’s plenty fast, but calling it thousands of separate times in a Python loop will be slow. This is more just about learning how to vectorize properly. If you’re working in numpy land and you’re calling a numpy function in a loop that’s iterating over more than a handful of items, chances are you’re not vectorizing properly


I realized that I should say more specifically about numpy usage.

If you see a piece of code like this, it rings the bell that the person has no idea what he/she is doing:

Bad Pattern:

    my_list = np.array(xxx)
    summed = []
    for row in my_list:
        summed.append(np.sum(row))
Worst Pattern:

    my_list = np.array(xxx)
    
    def get_summed(arr):
        return np.sum(arr)

    summed = []
    for i in range(len(my_list)):
        summed.append(
            get_summed(my_list[i])
        )
        
Perfered:

    # np.array(xxx) is redundant
    summed = np.sum(xxx, axis=1)
Same applies to all numpy operations.


I feel a lot of the "Python perf" thing is an inferiority complex. cPython is getting faster all the time, and (obviously using libraries like numpy and others that hook into compiled code) I don't think I've ever seen it become a business bottleneck. If it's ever a scaling problem, then you hire lower-level language devs, it's a good problem to have.

Python is much, much easier to learn, and Rust is notoriously difficult. This obviously feeds into the inferiority complex. But -- while I do know there's a number of applications where perf is crucial -- I think it's well worth doing an ego check before moving from what's a lower friction path and gives you access to myriad developers, including ones trained in very hard disciplines.

Also: does it ever make sense to write something like a CRUD+ backend in Rust? Maybe 100X Rustaceans can do it with one hand tied behind their backs; but imagine what these ubermenschen could be achieving in Python?


We're working as an internal dev team for a moderately large company and the challenge is that very few people have to maintain a ton of business automation. We love our Python+Django stack as it gives it a ton of productivity, but as our business and data grows, some things are becoming quite slow.

For now, we're getting away with Python-only optimizations and relying on cPython improvements. Still, some processes take a few minutes and a lot of that is spent just because Python is quite slow. In that sense, yes, it happens for CRUD+ applications (which I consider our CRM+ERP to be).


I think you might be surprised. I would say Rust isn't notoriously difficult per se, it's just harder to please the compiler. But that's still viewing the compiler as an adversary when it's more like an assistant, so the analogy breaks down. You don't have to be a 100x engineer to use Rust. In fact, quite the opposite. Rust gives engineers a lot more guardrails to prevent what would be runtime errors in other languages.


I think for a lot of people, like myself, writing Rust is just as easy as writing Python (actually it's way easier IMO). So when people write Python it's sort of like "uh, why would you choose the slower tool?".

Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".


If writing Tcl extensions in C during the .com wave 23 years ago taught me something, was that glue languages are great, and that I don't want to use any that doesn't come with either AOT or JIT compiler in the box, other than for OS scripting tasks.


> Also: does it ever make sense to write something like a CRUD+ backend in Rust?

Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).

As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.


I use it for backends that do a lot of math because the python deployment story sucks, rust is faster anyways, and I love rust (do python and js by day). It’s been so easy to bring up rust containers on serverless and they just don’t break. The cold start times are also insanely good compared to python


Whatever performance means is really interesting. One of the few unquestionable benefits of cloud computing is that it is now easier than ever to measure the 'running' cost of a program / solution.

What Amazon charges for, say a t2.nano must factor in the cost of the hardware, power, cooling, and system adminstration, so it is technically possible to evaluate the running cost of a piece of software.

If I, as the ruthless capitalist that I am pretending to be, made the running cost of the solution an integral part of annual remuneration, through bonus calculation say, there will be a tipping point when that cost has an impact on techology strategy.

I know it sounds remote but a similar scheme has/was implemented in some investment banks, where bonus calculations introduced a vesting schedule and factored in the back office cost of maintaining the trade over its lifetime rather than being based entirely on revenue.

Let's start with a totatally reasonable caculation made by a non technical user , with absolutely no understanding of the problem at hand, and by that I mean completely unfair and arbitrary, so pretty much the real world.

The TechEmpower multiple request benchmarks tell me that Rust is the performance king. Go offers about 2/3 that performance, and python 1/3, that bit doesn't sound too unreasonable, perhaps a little unkind to Go, but I digress; given these relative performance numbers, if the running cost bonus pool is $x, the maximum amount you will receive for a python based solution is 1/3x.

At what point does the financial impact influence our technology strategy in a way that perhaps our otherwise virtuous proseltysing about the need to reduce our carbon footprint doesn't?

FWIW I am continually suprised that Jon Blow, Casey Muratori , Mike Acton et al, aren't beating exactly this drum. It's not that their arguments about writing high performance software are not compelling in their own right, but appealing to peoples otherwise orthogonal concerns / passions seems to be a logical next step.

Sorry to ramble on!!


I wonder why GraalVM is not more often used for these speed critical cases: https://www.graalvm.org/python/ (Same for ruby https://www.graalvm.org/ruby/)

Is the problem the Oracle involvement? Or is it not that fast as advertised or problems with the ecosystem (C libraries)?


My concern is that it’s not ready for prod yet.

“At this point, the Python runtime is made available for experimentation and curious end-users. “

https://www.graalvm.org/latest/reference-manual/python/


I would not use anything made by oracle if I have the choice. My team sees it similarly.


Better clean up the Linux kernel from Oracle contributions then, in case you are using it.

https://lwn.net/Articles/915435/


So the Linux kernel is an Oracle licensed product? Better uninstall Ubuntu then.


I am not the one having issues with Oracle.

In fact, I am thankful that they at least avoided Java being stuck in Java 6, and MaximeVM turned into GraalVM, when no one else cared to save Sun from insolvency. Only IBM made a candidate offer that was quickly withdrawn.

So people should stop acting as if there was any magic way to rescue Sun assets.

Also the fact they were one of the first GNU/Linux supporters in enterprise context, which allowed us to actually have a couple of GNU/Linux computers among our Aix, HP-UX and Solaris servers, happily running Oracle instances.


Language design request:

Take from Rust Algebraic types, ahead of time compilation and strong types, functional features, a borrow / escape checker that automatically turns shared data into Rc or Arc, as necessary, instead of tormenting me to rewrite performance irrelevant code;

Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.

... and give me a fast, safe and powerful language that gets out of my way while maximizing the power of the compiler to prevent bugs. Golang was a commendable attempt, but they made '70s design decisions that condemned the language: mandatory garbage collection, nullables, (void*) masquerading as interface{} casts, under-powered compiler etc.


I think Nim is closest to this wishlist


Pretty much, ARC is RC without needing to manually write it. It can be as compact as Python and near C++/Rust speeds with some optimization. Then add in macros for real performance tricks like simd. Julias pretty nice as well.


Modern Java and modern C# should be closer to what you described.

On a side note, python is strictly pass by value. For non-primitives, their references are passed by value.


Sounds like swift to me


This is possibly one of the best written articles end-to-end I have read. Excellent job telling the story


Agreed it was well written, but kinda pointless though since they could have “solved” the problem using the existing tools in a couple lines of code without any new deps. All that content annd profiling and they missed the fact that they were using numpy wrong.


Totally out of curiosity, could you be a bit more concrete in your posted example? I can't get it to work (I'm inexperienced with numpy and I'm messing something when translating your quick example to python)

Thanks!


Yea sorry that was just pseudo code. You want it in this form:

    centers = np.array([
        [3, 3],
        [4, 4]
    ])
    point = np.array([3.5, 3.5])
    vals = centers - point
    np.linalg.norm(vals, axis=1)


Elixir is made faster with Rust, too. Rust is a great skillset to have for those measured moments.


I wonder if being able to quickly retrieve a numpy array of the polygon centers would make an equivalent difference. Since then you could at least retrieve the centers from the polygon as an array you could just use numpy operations for the closest polygon operation:

``` centers = get_centers(polgons) # M x 3 array close_idx = np.where( np.linalg.norm(centers - point, axis=1) < max_dist)[0] close_polygons = polygons[close_idx,:] ```

That's one reason I prefer for to use arrays for polygons, rather then abstract it into a Python object. Fundamentally geometries are sequences of points, and with some zero-padding to account for irregular point counts, you can still keep them in a nice, efficient array representation.


Agreed, I speed up Python numpy code with numba quite often and it isn’t at all unreadable to put it in an ndarray subclass.

    poly = Polygon(vertices)
I would bet you can achieve just as much of a speedup with numba or Cython using this form.


in the domain of Python tooling made faster with Rust, check out https://github.com/charliermarsh/ruff which is 10-100x faster than pylint etc


Like to see comparison to when they use a spatial index like strtree or rtree.


> Python is a superb API for researchers,

Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.

As someone who does work with researchers who do use Python a lot, I see the everyday painful experiences of people who use it. And this pain doesn't need to be there. It's just masochism. And the only real reason is that they don't know any better. The only other thing they know is Matlab, and that's even worse.

Python is just a bad language. Popular, but awful. Ironically, while researchers are supposed to be on the forerfront of discovery and technology... well, they aren't. Industry outpaced research. So much so that today there are government programs to onboard researchers into more automated and more automatically verified way to do research. And we aren't talking about making an elite force here. These programs are meant for people in research who copy data from Excel sheets one data point at a time into another spreadsheet. It's that kind of bad.

My wife happened to work in such a government center, and that's how I know about what's going on inside these programs. And it's very sad that decisions about the preferred tools for research automation are made by people who, unlike most of their peers, had some exposure to what happens in the industry, but had no deeper understanding of the reasons any particular technology ended up in any particular niche, nor any independent ability to assess the capabilities of any particular tech. It's really sad what's going on there.


>Python is just a bad language. Popular, but awful.

You couldn't be more wrong. Python is the language going forward. There is a reason why bleeding edge ML stuff is done through Python, as well as it being the backend to several very popular web platforms and is second most use language on github behind JS solely because JS is hard tied to web.

I have a feeling the hate for Python is just comes from paradigms that are taught in extremely poor CS curriculum in schools. If you think that Python is bad because of dynamic typing, you haven't been paying attention to the direction compute is going.


No, you couldn't be more wrong. Nice argument you dug up there!

Language going forward... I hadn't had such a laugh in a while... It was dead in the 70s. Long before it was born. Are you under influence or something?

> There is a reason why bleeding edge ML stuff is done through Python

No, there isn't, and bleeding edge ML isn't done in Python. It's done mostly in C++ and more seldom in C. Which are another garbage languages, but that's just the reality we live in. Python is just a tiny fraction of what's being done, the tip of the tip of the iceberg.

And, no Python is not the backend of popular platforms, it is, again, a tiny little bit of what's going on in those "popular platforms", and, unlike you, I actually worked for those "popular platforms", so, would know how that is.

But, even, imagine that in your fairy-tale world Python is somehow so super-important and successful? Didn't I already explain how this is possible while still being a garbage language? -- I bet I did. You just rushed to spit your despair and frustration at me, because I offended something that you like... well, you think that you like, but really, you just don't know any better, just like other people using Python :( And that's really the sad part. You lack critical thinking to be really able to tell the quality of your tools. You substituted the ability to judge the quality of something by looking at what the majority is doing and following the herd.

And, I didn't study when Python became taught in school. In my days the language of choice for intro to CS was Visual Basic :| It's also garbage... and the whole idea that the intro to CS has to be done by learning a random fashionable language of the day is just dumb. It's a misunderstanding of SICP, which to someone who didn't understand what the book is about looks as if the author tried to teach the students Scheme, and later intro to CS classes were modeled on this book, but replaced Scheme with another language, while completely forgetting that Scheme was used in the first place as a language with "little syntax", to avoid spending time on learning the language.

You again are confused when you use the expression "dynamic typing", but you don't even know what that is. But you will never accept it because, again, you don't have the ability to critically think, you follow the herd. You red a "definition" somewhere that says that Python is "dynamically typed language", and you believed this nonsense, even though you never really thought about what that might mean...

One thing we agree on though: the direction the compute is going is to utilize more low-skill programmers, and that means, beside other things, following fads more than doing any sensible work. Python is the current fad, and so, for a while, we are stuck with this garbage. And, the way things are going, at best, I can hope for another equally idiotic language to come to replace it. But, probably not very soon.


What are the best alternatives to Python then?


For research? -- Julia seems to be definitely better. It's purpose-built for doing just that. R would also be there. If you want general statistics, then add J to the fold.

But specific fields often have their own, bespoke solutions. I've only ever dealt with math, but it has plenty of its own niche languages that are much better than, say, Sage. My personal choice was Maxyma, but that's because I like Common Lisp.

Furthermore, it's just the situation today. It doesn't mean that this is what it has to be tomorrow. Any of the languages I used in this domain have their issues, and could still be improved. We are nowhere near a place where it's hard to imagine something better than what we have. So, I'd say, if you really want a very good language, you might as well start building one now -- you have a very good chance yours will be the best one so far.


> For research? -- Julia seems to be definitely better. It's purpose-built for doing just that.

What does this even mean lol, “research” is incredibly broad


I'll elaborate. When I use the word "research" I mean activity performed by accredited researchers. Researcher is a rank, or a title if you want. It's similar to "software engineer" -- if you are in an academic institution the word "researcher" could be part of your title / job description.

The activity has as its goal to publish the findings. Findings are expected to be meaningful and to be found in the domain of sciences. It is also possible to extend the scope to arts, but I'm hesitant about it, and the way I used the word doesn't truly relate to what happens in arts department (for this purpose, math is an art).

Accredited, here means association with academic institution by means of employment or similar.

On conceptual level, all discoveries in sciences must have a hypothesis whose validity and truthfulness is to be established through empirical evidence, i.e. experiment. To interpret experiments one needs to use statistics. This is the vehicle driving experiments. This is where the kind of programming I'm talking about comes in.

Note that researchers might use programming for other activities s.a. eg. automation of keeping their diaries, or automating their correspondence and so on. But I meant specifically the statistical aspect of doing research.


I think what OP means is that Julia has a number of features that work very well for the workflows and processes of programming for scientists.

1. Interactive workflows. One of the defining features of doing science is that you don't know what the right answer or right approach is. This makes interactive workflows (like REPLs) really valuable since you can load data once and do 100 different analyses on it. Notebooks are also really useful as a means of showing both code and results at the same time, and Pluto.jl is one of the best here since it removes the possibility of ending up with inconsistent state by tracking dependencies between cells.

2. Reproducibility. Another important feature for scientific code is that you want someone to be easily able to take your data and code, be easily able to install the code, run it and get the same answer. This is one of python's biggest shortcomings. Python has an incredibly rich package ecosystem, but is lacking a good unified system for reproducibly installing packages (Poetry is the closest but it has problems with binary dependencies). Julia (and Rust) have virtual environments and the idea of a manifest file that records the exact version of all your (transitive) package dependencies built in which make it trivial for someone else with no instructions get an exact clone of all the software needed to run your code.

3. Ease of use. Most scientists are scientists first and bad to mediocre at programming (there are obviously exceptions, some scientists are great programmers). Static type systems and manual memory management are major impediments to beginner use. C++ gets some scientific use for it's performance, but there's a reason Python R and Matlab are the languages of most scientists.

4. Performance. Lots of fields (bio, astronomy, high energy physics, chemistry) need a fast language to be able to get results in a reasonable amount of time. Julia is fast and is one of the easier languages to write GPU accelerated code in.

5. Open source. Closed source languages (Matlab) are a total pain to deal with.


>R would also be there.

Surely you jest. Whatever problems Python may have with proper development or deployment practices, R is ten-fold worse. R is and will forever be an interactive, make-it-work-now language and production backbone second. The language is far too accommodating and will go silent casting all manner of sins in the manner of keeping the program running. Package management is still a huge headache as proper isolation is still not well addressed. Too many R packages assume they have root during installation and can do whatever they wish. Volumes can be written about R namespacing.


> R is and will forever be an interactive

I don't see this as a problem for researchers. Quite the contrary. To compare this to Python: Python's interactivity sucks. Bad syntax prevents it from using it interactively efficiently. Abysmal debugger. And if you consider that a typical environment in which Python is used in research setting is a notebook, then add to it an even worse wrapper for debugger available in notebooks.

Python is garbage for production systems too. If you want to use the results of your research for practical stuff, you will not use Python code you wrote for research. I'm more familiar with the world of medical research, and can confidently say that I've never seen a practical medical device or software product that used Python. Medical equipment typically wants to be realtime, which is a world completely closed to Python, for example.

However, Python (or R) being bad for production systems is perceived to be an acceptable price... I wish it wasn't, but that's how things are.

You cannot unironically claim that Python has superior package management... It's the shittiest ever. I have not seen a language which has done it worse than Python, and I've worked with at least 2-3 dozens of them. In my day job, I'm in the ops / infra department. I maintain a lot of Python packages both for commercial entities and for open-source independent developers. Part of me is very upset that Python is such a shit-show when it comes to packaging, but another part of me is happy because it means that I will have a job dealing with the fallout of Python packaging for a while.

Bottom line, if I had to support a bunch of scientists doing their research, and I had to deal with packaging their stuff, I'll take R over Python in a heartbeat.


Rust is verbose. I would use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia and usually get similar performance.


Does it compile AoT or are you stuck with Julia start time?


One of the others understated pros of rewriting some parts in Rust, it's that you can parallelize easily with Rayon[0]

[0] -- https://github.com/rayon-rs/rayon


Rust is great but isn’t the core problem here using the wrong algorithm? It looks like this is ideally suited for a quad tree instead of a naive for loop. I would expect that to pulverise any current benchmark.


I guess you also need to take the time into account to create the quad tree from an unsorted 'polygon soup' first, and in terms of coding effort, a brute force conversion from python to a compiled tight loop over unsorted arrays provides a lot of bang for the buck (and a speedup of 100x for relatively little effort might be 'good enough' for quite a while until the input data grows big enough to require the next optimization effort).

(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)


Im assuming the polygons dont change to often so you can amortize the construction of the quadtree. Depending on your world view the implementation is trivial since Shapely, a dependency they already likely have, has an implementation of it.


Author here: actually, in this analogy (as this is just a demo library), the polygons change each time so we couldn't use this type of optimization (at least not in a straightforward way).


Why do I need to rewrite in Rust, when I can just use Polars[1] that will cover most usecases [1] https://www.pola.rs/


What is the relation? Polars replaces pandas, not numpy.


Polars and DuckDB have replaced Pandas for me for the most part.


> The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?

I literally clicked in to read the article to see if they'd mention this:) But... unless I missed it, there wasn't really an answer? I thought numpy does do the heavy lifting in native code, so why is this faster? Does this version just push more of the logic into native code than numpy did?


Numpy is fast when the code is vectorized. The code they are benchmarking against was not vectorized. They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist). Instead of vectorizing the whole operation, the python code was just calling numpy in a loop to find the distance of two points.

Just that small change already gives 10x faster performance without ever leaving python/numpy land.


> They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist).

Scipy should have already implemented such thing. Scikit-Learn also. Because KNN clustering is exactly doing this kind of work.


They did have a choice quote a bit before the blurb you quoted. The real world problem is sufficiently more complex/different enough that vectorizing it would be a pain.

>It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable...


Thanks, I managed to miss that


The article mentioned that there are some gains from using vectorised numpy methods, i.e. spending more time in numpy code. I would be interested in whether the List[Polygon] could be converted into two long arrays of all xs and all ys with indices into the starts (essentially a dense representation of a sparse array) and then the core function rewritten for Numba since it could now not use any Python objects. This would break the interface of course, but may get within striking distance of the Rust implementation.


The slowness comes from the interaction of numpy and a Python object "Polygon", which in not numpy. I suspect that a sufficiently clever coder could have optimized the result without resorting to Rust, but at the cost of a substantial increase in complexity of the codebase. The proposed approach keep the Python code simple (and moves the complexity into having another language to deal with).


    diff --git a/poly_match_v1.py b/poly_match_v1.py
    index 675c88a..4293a46 100644
    --- a/poly_match_v1.py
    +++ b/poly_match_v1.py
    @@ -1,4 +1,5 @@
     from functools import cached_property
    +from itertools import compress
     from typing import List, Tuple
     import numpy as np
     from dataclasses import dataclass
    @@ -56,11 +57,8 @@ def generate_example() -> Tuple[List[Polygon], List[np.array]]:
     def find_close_polygons(
  polygon_subset: List[Polygon], point: np.array, max_dist: float
     ) -> List[Polygon]:
    -    close_polygons = []
    -    for poly in polygon_subset:
    -        if np.linalg.norm(poly.center - point) < max_dist:
    -            close_polygons.append(poly)
    -
    +    dist = np.linalg.norm([poly.center for poly in polygon_subset] - point, axis=1)
    +    close_polygons = list(compress(polygon_subset, dist < max_dist))
  return close_polygons
10x faster than the original, without resorting to native code, and without substantial increase in complexity of code base.


I wish the author would take your suggestion (and other recommendations) here and try it out again. Would be a very interesting follow up to read: "How we speed up Python and simplifying our codebase by removing on more dependency".


Focusing on speeding up find_close_polygons instead of realizing that you're matching many points against the same set of polygons is also unfortunate, since that function being slow is a red herring. You can create a scipy.spatial.KDTree for example and just query all the points against that.


People can write slow code in any language ;-) We've had to fire contractors who "wrote" dataframe code but was not vectorized in practice despite repeat request to do so. Same thing for slow code in CUDA.

From a maintenance view, I much prefer folks write vectorized data frames vs numpy or low level bindings, but that comes from having lived with the alternative for a lot longer. All of our exceptions are pretty much slotted for deletion. (Our fast path is single or multi-GPU dataframes in python.) Here's to hoping that one day we'll have dependent types in mypy!


> People can write slow code in any language ;-)

That was a lot of my curiosity, because I'm not quite a good enough programmer to know whether the problem is just that the original code is bad or does something super inefficient, and so rewriting it any language will let you make massive improvements.


I am really curious if there is an important reason why not trying this performance improvement with Cython first. Can someone comfortable with Cython explain what are the pros and cons doing this optimization with Cython?


Python and Rust. Apple has some job openings in distributed systems for people fluent in both.


Is Apple still hybrid only, no remote? Any links to some of their careers pages on this?

Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.


I'm wondering how much would be the speedup by rewriting the critical part in cython


It's not really Python that was sped up though, it was an application written in python augmented with a bit of optimized rust code.

This sort of hybrid is super common, you typically spend 90%+ of your time in computationally intensive problems in a very small subset of your code, typically the innermost loops. Optimizing those will have very good pay-off.

Traditionally we'd do this with high level stuff in one language and then assembly for the performance critical parts, these days it is more likely a combination of a scripting language for the high level part and a compiled language for the low level parts (C, rust, whatever). Java and such as less suitable for such optimization purposes, both because they come with a huge runtime and because they are hard to interface to other languages unless they happen to use the same underlying VM, but then there usually isn't much performance gain.

Another nice way to optimize computationally intensive code is by finding out if the code is suitable for adaptation to the GPU, which can give you many orders of magnitude speed improvement in some cases.


Very cool! I can see myself using this soon actually :) On top of the "code speed up" this is a good problem for 2d data structures for performing this type of "find objects within radius" type of query.


It is an alternative framing of the N+1 problem (mistake) SQL users make

https://news.ycombinator.com/item?id=34207974


>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?


Kudos - this is a very nice blog post - "real" engineering ;)


For this particular code, vectorization and some acceleration library, such as JAX, may be a better path to optimization. Otherwise an excellent article!


It's much easier and more accurate to time your python scripts with `time python script.py`. Cool write up.


`time` is absolutely awful, with the minor exception of bsd’s time maybe.

If you’re going to benchmark scripts or executables, use hyperfine.


Very nice. Thanks for sharing.

https://github.com/sharkdp/hyperfine


The premise was to not rewrite everything in rust, but you basically ended up rewriting 90% of it in rust


90% of the bottleneck, not 90% of their whole application. The author says that rewriting everything in Rust would have taken months, so the whole application must be huge.

"It is big and complex and very business critical and highly algorithmic, so that would take ~months of work, ..."


OP fully rewrote the example program in rust, by also moving the entire data structures there. This would mean that any interaction with these ndarrays could be possible only on the rust side, hence any other code that uses them must be rewritten, unless there’s some porting of rust ndarrays to python numpy ndarrays


Yes, what did you expect? That he shares his internal code base with the world just to silence people who can’t generalize?


No? In fact, if you understand what what I wrote, by generalizing this small piece of code it would mean that most of the codebase must be also partially rewritten to rust to be able to interoperate with the new data structures moved in rust. Thus these “less than 100 lines of code” refer to just this simple example program, which was fully rewritten in rust, hence, by generalizing, the premise was pointless


Using PyPy, which is a real compiler, might help.

That's doing spatial data processing by exaustive search, which is inherently slow. There are better algorithms. If the number of items to be searched is large, the spatial indices of MySQL could help.


They tried PyPy at the beginning and it was 2x slower. Plausibly it would be better with additional optimization, but it's not cut and dry.


PyPy is a JIT compiler not a "real compiler", it requires warm up time to start optimizing code on runtime.


It does generate machine code, which CPython does not.


In other words, PyPy is not a AOT-compiler.


Did you read the article?


The original code already look crap due to making a new list containing object instead of a mask or something else. Also that can be done using list comprehension. Also it totally can be vectorize or parallelize.

You need more experienced Python engineer.


> Rust (with the help of pyo3) unlocks true native performance for everyday Python code, with minimal compromises.

Hasn't he heard of ctypes? You can wrap C structs add Python objects since forever.


There’s probably a way to get at the numpy objects without having to go through the python runtime and do all the computation in pure C.

I assume, haven’t really messed with numpy for anything but I can’t imagine it wouldn’t work that way.


Title is terrible. Better: A Python code is 100x faster by rewriting small part in Rust


We've taken the magic numbers out of the title now, as the site guidelines ask - https://news.ycombinator.com/newsguidelines.html




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: