Making Python faster with Rust

okaleniuk · on March 30, 2023

Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.

A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.

I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.

masklinn · on March 30, 2023

The author did talk about it on reddit, but explained that for the purpose of the blog post they wanted to focus on the big stuff and profiling-guided optimisation of the process: https://reddit.com/r/rust/comments/125pbq0/blog_post_making_...

ZeroCool2u · on March 30, 2023

The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.

appeldorian · on March 30, 2023

I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.

He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.

tremon · on March 30, 2023

Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.

ohr · on March 31, 2023

Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.

Yoric · on March 30, 2023

Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.

snicker7 · on March 30, 2023

Languages and with some sort of decent metaprogramming support can alleviate this sort of issue (see Zig, Julia, Jai, etc.)

Yoric · on March 30, 2023

Do you have examples? I cannot think of a good way to do that.

jmalicki · on March 31, 2023

For python, you can use pandas. It even has arrow support now.

Yoric · on March 31, 2023

Yeah, I need to try it.

But the parent was suggesting that metaprogramming could make readability issues go away and I can't wrap my head around how it would do that.

rsfern · on April 1, 2023

In Julia you can use https://juliaarrays.github.io/StructArrays.jl/stable/ which lets your code use the array of structs interface but internally it’s a structure of arrays

Yoric · on April 1, 2023

Nice, thanks!

prirun · on March 30, 2023

Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s

lights0123 · on March 30, 2023

It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.

prirun · on March 31, 2023

Results with -O are below.

x1 (AoS) vs x2 (SoA): no performance difference x3 (arrays not in structure, both arrays in loop): slower x4 (arrays not in structure, one array in loop): faster

My advice is still not to assume that SoA is always faster than AoS without benchmarking.

  + cc -O -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m11.775s
  user 0m3.540s
  sys 0m6.592s
  + ./x1
  s=1808348672

  real 0m5.427s
  user 0m2.727s
  sys 0m2.682s
  + cc -O -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m5.185s
  user 0m2.296s
  sys 0m2.872s
  + ./x2
  s=1808348672

  real 0m5.140s
  user 0m2.273s
  sys 0m2.852s
  + cc -O -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m6.423s
  user 0m3.745s
  sys 0m2.660s
  + ./x3
  s=1808348672

  real 0m6.485s
  user 0m3.741s
  sys 0m2.714s
  + cc -O -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m4.875s
  user 0m2.205s
  sys 0m2.651s
  + ./x4
  s=1808348672

  real 0m4.894s
  user 0m2.189s
  sys 0m2.684s

Narishma · on March 31, 2023

The reason you're not seeing much difference is that your struct is very small, just 16 bytes. In the cases where you're using both x and y in the loop (x1 and x2), you can fit 4 of them in a cache line and you're not wasting space since you need to use both. In the case you're only using one of the values (x3), you're wasting half a cache line and that shows in the benchmark. If you had a bigger struct and/or where you're not using all the members in the calculation, you'd see a much bigger difference in performance between SoA and AoS.

rpep · on March 31, 2023

You need to add “-march=native” to turn on SIMD at the highest level your processor will support, otherwise it will just use SSE4.1 by default on most compilers (the lowest common denominator, as all x86-64 processors have it).

b0b10101 · on March 30, 2023

This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?

So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???

fbdab103 · on March 30, 2023

Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.

Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.

koito17 · on March 30, 2023

>Today, there is a Python package for everything.

The same could be said about CPAN and NPM. Yet Perl is basically dead and JavaScript isn't used for any machine learning tasks as far as I'm aware. WebAssembly did help bring a niche array of audio and video codecs to the ecosystem[1][2], something I'm yet to see from Python.

I don't use Python, but with what little exposure I've had to it at work, its overall sluggish performance and need to set up a dozen virtualenvs -- only to dockerize everything in cursed ways when deploying -- makes me wonder how or why people bother with it at all beyond some 5-line script. Then again, Perl used to be THE glue language in the past and mod_perl was as big as FastAPI, and Perl users would also point out how CPAN was unparalleled in breadth and depth. I wonder if Python will follow a similar fate as Perl. One can hope :-)

[1] https://github.com/phoboslab/jsmpeg

[2] https://github.com/brion/ogv.js/

junipertea · on March 30, 2023

That’s a lot of opinions for so little exposure. There are a lot uses that don’t involve docker or a dozen virtual envs.

chank · on March 30, 2023

Honestly, I use python everyday in the ML/AI space. If we're talking in that context they're pretty spot on about python, virtualenvs, and docker.

insanitybit · on March 30, 2023

I've used Python quite a lot and their experience sounds about right.

pnt12 · on March 30, 2023

To counter these anecdotes, I used python for building web apis and only needed poetry to manage 1 virtual env, and containerizing with docker was straight forward.

brrrrrm · on March 30, 2023

> JavaScript isn't used for any machine learning tasks as far as I'm aware

https://github.com/facebookresearch/shumai

wizzwizz4 · on March 30, 2023

> WebAssembly did help bring a niche array of audio and video codecs to the ecosystem

Python already has all those: the ctypes module is just as hard to use as WebAssembly, with a much lower barrier-to-entry.

Yoric · on March 30, 2023

WebAssembly has the benefit of portability, though. Python on Windows... is still an open problem. Or even Python on dev-oriented distro vs. numsci-oriented distro.

el_oni · on March 30, 2023

This is how I got into software development.

During my PhD I was running some simulations using poorly written python code. initially it would take several hours. In that time i could go to the lab, run some wetlab experiments and the results of my simulations would be there when i got back to the office. It was only taking python "home" and building some of my own projects that i learned how to 1. write more pythonic code and 2. write more performant code. Now i work for a software company.

If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.

onetonfoot · on March 30, 2023

You can have your cake and eat it with the likes of

* PythonCall.jl - https://github.com/cjdoris/PythonCall.jl

* NodeCall.jl - https://github.com/sunoru/NodeCall.j

* RCall.jl - https://github.com/JuliaInterop/RCall.jl

I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself

Terretta · on March 30, 2023

* NodeCall.jl - https://github.com/sunoru/NodeCall.jl

// just fixed missing 'l' in link

bakuninsbart · on March 30, 2023

Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.

The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.

visarga · on March 30, 2023

It also matters if you write code to run once or to serve in production, if it is experimental or stable.

If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.

Barrin92 · on March 30, 2023

tbf you don't need to go to C. You could write Common Lisp or Ocaml, both academic high level languages and very performant. Hell SBCL can get you to C range performance wise while you're writing dynamic, GCed code. Sure it's a little bit more involved than learning Python but not that much if you get 50x performance for free. Prevalence of Python is really baffling to me because compute resources cost money.

el_oni · on March 30, 2023

academic for CS maybe. Not so much for chemistry, biology etc. If you work in computation areas of those subjects you are more likely to know matlab, R, python and maybe Julia.

I didn't know of anybody who had done any Lisp or Ocaml in my time in academia (in chemistry, chemical engineering and biology departments), but that's just 3 universities, and i certainly didn't know everybody.

Yoric · on March 30, 2023

I've known people who write OCaml for biology, but they sure looked lonely :)

akasakahakada · on March 30, 2023

Not even readability. Academic code is mostly unreadable. If you need example: IBM Qiskit.

Everything is just a prove of concept and no one expect anything more than that.

kaba0 · on March 30, 2023

C is definitely not a good choice for this, I would hate to come back 2 days later to my computation and see “segfault” as the only output.

Matumio · on March 31, 2023

I'd be happy getting just a "segfault" from C. It would be so much better than subtly wrong results from reading uninitialized memory or out-of-bounds access, results that change depending on debug vs optimized build, or when changing some adjacent code.

kaba0 · on March 31, 2023

Sure, a segfault was just more visual, that’s why I went with that.

brahbrah · on March 30, 2023

They would have gotten the same performance in python with numpy if they did it like this instead of calling norm for every polygon

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

They were just using numpy wrong. You can be slow in any language if you use the tools wrong

masklinn · on March 30, 2023

You made this assertion multiple times, but so far it’s been entirely unsupported in fact, despite TFA having made the entire code set available for you to test your hypothesis on.

brahbrah · on March 30, 2023

On Google colab

    import numpy as np
    import time


    vals = np.random.randn(1000000, 2)
    point = np.array([.2, .3])
    s = time.time()
    for x in vals:
        np.linalg.norm(x - point) < 3
    a = time.time() - s

    s = time.time()
    np.linalg.norm(vals - point, axis=1) < 3
    b = time.time() - s

    print(a / b)

~296x faster, significantly faster than the solution in the article. And my assertion was supported by nearly 20 years of numpy being a leading tool in various quantitative fields. It’s not hard to imagine that a ubiquitous tool that’s been used and optimized for almost 20 years is actually pretty good if used properly.

akasakahakada · on March 30, 2023

Finally some dude knows how to use numpy properly. I wish I can upvote 5 times.

I basically raise the same question somewhere below and got downvoted LOL.

tayo42 · on March 30, 2023

what is the difference?

though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario

on one hand, i feel like you should just learn how to use your stuff properly. on the other hand it is interesting to see that people who can't write fast code or use libraries properly are actually writing fast code. like fast code for the masses almost hah. though maybe theyll just run into the same issue when they misuse a library in rust

brahbrah · on March 30, 2023

The first issue I have with it is that they've now convinced a large portion of people that read this article that a very good tool is not as good as it actually is. This is a disservice to the great engineering that has gone into it.

The rest of my issue with it is hypothetical. I don't care what he does at work, but I would imagine if I was that dude's manager and he convinced me that he put in all this work and determined that the best path forward is to introduce a brand new language and tool chain into our environment to maintain (obviously not as big a deal if it was already well engrained in the team), and then I come to find out that he could have gotten even better results by changing a few lines with the existing tools, that I would have to reevaluate my view of said developer.

ThouYS · on March 30, 2023

everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?

but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side

ubj · on March 30, 2023

One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.

The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.

Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.

[1]: https://julialang.github.io/PackageCompiler.jl/dev/

[2]: https://github.com/tshort/StaticCompiler.jl

[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...

[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...

meepmorp · on March 30, 2023

Thank you for settling a question for me - I was looking at julia's aot compilation abilities last week and the situation seemed like kind of a hassle.

shakow · on March 30, 2023

IME, for having used Julia quite extensively in Academia:

- the development experience is hampered by the slow start time;

- the ecosystem is quite brittle;

- the promised performances are quite hard to actually reach, profiling only gets you so far;

- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)

> what's stopping people from switching???

All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.

ActorNightly · on March 30, 2023

I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.

Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.

visarga · on March 30, 2023

Julia doesn't get the latest models first, or have as big of a community.

korijn · on March 30, 2023

A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.

pbowyer · on March 30, 2023

What's a good guide to learn how to make (and see) vectorized code? It's a mindshift and not one I find easy.

appeldorian · on March 30, 2023

I think a great start is to make arrays of similar data. Instead of an array of (x,y,z) use an array for x, an array for y and another one for z. If you then square these and sum them for example, the compiler might figure out good optimizations for it if you write it as a simple loop.

Also read about SIMD instructions like AVX2. They are often used under the hood when possible, but just knowing what they require can help "triggering" them, depending on which language you use. In C++ for example, the compiler really looks for opportunities to use those instructions. You can tell the compiler did it, by looking in the assembly code if any XMM or YMM registers are being used (these are the names of the SIMD registers).

akasakahakada · on March 30, 2023

A more accruate keyword for googling is "SIMD". Single Instruction Multiple Data.

Numpy's tutorial for broadcasting is also a good starting point.

https://numpy.org/doc/stable/user/basics.broadcasting.html

korijn · on March 30, 2023

The gist of it is that you give numpy two arrays, and what operation to apply. Then numpy will figure out what the for loop(s) should look like depending on the shape of the arrays.

You can look at various tutorials to see how it works. For example: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-co...

rwalle · on March 30, 2023

Exactly, that is the real issue, vectorization might be good enough in terms of performance. It doesn't seem to be mentioned in the article at all.

hn92726819 · on March 30, 2023

It might have been added later, but the author mentions vectorization in the beginning:

> It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).

akasakahakada · on March 30, 2023

Semi Vectorized code:

https://github.com/ohadravid/poly-match/blob/main/poly_match...

Expecting Python engineers unable to read defacto standard numpy code but meanwhile expect everyone can read Rust.....

Not to mention that the semi-vectorized code is still suboptimal. Too many for loops despite the author clearly know they can all be vectorized.

For example instead the author can just write something like:

   np.argmin(
    distances[distances<=threshold]
    )

Also in oneplace there is:

    np.xxx( np.xxx, np.xxx + np.xxx)

You can just slap numexpr on top of it to compile this line on the fly.

https://github.com/pydata/numexpr

ohr · on March 31, 2023

Author here:

For the original library we did all the numpy tricks we could think of, but we really needed to do this type of exhaustive search for some of the data.

If someone wants to open a PR with a "fully optimized" numpy code, that would be very cool just for comparison :)

Tarq0n · on March 30, 2023

Not super familiar with Python but isn't that append call within a loop going to cause a lot of allocations?

akasakahakada · on March 30, 2023

3000 calls to list.append() cost only 2ms. In a computational intense program, no one bothers. Because usually one call to do matrix multiplication is already costing 500ms or so.

Of couse you can prelocate memory for size=3000, and append stuff in a loop. But this saves only 10ms. Too insignificant.

FreeHugs · on March 30, 2023

The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

        def find_close_polygons(
            polygon_subset: List[Polygon], point: np.array, max_dist: float
        ) -> List[Polygon]:
            close_polygons = []
            for poly in polygon_subset:
                if np.linalg.norm(poly.center - point) < max_dist:
                    close_polygons.append(poly)

            return close_polygons

And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

        use pyo3::prelude::*;
        use ndarray_linalg::Norm;
        use numpy::PyReadonlyArray1;

        #[pyfunction]
        fn find_close_polygons(
            py: Python<'_>,
            polygons: Vec<PyObject>,
            point: PyReadonlyArray1<f64>,
            max_dist: f64,
        ) -> PyResult<Vec<PyObject>> {
            let mut close_polygons = vec![];
            let point = point.as_array();
            for poly in polygons {
                let center = poly
                    .getattr(py, "center")?
                    .extract::<PyReadonlyArray1<f64>>(py)?
                    .as_array()
                    .to_owned();

                if (center - point).norm() < max_dist {
                    close_polygons.push(poly)
                }
            }

            Ok(close_polygons)
        }

Why is the Rust version 13x faster than the Python version?

spi · on March 30, 2023

Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

    import numpy as np

    n_sides = 30
    n_polygons = 10000

    class Polygon:
        def __init__(self, x, y):
            self.x = x
            self.y = y
            self.center = np.array([self.x, self.y]).mean(axis=1)


    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        close_polygons = []
        for poly in polygon_subset:
            if np.linalg.norm(poly.center - point) < max_dist:
                close_polygons.append(poly)

        return close_polygons

    polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
    point = np.array([0, 0])
    max_dist = 0.5

    %timeit find_close_polygons(polygons, point, max_dist)

(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:

    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        centers = np.array([polygon.center for polygon in polygon_subset])
        mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
        return [
            polygon
            for polygon, is_pass in zip(polygon_subset, mask)
            if is_pass
        ]

then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)

winrid · on March 30, 2023

Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...

akasakahakada · on March 30, 2023

Nope. Map() is same speed as for loop.

Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.

winrid · on March 30, 2023

Yeah I guess this isn't true anymore, it looks like maybe it was true in 2.6 days.

akasakahakada · on March 30, 2023

I immediately verified both claims.

list(map(func, arr)) did bring 10% benefits if the func is builtin e.g. int(), str().

But if func is tuple(), list(), set() or any kind of user defined function, list(map()) is always slower.

You can try yourself to see list(map()) is not working well:

    import numpy as np
    a = np.arrange(100000, 100000)
    %%timeit
    b1 = [np.sum(x) for x in a]
    # repeat once
    %%timeit
    b2 = list(map(np.sum, a))
    # repeat once
    import gc
    gc.collect()
    %%timeit
    b2 = list(map(np.sum, a))
    # repeat once
    b1 = [np.sum(x) for x in a]
    # repeat once

I guess that's why I only use map() if and only if is it the case 'list(map(itemgetter, arr))', because generally there is no benefit to use it.

winrid · on March 30, 2023

thanks!

FreeHugs · on March 30, 2023

I don't think it's the loop implementation. The stuff in the loop should take multiple orders of magnitude more time than the loop itself:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

zarzavat · on March 30, 2023

I don’t know if numpy fixed this, but it used to be that mixing Python numbers with numpy in a tight loop is horribly slow. Try hoisting max_dist out of the loop and replacing it with max_dist_np that converts it to a numpy float once.

akasakahakada · on March 30, 2023

Speaking of this, I once find that

    for x in numpy.array:

is 9X slower than

    for x in numpy.array.tolist():

in 2021.

tweakimp · on March 30, 2023

Its not the looping itself that is slow in the article you linked, its that every element is appended to the list. If you use a list comprehension its even faster and it still loops over all elements of the list.

masklinn · on March 30, 2023

Here is the decompilation of the listcomp

    [x for x in range(5)]

:

    RESUME 0
    BUILD_LIST
    LOAD_FAST
    FOR_ITER 4
    STORE_FAST (x)
    LOAD_FAST (x)
    LIST_APPEND
    JUMP_BACKWARDS 5
    RETURN_VALUE

As you can see from the third last instruction, a listcomp does append individual elements to the list. What it doesn’t need to do is call a method to do so (let alone lookup the corresponding method).

winrid · on March 30, 2023

No, AFAIK each for loop iteration appends and pops the stack in the interpreter, while map loops all entirely in the native implementation of the interpreter itself.

hannofcart · on March 30, 2023

I was surprised that the Rust version is _only_ 13x as fast as the Python version.

IshKebab · on March 30, 2023

Probably because it wasn't pure Python to start with.

lenkite · on March 30, 2023

Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?

nickstinemates · on March 30, 2023

One carries the entire feature set of the python runtime, the other is compiled.

FreeHugs · on March 30, 2023

The time is spent in this 3-line loop:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

I don't think the entire feature set of the Python runtime is involved in this.

ben-schaaf · on March 30, 2023

Without using every feature you still have to conform to the complexity of the runtime. Every variable in that loop is a hash map lookup into the locals. `np.linalg.norm` is two field accesses, necessitating more hash map lookups on the module objects. `-` and `<` are attribute lookups as well as full function calls.

wizzwizz4 · on March 30, 2023

> Every variable in that loop is a hash map lookup into the locals.

No, it's a LOAD_FAST bytecode instruction. (The other stuff is mostly right, and probably contributes.)

hoseja · on March 30, 2023

The final code takes just 2.90ms per iteration.

FreeHugs · on March 30, 2023

The rest is not a fair comparison, because it rewrites the used libraries, not the application code.

You can always speed up an application if you rewrite the used libraries to match your specific use case.

iudqnolq · on March 30, 2023

It's a fair comparison if the purpose is to guide people in fixing performance issues in their python code.

"That Rust library will be faster than the corresponding python library" is a useful thing to know here.

flohofwoe · on March 30, 2023

Usually not by 10x though, unless the original implementation involved some really bad decisions.

wnkrshm · on March 30, 2023

The Rust code is still only brute force - using suitable persistent acceleration structures you can probably get a 10x again or maybe even 100x, in 2D a kd-tree is really fast for NNs.

So much faster that the allocations for the result will probably be the bottleneck.

ssivark · on March 30, 2023

Making Python (near infinitely) faster by using it as a glue language, and running all the computation outside Python :-P

INTPenis · on March 30, 2023

Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.

No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.

josephg · on March 30, 2023

It sounds amazing, but bear in mind there are a lot of code which can’t be sped up like this because:

- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.

- Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.

- A lot of programs in languages like Python are slow because the working set size contains a lot of small objects, and the GC struggles. You can optimize code like this by moving large parts of the object graph into rust. But it can become a mess if the objects rust retains then need references to Python objects, in turn.

The optimization described in this blog post is the best case scenario for this sort of thing - the performance hotspot was clear, small, and CPU bound. When you can make optimizations like this you absolutely should. But your mileage may vary when you try this out on your own software.

rpep · on March 31, 2023

> Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.

They definitely do, but I’d usually suggest that if you find this an issue then perhaps the function you’re exposing from the compiled language should be higher level, with more work done in the compiled code to avoid the overhead of returning control back to the interpreted language.

josephg · on March 31, 2023

Maybe. But that can also be a self tightening knot. Sometimes there’s no elegant place to cleave a program or library in two, and you really just want to pick a single language for the whole project.

Mixing languages can also be a bit of a disaster for maintainability. Refactoring codebases which meaningfully span multiple languages is miserable work.

mwcampbell · on March 30, 2023

I wish I had understood this in 2004 when I decided to go all-in with an interpreted language (in this case, Lua) for code that needs to make FFI calls in hot loops. Then again, I suppose my best alternative at the time would have been C++98 as compiled by Visual C++ 6. I'm glad we have much better options now.

alex_sf · on March 30, 2023

Python is a rough language to be productive in. It's a great scratchpad, but dynamic typing, exceptions/poor error handling, and a horrifying deployment and dependency system make me reach for something like Go in any case where I need something to be even vaguely reliable.

The more ML I do, the more disappointed I get.

ajitid · on March 30, 2023

Ever tried gluing Go with either Python or JavaScript? I'm interested in learning what libraries are there to glue them and how complicated and slow they could be.

noam_k · on March 30, 2023

I've used gopy[0] recently to access a go library in Python. It surprisingly Just Worked, but I was disappointed by some performance issues, like converting lists to slices.

[0] https://github.com/go-python/gopy

jerf · on March 30, 2023

The gulf between the world Python was written to address and the world we live in continues to grow. Python's native memory layout is actively hostile to any sort of performance on modern systems, so it's going to continue to have the problem that interacting with a faster runtime will require a ton of copying.

I've never been impressed by the claim that you can just drop in other faster languages into Python; the costs of communication are so staggeringly high that you have to write around it to have any gain, and by the time you've moved effectively your entire task into the underlying faster language, Python's gain is often quite minimal. NumPy works to some extent because it can fire off huge tasks from a single Python call, a lot of less numerical code can't do that, and I think the ML community still generally underestimates and fails to understand how much performance they leave on the floor in the Python portions of their code, at least based on my interaction with them.

xapata · on March 30, 2023

To some degree, it's about familiarity with your tools. And different tools are optimized for different tasks.

Besides, aren't you deploying Docker containers, anyway?

alex_sf · on March 30, 2023

I do, absolutely. But it seems like an exceptionally rare practice for most Python code, at least in the ML space, which puts me back at the start.

xapata · on March 31, 2023

Many ML practitioners aren't software engineers. I don't expect that cohort (non-engineers) would manage a deployment well in any language.

INTPenis · on March 30, 2023

That's just habitual. Anyone coming from Python could say the same things about another language. The typing never bothered me because you learn to work with it. But there is of course type hinting now, which I barely use.

alex_sf · on March 31, 2023

I don't think so. I was a Python dev for years, and plenty of other languages don't have the issues Python does. It was a great learning language when I started, but I think most folks that have only used Python just don't realize what they are missing.

INTPenis · on March 31, 2023

Can you give me an example from your recent memory?

I've been coding since the 90s, ASP, PHP, JS, C, Perl, I transitioned from Perl to Python back in 2012-2013. Dabbled in Go when it first came out because I was a Plan9 fanatic and recognized some of the source files, but never went further than tutorials.

Honestly I find very little wrong with the Python ecosystem, except the general insecurity of using package managers. But that applies to most package management, it's a social/infosec issue that Fedora has mitigated fairly well, if you want role models.

The languages that I found most annoying, as a user and developer, were C, Javascript, Typescript and Ruby.

IshKebab · on March 30, 2023

I agree. Add in the fact that you will have to do awkward FFI to some faster language... It's better just to use that language in the first place.

internetter · on March 30, 2023

Right? I mean, it's not like we haven't been doing this already. All the computationally intensive python libraries are just a convenient wrapper for C anyway, the only reason python can be used for ML.

za3faran · on March 30, 2023

Then you give up the benefits of using a managed language, and you now have to maintain two stacks.

IMO/IME much better to go with a language where you don't have this dichotomy in the first place - e.g. Java or C#.

okaleniuk · on March 30, 2023

Also, there is a faster Python which is also Python. And the author considered it as well (both PyPy and Numba), it's just in this particular scenario they were not the best way to go.

baq · on March 30, 2023

This is basically what Python was first designed for and as evidenced by the article still excels at

flohofwoe · on March 30, 2023

Well, that's exactly where Python works well. As scripting glue sitting between or above native code which does the heavy-lifting.

est · on March 30, 2023

> and running all the computation outside Python

You don't have to move all the computation, just `for` loops will help alot.

majoe · on March 30, 2023

I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.

My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.

Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.

Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.

hgomersall · on March 30, 2023

Numba is great if you want to write a naive loop approach in python.

dunefox · on March 30, 2023

Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.

xgdgsc · on March 30, 2023

Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.

brahbrah · on March 30, 2023

This was a silly and unnecessary optimization. He’s just using numpy wrong.

Instead of:

for p in ps: norm(p.center - point)

You should do:

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

You’ll get your same speed up in 2 lines without introducing a new dependency

_glass · on March 30, 2023

Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?

  There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.

[0] https://github.com/ohadravid/poly-match

brahbrah · on March 30, 2023

No, their v1.5 is still calling norm on every polygon. They’re still using it wrong

On Google colab

    import numpy as np
    import time


    vals = np.random.randn(1000000, 2)
    point = np.array([.2, .3])
    s = time.time()
    for x in vals:
        np.linalg.norm(x - point) < 3
    a = time.time() - s

    s = time.time()
    np.linalg.norm(vals - point, axis=1) < 3
    b = time.time() - s

    print(a / b)

~296x faster, significantly faster than the solution in the article.

jerf · on March 30, 2023

This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.

Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.

It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.

My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.

brahbrah · on March 30, 2023

This is a very valid point that I can’t disagree with. I’ve gone through the pain of learning that subset of the language decently well, but also been lucky enough to work at places that compensate very well for that knowledge.

oblio · on March 30, 2023

This is nice, how would you go about as a performance noob? I can't imagine there's a line in the docs saying "this is slow!".

akasakahakada · on March 30, 2023

LineProfiler is the best tool to learn how to write performant Python and do code optimization.

https://github.com/pyutils/line_profiler

You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.

For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.

Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.

dralley · on March 30, 2023

You should take a look at Scalene - it's even better.

https://github.com/plasma-umass/scalene

akasakahakada · on March 30, 2023

Wow this looks sick! Great thanks!

brahbrah · on March 30, 2023

So in addition to what akasaka said (another thumbs up for line profiler from me, great tool) this isn’t a problem with linalg.norm being slow. It’s plenty fast, but calling it thousands of separate times in a Python loop will be slow. This is more just about learning how to vectorize properly. If you’re working in numpy land and you’re calling a numpy function in a loop that’s iterating over more than a handful of items, chances are you’re not vectorizing properly

akasakahakada · on March 30, 2023

I realized that I should say more specifically about numpy usage.

If you see a piece of code like this, it rings the bell that the person has no idea what he/she is doing:

Bad Pattern:

    my_list = np.array(xxx)
    summed = []
    for row in my_list:
        summed.append(np.sum(row))

Worst Pattern:

    my_list = np.array(xxx)
    
    def get_summed(arr):
        return np.sum(arr)

    summed = []
    for i in range(len(my_list)):
        summed.append(
            get_summed(my_list[i])
        )

Perfered:

    # np.array(xxx) is redundant
    summed = np.sum(xxx, axis=1)

Same applies to all numpy operations.

thanatropism · on March 30, 2023

I feel a lot of the "Python perf" thing is an inferiority complex. cPython is getting faster all the time, and (obviously using libraries like numpy and others that hook into compiled code) I don't think I've ever seen it become a business bottleneck. If it's ever a scaling problem, then you hire lower-level language devs, it's a good problem to have.

Python is much, much easier to learn, and Rust is notoriously difficult. This obviously feeds into the inferiority complex. But -- while I do know there's a number of applications where perf is crucial -- I think it's well worth doing an ego check before moving from what's a lower friction path and gives you access to myriad developers, including ones trained in very hard disciplines.

Also: does it ever make sense to write something like a CRUD+ backend in Rust? Maybe 100X Rustaceans can do it with one hand tied behind their backs; but imagine what these ubermenschen could be achieving in Python?

karamanolev · on March 30, 2023

We're working as an internal dev team for a moderately large company and the challenge is that very few people have to maintain a ton of business automation. We love our Python+Django stack as it gives it a ton of productivity, but as our business and data grows, some things are becoming quite slow.

For now, we're getting away with Python-only optimizations and relying on cPython improvements. Still, some processes take a few minutes and a lot of that is spent just because Python is quite slow. In that sense, yes, it happens for CRUD+ applications (which I consider our CRM+ERP to be).

faitswulff · on March 30, 2023

I think you might be surprised. I would say Rust isn't notoriously difficult per se, it's just harder to please the compiler. But that's still viewing the compiler as an adversary when it's more like an assistant, so the analogy breaks down. You don't have to be a 100x engineer to use Rust. In fact, quite the opposite. Rust gives engineers a lot more guardrails to prevent what would be runtime errors in other languages.

insanitybit · on March 30, 2023

I think for a lot of people, like myself, writing Rust is just as easy as writing Python (actually it's way easier IMO). So when people write Python it's sort of like "uh, why would you choose the slower tool?".

Obviously the answer is "because I'm faster with Python" but then part of my brain is "well get good idk".

pjmlp · on March 30, 2023

If writing Tcl extensions in C during the .com wave 23 years ago taught me something, was that glue languages are great, and that I don't want to use any that doesn't come with either AOT or JIT compiler in the box, other than for OS scripting tasks.

JodieBenitez · on March 30, 2023

> Also: does it ever make sense to write something like a CRUD+ backend in Rust?

Well... given 2 frameworks equally pleasant to work with, why not using the one with the best performance ? (Whatever performance means... less cpu ? less memory ? better i/o ?).

As a Django user, working on problems where Django shines, I have yet to see such a solution in Rust, but that doesn't mean it won't happen one day.

rhodysurf · on March 30, 2023

I use it for backends that do a lot of math because the python deployment story sucks, rust is faster anyways, and I love rust (do python and js by day). It’s been so easy to bring up rust containers on serverless and they just don’t break. The cold start times are also insanely good compared to python

HacklesRaised · on March 30, 2023

Whatever performance means is really interesting. One of the few unquestionable benefits of cloud computing is that it is now easier than ever to measure the 'running' cost of a program / solution.

What Amazon charges for, say a t2.nano must factor in the cost of the hardware, power, cooling, and system adminstration, so it is technically possible to evaluate the running cost of a piece of software.

If I, as the ruthless capitalist that I am pretending to be, made the running cost of the solution an integral part of annual remuneration, through bonus calculation say, there will be a tipping point when that cost has an impact on techology strategy.

I know it sounds remote but a similar scheme has/was implemented in some investment banks, where bonus calculations introduced a vesting schedule and factored in the back office cost of maintaining the trade over its lifetime rather than being based entirely on revenue.

Let's start with a totatally reasonable caculation made by a non technical user , with absolutely no understanding of the problem at hand, and by that I mean completely unfair and arbitrary, so pretty much the real world.

The TechEmpower multiple request benchmarks tell me that Rust is the performance king. Go offers about 2/3 that performance, and python 1/3, that bit doesn't sound too unreasonable, perhaps a little unkind to Go, but I digress; given these relative performance numbers, if the running cost bonus pool is $x, the maximum amount you will receive for a python based solution is 1/3x.

At what point does the financial impact influence our technology strategy in a way that perhaps our otherwise virtuous proseltysing about the need to reduce our carbon footprint doesn't?

FWIW I am continually suprised that Jon Blow, Casey Muratori , Mike Acton et al, aren't beating exactly this drum. It's not that their arguments about writing high performance software are not compelling in their own right, but appealing to peoples otherwise orthogonal concerns / passions seems to be a logical next step.

Sorry to ramble on!!

karussell · on March 30, 2023

I wonder why GraalVM is not more often used for these speed critical cases: https://www.graalvm.org/python/ (Same for ruby https://www.graalvm.org/ruby/)

Is the problem the Oracle involvement? Or is it not that fast as advertised or problems with the ecosystem (C libraries)?

thisgoodlife · on March 30, 2023

My concern is that it’s not ready for prod yet.

“At this point, the Python runtime is made available for experimentation and curious end-users. “

https://www.graalvm.org/latest/reference-manual/python/

dunefox · on March 30, 2023

I would not use anything made by oracle if I have the choice. My team sees it similarly.

pjmlp · on March 31, 2023

Better clean up the Linux kernel from Oracle contributions then, in case you are using it.

https://lwn.net/Articles/915435/

dunefox · on March 31, 2023

So the Linux kernel is an Oracle licensed product? Better uninstall Ubuntu then.

pjmlp · on March 31, 2023

I am not the one having issues with Oracle.

In fact, I am thankful that they at least avoided Java being stuck in Java 6, and MaximeVM turned into GraalVM, when no one else cared to save Sun from insolvency. Only IBM made a candidate offer that was quickly withdrawn.

So people should stop acting as if there was any magic way to rescue Sun assets.

Also the fact they were one of the first GNU/Linux supporters in enterprise context, which allowed us to actually have a couple of GNU/Linux computers among our Aix, HP-UX and Solaris servers, happily running Oracle instances.

cornholio · on March 30, 2023

Language design request:

Take from Rust Algebraic types, ahead of time compilation and strong types, functional features, a borrow / escape checker that automatically turns shared data into Rc or Arc, as necessary, instead of tormenting me to rewrite performance irrelevant code;

Take from Python the simple syntax, default pass by reference of all non-numeric types, simplified string handling, unified slice and array syntax - and any other simplifying feature possible.

... and give me a fast, safe and powerful language that gets out of my way while maximizing the power of the compiler to prevent bugs. Golang was a commendable attempt, but they made '70s design decisions that condemned the language: mandatory garbage collection, nullables, (void*) masquerading as interface{} casts, under-powered compiler etc.

machiaweliczny · on March 30, 2023

I think Nim is closest to this wishlist

elcritch · on March 30, 2023

Pretty much, ARC is RC without needing to manually write it. It can be as compact as Python and near C++/Rust speeds with some optimization. Then add in macros for real performance tricks like simd. Julias pretty nice as well.

za3faran · on March 30, 2023

Modern Java and modern C# should be closer to what you described.

On a side note, python is strictly pass by value. For non-primitives, their references are passed by value.

mijoharas · on March 30, 2023

Sounds like swift to me

jcolella · on March 30, 2023

This is possibly one of the best written articles end-to-end I have read. Excellent job telling the story

brahbrah · on March 30, 2023

Agreed it was well written, but kinda pointless though since they could have “solved” the problem using the existing tools in a couple lines of code without any new deps. All that content annd profiling and they missed the fact that they were using numpy wrong.

tecleandor · on March 30, 2023

Totally out of curiosity, could you be a bit more concrete in your posted example? I can't get it to work (I'm inexperienced with numpy and I'm messing something when translating your quick example to python)

Thanks!

brahbrah · on March 30, 2023

Yea sorry that was just pseudo code. You want it in this form:

    centers = np.array([
        [3, 3],
        [4, 4]
    ])
    point = np.array([3.5, 3.5])
    vals = centers - point
    np.linalg.norm(vals, axis=1)

Dowwie · on March 30, 2023

Elixir is made faster with Rust, too. Rust is a great skillset to have for those measured moments.

saeranv · on March 30, 2023

I wonder if being able to quickly retrieve a numpy array of the polygon centers would make an equivalent difference. Since then you could at least retrieve the centers from the polygon as an array you could just use numpy operations for the closest polygon operation:

``` centers = get_centers(polgons) # M x 3 array close_idx = np.where( np.linalg.norm(centers - point, axis=1) < max_dist)[0] close_polygons = polygons[close_idx,:] ```

That's one reason I prefer for to use arrays for polygons, rather then abstract it into a Python object. Fundamentally geometries are sequences of points, and with some zero-padding to account for irregular point counts, you can still keep them in a nice, efficient array representation.

toxik · on March 30, 2023

Agreed, I speed up Python numpy code with numba quite often and it isn’t at all unreadable to put it in an ndarray subclass.

    poly = Polygon(vertices)

I would bet you can achieve just as much of a speedup with numba or Cython using this form.

swyx · on March 30, 2023

in the domain of Python tooling made faster with Rust, check out https://github.com/charliermarsh/ruff which is 10-100x faster than pylint etc

moreresearchplz · on March 30, 2023

Like to see comparison to when they use a spatial index like strtree or rtree.

crabbone · on March 30, 2023

> Python is a superb API for researchers,

Where does this nonsense come from? No. It isn't. It's just a stupid fashion. Something that should be discouraged, not encouraged by trying to make it work when it's obviously broken.

As someone who does work with researchers who do use Python a lot, I see the everyday painful experiences of people who use it. And this pain doesn't need to be there. It's just masochism. And the only real reason is that they don't know any better. The only other thing they know is Matlab, and that's even worse.

Python is just a bad language. Popular, but awful. Ironically, while researchers are supposed to be on the forerfront of discovery and technology... well, they aren't. Industry outpaced research. So much so that today there are government programs to onboard researchers into more automated and more automatically verified way to do research. And we aren't talking about making an elite force here. These programs are meant for people in research who copy data from Excel sheets one data point at a time into another spreadsheet. It's that kind of bad.

My wife happened to work in such a government center, and that's how I know about what's going on inside these programs. And it's very sad that decisions about the preferred tools for research automation are made by people who, unlike most of their peers, had some exposure to what happens in the industry, but had no deeper understanding of the reasons any particular technology ended up in any particular niche, nor any independent ability to assess the capabilities of any particular tech. It's really sad what's going on there.

ActorNightly · on March 30, 2023

>Python is just a bad language. Popular, but awful.

You couldn't be more wrong. Python is the language going forward. There is a reason why bleeding edge ML stuff is done through Python, as well as it being the backend to several very popular web platforms and is second most use language on github behind JS solely because JS is hard tied to web.

I have a feeling the hate for Python is just comes from paradigms that are taught in extremely poor CS curriculum in schools. If you think that Python is bad because of dynamic typing, you haven't been paying attention to the direction compute is going.

crabbone · on April 1, 2023

No, you couldn't be more wrong. Nice argument you dug up there!

Language going forward... I hadn't had such a laugh in a while... It was dead in the 70s. Long before it was born. Are you under influence or something?

> There is a reason why bleeding edge ML stuff is done through Python

No, there isn't, and bleeding edge ML isn't done in Python. It's done mostly in C++ and more seldom in C. Which are another garbage languages, but that's just the reality we live in. Python is just a tiny fraction of what's being done, the tip of the tip of the iceberg.

And, no Python is not the backend of popular platforms, it is, again, a tiny little bit of what's going on in those "popular platforms", and, unlike you, I actually worked for those "popular platforms", so, would know how that is.

But, even, imagine that in your fairy-tale world Python is somehow so super-important and successful? Didn't I already explain how this is possible while still being a garbage language? -- I bet I did. You just rushed to spit your despair and frustration at me, because I offended something that you like... well, you think that you like, but really, you just don't know any better, just like other people using Python :( And that's really the sad part. You lack critical thinking to be really able to tell the quality of your tools. You substituted the ability to judge the quality of something by looking at what the majority is doing and following the herd.

And, I didn't study when Python became taught in school. In my days the language of choice for intro to CS was Visual Basic :| It's also garbage... and the whole idea that the intro to CS has to be done by learning a random fashionable language of the day is just dumb. It's a misunderstanding of SICP, which to someone who didn't understand what the book is about looks as if the author tried to teach the students Scheme, and later intro to CS classes were modeled on this book, but replaced Scheme with another language, while completely forgetting that Scheme was used in the first place as a language with "little syntax", to avoid spending time on learning the language.

You again are confused when you use the expression "dynamic typing", but you don't even know what that is. But you will never accept it because, again, you don't have the ability to critically think, you follow the herd. You red a "definition" somewhere that says that Python is "dynamically typed language", and you believed this nonsense, even though you never really thought about what that might mean...

One thing we agree on though: the direction the compute is going is to utilize more low-skill programmers, and that means, beside other things, following fads more than doing any sensible work. Python is the current fad, and so, for a while, we are stuck with this garbage. And, the way things are going, at best, I can hope for another equally idiotic language to come to replace it. But, probably not very soon.

pahbloo · on March 30, 2023

What are the best alternatives to Python then?

crabbone · on March 30, 2023

For research? -- Julia seems to be definitely better. It's purpose-built for doing just that. R would also be there. If you want general statistics, then add J to the fold.

But specific fields often have their own, bespoke solutions. I've only ever dealt with math, but it has plenty of its own niche languages that are much better than, say, Sage. My personal choice was Maxyma, but that's because I like Common Lisp.

Furthermore, it's just the situation today. It doesn't mean that this is what it has to be tomorrow. Any of the languages I used in this domain have their issues, and could still be improved. We are nowhere near a place where it's hard to imagine something better than what we have. So, I'd say, if you really want a very good language, you might as well start building one now -- you have a very good chance yours will be the best one so far.

cauthon · on March 30, 2023

> For research? -- Julia seems to be definitely better. It's purpose-built for doing just that.

What does this even mean lol, “research” is incredibly broad

crabbone · on April 1, 2023

I'll elaborate. When I use the word "research" I mean activity performed by accredited researchers. Researcher is a rank, or a title if you want. It's similar to "software engineer" -- if you are in an academic institution the word "researcher" could be part of your title / job description.

The activity has as its goal to publish the findings. Findings are expected to be meaningful and to be found in the domain of sciences. It is also possible to extend the scope to arts, but I'm hesitant about it, and the way I used the word doesn't truly relate to what happens in arts department (for this purpose, math is an art).

Accredited, here means association with academic institution by means of employment or similar.

On conceptual level, all discoveries in sciences must have a hypothesis whose validity and truthfulness is to be established through empirical evidence, i.e. experiment. To interpret experiments one needs to use statistics. This is the vehicle driving experiments. This is where the kind of programming I'm talking about comes in.

Note that researchers might use programming for other activities s.a. eg. automation of keeping their diaries, or automating their correspondence and so on. But I meant specifically the statistical aspect of doing research.

adgjlsfhk1 · on March 30, 2023

I think what OP means is that Julia has a number of features that work very well for the workflows and processes of programming for scientists.

1. Interactive workflows. One of the defining features of doing science is that you don't know what the right answer or right approach is. This makes interactive workflows (like REPLs) really valuable since you can load data once and do 100 different analyses on it. Notebooks are also really useful as a means of showing both code and results at the same time, and Pluto.jl is one of the best here since it removes the possibility of ending up with inconsistent state by tracking dependencies between cells.

2. Reproducibility. Another important feature for scientific code is that you want someone to be easily able to take your data and code, be easily able to install the code, run it and get the same answer. This is one of python's biggest shortcomings. Python has an incredibly rich package ecosystem, but is lacking a good unified system for reproducibly installing packages (Poetry is the closest but it has problems with binary dependencies). Julia (and Rust) have virtual environments and the idea of a manifest file that records the exact version of all your (transitive) package dependencies built in which make it trivial for someone else with no instructions get an exact clone of all the software needed to run your code.

3. Ease of use. Most scientists are scientists first and bad to mediocre at programming (there are obviously exceptions, some scientists are great programmers). Static type systems and manual memory management are major impediments to beginner use. C++ gets some scientific use for it's performance, but there's a reason Python R and Matlab are the languages of most scientists.

4. Performance. Lots of fields (bio, astronomy, high energy physics, chemistry) need a fast language to be able to get results in a reasonable amount of time. Julia is fast and is one of the easier languages to write GPU accelerated code in.

5. Open source. Closed source languages (Matlab) are a total pain to deal with.

0cf8612b2e1e · on March 30, 2023

>R would also be there.

Surely you jest. Whatever problems Python may have with proper development or deployment practices, R is ten-fold worse. R is and will forever be an interactive, make-it-work-now language and production backbone second. The language is far too accommodating and will go silent casting all manner of sins in the manner of keeping the program running. Package management is still a huge headache as proper isolation is still not well addressed. Too many R packages assume they have root during installation and can do whatever they wish. Volumes can be written about R namespacing.

crabbone · on April 1, 2023

> R is and will forever be an interactive

I don't see this as a problem for researchers. Quite the contrary. To compare this to Python: Python's interactivity sucks. Bad syntax prevents it from using it interactively efficiently. Abysmal debugger. And if you consider that a typical environment in which Python is used in research setting is a notebook, then add to it an even worse wrapper for debugger available in notebooks.

Python is garbage for production systems too. If you want to use the results of your research for practical stuff, you will not use Python code you wrote for research. I'm more familiar with the world of medical research, and can confidently say that I've never seen a practical medical device or software product that used Python. Medical equipment typically wants to be realtime, which is a world completely closed to Python, for example.

However, Python (or R) being bad for production systems is perceived to be an acceptable price... I wish it wasn't, but that's how things are.

You cannot unironically claim that Python has superior package management... It's the shittiest ever. I have not seen a language which has done it worse than Python, and I've worked with at least 2-3 dozens of them. In my day job, I'm in the ops / infra department. I maintain a lot of Python packages both for commercial entities and for open-source independent developers. Part of me is very upset that Python is such a shit-show when it comes to packaging, but another part of me is happy because it means that I will have a job dealing with the fallout of Python packaging for a while.

Bottom line, if I had to support a bunch of scientists doing their research, and I had to deal with packaging their stuff, I'll take R over Python in a heartbeat.

xgdgsc · on March 30, 2023

Rust is verbose. I would use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia and usually get similar performance.

shakow · on March 30, 2023

Does it compile AoT or are you stuck with Julia start time?

wdroz · on March 30, 2023

One of the others understated pros of rewriting some parts in Rust, it's that you can parallelize easily with Rayon[0]

[0] -- https://github.com/rayon-rs/rayon

wcrossbow · on March 30, 2023

Rust is great but isn’t the core problem here using the wrong algorithm? It looks like this is ideally suited for a quad tree instead of a naive for loop. I would expect that to pulverise any current benchmark.

flohofwoe · on March 30, 2023

I guess you also need to take the time into account to create the quad tree from an unsorted 'polygon soup' first, and in terms of coding effort, a brute force conversion from python to a compiled tight loop over unsorted arrays provides a lot of bang for the buck (and a speedup of 100x for relatively little effort might be 'good enough' for quite a while until the input data grows big enough to require the next optimization effort).

(e.g. it doesn't need to be "as fast as possible", just fast enough to no longer be a workflow bottleneck)

wcrossbow · on March 30, 2023

Im assuming the polygons dont change to often so you can amortize the construction of the quadtree. Depending on your world view the implementation is trivial since Shapely, a dependency they already likely have, has an implementation of it.

ohr · on March 31, 2023

Author here: actually, in this analogy (as this is just a demo library), the polygons change each time so we couldn't use this type of optimization (at least not in a straightforward way).

wanderingmind · on March 30, 2023

Why do I need to rewrite in Rust, when I can just use Polars[1] that will cover most usecases [1] https://www.pola.rs/

masklinn · on March 30, 2023

What is the relation? Polars replaces pandas, not numpy.

wenc · on March 30, 2023

Polars and DuckDB have replaced Pandas for me for the most part.

yjftsjthsd-h · on March 30, 2023

> The library was already using numpy for a lot of its calculations, so why should we expect Rust to be better?

I literally clicked in to read the article to see if they'd mention this:) But... unless I missed it, there wasn't really an answer? I thought numpy does do the heavy lifting in native code, so why is this faster? Does this version just push more of the logic into native code than numpy did?

alex_smart · on March 30, 2023

Numpy is fast when the code is vectorized. The code they are benchmarking against was not vectorized. They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist). Instead of vectorizing the whole operation, the python code was just calling numpy in a loop to find the distance of two points.

Just that small change already gives 10x faster performance without ever leaving python/numpy land.

akasakahakada · on March 30, 2023

> They wanted to calculated the distances of n points against a given point and find out which points are closer than a threshold (max_dist).

Scipy should have already implemented such thing. Scikit-Learn also. Because KNN clustering is exactly doing this kind of work.

fbdab103 · on March 30, 2023

They did have a choice quote a bit before the blurb you quoted. The real world problem is sufficiently more complex/different enough that vectorizing it would be a pain.

>It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable...

yjftsjthsd-h · on March 30, 2023

Thanks, I managed to miss that

anewhnaccount2 · on March 30, 2023

The article mentioned that there are some gains from using vectorised numpy methods, i.e. spending more time in numpy code. I would be interested in whether the List[Polygon] could be converted into two long arrays of all xs and all ys with indices into the starts (essentially a dense representation of a sparse array) and then the core function rewritten for Numba since it could now not use any Python objects. This would break the interface of course, but may get within striking distance of the Rust implementation.

biomcgary · on March 30, 2023

The slowness comes from the interaction of numpy and a Python object "Polygon", which in not numpy. I suspect that a sufficiently clever coder could have optimized the result without resorting to Rust, but at the cost of a substantial increase in complexity of the codebase. The proposed approach keep the Python code simple (and moves the complexity into having another language to deal with).

alex_smart · on March 30, 2023

    diff --git a/poly_match_v1.py b/poly_match_v1.py
    index 675c88a..4293a46 100644
    --- a/poly_match_v1.py
    +++ b/poly_match_v1.py
    @@ -1,4 +1,5 @@
     from functools import cached_property
    +from itertools import compress
     from typing import List, Tuple
     import numpy as np
     from dataclasses import dataclass
    @@ -56,11 +57,8 @@ def generate_example() -> Tuple[List[Polygon], List[np.array]]:
     def find_close_polygons(
  polygon_subset: List[Polygon], point: np.array, max_dist: float
     ) -> List[Polygon]:
    -    close_polygons = []
    -    for poly in polygon_subset:
    -        if np.linalg.norm(poly.center - point) < max_dist:
    -            close_polygons.append(poly)
    -
    +    dist = np.linalg.norm([poly.center for poly in polygon_subset] - point, axis=1)
    +    close_polygons = list(compress(polygon_subset, dist < max_dist))
  return close_polygons

10x faster than the original, without resorting to native code, and without substantial increase in complexity of code base.

mgraupner · on March 30, 2023

I wish the author would take your suggestion (and other recommendations) here and try it out again. Would be a very interesting follow up to read: "How we speed up Python and simplifying our codebase by removing on more dependency".

nossid · on March 30, 2023

Focusing on speeding up find_close_polygons instead of realizing that you're matching many points against the same set of polygons is also unfortunate, since that function being slow is a red herring. You can create a scipy.spatial.KDTree for example and just query all the points against that.

lmeyerov · on March 30, 2023

People can write slow code in any language ;-) We've had to fire contractors who "wrote" dataframe code but was not vectorized in practice despite repeat request to do so. Same thing for slow code in CUDA.

From a maintenance view, I much prefer folks write vectorized data frames vs numpy or low level bindings, but that comes from having lived with the alternative for a lot longer. All of our exceptions are pretty much slotted for deletion. (Our fast path is single or multi-GPU dataframes in python.) Here's to hoping that one day we'll have dependent types in mypy!

yjftsjthsd-h · on March 30, 2023

> People can write slow code in any language ;-)

That was a lot of my curiosity, because I'm not quite a good enough programmer to know whether the problem is just that the original code is bad or does something super inefficient, and so rewriting it any language will let you make massive improvements.

osmanbaskaya · on April 1, 2023

I am really curious if there is an important reason why not trying this performance improvement with Cython first. Can someone comfortable with Cython explain what are the pros and cons doing this optimization with Cython?

1MachineElf · on March 30, 2023

Python and Rust. Apple has some job openings in distributed systems for people fluent in both.

gjourdvhiokhf · on March 30, 2023

Is Apple still hybrid only, no remote? Any links to some of their careers pages on this?

Piqued my curiosity as my background almost lines up, and I'm interested in that kind of role... But I live near one of their offices only hiring for skill sets I don't really have.

poulpy123 · on March 30, 2023

I'm wondering how much would be the speedup by rewriting the critical part in cython

jacquesm · on March 30, 2023

It's not really Python that was sped up though, it was an application written in python augmented with a bit of optimized rust code.

This sort of hybrid is super common, you typically spend 90%+ of your time in computationally intensive problems in a very small subset of your code, typically the innermost loops. Optimizing those will have very good pay-off.

Traditionally we'd do this with high level stuff in one language and then assembly for the performance critical parts, these days it is more likely a combination of a scripting language for the high level part and a compiled language for the low level parts (C, rust, whatever). Java and such as less suitable for such optimization purposes, both because they come with a huge runtime and because they are hard to interface to other languages unless they happen to use the same underlying VM, but then there usually isn't much performance gain.

Another nice way to optimize computationally intensive code is by finding out if the code is suitable for adaptation to the GPU, which can give you many orders of magnitude speed improvement in some cases.

brandonpelfrey · on March 30, 2023

Very cool! I can see myself using this soon actually :) On top of the "code speed up" this is a good problem for 2d data structures for performing this type of "find objects within radius" type of query.

orangepurple · on March 30, 2023

It is an alternative framing of the N+1 problem (mistake) SQL users make

https://news.ycombinator.com/item?id=34207974

sandGorgon · on March 30, 2023

>Also, using any JIT-based tricks (PyPy / numba) results in very small gains (as we will measure, just to make sure).

i wasnt able to see the numba comparison. anyone know how much worse it was ?

mattbillenstein · on March 30, 2023

Kudos - this is a very nice blog post - "real" engineering ;)

kavalg · on March 30, 2023

For this particular code, vectorization and some acceleration library, such as JAX, may be a better path to optimization. Otherwise an excellent article!

Heston · on March 30, 2023

It's much easier and more accurate to time your python scripts with `time python script.py`. Cool write up.

masklinn · on March 30, 2023

`time` is absolutely awful, with the minor exception of bsd’s time maybe.

If you’re going to benchmark scripts or executables, use hyperfine.

akasakahakada · on March 30, 2023

Very nice. Thanks for sharing.

https://github.com/sharkdp/hyperfine

firechickenbird · on March 30, 2023

The premise was to not rewrite everything in rust, but you basically ended up rewriting 90% of it in rust