Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nuitka: An extremely compatible Python compiler (nuitka.net)
223 points by BiteCode_dev on Sept 2, 2021 | hide | past | favorite | 84 comments


Nuitka is a wonderful project which in my opinion doesn't get enough attention.

I first found it back in 2015 when I worked at a company where we built a python based desktop application as a part of our industrial control system. Nuitka provided better performance then pyinstaller/cx_freeze while it was still simple to work with. Back then there were still a few incompatibilities but I've followed the project throughout the years and it's mature like a fine wine since.


Some of the past discussions:

* https://news.ycombinator.com/item?id=8771925 (2014, 135 comments)

* https://news.ycombinator.com/item?id=10994267 (2016, 52 comments)

* https://news.ycombinator.com/item?id=15354613 (2017, 60 comments)


Nuitka looks like a traditional ahead-of-time compiler using SSA form[0] and is written in Python. I’d be interested in seeing performance comparisons with PyPy, which uses the second Futamura protection and is written in a dialect of Python.

[0]: https://nuitka.net/doc/developer-manual.html#ssa-form-for-nu...


It's slower than pypy, but starts faster and results in a standalone executable. Also, it's compatible with most c extensions.



Thanks for the link. Don't know if you are affiliated but there is a lot of link rot going on there with external links.



This project is super impressive. It makes me wonder why there's not a similarly mature JavaScript AOT compiler implementation for use in game development or other places where you want performance better than a VM and you're not allowed to JIT.

A small recommendation for these release pages would be to link to the tag in whatever source host you're using! I just want to see the code.


The main slowdown in JavaScript is because it is inherently so dynamic — all function calls are "virtual," properties are string lookups, you need a lot of indirection (boxing) to store objects. Without actually running the program on some input, it is infeasible to determine what those indirections will resolve to. So static compilation is useless (or would produce instructions very similar to what an interpreter would execute). That's why we usually interpret the code (and JIT it to optimize places where the program is not actually dynamic).

Theoretically you could "precompile" JavaScript by running your code on some known input and caching the VM's state. Then you can "hydrate" your VM with that state. But this would only speed up the VM warmup (which is not that long) and requires significant browser/VM buy-in, so it is probably infeasible.

The only way to produce better performance in JavaScript is to restrict it to a less-dynamic subset (e.g. objects must have static keys) and add at least some static typing so that a compiler can resolve some references ahead of time. That's why asm.js was a thing. But asm.js has been superseded by WASM, which better in most regards.


Have you seen the work Fastly discussed about their AOT compiler? Seems like an interesting approach that maybe could be used to preprocess code into a more efficient form before running it in the cloud. The challenge is that JS engines are primarily focused on the browser and such an optimization opportunity isn’t interesting there - the majority of development is done by browser vendors.


There is, for example https://bellard.org/quickjs/


That's an interpreter written in C and without a JIT. Not the same as an AOT compiler.

https://github.com/bellard/quickjs/blob/master/quickjs-opcod...

The point isn't just not having a JIT. You can run v8 without a JIT. The point is good performance without a JIT.


"Can compile Javascript sources to executables with no external dependency"


Which probably works by stapling bytecode to a precompiled interpreter. A tried and true strategy, but not necessarily what people want when asking for AOT compilation.


It's only a few steps away from what Graal/Truffle do: unroll the interpreter loop over the bytecode, then keep re-running dead code elimination and inlining passes - you end up with something very close to optimized native output.


Specially since Graal/Truffle do include a JIT, which even has better optimization heuristics than Hotspot.


> @item -c

> Only output bytecode in a C file. The default is to output an executable file.

> @item -e

> Output @code{main()} and bytecode in a C file. The default is to output an executable file.

https://github.com/bellard/quickjs/blob/master/doc/quickjs.t...


Not really what you are after, but it does use AOT (via generated C++ code).

https://arcade.makecode.com/

https://makecode.com/language


> In case of microcontrollers, PXT programs are compiled in the browser to ARM Thumb assembly, and then to machine code, resulting in a file which is then deployed to the microcontroller, usually via USB mass-storage interface.

Looks like the only compilation strategy is directly to ARM Thumb assembly or did I miss something?


Maybe my memory is fuzzy about the exact workflow, in any case you can check

"Microsoft MakeCode: from C++ to TypeScript and Blockly (and Back)"

https://www.youtube.com/watch?v=tGhhV2kfJ-w

and the respective C++ FFI,

https://makecode.com/simshim


You'd be stuck with float numerics everywhere because there's no fallback to recompile if an attempt to use integer fails.


compile both, dispatch dynamically in runtime.


It wouldn't be a "both" situation. Every number in a block of code may or may not be representable as an integer. You can't generate code for every possible combination.


Because it will require tons of resources to make it more performant that the current JS JITs and it will probably fail. So what you see is some games just embedding V8.


If you're already using Python's type hints, I'd suggest checking out mypyc instead. If your code is already type-checked by mypy, you don't need to do much more than run `mypyc` to get a performance boost.


Tangentially related question: could Nuitka target WebAssembly, given that according to their overview page, they translate Python into C? Usually when it comes to Python -> WebAssembly, the biggest problem is the lack of GC in the WebAssembly spec (as far as I understand), and I'm wondering if this would be an issue for Nuitka as well...


Of course you can build a GC in WebAssembly. You might have to avoid the native stack and lose some performance that way, but that shouldn’t be too bad I think.

One problem is that this GC doesn’t interact with the browser’s GC in any way. So you have painful memory management interactions when (for example) a DOM event handler references a WebAssembly object which in turn holds a reference to a browser object, possibly with cycles (so simple reference counting isn’t enough).

You can fall back to manual memory management here, but that’s painful to use. To make this work seamlessly, you need a way to trace references across both heaps in one swoop. Last I checked, there was standardization work underway to enable that.


Yes but pyodide is probably better suited for that and can load uncompiled python code.

See :

https://github.com/pyodide/pyodide

And an example featuring a pure client side jupyter instance :

https://news.ycombinator.com/item?id=28377550


It would be neat if the big companies like Dropbox, Instagram, Google, Oracle, Shopify, Stripe, whatnot building better Python/JavaScript/Ruby implementations would start building program analysis libraries for Python/JavaScript/Ruby so that more implementations could get this for free.

For example, the homepage of Nuitka says they only just added support for constant folding and propagation. That's such low hanging fruit it's crazy they have to do that for themselves.

You can imagine generic libraries for Python/JavaScript/Ruby that will turn the AST alone into its most optimal form and then let some other backend worry about code generation or VM implementation.


An issue, though not an unsolvable one, is that you probably need a standardized representation of the AST to write these optimization passes against; it is not uncommon for compiler writers to disagree on what representation is the best for what purpose.

Then there is the question of optimizations that are (typically) easier/possible only at the code-generation phase.

All of this is solvable of course, but there needs to be some will to do so (or in the case of companies, enough commercial benefit).


LLVM IR?


Wow, this got my attention. I use Python for web application development. A 2-3x speedup would be very interesting for higher-load deployments.


The best feature of nuitka, and its main goal, is not speed gain. That's the cherry on top.

The main goal is getting a standalone executable that works with c extensions.


that's correct, nuitka is to package python into one single executable, it's an 'installer' per se.


Not an installer no. The installer part is not provided, the result is an executable that works out of the box. The installer would then take that and put it somewhere, but you don't need to. It can even work from a USB key.


Does it also package the extensions?


What sort of web applications do you work on where Python interpreter speed is the limiting factor? Usually, those applications are constrained by network throughput and context switches and page faults.


This argument is so tiring. That's part of why websites are so slow despite having insane hardware at their disposal.

At my company, for a web app we run in production, we strive to get every response out of our infrastructure in under a millisecond, everything above that except for a few select endpoints is considered as a bug. By using sensible technology choices, it's not even that hard to do. A RDMS like postgres can answer to queries in microseconds.

Nice bonus, we can provide real time computation features our competitors could only dream of, just by not using dog-slow technology.

Our customers are not techies and you know what? When they use our product, the first comment is usually "wow, it's so fast".


Are you guys hiring? xD

Just to expand on your point:

Sure you can let a chunk of code take a few seconds longer than it could with little overall effect.

But stack a hundred of these up and all those seconds compound and suddenly become immensely important.

A suit of armour with a million chinks isn't very effective armour.

(As an aside, depending on your use case, it can be worth optimizing the big bottlenecks, high use, computationally expensive, etc. components and not worry too much about the rest)


> A suit of armour with a million chinks isn't very effective armour.

Counterpoint: https://en.wikipedia.org/wiki/Chain_mail


Right, I agree. My question is “what application is constrained by interpreter overhead.” It should be faster than milliseconds for almost any code, I would think.

I am agreeing with you, I don’t understand your response.


He is pointing out that it isn't necessarily a constraint, just an improvement of the product by using more performant systems.


Most websites today are slow due to loading megabytes of javascript, trackers from a dozen domains, css frameworks, and large images.


same at my startup. I made the descision to use elixir because its so fast while still having the readability of ruby or python. end result saves us money on hardware. more importantly, the customers feel like they are using a responsive and fast application.


What web/app stack are you using to fulfill <1ms responses?


If you are interested in speed you might get more from Cython (not to mistake it with CPython) or mypyc. The catch is that you need to specify types (in Cython you otherwise won't get speed gain, and mypyc will refuse to compile).

I remember reading an article somebody was describing that instead of even adopting project to work with Cython he/she outright started their project in Cython and found it beneficial.


You're extremely unlikely to get much, if any, speed up on a large, general codebase.

The 2-3x improvements are on specific micro benchmarks.


if you're just looking to speed things up PyPy, and Cython are both good options


Graal Python is really fast too, but I'm not sure how it plays with C extensions.



How is this more compatible than just python? Since it requires libpython so you anyway need to have python installed on the target machine?

Also is this actually faster than just python? I tried it out with some bad expensive looping using lists with python3.9 and it was 100ms slower (1.3sec vs 1.4sec).

Just in general why should I be using Nuitka?


I think the idea behind Nuitka is to compile Python into an executable, rather than interpret a script using a fully generic interpreter, but without (or with minimal) limitations, falling back to libpython's implementation if necessary.

> In the future Nuitka will be able to use type inferencing based on whole program analysis. It will apply that information in order to perform as many calculations as possible in C, using C native types, without accessing libpython.

I believe the difference with PyPy is that PyPy tries to do this using just-in-time compilation and nuitka uses ahead-of-time compilation.


nuitka takes your python code and all its deps, turns it to c code, and compiles that into a standalone executable that is both faster and doesn't need a separate python vm to be installed.


Expect it does require libpython and at least on Ubuntu 20.04 it links against the dev version of the library. So not only can you not run the resulting binary on any random Ubuntu 20.04 you also need to install same version of the python and the "-dev" version.

I did try this. With the "stock" python3 (3.8.something) I couldn't get nuitka to compile my code (or the example from the docs). With python3.9 I could start the compilation, but I didn't have libpython-dev shared object and finally with python3.9-dev I was able to make the binary. However after transfering it to another Ubuntu 20.04 machine I couldn't run it since it was linked against libpython3.9-dev.

Just as side note. The same python code (i.e. the .py file) would run on both machines out of the box.


If you pass the proper option, it will compile everything, including libpython, resulting in a stand alone executable.

The only dep would be libc, and if you compile against an old version, it should be forward compatible.


it has an option to build with libpython. so your comment is incorrect.


I guess there is a good reason for it to not be default behavior


> Also is this actually faster than just python? I tried it out with some bad expensive looping using lists with python3.9 and it was 100ms slower (1.3sec vs 1.4sec).

Mind sharing that code? I'm sure someone would like to be able to look into this.


I don't have it anymore, but it was something stupid like:

    import time
    start = time.time()
    list = []
    for i in range(10000000):
        list.append(i)
    sum = 0
    for item in list:
        sum = sum + i
    print(time.time() - start)
I am not superduper python expert, but I know that arrays/lists are the stupidest thing you can use so that's why I did stupid things with them. So before someone posts more optimized version the code was suppose to be shit since I assumed that the compiler would optimize it.

I expected the binary to essentially be

    1. start = time.time()
    2. pre-populated list since it should be constant e.g. list = [1,2,3,...] or completely remove it since it is not used anywhere
    3. pre-filled sum since it too is constant or same as above, removed since it is not used anywhere
    4. getting second time.time() and printting the difference
But there was no compilation time optimization. Granted this wasn't promised, but I just assumed it was in there since it is compiling.


This is where Python being dynamic makes things like this hard. You have no idea what range does, as that name may point to any user defined object by the time the code is being executed. So no, I don't think it can "optimize both loops away".

Come to think of it, you don't particularly know what sum = sum + i is doing either. There could be an __radd__ implementation in whatever your custom range spits out that indexes the WWW, renders a Mandelbrot or whatever.

And yes, of course overloading range and __radd__ that way would also be considered stupid, but if the loops were optimized away, it wouldn't be a drop in replacement for Python.


the compiler can literally see that the `list` variable is only used to generate the `sum` variable and that the `sum` is never used.

Since the `sum` is never accessed by anyone it can be optimized away and now since the only accesser of `list` is gone it too can be optimized away.

Thus both loops most certainly CAN and SHOULD be removed from the final compiled code. Only reason Python can't do it is because it is not compiled, so it has to actually run through the code at runtime and see what happens.

This is basic static analysis and any other compiler for any other language would do this. This is literally why we have keywords like `volatile` in compiled languages.

I don't understand your first argument at all. Why wouldn't the compiler know what the standard `range()` function does? The second one only makes sense since we aren't type hinting, however that too could be reversed by analyzing the code.

If we can't trust the functions in standard library then what is the benefit of compiling the code?


> I don't understand your first argument at all. Why wouldn't the compiler know what the standard `range()` function does? The second one only makes sense since we aren't type hinting, however that too could be reversed by analyzing the code.

I'm sure it knows what the standard range() function does. It just doesn't know that that's the standard range function you use when the code is being run. Any Python program is also a module. Any module can be imported. Any imported module's global namespace is unknown at the time you write the program.

Imagine I have your code, above, in a module called nextlevelwizard, and I write my own little progam as this:

    import builtins
    def myrange(n):
        cur = 0
        while cur < n:
            print(cur)
            yield cur
            cur += 1
    builtins.range = myrange
    import nextlevelwizard
Now, suddenly, your loop has a side effect. Side effects cannot be "optimized away" - the whole point of programs are generally to produce some sort of side effect!

This is a silly example, of course, but the dynamic nature of Python means that it really is very hard to skip steps you assume are irrelevant.


Expect the compiler should import whatever module you are using and analyze that module's code just like in any compiled language.

Even in the example you gave it could replace the loop with a bunch of print calls since the result (yields) do nothing.

Compilers are hard.


Maybe it was the size of the list.

Have you tried xrange?


No and I won’t. There is no reason why the compiled code shouldn’t optimize both loops away since they do nothing. And in any case it should at least be faster if it doesn’t optimize the code away.

And I am fairly sure xrange() is not in Python3


I also got a slower than rub by cpython a nuitka produced binary. Granted, it is a complicated beast with several external python libraries, numpy, others sprinkled with cython, etc. Here is the original repo I have tried to speed up using:

python -m nuitka --clang --follow-imports main.py

repo: https://github.com/MRCIEU/gwas2vcf

If someone can make the program run faster by whatever means, it will make a bunch of people quite happy.


> Retry downloads without SSL if that fails, as some Python do not have working SSL. Fixed in 0.6.14.5 already.

Is this a good idea? If I understand correctly, if an HTTPS request fails, it falls back to HTTP. So an attacker could just block the HTTPS request, then MITM the HTTP request.


See also: shiv, which doesn't compile, but bundles you python scripts and deps:

https://news.ycombinator.com/item?id=28377545

Very nice for deployment and scripting.


This looks nice for distribution, thanks!


How well does it handle tkinter?


Extreme compatibility has a cost.

If you're willing to give up some compatibility, you could get faster code that can be checked statically by transpiling to one of the many great choices from a set of statically typed languages.

This is the approach taken by py14, pyrs and their successor py2many, on which I've been working for some months.


But it was the lack of 100% compatibility that hampered pypy's adoption.

Anyway what are your incompatibilities?


It seems to be fully statically typed, for starters:

https://github.com/adsharma/py2many/blob/main/doc/langspec.m...

But, given LuaJIT and V8, it's obviously possible to generate high-performance code without such restrictions.

As for PyPy adoption, it's hindered more by the lack of API compatibility with CPython native extensions.


Yes - Julia also has a pretty good JIT too, but doesn't have a great story on AOT compiled binaries.

If you're finding it easy to write a quick command line tool in python, but end up rewriting it in another language for size/performance/self contained deployment, this may be a good fit.


Doesn't use python's C API or runtime. So no metaclasses, monkey patching, eval(). All types must be known at compile time or it should be possible to infer them.

There are several challenges to solve - translating stdlib of one language to another (see plugins.py) and bridge semantic gaps (rust doesn't like mutable global state, python has them).


Does this mean that existing Python code - like dependencies on pypi won't be generally compatible?

Without metaclasses and dynamic types I think it's better to call the language something else, and say it's Python-inspired (like Elm is Haskell-inspired)


Renaming the language in the future is still a possibility. For now, all the tests here:

https://github.com/adsharma/py2many/tree/main/tests/cases

are run with cpython interpreter and verified for compatibility.

Even when there is a desire to innovate (design by contract or pattern matching as an expression), hope to do so without breaking cpython (as long as you stick to the subset).


How does this compare with pyston which was praised in previous HN discussions?


Does it print the same stacktrace as CPython?


Yes, but since nuitka targets distribution, you probably wants to disable that for the end user.


only one thing to say.

"compatibility comes at a cost."


If you used pip install nutika -U - on both mac and linux, it doesn't find an installation candidate for me. pip install -U "https://github.com/Nuitka/Nuitka/archive/develop.zip" works on both currently. But this does not get "nutika" on the path. Perhaps someone here is more familiar with the installation system and can comment?


It's nuitka, not nutika.

And use "python -m nuitka", not "nuitka" alone.

"-m" is a very important option in python that solves a lot of path problems.


nutika ≠ nuitka




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: