Right, I'd like to see a comparison to Nim or Julia, or another compiled high-level language that isn't particularly performance-oriented like Haskell or Clojure, or Common Lisp, or even Ruby with its new JIT(s). Or for that matter, Python with Numba or one of the other JIT implementations (PyPy, Pyston, Cinder).
Is that improvement caused because the program is automatically parallelized or because the code is compiled/JITed? A x150 improvement is too much, so I suspect both reasons collaborate to get it.
they say the example i used is JIT compiled into machine code. i haven't looked into the codebase yet but i presume that means it just un-pythons it back into C? not sure.
fwiw, i tried the gpu target (cuda) and it was faster than vanilla, but slower than accelerated cpu target by about 4x.
(taichi) [X@X taichi]$ python primes.py
[Taichi] version 1.4.1, llvm 15.0.4, commit e67c674e, linux, python 3.9.14
[Taichi] Starting on arch=x64
Number of primes: 664579
time elapsed: 93.54279175889678/s
Number of primes: 664579
time elapsed: 0.5988388371188194/s