Pystone isn't a performance benchmark, or at least it isn't a useful one. It's more of a regression test to see if anything has changed between versions. It's not useful as a performance benchmark because it doesn't weight the results according to how much the individual features matter in real life. There are three versions of Python besides CPython that are in commercial use. Two are much slower than CPython (up to three times slower), and Pypy is (currently) faster in some applications and slower in others.
The CPython interpreter is not a simple switch. It uses computed gotos if you compile it with gcc. Microsoft VC doesn't have language support needed for writing fast interpreters, so the Python source is written in a way that will default to using a switch if you compile it with MS VC. So, on every platform except for one, it's a computed goto.
Modern CPU performance is very negatively affected by branch prediction failure and cache effects. A lot of the existing literature that you may see on interpreter performance is obsolete because it doesn't take those factors into account, but rather assumes that all code paths are equal. Threading worked well with older CPUs, not so well with newer ones.
I am current working on an interpreter that recognises a subset of Python for use as a library in complex mathematical algorithms. As part of this I have bench marked multiple different interpreter designs for it and also compared it to native ('C') code. It is possible to get a much faster interpreter, provided you limit it to doing very simple things repetitively. These simple things also happen to be the sorts of things which are popular with benchmark writers (because they're easy to write cross language benchmarks for), but which CPython does not do well in.
A sub-interpreter which targets these types of problems should give improved performance in this area. Rewriting the entire Python interpreter though would probably have little value, as the characteristics of opening a file or doing set operations, or handling exceptions are entirely different from adding two numbers together.
There is no such thing as a single speed "knob" which you can crank up or down to improve performance. There are many, many, features in modern programming languages, all of which have their own characteristics. Picking out a benchmark which happens to exercise one or a few of them will tell you nothing about how a real world application will perform unless it corresponds to the actual bottlenecks in your application. For that, you need to know the application domain and the language inside and out.
One thing about Python developers is that they tend to be very pragmatic. When someone comes to them with an idea, they say "show me the numbers in a real life situation". More often than not, the theoretical advantage of the approach being espoused evaporates when subjected to that type of analysis.
Anyway, I've let them tell me the CPython interpreter is very simple on purpose to allow it to function as a standard 'definition' of the language behaviour. A simple jit does wonders, as does a less brain dead gc. Superinstructions, threading, ... are all possible. But you're absolutely right: It's really difficult to predict how much each improvement would contribute.
Have a look at the lines starting at line 821 in the very file you referenced. I have quoted a bit of it here:
"Computed GOTOs, or the-optimization-commonly-but-improperly-known-as-"threaded code" using gcc's labels-as-values extension (...) At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture."
They also have an explanation of the branch prediction effect which I mentioned earlier.
They have both methods (switch and computed goto) since some compilers don't support computed gotos, and some people want to use alternative compilers (e.g. Microsoft VC).
In my own interpreter, I tried both switch and computed gotos, as well as another method called "replicated switch". I auto-generate the interpreter source code (using a simple script) so that I could change methods easily for comparison. In my own testing, computed gotos were about 50% faster than a simple switch, but keep in mind that is strictly doing numerical type code. More complex operations would water that down somewhat, as less of the execution time would be due to dispatch overhead.
Computed gotos aren't really any more complex than a switch once you understand the format, and as I said above you can convert between the two with a simple script. What does get complex is doing Python level static or run time code optimization to try to predict types or remove redundant operations from loops. CPython doesn't do that, while Pypy does this extensively. It's these types of compiler and run-time re-compile optimizations which make the big difference.
Overall, my interpreter is currently about 5.5 times faster than CPython with the specific simple benchmark program I tested. However, keep in mind it only does (and only ever will do) a narrow subset of the full Python language. Performance is never the result of a single technique. It's the result of many small improvements each of which address a specific problem.
so the conclusion really is: CPython is way slower than it should be.
Question: if the subset is small, isn't it better to use something like 'shed skin' ?
http://code.google.com/p/shedskin/
I once looked at it, and it does a fairly literal translation. The only problem is that it changes semantics of the primitive types. For example a python integer becomes a C++ int. (and overflow semantics change)
The CPython interpreter is not a simple switch. It uses computed gotos if you compile it with gcc. Microsoft VC doesn't have language support needed for writing fast interpreters, so the Python source is written in a way that will default to using a switch if you compile it with MS VC. So, on every platform except for one, it's a computed goto.
Modern CPU performance is very negatively affected by branch prediction failure and cache effects. A lot of the existing literature that you may see on interpreter performance is obsolete because it doesn't take those factors into account, but rather assumes that all code paths are equal. Threading worked well with older CPUs, not so well with newer ones.
I am current working on an interpreter that recognises a subset of Python for use as a library in complex mathematical algorithms. As part of this I have bench marked multiple different interpreter designs for it and also compared it to native ('C') code. It is possible to get a much faster interpreter, provided you limit it to doing very simple things repetitively. These simple things also happen to be the sorts of things which are popular with benchmark writers (because they're easy to write cross language benchmarks for), but which CPython does not do well in.
A sub-interpreter which targets these types of problems should give improved performance in this area. Rewriting the entire Python interpreter though would probably have little value, as the characteristics of opening a file or doing set operations, or handling exceptions are entirely different from adding two numbers together.
There is no such thing as a single speed "knob" which you can crank up or down to improve performance. There are many, many, features in modern programming languages, all of which have their own characteristics. Picking out a benchmark which happens to exercise one or a few of them will tell you nothing about how a real world application will perform unless it corresponds to the actual bottlenecks in your application. For that, you need to know the application domain and the language inside and out.
One thing about Python developers is that they tend to be very pragmatic. When someone comes to them with an idea, they say "show me the numbers in a real life situation". More often than not, the theoretical advantage of the approach being espoused evaporates when subjected to that type of analysis.