Rust will likely not support tail call optimization

pcwalton · on April 10, 2013

Note that we do sibling call optimization per LLVM, and you can use trampolines like Clojure does. Sibling call optimization, most notably, includes all self-calls. This means that calls to the currently executing function in tail position will be tail call optimized, if there are no destructors in scope.

The biggest problem here is ABIs: you have to use Pascal calling conventions instead of C calling conventions. It's also difficult when you have destructors: there are not many tail call positions when destructors are in scope.

tsewlliw · on April 10, 2013

I'd vote getting to use RAII is a better win than TCO.

And really its by definition not a tail call when you have to run destructors after a "tail" call, and tricky/annoying to express that destructors should be run prior to a tail call.

masklinn · on April 10, 2013

Wouldn't it be possible to have some sort of macro able to "manually" convert calls in a recursive style into loops? Something along the lines of Clojure's `loop/recur` but less tricky (and maybe with a single "word")?

ufo · on April 10, 2013

Self recursion is only a particular case and you can often represent it with loops or folding patterns as you are imagining. However, the really cool use for tail call elimination is when you are calling other functions since you can't rewrite the code without breaking encapsulation. Out of the top of my head, one important example of this would be continuation passing style (you need TCO to turn the continuations into real "gotos")

gngeal · on April 10, 2013

That would be useless. I believe the point of proper tail calls is that the set of functions that may be called from a particular call site is not closed. The callee may be a parameter of the caller, for example in combinator-style libraries. You can't convert that in advance.

pcwalton · on April 10, 2013

Yes, I think so.

edwintorok · on April 10, 2013

How about the other two calling conventions that support tail call optimization? (cc 10 and cc11) Do those have the same performance penalty as fastcall + tailcallopt?

beagle3 · on April 10, 2013

Please, for the sake of $DIETY, stop calling it "Tail Call Optimization". It is not an optimization, and the wrong term causes lots of useless discussions and misunderstanding. Call it "Tail Call Elimination", because that's what it is.

With finite memory (that is, in the real world), a program that relies on tail recursion will blow up if TCE is not implemented even though it can run forever with TCE. Thus, it changes actual program results (rather than just runtime speed or amount of memory required), and cannot be considered just an "optimization", since it is functionally required.

And to future-language designers: Please, for the sake of $DIETY2, make the syntax for guaranteed-eliminated-tail-call different than a regular call: It's really a bad idea that a function such as:

    def f(x):
       ...
       return 0+f(x-1)

cannot in general be TCEd (adding 0 to a floating point imporper denormal will make it proper), whereas

     def f(x):
        ...
        return f(x-1)

can. I suggest "chain y" instead of "return y" so that the compiler can verify that a TCE can indeed happen.

millstone · on April 10, 2013

cannot be considered just an "optimization", since it is functionally required

Here is a C program:

    #include <stdio.h>
    int main(void) { while (1) malloc(64); }

With clang, when compiled with -O0, this consumes all the memory on my system. When compiled with -O2, it runs forever in constant memory, because clang has optimized out the call to malloc. If I replace the call with, say, 'malloc((size_t)(-1) >> 1)', it even changes the output of the program.

Here is another example:

    int factorial(long long x) { return x * factorial(x-1); }

With gcc, when compiled with -O2, the factorial function can correctly compute the factorial of a very large number. When compiled with -O0, it cannot.

I would call both of these optimizations, even though someone may rely on either for correctness. So I don't think tail calls are unique in this respect.

make the syntax for guaranteed-eliminated-tail-call different than a regular call

Agreed.

bcoates · on April 11, 2013

Both of those programs are playing with C's infamously broad undefined behavior rules, the first for a non-terminating loop and the second for using too much stack (when called with a sufficiently large number). -O2 being nice to you is a coincidence, not something you should rely on.

In a language which prefers diagnosing programmer error over punting both should be errors.

tshaddox · on April 11, 2013

In the finite world beagle3 talks about, don't all optimizations potentially qualify as "functionally required"? If the finite world you're operating in can only run a program that requires at most N steps ("steps" being deliberately vague "computing resources," like CPU cycles or memory usage), then any optimization that makes an (N+1)-step program complete in N steps qualifies as a "functional requirement."

kragen · on April 11, 2013

The reason people are so interested in TCO/TCE is not just that someone might rely on them for correctness, which as you point out is true of any optimization, but rather that they are so predictable that you can omit looping constructs from your core language if you have them.

beagle3 · on April 12, 2013

> even though someone may rely on either for correctness.

That someone is playing with fire. It's the kind of optimization that might be removed in a hotfix because it turns out to break something else.

It's even worse than relying on things like byte order -- and if you rely on them, you are definitely outside the realm of "C language", and into the realm of "Specifically LLVM v.37x patchlevel 2, on an SSE4 architecture"

gngeal · on April 10, 2013

"Please, for the sake of $DIETY"

If you worship food, that is. But you're right, people calling proper tail calls "an optimization" should be shot. Those who survive the subsequent generation scavenging should be shot again. :-)

ericbb · on April 10, 2013

Or follow Alef (http://www.cs.bell-labs.com/who/rsc/thread/ug.pdf) and use "become". The Alef become statement is a kind of generalized tail-call form because you can use it to drop the current stack frame from anywhere, not just in tail-position.

kibwen · on April 10, 2013

As a historical note, this was also the original plan for Rust, where the presence of the `be` keyword rather than the `return` keyword would indicate tail-calling.

kragen · on April 11, 2013

I suggest "chain y" instead of "return y" so that the compiler can verify that a TCE can indeed happen.

I think the rust syntax for this was `be y` instead of `return y`.

srean · on April 11, 2013

If I remember correctly the new version of TCL will have special syntax to enforce TCE. Unfortunately I did not book mark the relevant online discussion.

dewitt · on April 10, 2013

Clojure also doesn't support TCO, and from what I can tell the explicit loop/recur is sufficient, which also has the advantage that it is very clear (i.e., the compiler will tell you) when TCO is not being invoked. With implicit TCO you might think you're getting the benefit, only to have a stack blow up unexpectedly at runtime with an edge-case input.

pcwalton · on April 10, 2013

Especially with destructors. Whenever you have a destructor in the frame, you're not in tail-call position. This means that tail-call positions in Rust are going to be few anyway.

qznc · on April 10, 2013

That calls for an interesting optimization: Using escape+liveness analysis you could probably call some of those destructors earlier.

kragen · on April 11, 2013

The problem is that TCO in the sense we're talking about isn't really an "optimization". It's a language feature that removes the need for explicit looping constructs by allowing you to write them as higher-order functions instead. This only works if the guy writing the higher-order function can prove that the compiler will successfully "optimize away" the tail call, or if he doesn't care whether the program dies with a stack overflow. If the success of TCO is dependent on which conservative approximation the compiler is using for escape and liveness analysis, then people who care about their programs continuing to run will demand explicit looping constructs anyway.

There are three separate inventions in Scheme that work this way. Closures give you objects without an object construct, tail-call optimization gives you loops without looping constructs, and call/cc gives you threads, exceptions, and backtracking without thread, exception, or backtracking constructs.

(I mention Scheme because all three of these were, as far as I can tell, introduced in Scheme, and only later adopted by other functional languages like ML, although to be fair, TCO at least falls out naturally from combinator-graph reduction.)

In a sense, Scheme is sort of like a functional assembly language: there are lots of object systems, looping constructs, threads, and exception systems in Scheme, and they aren't compatible with each other. It's sort of like the situation with linked lists in C, where every library has its own linked-list type.

It's exactly the opposite of assembly language in another way, though. By making object instantiation and population implicit, both the compiler and the maintenance programmer have to do extra work to figure out what's an object and what's not, and what the object's fields are. The same is true of loops, threads, and exceptions. In assembly, instead, you have the problem of things being too explicit, thus losing the signal in the noise.

That's my take, anyway. I haven't spent that much time programming in either Scheme or assembly, although I did write an almost-Scheme compiler targeting assembly called Ur-Scheme.

solinent · on April 11, 2013

The "assembly language" for ML would actually be the lambda calculus. Scheme isn't really defined unless you're talking about some standard, and the semantics of the language are probably just defined in English.

kragen · on April 11, 2013

Hmm, it sounds like you're trying to argue with me, but I'm not clear on what you're saying — some quick responses — hope this helps:

① I did not claim Scheme was an "assembly language" for ML. I said that programming in, reading programs in, and compiling programs in Scheme was like programming in, reading programs in, and compiling programs in assembly language in some specific ways, and very unlike it in others. The relationship between Scheme and ML is that some of the central insights of Scheme were adopted by ML.

② ML defines evaluation order. The lambda calculus does not. Typical ML implementations compile to the assembly languages of actual processors. Can you clarify?

③ I think Scheme is sufficiently well defined for this discussion — it's a family of languages originating in some papers by Sussman and Steele in the 1970s, and continuing through the current R7RS work, including a number of compilers. Several of the standards define the semantics of the language symbolically, not just in English.

cpeterso · on April 10, 2013

Does this mean Rust's `be` keyword can be retired? :)

pcwalton · on April 10, 2013

kvb · on April 10, 2013

Doesn't loop/recur only help in directly self-recursive calls, as opposed to mutual recursion or tail calls to function arguments?

anonymoushn · on April 10, 2013

That's correct. It should be possible to have an explicit guarantee of TCO while maintaining the ability to write things that aren't trivially reducible to loops...

nicolast · on April 10, 2013

Which is +- what Scala allows with an annotation.

fhars · on April 10, 2013

No, scala's @tailrec annotation only guarantees that calls that can trivially be compiled to loops in fact are. You can't do better on a standard JVM without jumping through hoops (or, rather, trampolines).

dewitt · on April 10, 2013

Right. In Clojure you'd use an explicit trampoline for stack-free mutual recursion.

andrewvc · on April 10, 2013

Mutual TCO is really uncommon for what it's worth. Of you need it that bad a trampoline should suffice

kvb · on April 10, 2013

Direct mutual recursion is somewhat rare (but useful for encoding state machines, for instance). However, using continuation passing style (and tail calls to a continuation argument) to prevent stack overflow is a common idiom in some functional languages (e.g. when mapping over a binary tree).

derefr · on April 11, 2013

Erlang uses mutual TCO everywhere. All actors/processes are state machines with the function they're in being their current state. To switch states, you just tail-call a new function (which doesn't have to be from the same module you were in.)

protomyth · on April 10, 2013

It seems like part of decision was reached because of limitations in LLVM[1]. This makes me a bit uneasy, but I guess it has to follow C to meet its goals.

1) "LLVM does support tail call optimization, but it requires using a different calling convention than C and slowing down all function calls" from https://mail.mozilla.org/pipermail/rust-dev/2013-April/00355... and http://llvm.org/docs/CodeGenerator.html#tail-call-optimizati...

cwzwarich · on April 10, 2013

They could also add their own calling convention to LLVM to support what they want like GHC and HiPE have done.

saidajigumi · on April 11, 2013

No they can't, at least not without compromising on several points laid out in TFA. Mainly these two:

    - Tail calls also "play badly" with assumptions in C 
      tools, including platform ABIs and dynamic linking.

    - Tail calls require a calling convention that is a 
      performance hit relative to the C convention.

See other posts in this discussion for examples of the impacts of some alternate calling convention choices.

anonymous · on April 11, 2013

Then how can GCC and clang do TCE in C programs?

aidenn0 · on April 11, 2013

1) They can't do it all the time

2) Some of the times they can do it, they do it when they control both the caller and the callee. In that case you can violate the ABI to your heart's content.

3) I don't really see TCE as an issue in Rust, since they seem to be going the RAII route and, as others have commented, RAII and TCE don't play well together (for the same reason a call at the end of a dynamic binding can't be TCE'd in common-lisp).

btipling · on April 10, 2013

What is tail call optimization? How does it usually work? There's an implementation example on the Wikipedia page to answer this question a little bit:

http://en.wikipedia.org/wiki/Tail_call

ericbb · on April 10, 2013

And, of course, "Debunking the 'Expensive Procedure Call' Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO" by Guy Steele. Available here: http://library.readscheme.org/page1.html

arb99 · on April 10, 2013

also http://stackoverflow.com/questions/310974/what-is-tail-call-... gives a decent overview

pekk · on April 10, 2013

I'm puzzled: if Rust and Clojure won't support TCO and they are both cool, why is there such bitter complaining every time it comes up that Python won't support TCO?

andolanra · on April 10, 2013

I can't speak for Clojure, but Rust's rationale for omitting TCO is based on sound engineering issues (as articulated) whereas Guido's sole reason for omitting it (as far as I've seen) is that you lose stack trace information, which is considered something of a red herring (as you can always selectively disable TCO for debugging, and there are ways of retaining stack information anyway.) Note that the original article here is quite disappointed that they can't include TCO, because it would be a valuable tool and allow for different, efficient styles of programming.

rit · on April 10, 2013

I can't answer fully for Clojure's reasons for not supporting TCO, but it will at least be fundamentally rooted in the fact that the JVM, as a platform, does NOT support TCO.

Scala is usually the language brought up as a counter argument here, but Scala has the same limitations as Clojure - the JVM can't do the TCO. Scala's compiler tries, with certain types of tail calls, to optimize via (I believe) a trampoline – it reduces those calls into a loop, so they are no longer a function call.

But, only certain types of calls (specifically, the last line of code in a recursive function must be a call to the hosting function) can be TCO in Scala. There is an annotation, scala.annotation.tailrec, which can be placed above a method you want to tail recurse.

@tailrec does not, however, "force" the compiler to do TCO – it simply forces compilation to fail if the method in question cannot be Tail Call Optimized. It's a developer hook for saying "I realize the compiler tries to do TCO where possible, but I require this method to be trampolined". If it can't be, you'll get a compilation error with details on why the TCO failed.

fhars · on April 10, 2013

Scala compiles direct tail recursion the a jump to the start of the function (jumps within a method are possible on the JVM) without any trampoline.

So @tailrec tells the compiler to compile this call to an unconditional jump.

derleth · on April 10, 2013

> the JVM, as a platform, does NOT support TCO.

And the x86, as a platform, does?

What, specifically, does the JVM do to make TCO more difficult than it would be in the machine code of your choice?

smosher · on April 10, 2013

> And the x86, as a platform, does?

"It's complicated."

http://stackoverflow.com/questions/105834/does-the-jvm-preve...

kragen · on April 11, 2013

Yes, the x86 instruction that implements TCO is called JMP. The JVM doesn't have an equivalent instruction; you can only jump around inside a single method, not to the beginning of another method.

derleth · on April 11, 2013

> The JVM doesn't have an equivalent instruction; you can only jump around inside a single method, not to the beginning of another method.

OK, why do functions in the source language have to translate directly into methods at the JVM level? Purely for debugging?

kragen · on April 11, 2013

The JVM also doesn't (I'm fairly sure) have an indirect JMP instruction, which means you can't compile a polymorphic source-language method call into a JMP inside of a single generated method. Instead you have to use a call to a method. That means that tail-calls to functions passed as parameters (rather than functions that can be statically bound at compile time) can't use JMP.

happy_dino · on April 11, 2013

The “JVM” can use any instruction it wants to use.

If you want a runtime with proper tail calls, use a implementation which supports it.

kragen · on April 11, 2013

I'm talking about the bytecode instructions standardized in the "JVM" specification. The "JVM" has to implement the semantics of the instructions found in your bytecode program, or your program won't run. Your bytecode program can only use bytecodes defined in the "JVM" specification, or it won't run on the "JVM".

happy_dino · on April 12, 2013

There is no need or reason to change or add any bytecode instructions to support proper tail calls.

Take standard class files and execute them on a runtime with proper tail calls. Done. Works.

kragen · on April 12, 2013

Which JVMs support proper tail calls? Doesn't this violate requirements of the JRE?

happy_dino · on April 12, 2013

For instance oss.readytalk.com/avian. Why should it?

kragen · on April 12, 2013

Because the security manager needs to be able to introspect the stack. Avian isn't a JVM; it's "designed to provide a useful subset of Java's features" and can run some Java code.

abecedarius · on April 15, 2013

It's actually possible to implement the stack inspection in a tail-call-friendly way (along the lines of Racket's continuation marks IIRC), though AFAIK nobody does it.

happy_dino · on April 12, 2013

> Because the security manager needs to be able to introspect the stack.

Not allowing certain “optimizations” in security-sensitive contexts is perfectly fine. In fact, this is exactly what Avian, the CLR (and pretty much everyone else) is doing.

> Avian isn't a JVM; it's "designed to provide a useful subset of Java's features" and can run some Java code.

Now you're talking about legal aspects. Frankly, I'm not interested in discussing those.

Avian is a JVM for all practical purposes. If you disagree, please provide a test-case which runs on HotSpot but not on Avian.

tomku · on April 11, 2013

JVM interop. You can't call a Scala function from Java unless it's somehow addressable as a Java method, so Scala implements functions in terms of Java methods. The JVM, for better or worse, is largely built around Java's object model - you can do your own weird stuff, but you won't easily be able to interop with Java code or benefit from HotSpot's optimizations.

kragen · on April 11, 2013

For better or worse, Scala does do quite a bit of work to go beyond Java's object model.

tsewlliw · on April 10, 2013

Amusingly, "losing stack trace information" is among the reasons the JVM doesn't do TCO -- it needs to keep it around for the security manager to do its thing. Its especially annoying there because references are held by higher stack frames and prevent GC, and the destructor excuse isn't present.

happy_dino · on April 10, 2013

> it needs to keep it around for the security manager to do its thing

Haha. As if the JVM needed proper tail calls to get owned every other week. :-)

Anyway, the discussion is pretty much moot, because "JVM" doesn't really refer to specific implementation. If one cares about proper tail calls, use an implementation which supports it.

douglasisshiny · on April 10, 2013

Does the JVM get 'owned every other week'? Or are you thinking of the JRE?

pjmlp · on April 11, 2013

> Haha. As if the JVM needed proper tail calls to get owned every other week. :-)

Correction, the JVM that is made available by Oracle, and most people wrongly think it is the only one.

There are tons of JVM vendors out there.

happy_dino · on April 11, 2013

Which I more or less already said with

> If one cares about proper tail calls, use an implementation which supports it.

Right?

pjmlp · on April 11, 2013

I would say yes, but your remark about getting owned is Oracle specific hence my comment.

bcoates · on April 10, 2013

A debugging switch that, when enabled, turns a valid program into an invalid one is madness.

Python has decided to place very conservative requirements on what an implementation's stack must be capable of. This might indicate a certain lack of ambition but it's a justifiable engineering tradeoff.

andolanra · on April 11, 2013

There are better alternatives to said debugging switch for examining the stack of tail-call-optimized programs, c.f. Clements and Felleisen's 2004 paper, "A Tail-Recursive Machine with Stack Inspection." The debugging switch is merely the simplest naïve option.

Essentially, your engineering tradeoff boils down to, "It's easier (for implementers) to omit TCO." That is actually a very reasonable and rational position, as simpler implementations are less likely to have bugs and have a lower creation cost. However, I still hold that the advantages of TCO outweigh the concern about its implementation. (Also, I do not believe that Python does place conservative requirements on an implementation's stack, as a conformant Python implementation must also implement generators; I regard this as roughly on par with TCO's complexity.)

kragen · on April 11, 2013

Complexity and stack space usage are two very different things. CPython does indeed place (very!) conservative limits on stack depth, and it also implements generators. I haven't looked at the generator implementation, but it has a number of limits that make it easy to implement — the main one is that generators don't have their own stacks, so can't yield from inside a function they call, the way e.g. Lua coroutines or (effectively) Icon generators can.

andolanra · on April 11, 2013

I'm sorry, I made something of a terminological error. When I said 'complexity', I was referring to complexity of implementation, not algorithmic time or space complexity. My intention was to assert that implementing generators and implementing tail-call optimization are both endeavors of roughly the same order of engineering difficulty (as the mechanism used to implement TCO is quite simple) and not to compare the stack space required by each.

I also should clarify that at this point, it's likely not reasonable to expect Guido or the Python community to ever turn around and decide to implement TCO, as the current engineering effort required to add TCO to all active implementations of Python is much larger in comparison to the engineering required to add it at an earlier stage. I regard this as a failure of the langauge design from the start, but having made their decision I don't fault the Python community for sticking to it.

kragen · on April 11, 2013

I was responding to your "I do not believe that Python does place conservative requirements on an implementation's stack", in response to bcoates's statement that it does, nothing about complexity. I was just trying to clarify (my interpretation of) what he said.

jacobolus · on April 11, 2013

Actually, in Python 3.3 they can (sort of). See PEP 380: http://www.python.org/dev/peps/pep-0380/

kragen · on April 11, 2013

Thanks! I didn't know this, and it should remove a log-N factor from a lot of algorithms :)

bcoates · on April 11, 2013

I believe that both C# iterator-functions and python generators are designed with enough limits that they can be implemented as syntax sugar: you lift the generator function's locals into an object and decompose the function body into a state-machine.

derefr · on April 11, 2013

> A debugging switch that, when enabled, turns a valid program into an invalid one is madness.

You mean like gcc's -Werror? Or enabling asserts? Both of those seem useful to me.

PommeDeTerre · on April 11, 2013

-Werror isn't a "debugging switch". It should be enabled at all times, for development builds and production builds.

If compilation warnings are given, then the responsible thing to do is to assume the program is invalid, at least in certain situations. Fix the code, avoid the warnings completely, and that's that.

There are remarkably few cases where compilation warnings can or should be ignored.

Locke1689 · on April 11, 2013

Since the compiler didn't succeed at compilation, no program was generated at all. Removing asserts is really compiling a different program.

kyllo · on April 10, 2013

Because Rust and Clojure have have better excuses, maybe?

Rust is supposed to be as fast as C/C++ and TCO will slow it down.

Clojure can't have TCO because the Java Virtual Machine doesn't support it. So they have a workaround (trampolines).

Python doesn't support TCO because... it screws up stack traces?

groovy2shoes · on April 10, 2013

> Rust is supposed to be as fast as C/C++ and TCO will slow it down.

TCO doesn't slow programs down. If anything, it might even make them a tiny bit faster. But speed isn't the reason you'd want TCO: you want it as a stack optimization so that tail-recursive functions use constant stack space. Furthermore, although the C specification places no TCO requirement on implementors, it's actually a very reasonable optimization even in C. GCC actually does do TCO in some cases (I'm using 4.5.3, and tail-call elimination occurs with -O2 and higher).

> Clojure can't have TCO because the Java Virtual Machine doesn't support it. So they have a workaround (trampolines).

This is true. There was a time when TCO support was slated for Java 7. That time has come and gone; unofficially, it looks like it may make an appearance in Java 9 at the earliest.

> Python doesn't support TCO because... it screws up stack traces?

That's the cited reason, but it's horseshit. As others have pointed out on this thread, it's feasible to simply disable TCO when doing debugging if you need a stack trace. Even fancier, there are algorithms to recover the stack even after the calls have been eliminated (see Lua for example).

pcwalton · on April 10, 2013

"GCC actually does do TCO in some cases (I'm using 4.5.3, and tail-call elimination occurs with -O2 and higher)."

Only sibling call optimization, as cdecl doesn't allow TCO in the general case. Rust will do this too.

pasquinelli · on April 10, 2013

"TCO doesn't slow programs down. If anything, it might even make them a tiny bit faster." right, but i think they were saying that, in this case, because of dependencies out of the rust team's control, they would have to use a different, slower calling convention to support proper tail calls.

fhars · on April 10, 2013

Well, according to a quick perusal of the LLVM documentation, the only llvm calling convention that allows TCO on all platforms seems to be fastcc, whose main purpose is to have calls that are a fast as possible, at the cost of not being compatible with the slower, but more common calling conventions of C. http://llvm.org/docs/LangRef.html#calling-conventions

qznc · on April 10, 2013

What does fastcc do when there are more arguments than registers?

These cases are the reason for Pascal-convention according to Walton. https://mail.mozilla.org/pipermail/rust-dev/2012-January/001...

jevinskie · on April 11, 2013

In that case you can clone the function. Internal linkage calls can use the fastcc calling convention while you leave either a whole clone with the non-fastcc calling convention or a non-fastcc wrapper of the fastcc function for calls from external linkage.

Rickasaurus · on April 10, 2013

We'll just have to put Rust up against MLton and see if tail calls really are so expensive.

larsberg · on April 10, 2013

If I understand it correctly, the Rust issue is that tail calls are expensive in the context of LLVM and the calling convention that they are using. MLton (like Manticore, which I work on) uses a different code generation backend and calling convention entirely. And neither of us worry about "destructors" or, in general, things that happen at every return point.

Though, interestingly, in the context of a CPS-converted program, destructors would just be the next continuation called on the way to calling the procedure previously "returned to." Under our (Manticore) calling convention, we would just jump to each of those continuations and wouldn't have to grow stack (though in our case, it's keeping around heap frames, as we're not stack-based), and would have the same next allocation and instruction-level behavior as a tail call.

That said, if you think tail calls are hard to debug, CPS-converted programs would destroy most people's will to live. No stack backtraces, except in debug modes, with 5-10x performance penalties.

cwzwarich · on April 10, 2013

If you don't have any interesting control operators like call/cc, prompt/control, etc. then it should be possible to track stack-like control patterns after CPS conversion through the compiler and through DWARF (or a custom debugging format). Unless I am missing something it seems that the control operators are the problem, not CPS.

larsberg · on April 10, 2013

Certainly, that's true! As I mentioned in my other far-too-wordy response, though, if you just CPS convert and run a program, it will be far too slow (5x the funcalls != 5x the fun...). So, it's the combination of CPS conversion and later optimizations --- paired with whole-program compilation --- that make it so hard to debug.

Rickasaurus · on April 10, 2013

I never thought tail calls were hard to debug, but can you can keep text offsets to the original tokens through the CPS transform? Usually with typed languages that's enough information to figure out what went wrong or at least what to log.

Also, I'm not sure why the entire library needs the same calling convention when it's such bad form to expose all of your functions anyway.

larsberg · on April 10, 2013

You can keep the original tokens. The challenge with debugging such programs is really the integration with the rest of the optimizations. CPS transformation turns your program into a ton of tiny functions. Then, any reasonable optimizer will do a huge amount of inlining. Attempting to do trace-based debugging bounces you from one subexpression in one function to another and so forth. Especially when you consider that Manticore (like MLton) is a whole-program optimizing compiler, you also get little chunks of library functions we've duplicated like map, foldr, etc.

Worse, we liberally remove dead code, including unused arguments, branches of conditionals we can guarantee are never called, etc. And since it's a research compiler, there isn't really a "-O0". You can turn off individual optimization passes, but guessing what combination lead to something getting inlined or not requires some careful study.

As to why they've chosen what they have with LLVM, I don't know enough to judge. Doing something like ML compilers do with a custom calling convention internally and then a C convention externally is pretty expensive, and ends up putting a massive penalty on C calls (oh, you want to call C? let me move the GC pointer out of the way, set up a fake stack frame for you, etc.). Further, doing it makes register allocation more of a challenge. Many ML compilers like to "pin" certain registers with their own values (e.g., the GC local heap limit pointer) so that we can write custom little blobs of assembly that get emitted in the right places without worrying about substituting in what the _real_ heap limit pointer is, etc. That pinning interacts badly with LLVM, as you can see from the Haskell/LLVM master's paper - they basically had to give up on it and just add their pinned values as extra arguments and then stop emitting code that relied on it being in a sane, typical place.

Hope that helps! I've been thinking about these problems and talking only with other people who do the same for so many years I'm starting to forget which parts are and aren't obvious (or even published/documented).

Rickasaurus · on April 10, 2013

Very helpful, thank you!

I guess another problem would be that people often complain about how slow these kinds of compilers can be, and if you keep around enough to reconstruct good error messages it would probably make it that much slower.

larsberg · on April 11, 2013

Most of the time in whole-program compilers such at Manticore and MLton (> 60%, though it depends on the particular program) is spent in the combination of parsing the input source files and running GCC over the generated assembly to produce the binary.

Many compilers (including ours) did not keep around that extra information because even in 2007 (when we started) RAM was a bit scarce. The cost associated with keeping that info around between phases is primarily in the extra working set hit and the resulting GC and especially memory paging issues. Given that it's not unreasonable to expect people to have > 1GB of available physical RAM for the compilation process these days, that's something we should consider changing. Though at this point threading that info through the compiler is probably a couple weeks of dedicated effort to get working correctly.

burntsushi · on April 11, 2013

> (like Manticore, which I work on)

Nice! I have had envious eyes for Manticore as a modern replacement for Concurrent ML. Are you guys intending for it to be available for general use or more of a research language?

larsberg · on April 11, 2013

It's still a research language today. Honestly, we would need to make a fairly major push --- which would freeze all resarch work and publications --- to get to a point where it is ready for general use. We're working on some grant proposals along those lines, because frankly without NSF support (in the context of making it ready for wider use in classes such as CMU's new intro curriculum), it's unlikely we can justify the massive amount of work required to take it from something that PL experts to use to something that, say, MLton or SML/NJ users could pick up and use.

I hate to sound mercenary, but frankly if we can't significantly increase the project from the current number of developers (myself and two part-time undergrads), it's difficult to see how we can make it more generally available. Especially when I lose a couple of months of work every time we double the number of processors in a server-class machine, as there's always either a GC bug or some scalability issue still lurking in the runtime...

mcguire · on April 10, 2013

Well, technically, "it screws up stack traces" is the usual excuse given for why the JVM doesn't support it.

TheCoelacanth · on April 11, 2013

Python doesn't optimize any tail calls at all. Rust optimizes sibling calls, which includes all self-tail calls. It's only tail calls to a different function that Rust doesn't eliminate.

kragen · on April 11, 2013

I imagine the same people who complain about lack of TCO in Python are also complaining about lack of TCO in rust and Clojure.

papsosouid · on April 10, 2013

Who says they are both cool? HN isn't a single person. It is quite possible that group A thinks rust and clojure are "cool", and group B thinks that languages without TCO (including python, clojure and rust) are bad.

mike_ivanov · on April 10, 2013

A TCO-support discussion on Y combinator.

OH IRONY.

zxcdw · on April 10, 2013

Elaborate for the ignorant.

qznc · on April 10, 2013

The Y Combinator is something important in the theory of functional programming. It takes one function as argument and applies it to itself ad infinitum. Together with lazy evaluation this is how you can implement recursion in a language which does not natively support it, namely (lazy) lambda calculus.

http://en.wikipedia.org/wiki/Fixed-point_combinator#Y_combin...

Actually, TCO and Y are not really related with each other apart from the fact that you usually learn about them in the same course about functional programming.

kragen · on April 11, 2013

If you write your loops with the Y-combinator, you need tail-call elimination to keep them from overflowing the stack when they run for enough iterations, because the way your function f goes on to the next iteration of itself is by calling the (Y f) that was passed as its first argument.

qznc · on April 11, 2013

You can write a lambda calculus interpreter which does beta-reduction. In this case, there is no stack which could overflow. Just a lambda expression, which is reduced to another expression.

kragen · on April 11, 2013

Yes, as I mentioned in another comment in this thread, TCE falls out automatically from combinator-graph reduction.

kd0amg · on April 10, 2013

The Y combinator is a way to express recursion in lambda calculus. Tail call optimization (turning certain function calls into GOTOs) is the traditional way to keep from overflowing the control stack in languages based on lambda calculus.

asbut · on April 11, 2013

The problem seems to be that you can't do tail calls with the C calling convention.

For example, let's say a 0-argument function tail calls a 1-argument one: the 1-argument function expects an argument on the stack, so the 0-argument function must push one.

However, when the 1-argument function returns, the argument will still be on the stack, because with the C calling convention the caller removes arguments from the stack.

But the caller called a 0-argument function, so he won't remove any argument, and thus the stack is now misaligned due to the argument left over there, which will crash the program soon.

However, just switching to the Pascal/stdcall convention, where the callee removes arguments from the stack should just work; it might be slightly slower, but on all modern architectures (i.e. those that aren't x86-32) parameters are going to be passed in registers anyway for must functions, so it shouldn't matter.

The problem of that is that non-varags functions cannot be called with a varargs prototype; this is an issue with K&R C that allows calling undeclared functions, but isn't an issue with Rust.

So, I'm not quite sure why Rust doesn't just switch calling conventions and support tail calls.

jevinskie · on April 11, 2013

Here is some information about TCE in LLVM from the lead developer: http://nondot.org/sabre/LLVMNotes/GuaranteedEfficientTailCal...

Rickasaurus · on April 10, 2013

Screw it, we might as well use Go

Seriously though, why not just use a rec keyword and then disallow any of the things that don't play nicely when you're in the recursive function? If you really wanted to be cool you could put those things right into the type system.

ufo · on April 10, 2013

As they mentioned in the original post, a big problem is that TCO does not play nice with two other important features they want.

1. deterministic destructors that run at the end of functions (therefore making things that look like tail calls not actually tail calls) and

2. binary compatibility with C and C++ libraries and tools (they say that tail recursion doesn't let you usethe C calling conventions that these tools and libraries expect you to use)

There is no point in allowing tail recursion in restricted contexts if you can't use these restricted functions to do the sort of stuff Rust was actually made to do

derefr · on April 11, 2013

> 1. deterministic destructors that run at the end of functions (therefore making things that look like tail calls not actually tail calls) and

Couldn't owned boxes (~) passed through the jump transfer their ownership to the next frame (so they'll be destructed by whoever finally does return), and everything else just get destructed before the jump?

pcwalton · on April 11, 2013

Yeah, we discussed the possibility of doing this. But we eventually decided that it wasn't worth it due to these issues:

* It's surprising behavior for destructors, which can have side effects. C++/D RAII and try/finally in other languages never behave like this.

* You can do it yourself, by moving the values-to-be-destructed into the bit bucket with "let _ = ...";

* You still have the ABI issues in that you can't use cdecl and have to use Pascal/stdcall to get TCO.

ihnorton · on April 10, 2013

> binary compatibility with C and C++ libraries and tools >(they say that tail recursion doesn't let you use the C > calling conventions that these tools and libraries expect > you to use)

I've read variations of this comment about Rust C++ compatibility a few times, but haven't managed to find a source. Any references you could point me to?

asb · on April 11, 2013

Apparently sanxiyn is planning on looking to add support at a similar level to D: http://www.reddit.com/r/rust/comments/1c3clf/c_ffi/c9codm1

kibwen · on April 10, 2013

AFAIK there's no plan to support any sort of C++ interop natively in Rust. However, it should work just fine if your C++ code exposes a C-compatible interface. Servo has to be able to call into SpiderMonkey somehow.

kragen · on April 11, 2013

GCC C++ ABIs are unstable enough that cross-platform C++ libraries usually export a C-compatible interface already.

ihnorton · on April 11, 2013

g++ vs itself, or g++ vs msvc? g++ uses Itanium C++ ABI on at least linux/mac/mingw, and the mangling and vtable layouts seem to be the same (I'm very interested in any information to the contrary - though I realize vtables are only part of the story).

kragen · on April 11, 2013

Itanic may be an exception, but there have been at least two backwards-incompatible C++ ABI changes in G++ over the years. Casting to superclass in the presence of multiple inheritance and exception handling have been among the changes. Maybe they should change vtables too, I don't know.

ihnorton · on April 11, 2013

Figured as much - it would be hard to take any language seriously without C interop. But that's a long way from "C++ binary compatibility." I've heard this several times, so I was hoping it might be true.

kibwen · on April 11, 2013

I've just learned that one of the Samsungers working on Rust has tentative plans to implement C++ interop in the same fashion as D.

http://www.reddit.com/r/rust/comments/1c3clf/c_ffi/c9codm1

In any case, I'm not sure that any official Rust source has ever tried to claim "C++ binary compatibility". Perhaps this language is a result of people misinterpreting the fact that Rust lays out structs in memory in the same fashion as C and C++.

ihnorton · on April 11, 2013

Cool, thanks. I had not seen anything official either - just pipe-dreaming that one of those commenters I saw knew something special.

coldtea · on April 10, 2013

They don't want to be cool. They want to create a usable, pragmatic and fast language.

nnq · on April 10, 2013

at what point did people start to use the word "pragmatic" for "opinionated"? (and I've heard it a lot lately...) I'm ok with opinionated technologies, but stop calling them "pragmatic": when the implementer of a language or technology makes a choice that restricts how its users can do some things, it makes an opinionated decision, that just happens to be pragmatic in the use-context he has thought of, but may not be pragmatic at all for the use-context that some technology user imagines, like a language feature that may be hard-to-impossible to implement but that when done properly (and it only has to be done once) would simplify the lives of a large group of programmers that just want to do things in a certain way.

coldtea · on April 10, 2013

>at what point did people start to use the word "pragmatic" for "opinionated"?

I don't think people are doing that. It's just that being pragmatic also makes you opinionated by necessity.

Actually that would a very good definition: being pragmatic means you are "opinionated by necessity" (as opposed to opinionated by other concerns, e.g pureness, advancing the state of the art, etc). So being pragmatic is a subset of being opinionated, not a synonym.

>when the implementer of a language or technology makes a choice that restricts how its users can do some things, it makes an opinionated decision, that just happens to be pragmatic in the use-context he has thought of, but may not be pragmatic at all for the use-context that some technology user imagines

A language has thousands (millions) of users and every one can imagine whatever use-context. Not catering to all of them doesn't make you opinionated: merely pragmatic. Nobody has the time and the means to cater to all possible use-contexts. And we all know that adding too many things can ruin a language, over-complicate it and such. So that's another pragmatic concern.

>like a language feature that may be hard-to-impossible to implement but that when done properly (and it only has to be done once) would simplify the lives of a large group of programmers that just want to do things in a certain way.

Sounds like a very horrible feature to build into a language. Especially when you are starting out. If it's "hard-to-impossible to implement" then you are wasting tons of resources to something that might very well not pan out in the end.

To reject this is not opinionated in the bad way. It's merely being pragmatic again, ie understanding that you have to prioritize things, and not go on a wild goose chase.

_pvxk · on April 10, 2013

nitpicking: the extension of pragmatic is a subset of the extension of opinionated. The word pragmatic is a hyponym of opinionated.

Rickasaurus · on April 10, 2013

Sounds mostly like a pile of empty words to me. Usable and pragmatic are purely contextual. Is rust faster than Haskell?

coldtea · on April 10, 2013

>Sounds mostly like a pile of empty words to me.

All words are empty until we put meaning on them. Go and read what they are trying to achieve and how, and then they won't be empty anymore.

>Usable and pragmatic are purely contextual. Is rust faster than Haskell?

It's not highly optimized yet, as it's in pre-alpha stage. But it's goal is to be faster than Haskell, and close to C/C++/ADA speed.

Usable and pragmatic means that it should work for their goals, which are very real and tangible themselves: to use it as a compiled language to create a fast, parallel and secure web browser engine.

They don't want to pile on academic concept and programming features or compiler tricks just to be "cool", "interesting", or "cutting edge". They don't even care if they would be "nice to have". They care about: what helps their goals, and what can be implemented without overcomplicating things.

If Rust cannot become a language in which it's able to write a fast (faster than the currently available), parallel (more parallel than the currently available) and safe (safer than the currently available) browser engine, then it would have failed on its targets.

Rickasaurus · on April 10, 2013

My point above about Go (but put somewhat trollishly I admit) was that if rust forsakes the things which make functional programming great, what about it is compelling? I was under the impression that it was supposed to be the functional systems programming alternative.

pnathan · on April 10, 2013

Well, there are a couple compelling points about Rust:

- deterministic memory management

- actual generics. :P

- an advanced type system (vs. java/c++)

- design choices taken towards efficiency

- default-immutable memory

scott_s · on April 10, 2013

I think your impression is mistaken. While functional style programming is supported, I have never gotten the impression that it's the main goal: http://static.rust-lang.org/doc/tutorial.html#introduction

steveklabnik · on April 10, 2013

> I was under the impression that it was supposed to be the functional systems programming alternative.

The elevator pitch is "Speed of C++, safety of ML, concurrency of Erlang."

kragen · on April 11, 2013

Note that this doesn't say "concision of ML", although clearly they're aiming for that wherever possible.

(Is concision really what you lose when you need explicit looping constructs? Maybe "abstraction capability"?)

Rickasaurus · on April 10, 2013

Wouldn't self/co-recursion be a safety of ML feature? Otherwise wouldn't you need mutation to do anything useful without worrying about your stack?

dherman · on April 11, 2013

I think you're mixing up tail recursion with general recursion. Rust absolutely supports recursion: the implementation uses segmented stacks to allow for stacks to grow dynamically. Tail call support is about not having the stack grow but only in the subset of recursive calls that are in tail position.

So support for tail calls is not about safety but expressiveness, and Rust chooses to express iterative algorithms through other means, such as the iteration protocol that's built on top of higher-order functions.

EDIT: Maybe I'm misreading you and you only mean the imperative nature of loops forces you to use mutation. Rust does not eschew mutation altogether. Even as a functional programmer myself, I'd argue pretty emphatically that mutation, particularly of local variables, is not a grave safety concern.

steveklabnik · on April 10, 2013

From a certain perspective, yes. You're confusing the tool with its purpose.

Rickasaurus · on April 10, 2013

What do you mean? Combinators are great (even if implemented imperatively) until you really care about performance and isn't that what this rust discussion is all about.

Did you mean something other than combinators?

steveklabnik · on April 11, 2013

I mean that 'safety' means a wide variety of things, and you're conflating the PLT feature with the more general reason why that feature was needed.

For example, above, you mention how TCO lets you use first-class functions to help you eliminate mutability. It's the elimination of mutability that's the safety here. Rust allows you to eliminate mutability through other means.

Tuna-Fish · on April 11, 2013

Their specific performance-related target is to have predictable, low worst-case latencies. In throughput-style loads, they might well lose to Haskell most of the time.

For some use cases (games, web browsers, etc) the first goal is much more important than the second.

surrealize · on April 11, 2013

Coming from haskell, I just like the fact that rust has a decent record syntax built in :)

Offler · on April 10, 2013

They eventually want to be competitive against C++ code and it seems they won't support TCO as that would harm their performance target.

jacquesm · on April 10, 2013

Not having TCO might lead to a much bigger problem than not hitting a performance target for large numbers of loop iterations. I'd rather have a program that is slightly slower and that works correctly for all input than one that has a built in - but invisible and hard to test for - hard limit.

qznc · on April 10, 2013

I assume you mean that without TCO the stack might overflow for certain inputs and it is easy to miss those cases during testing.

Is that actually a real problem? I never hear C/C++/Java/Python/Ruby/Javascript/D/Go/Clojure programmers complain about such bugs. On a Linux/Windows/OS X system the stacks are big enough to practically never hit such bugs. Whenever I get a stack overflow, I coded an infinite loop, which is actually easier to find without TCO, because the program is terminated.

jacquesm · on April 10, 2013

C/C++ etc don't have that problem because they have loops.

TCO is a way to turn a recursion into a loop. If for every situation where you would elegantly use recursion you'd end up with real recursion and all its overhead you'd indeed end up overflowing the stack. For a C/C++ etc program it would not matter whether your loop iterator would be 10 or 10,000,000,000, the program would just run a little slower. The rust/clojure program (coded up without knowledge of the underlying mechanisms, so 'naively') would likely run out of memory. With TCO such a naive implementation would run just fine. 10,000,000,000 is merely large, not infinite and should all things otherwise being equal not come with a huge memory penalty if all you're doing is writing elegant code in the most applicable idiom for a certain language.

kibwen · on April 10, 2013

I can't speak for Clojure, but the idiomatic Rust code that I've seen does tend toward C-style iteration (though that iteration is discreetly accomplished via closures and higher-order functions) rather than Lisp-style recursion. So I wouldn't expect recursive memory exhaustion to be a common concern in idiomatic Rust, though that's obviously no consolation if you were hoping to code in a recursive style.

Though perhaps you'll be happy to hear that Rust stacks are growable (Go-style) rather than fixed, so stacks are typically small and running out of stack is (afaict) theoretically less of a concern than in C/C++.

pcwalton · on April 11, 2013

The most applicable idiom for Rust is to use looping with higher-order functions. Rust code doesn't really favor recursion.

jacquesm · on April 11, 2013

Yes, but that may very well be because it does not do TCO... What's idiomatic in a language is usually the result of how efficiently that language implements certain constructs.

pcwalton · on April 11, 2013

Even if we implemented TCO, I doubt it would be idiomatic because destructors reduce the number of tail call positions so dramatically.

jacquesm · on April 11, 2013

Right, because any destructor would break your ability to do TCO assuming that it runs before the tail call... So as soon as you use an object mechanism that will invoke a destructor at some point your ability to use TCO is gone.

Interesting, isn't there a possible transformation where you would be able to move the constructor/destructor pairs out of the generated loop construct without breaking the function?

After all a tail call is just a fancy go-to under the hood.

pjmlp · on April 10, 2013

The thing is that most C and C++ compilers actually do TCO when applying -O2 or similar, if they see the possibility to do so.

pcwalton · on April 10, 2013

As I stated above, the C and C++ optimization is only sibling call optimization, which Rust does too.

pjmlp · on April 10, 2013

Ah ok, I missed that, thanks for pointing it out.

cpeterso · on April 10, 2013

Rust has/had a `be` keyword reserved for potentially implementing TCO.