I've been wondering about garbage collection in pure functional languages lately. It is often said that statically typed pure functional languages allow the compiler to do a lot more correctness checking. Shouldn't the same formal reasoning features also enable much better garbage collection algorithms? Or would that be asking for a solution to the halting problem?
There's lots of interesting optimizations enabled by immutable data. E.g. in generational collectors old data can never point at new data, which can be used to gain additional performance.
Though the GHC runtime isn't an example of immutable data. Haskell code actually results in a lot of mutation behind the scenes. It's quite possible for older generations to point to data in newer generations in Haskell. It's even likely with some use patterns.
Given that you already know this, I assume you must be talking about strict functional languages specifically.
Are you referring to thunk update after lazy evaluation? This is a very special kind of mutation, because after the update, it is guaranteed never to be mutated again. This is quite helpful for generational GC, actually, and underlies the concept of "eager promotion".
With a purely linear type system, memory management could be handled statically, negating the need for traditional GC. The underlying behavior would be closer to C-style just-in-time {de-}allocation, or C++ RAII. LinearML is an attempt at this: https://github.com/pikatchu/LinearML
Regarding Linear Type Systems and friends I recommend reading parts of Jesse Tov's PhD thesis [1], "Practical Programming with Substructural Types". I found parts of it pretty accessible as an undergraduate. In particular he uses substructural types to put types on the network protocol of a server and client. Neat stuff.
There's also the wikipedia page on Substructural Types [2].
Don't know of any articles, but there are a number of papers on linear logic and linear type theory. Most are pretty dense, but the first few pages of Wadler's "Linear types can change the world!" give a very readable overview of the area. http://homepages.inf.ed.ac.uk/wadler/papers/linear/linear.ps
It should, yes. AIUI the state of the art in the JVM is that we can do all the marking and sweeping in parallel, it's just heap compaction that requires a "stop the world" pause. If the VM had visibility over which objects were referentially transparent, it should be possible to compact those easily.
(Sadly even that doesn't mean a truly pauseless garbage collector - the "heap of non-referentially transparent objects" will still eventually fragment to the point of filling up)
Installing GHC manually is not hard, the website even provides pre-compiled binaries. I started doing it for different reasons, but was surprised how painless the process was. After doing it multiple times, I wrote together a file that guides me through the process, which you can find here: https://github.com/quchen/articles/blob/master/install_haske...
Tons of advantages to doing it that way, including the ability to easily maintain multiple versions of both GHC and Platform on the same system and swap between them with a single command. Also doesn't clutter up /usr/local/ with binaries not managed by apt or dpkg.
More generally, update-alternatives is a godsend for Stable users. It's like RVM or RBENV in the Ruby world - lets you install and manage multiple versions of the same platform and easily swap between them with a single command, including the system version from repos if you want. It frees you from out of date software in the repos while still providing a system to manage the complexities of it all.
What are the problems of making Haskell from source? (I did it recently, and it takes a long time, and you have to install a bunch of other random things... But at least it works in the end. I guess that sucks for devops?)
I run into far more version conflicts installing Haskell software through Cabal than any other language environment. Fortunately there is a solution: cabal-dev gives you per-directory package repositories.
> Cabal is a package manager that by default compiles packages either to a system database (if invoked as root/admin) or a per-user database. Since Haskell code is very often fussy about exact version numbers, and because Cabal offers essentially no way to uninstall packages, this is very painful. Cabal-dev is a front-end to Cabal that gives you per-directory repositories: suddenly conflicts go away, and you can remove packages by deleting the directory's repository and reinstalling everything again. It's a kludge, but it is very helpful.
> I would say, "You should never mutate your user or global package database" and recommend using cabal-dev. The way cabal-install does destructive updates is evil. In particular, it likes to mutate your global or user package database. You run the risk of building something and then breaking it later by mutating its dependencies.
If it's your local machine, you may be able to pull it from experimental when it comes out. If it's a server, can't you just compile everything statically and deploy the resulting executable?
Does anyone know why functional programming languages are so much favored by research teams over other types of PL ?
Every time i hear someone doing PL research, it's always on some kind of functional programming language such as Haskell or ML.
It's not about functional programming languages per se, it's just that a lot of core/academic PL research often focuses much more on concepts than on implementation, and languages with strong type systems (all of which are functional, in part also because OO/subtyping is hard to reason about formally) are ideal for encoding these concepts.
There is, however, a lot of research about implementation going on using other languages as well. Most GC research, for example, is done using JVM and Java programs. For this particular paper, it would be really hard to choose a different platform, though, because GHC is one of the rare industrial runtimes that offer lightweight threads.
> (all of which are functional, in part also because OO/subtyping is hard to reason about formally)
OO/subtyping is NOT orthogonal to functional, nor is it inherently at odds with formal reasoning. OCaml has both imperative and functional objects; Haskell, Mercury, and Coq (a formal logic language) all support typeclasses, which subsume most of OO and provide subtyping.
Additionally, strong typing has little to do with formal reasoning as well (see: Lisp/Scheme/the lambda calculus). What makes functional languages amenable to formal methods is that they are functional and therefore referentially transparent. Mathematical proofs carry no implicit concept of "state"; hence if you are trying to prove anything about code in, say, C, you need to augment the code with explicit state and remove all non-local effects. (See: the Why language, which attempts to bridge this gap.)
Algebraic/generalized-algebraic type constructors used by most functional languages don't hurt either, as they allow programs to construct complex terms without relying on lower-level stateful abstractions such as memory allocation.
hence if you are trying to prove anything about code in, say, C, you need to augment the code with explicit state and remove all non-local effects. (See: the Why language, which attempts to bridge this gap.)
Actually, you just need to augment your model with notions of state, which is standard in operational semantics. It can just be harder to prove things in a complex model, so theorists prefer simplification when possible.
Not inherently, no, but practically, yes. The type system of, say, Java, isn't even formally sound; overridden functions can be covariant in parameter types, which would not be permissible in a proper subtype-based type system such as OCaml's.
> all of which are functional, in part also because OO/subtyping is hard to reason about formally
OO subtyping is hard to reason about formally in formal systems designed for FP...news at 11! I jest.
The reason FP does so well academically is that no one knows how to rigorously evaluate something that isn't theoretical (since FP is close to math anyways) or performance-based (most other PL work).
The Haskell crowd is also very creative so they do a lot of cool stuff. However, many of their ideas are transferrable to other languages without the FP ideology.
The OO people have something to offer also, but we've been kind of muted lately.
The problem isn't that `is-a` is hard, it's the divergence of `is-subtype-of` and `is-subclass-of`. Take a look here, I think it explains it rather well.
http://okmij.org/ftp/Computation/Subtyping/
I would argue that nominative subtyping is natural (basically sub-classing), since it matches our ability to assign arbitrary meaning and relationships to words, which is incredibly useful when having a conversation. Ya, maybe a set isn't exactly a bag, but close enough. Birds mostly fly, but penguins don't, we can deal with that.
But the main problem with subtyping, nominative or structural, is how it messes up Hindley-Damas-Milner style type inference. But that shouldn't be surprising that an FP theory for type inference wouldn't work well for OOP.
No, these are true lightweight threads (implemented on top of an OS-thread pool, just as they are in Erlang and, I suppose, Haskell). You can have hundreds of thousands or even millions of these on a single machine. They can block for IO or for any other kind of synchronization (message passing etc.), just like regular threads.
whenever i hear "lightweight threads" my PTSD from experience with Java "green threads" comes back. Of course, my rational mind understands that many years have passed and things may in other runtimes be implemented differently and there are many benefits vs. native, yet ...
Absolutely. Currently fibers (lightweight threads) may only be preempted when performing a blocking operation (IO or waiting on some synchronization mechanism). We will implement time-share based preemption if we see a need for that.
I don't know what you mean by "special" (nor do I understand which "claims" you have issue with), but these are true lightweight threads. We use runtime bytecode transformation to create continuation, so that we can suspend and later resume a call-stack. These continuations are then scheduled on a thread-pool. It works exactly like it does in any other lightweight thread implementation AFAIK, except that in languages such as Erlang or Haskell, these continuations are created by the compiler or by the runtime, while in Quasar they're created with bytecode transformation. The JVM doesn't support continuations out of the box, but its instrumentation mechanism allows you to implement them in a library. It isn't any more, or less, special than lightweight threads in Erlang or Haskell.
Doesn't matter really. For the GC research I mentioned, they usually just change one of the JVMs, so what they care about is only bytecode. Then, you need real programs to test on (improvements in throughtput, pause times, fragmentations, etc), which usually means Java programs but really doesn't matter.
Does anyone know why functional programming languages are so much favored by research teams over other types of PL ?
That's like asking why algebra is preferred among mathematicians over randomly pushing operators and parentheses around hoping that it will work this time.
Because (static) FP is built on a sound and rich theory (type theory and lambda calculus) unless other language paradigms (e.g. OO). There are also different approaches, IMHO:
FPL research: Ok, we have this cool way to describe our computation. But how can we efficiently map this to the HW? In essence build a nice language and find an implementation.
Imperative PL research: Ok, we have this HW, how can we build an expressive PL on top of it? In essence build an implementation (the HW) and find a language.
Microprocessors evolved while chasing minicomputer features, which in term were shaped by C/UNIX sorts of ideas, which expanded and fed back. That was a very successful co-evolution. COTS supercomputers beat specialized hardware and specialized forms of parallelism not because they were abstractly better, but because markets and economies of scale drive whatever is popular to be the best performing and cheapest over time. (Down to $30 UNIX systems on a board.)
The two language schools you describe seem to be asking whether it is time (whether we have enough capacity) to simply float a new more mathematical world view on top of it all, or whether we should continue to play toward the strengths of the commodity hardware stack.
I personally think that something like Go works for my mind, and the hardware. (I will go make coffee now, rather than "define coffee" and wait for it to appear ;-)
The HN community is particularly receptive to functional programing. HN itself is written in Arc which is a language implemented in Racket.
There's still active research in Fortran i.e. for super computers etc., check out the SIGPLAN Fortran Forum [1]. OOPSLA/SPLASH has lots of non-functional language research [2].
Also check out PLDI, arguably one of the most prestigious PL conferences. Scala, Dart, and LLVM feature in their tutorials section [3].
HN also sits inside a center tag using a table with width="85%", as I recently found out while trying to override its CSS rules. It's bizarre that an organization so forward looking uses deprecated html tags.
For this specific example? This isn't really a research project, in the normal sense of the term. This is a report of a very thorough engineering effort to fix the scaling of the IO manager already in GHC. It doesn't invent anything new. It does detail every issue they found and how they dealt with it.
One thing I'm surprised no one has mentioned is that they found a race condition in epoll that's existed since version 2.4 of the linux kernel.
FP is a fertile topic for CS research in part because writing good compilers for FP languages is hard and poses lots of challenging questions for people to try to answer.
Most really low level code generation and parsing research etc. will apply across the board or at least cover a wide range of type of languages (e.g. register allocation, instruction selection).
But consider type theory - a language like Haskell providers a lot wider scope for research in that area than e.g. C. Or garbage collection - admittedly that's more widely applicable, but many FP languages throw an extra factor into the mix with immutable variables. Etc.
But there's certainly other PL research as well. E.g. all the work that's gone into Javascript compilers in the last few years. Trace trees came out of work done on a JVM for example.
I know that part of the reason I like FP is that a compiler is usually a pretty perfect pure function - source code -> machine code. So those people researching languages will often find that FP is the best tool for their job and will be most familiar with it. Then when they come to research something new they do it in the context they're most familiar with.
Pure functions handle state without any issues (quite elegantly, actually). They simply can't have mutable state that is globally observable. For instance, you could have a struct, and every time you 'mutate' the struct, you actually clone it and simply set the relevant field to the new value when you're instantiating the new struct.
Functional programming languages have mechanisms that make this process pure, simple, and efficient and you get the benefits of never mutating an existing value (which is important when several different functions may hold a reference to it, among other reasons).
See also: The Haskell state monad or lenses. Theses are pure mechanisms that Haskell provides to update state without mutating any data.
Edit: In the event that you do need shared mutable state, the Haskell STM (Shared Transactional Memory) monad is your answer.
The ST monad [1] provides an efficient hashtable now; doing hashtables as pure functional data structures is well know to be slow.
I would hope one wouldn't need to resort to STM for a compiler, especially since retry is fairly undefined! Still nothing about iterative computation however, I wonder how the ST monad would deal with a Y combinator?
Those are just signatures, what is going on inside the boxes?
Also, how does one encode iterative computation, like a data-flow analysis, in Haskell? And symbol tables? Is it sufficient that the symbol table is encapsulated within compile even if it involves dictionary read/writes? I'm genuinely curious.
Probably not, the limiting factor for most database systems, even of the plain old boring scale-up not scale-out SQL kind, is disk IO. Before SSDs and virtualized storage, it was all about having cabinets full of disks (usually in expensive SANs) for fast database applications. Enterprise hard disk drives only gave you low triple digit IOPS, I think the highest I've ever read in a 2.5" HDD is about 450 IOPS.
If you wanted data resiliency, you used RAID60 or RAID10 or some exotic variant that would usually give you on the order of a 30-50% reduction in capacity and sometimes as much as a 75% reduction in write IOPS - it's always been the writes that have been killer, that's the stuff you need to make sure persists. RAID6 with a crappy controller or on an overloaded good controller would make you pay the write hole tax.
Commodity SSDs give you up to, if you're willing to take peak numbers, on the order of 45,000 IOPS. To get to that kind of performance in the last decade, you'd need a cabinet with between 200 and 400 2.5" HDDs, depending on everything involved.
Now, that's all 2000s era stuff. The future is bright, thanks to virtualized storage arrays turning commodity storage into massive virtual SANs with tiered storage on consumer SSDs being used like race tires, consumables replaced periodically to maintain top performance, and so on. Even so, the biggest benchmarks I've seen for enterprise systems with multiple 10GbE adapters running raw disk IO has been on the order of 1 million IOPS. I'm sure exotic systems will put out higher numbers, after all there are machines that support over 100 physical CPUs, but they aren't attainable and certainly aren't a broad market.
All of those words were said to essentially say that databases have almost always been constrained by disk IO. The SDN controller in the paper? Limited if any disk IO. A SQL database? It probably could be written in naive, pre-java.nio Java and still IO limitations would dwarf the overhead of the language. (There are exceptions, I'm ignoring certain overheads, etc.)
Maybe a NoSQL solution in Haskell would scream with this new IO manager, but I suspect it's going to be correctness rather than performance that drives anyone to implement their database in Haskell right now.