30 Years of Decompilation and the Unsolved Structuring Problem: Part 1

albertzeyer · on Jan 3, 2024

A funny anecdote: Some time ago, I was writing a C-to-Python translator.

(Why? Just for fun, https://github.com/albertz/PyCParser, even more just-for-fun goal was this: https://github.com/albertz/PyCPython).

It literally would translate the C code in equivalent Python code, using ctypes heavily. It was mostly straight-forward, except for mapping goto (thus related to this control flow structuring problem).

Of course, there are some hacks to introduce goto in Python, which in many cases would operate on the Python bytecode, which actually has the JUMP_ABSOLUTE op, but there are also other ways (https://stackoverflow.com/questions/6959360/goto-in-python).

I could also have translated C directly to equivalent Python bytecode and not Python source code, but I really wanted to have Python source code.

My ugly solution worked basically like this: Whenever there was some goto in a function, it would translate it as follows:

First, we flatten any Python AST into a series of statements, with even more goto's added for while/for loops, if's, etc. (We can avoid doing this for all sub-ASTs where there are no no goto-labels. The goal is to have all goto-labels at the top-level, not inside a sub-AST.)

Only conditional goto's can stay. All Gotos and Goto-labels are marked somehow as special elements. So, we end up with sth like:

    x()
    y()
    <goto-label "a">
    z()
    <goto-stmnt "a">
    w()
    if v(): <goto-stmnt "a">
    q()

Now, we can implement the goto-handling based on this flattened code:

- Add a big endless loop around it. After the final statement, a break would leave the loop.

- Before the loop, we add the statement `goto = None`.

- The goto-labels will split the code into multiple part, where we add some `if goto is None:` before each part (excluding the goto-labels).

- For the goto-labels itself, we add this code:

    if goto == <goto-label>: goto = None

- For every goto-statement, we add this code:

    goto = <goto-stmnt>; continue

See here: https://github.com/albertz/PyCParser/blob/master/goto.py

mahaloz · on Jan 3, 2024

That's awesome! That's exactly how modern decompilers deal with a special type of goto occurrence. They reduce gotos (or completely eliminate them) by introducing a `while(true)` loop, followed by corresponding `continue` and `breaks`... we all, of course, know that `while(true)` did not exist in the source, but it's a nice hack!

We even do this in the angr decompiler, found here: https://github.com/angr/angr/blob/8e48d001e18a913ecd4ed2e995...

edflsafoiewq · on Jan 4, 2024

This is the https://en.wikipedia.org/wiki/Structured_program_theorem.

You may also be interested in the Relooper and Stackifier algorithms, which produce more efficient programs than a simple loop/switch.

HALtheWise · on Jan 4, 2024

I've wondered whether it would be possible to design a language with the explicit intent of being a decompilation target, such that you could guarantee that any program could be decompiled into it and then subsequently re-compiled with the original behavior preserved. Having such a language (perhaps a superset of C?) would make it way easier to perform binary patching.

rgovostes · on Jan 4, 2024

In a way disassembler output is exactly this. It’s a low level programming language, but one nonetheless.

There are tools like Remill (https://www.trailofbits.com/opensource/#binary) that disassemble to LLVM IR that can be fed back to the compiler to produce a new binary. You can actually target a different architecture, so it works for binary translation too.

turol · on Jan 5, 2024

This is not possible. Both disassembly and decompilation are equivalent to the halting problem. At least Mike Van Emmerik's PhD thesis ("Static Single Assignment for Decompilation") mentions this though I'm not sure if that's the original source.

konstante · on Jan 7, 2024

Yes, static disassembling is well known undecidable, data and code are indistinguishable in general. The reason is very simple (it's actually just an exercice), consider an assembly code:

jmp rax

...some binary data...

Where the value of "rax" depends on some input, so the disassembler can never be sure that "some binary data" is actually "data" or "code".

JonChesterfield · on Jan 4, 2024

I don't think you'd need to change the language, only the implementation. The easy way to go is to emit metadata along with the machine code to aid reconstruction, roughly a variant on dwarf format debug information.

ltfish · on Jan 4, 2024

DARPA’s Enhanced SBOM for Optimized Software Sustainment (E-BOSS) program may achieve this goal (via a slightly different approach).

Philpax · on Jan 4, 2024

This is great! I’m always on the lookout for advancements in decompilation. Looking forward to Part 2 :)

ryanjshaw · on Jan 4, 2024

I build a static analyzer in my spare time and read a bunch of +fravia in my youth, so I have some familiarity with the space but I'm not an expert so I could be completely off here: I wonder if the LLM advances we've seen recently could be used to improve the state of the art in structuring?

As in, could you abstract away the CFG and feed it into an LLM trained on a bunch of CFGs and corresponding ASTs? Especially given that the LLM could be trained on a lot of code that may very well have been reused in the code being decompiled. Even if not the same code, the same algorithms may be similar enough to improve the structuring output.

Philpax · on Jan 4, 2024

Yes, I suspect so; there's been some work on this already (from a cursory arXiV search: [0] [1] [2], but there are more). I'm also curious to see if graph neural networks can be used for structuring, as well - given a decompiled CFG, can the original CFG be predicted?

There's lots to be researched here, but as the post alludes to, (public) development is limited. My hope is that there will be more work in this space to enable porting of closed-source applications on older architectures to newer architectures; I have a few ideas on how to go about that, but not enough time to look into it.

[0]: https://arxiv.org/abs/2310.06530

[1]: https://arxiv.org/abs/2306.02546

[2]: https://arxiv.org/abs/2304.03854

torstenvl · on Jan 3, 2024

This is super interesting, and I've definitely wondered how control flow "graph schemas" are applied in decompilation.

[EDIT: Removed a comment about an apparent OCR issue, which has been fixed. Thanks for this great write-up!]

mahaloz · on Jan 3, 2024

Oh my gosh, that is really bad. That's my bad. The code snippet has been corrected to have the correct variable names. Thanks for the find :)