Vectorized Emulation: fuzzing at 2 trillion instructions per second

munin · on Oct 17, 2018

It would be cool to see an evaluation of this against baseline. Software is a giant warehouse filled with soggy pinatas. Swing any kind of bat in there and you'll get some candy.

gamozolabs · on Oct 17, 2018

I'll try to do a blog on this specific aspect. It can be hard to apples to apples with other fuzzers and toolings but I can give it a try.

This tooling is specifically designed for "hard" targets, while "hard" is subjective, think targets with fewer than 2 CVEs a year. Where getting even a null deref is hard.

I have used this on some soft targets and it's just as if you ran AFL against it, candy everywhere. The upside is that this tool usually "finishes" in an hour (no more coverage, no more crashes). Making it a bit easier to develop mutators/generators as you can run them to completion faster and have a more effective development cycle.

karmajunkie · on Oct 17, 2018

This might possibly be the best analogy for this industry I've ever come across.

jacobush · on Oct 17, 2018

Soggy candy mixed with pulped paper and sawdust.

TickleSteve · on Oct 17, 2018

I may be getting this wrong... but how can this possibly work?

Correct me if I'm wrong, but he's trying to emulate 16 systems in parallel by vectorizing the instructions.

Ok, but this assumes that all paths are identical. Once you start fuzzing by varying their input, the paths will all vary at which point you're down to 16 non-identical paths again.

The whole point of fuzzing is to expose the different paths and this would fail horribly for that, surely?

gamozolabs · on Oct 17, 2018

trishume hits the nail on the head here. The mutation strategy is well aware of how this system works.

Each core gets a completely unique fuzz case, where each lane of the vector gets a small mutation. In _many_ cases this mutation doesn't even affect flow (eg, the mutated parts are skipped over or never parsed due to errors). Meaning all 16 run to completion. What's really important here is that when the small modification you made to an individual lane does actually cause it to diverge, you now know where and when that part of the input is used in the program. This information is huge and can be used to tweak weights and other parameters of mutators/generators, making them learn which fields to use when and how often.

With some better logic (covered in a later blog) I'll talk more about handling fully divergent cases by having graph analysis to find post dominators in functions and run VMs until they can sync up. Rather than the current model of "sync if you can", this will be a smart forward-looking sync that will ensure that by the end of every function all VMs will be running again (even if that means I have to insert artifical post dominator nodes to graphs).

TickleSteve · on Oct 17, 2018

What kind of application are you running such that you don't get large path differences based on the input?

All fuzzing I have done (with AFL and the like) the paths vary wildly and will end up in completely different parts of the stack.

Surely the advantage you're gaining by parallelising is completely wiped out when you lose sync (which to me must be most of the time).

I just can't see how you can possibly keep sync between parallel runs in anything but the most trivial application-under-test.

To me, it seems that this will necessarily degrade to a single path being active. Have you done any analysis on how many paths are active simultaneously over a non-trivial run?

derefr · on Oct 17, 2018

Crypto code, maybe?

trishume · on Oct 17, 2018

I bet his mutation strategy works in tandem with the SIMD strategy. Probably the input for each SIMD lane differs by only one byte or something so that they are maximally coherent.

gamozolabs · on Oct 17, 2018

Bingo

TickleSteve · on Oct 17, 2018

So what you're really doing is (in parallel) finding the decoherence points... These branching/decision points are much more easily found by instrumenting the code as in AFL.

No need to parallelise at all, the compiler gives you that information.

What you have seems like a neat trick, but not an efficient way of actually fuzzing.

dragontamer · on Oct 17, 2018

> compiler gives you that information

It seems like he's fuzzing machine code. So he doesn't have the original source code.

pjc50 · on Oct 17, 2018

I wondered about that too, but the author points out that the VMs can re-converge: if you have one that does 5 times round a loop and another that does 10, they end up resynchronised.

Any kind of software with a message or input processing loop will naturally re-converge at the top of the loop. It's quite neat.

TickleSteve · on Oct 17, 2018

...but during that one iteration of the loop, the paths will diverge wildly.

In a typical message handling loop, the big-switch will immediately jump off to separate message-handling code, immediately wiping out the possibility of parallelisation.

pcwalton · on Oct 17, 2018

Same way GPUs handle divergence. You mask off inapplicable lanes until control flow reconverges.

That's why AVX-512 is essential: it contains mask registers that make this approach practical.

buckminster · on Oct 17, 2018

The point isn't just to execute 16 VMs at the same time. The point is to quickly identify when different executions diverge so you know which inputs lead to different paths.

DSingularity · on Oct 17, 2018

Think VM like the java VM not VM like the AWS VM =)

Here he is simply trying to fuzz faster by executing vectorized code instead.

aaaaaaaaaab · on Oct 17, 2018

>The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread.

What happens if the executable already contains AVX512 (or other SIMD) instructions?

Though I guess you can rewrite them into their scalar equivalent first, and convert that into AVX512.

gamozolabs · on Oct 17, 2018

Technically right now if they have AVX instructions my lifter screams at me and tells me to not be lazy and implement those instructions ;)

However if I were to lift AVX instructions I would lift them to their scalar counterparts in my IL (eg. by emitting 16 32-bit operations). Lets say there are 16 `vpaddd`s in a row, thus creating a 16x16 matrix, when I lift it I'd lift these adds to 128 individual adds, and then generate 128 `vpaddd`s. While I'm doing 16x more the `vpaddd`s I'm also running 16x more VMs than the application originally had, thus it cancels out, and the net effect is that I'm still running `vpaddd`s as fast as the CPU can execute them. Technically in this case my stuff might even be faster as it's going to be 128 adds instead of 16, which means the CPU is running based on instruction throughput longer and decoding the same instruction and keeping latencies down.

However it's possible that this transposition of the matrix might increase dependencies between instructions or something of the sorts which might slow down performance.

That being said, AVX is pretty much only used for memory loads and stores in string operations in most programs anyways, so it doesn't matter much.

pjc50 · on Oct 17, 2018

You could also set the processor capability bits in the VM to pretend the processor is an older one, which will generally cause the application to fallback to a compatible piece of code. It's pretty rare in non-maths-heavy code.

gamozolabs · on Oct 17, 2018

I do this as much as possible. I even go out of my way to compile things if I have access to source with a weaker version of x86, or even cross compile it for something like MIPS.

This is done however not due to a theoretical limitation of the tooling, but rather a practical one in that I haven't written SSE lifters and decoders yet. If I were to write these it would work just fine.

vnorilo · on Oct 17, 2018

The real problem is when different VMs branch differently. You essentially have to run both sides of a branch in serial with masked lanes. The author mentions this and states that the method works only because the VMs are very deterministic. I don't understand how he deals with indirect jumps.

DSingularity · on Oct 17, 2018

You mean the possibility that the branch target is different on the different VMs? He can fork the vectorization if the targets diverge. I’m not sure what he does though.

vnorilo · on Oct 17, 2018

Sure, but how does that not utterly destroy performance as the fork would probably lead to multiple sub-forks down the line.. or lane? :)

DSingularity · on Oct 17, 2018

When you are fuzzing the ultimate goal is to explore as much as you can of the entire search space. Forks cant hurt performance. They are just another potential execution path to be vectorized.

OskarS · on Oct 17, 2018

He writes that he's using a special type of IL for this technique, presumably he forbids indirect jumps in it.

mpartel · on Oct 17, 2018

This is supremely cool!

The masking stuff reminds me of how branching works (or at least used to work) on GPUs, so I wonder, could all this be made to work on a GPU too?

jacobush · on Oct 17, 2018

I don't know what this is except that it's something about optimization or bug finding or both and that it's wonderfully weird. How often do you see someone use up 96 gigs of RAM and rewrite instructions to AVX512, use a virtual MMU?

Double_a_92 · on Oct 17, 2018

Same. I have no idea what I just read. It's probably one of those things that only make sense if you already understand them. cough math topics on Wikipedia cough

pjc50 · on Oct 17, 2018

"Fuzzing" is the technique for finding bugs by varying the inputs of a program and looking at how this affects code paths. A simpler place to start would be http://lcamtuf.coredump.cx/afl/

rbanffy · on Oct 17, 2018

> if you cheap out on RAM and buy used Phis you could probably get the same setup for $1k USD.

I'd love to lay my hands on a cheap 72x5 Phi based workstations...

But since Intel only sells those by the tray, I guess I'll have to wait until someone decommission a large supercomputer.

crazysim · on Oct 17, 2018

https://www.ebay.com/itm/INTEL-ES-Xeon-Phi-7210-ES-CPU-1-30G...

?

rbanffy · on Oct 18, 2018

The ones that end in "5" have virtualization.

brian_herman__ · on Oct 17, 2018

Cool can this be in the next version of http://lcamtuf.coredump.cx/afl/?

mckirk · on Oct 17, 2018

Okay, I get the idea with kmasks for branching. But what about loops, where one VM would still have to continue the loop and the others would have to leave it?

sanxiyn · on Oct 17, 2018

You loop until all masks are zero.

TickleSteve · on Oct 17, 2018

....and different branches? i.e based in the outcome of that loop??

vectorEQ · on Oct 17, 2018

This is really interesting. i'm looking forward to more blogs on the last few topics / chapters!

nynx · on Oct 17, 2018

This is insane. and awesome.

rbanffy · on Oct 17, 2018

I love to see the creative misuse of technology. We need more of that.