The Saturation Effect in Fuzzing

Taniwha · on June 17, 2020

First fuzzer I wrote (we didn't call them that, we were just trying to find bugs triggered by bad packets in a cable TV protocol suite) made random bad packets - by making good ones and then mutating them past some random point - it found 3 bugs (which was amazing at the time) but the real problem with all these systems is (in terms of my problem ):

- the number of possible bad packets is literally billions of times larger than good ones - pretty soon the percentage of those that trigger bad behaviour gets close to 0

jsnell · on June 17, 2020

Modern fuzzing tools are far more effective at probing the state space than just random mutations. It's hard to appreciate just how effective until you see it in practice.

E.g. last time I fuzzed a network element with AFL, it took seconds from it to go from a starting corpus of a single ethernet IPv4 SYN packet to some double-encapsulated IP-in-NSH-IP-in-NSH-ethernet monstrosity that triggered a misparse. And seconds more for it to generate a IPv6 packet with a fragment extension header that triggered some other problem. A random walk would have no chance of finding that.

mentat · on June 17, 2020

Do you know of any good writeups for this particular kind of process? I too did fuzzing of network devices before it was called fuzzing and am interested in trying it again with modern tooling.

woodruffw · on June 17, 2020

The AFL technical details document[1] is a decent reference for one particular subset of fuzzer feedback: code coverage metrics.

[1]: https://lcamtuf.coredump.cx/afl//technical_details.txt

not2b · on June 17, 2020

The electronic design automation world calls it "constrained random simulation" and has been using the technique for two decades for hardware verification, using the same kind of coverage-driven methodology that modern fuzzers use, though in some ways the problem is simpler with a synchronous hardware model where the state space is explicit.

Taniwha · on June 17, 2020

My problem at the time was BAD packets (ones that were still occasionally getting past a CRC check after noise) we had pretty good good packet tests

jsnell · on June 17, 2020

Right, and a guided fuzzer would be effective at finding distinct classes of bad packets, where "bad" is defined as triggering your error function. Those complex packets that AFL was able to create from thin air were bad in the sense that they triggered bugs in our system, not bad in the sense of being malformed. The latter is interesting only insofar as those malformed packets trigger a bug.

Btw, and maybe I'm misunderstanding what you wrote, if you were generating random packets without fixing up the checksum, you were already wasting basically all of your testing capacity. All that it ends up doing is checking that the negative case of checksum verification works.

gizmo686 · on June 18, 2020

For people who are not familiar with guided fuzzers, it would not surprise me if AFL actually managed to get good checksums on its own.

I have seen it consistantly produce "magic" strings; presumably by walking the strcmp function calls, where each individual character comparison is another oppurtunity for execution to take a different path.

corndoge · on June 17, 2020

"all these systems" means black box fuzzing

all modern fuzzing is coverage guided for this exact reason

not2b · on June 17, 2020

Coverage-based fuzzing won't find cases where an extreme value of some variable causes a malfunction but there is no code that treats this value specially, because the coder missed it. The most common case of this is an integer variable with the smallest possible value: 0x80000000 for 32 bits. The problem with this value is that if you negate it, it is still negative. That might cause a computation to go badly wrong, but a coverage-based fuzzer might "think" that it has covered every state your code can reach without finding bugs.

Is anyone aware of fuzzers that take issues like this into account by explicitly trying problematic values, like INT_MIN for an integer variable?

landr0id · on June 17, 2020

Most fuzzers I think have a dictionary of special values they’ll occasionally use. I wrote a structure-aware fuzzer framework which uses random values for the initial generation, then on subsequent mutations will perform arithmetic/bit flips with a small chance to grab a special value from the dictionary.

Even without a dictionary if your input is reasonably small it should discover these special values given enough iterations.

not2b · on June 18, 2020

The dictionary of special values is a good approach to deal with this. Without it, you'd need billions of iterations before randomly trying 0x80000000 for a 32-bit integer.

arthurjj · on June 17, 2020

I find fuzzing fascinating but it seems to only be used in protocols and tools such as compilers, networking protocol implementations etc. Does anyone have experience using it for business logic?

aaron695 · on June 17, 2020

Not sure of your exact definition of fuzzing but our staging environment had real data in it.

I'd always run end to end tests on random real data records.

I don't see many people doing this, not sure?

There seems to be kickback because tests might fail one run and pass another. People don't seem to like this.

zxcmx · on June 17, 2020

I think as long as it's reproducible it's Ok... but should be a team decision. The data has to be non-sensitive of course, very often it's not ok to pull data out of prod. Fuzzers in general tend to work hard to make sure crashes are reproducible (usually with a seed at the start of the run, and by saving crashing inputs).

If you can't repro a crash a month or a week later, it's not worth it IMHO.

Usually you can get most of the benefits with data generation, based on a deterministic seed.

Mainly though, you have to ask about the purpose of the tests. Most developers want a test suite that tells them with high confidence that what they just worked on didn't create a new bug, or a regression. That lets them stay focused on their work rather than chasing through the codebase for an unrelated latent issue which might have been introduced by someone else, years ago.

There is often value in separating "exploratory" or "stochastic" tests which might uncover new bugs (previously unknown) from regular tests. To make it really work as part of default workflow you need a culture which understands a feature might take additional time because the team stopped to fix a latent issue to get back to green.

To put it bluntly, letting random-ish testing break your pipeline is making a statement of business priority (we care so much about random bugs that we will stop all other work until they are fixed) which might not align with reality.

"Is this the most important thing for me to be working on for the success of the business?"

I think the "right" way to add fuzzing / random testing to a pipeline is sell the value to the business and have an initiative to rigorously fuzz the snot out of the software. Crucially, have resources dedicated to triage and fix the identified issues. It should be a non-breaking pipeline stage right up to the point where everyone feels confident that any test failures are the result of new code.

The worst case is that an opinionated developer adds stochastic tests, with bad reproducibility, without consulting the team and randomly breaks builds in an organisation that only rewards or understands feature delivery and ticket punching. That is basically going to make their co-workers life hell and is IMHO not the right way to go about it.

solarkraft · on June 17, 2020

> tests might fail one run and pass another. People don't seem to like this

I have also encountered this attitude and haven't understood it so far. What stops you from recording the test cases to reproduce failed ones?

arthurjj · on June 17, 2020

I've done this by manually grabbing and anonymizing specific records and then creating unit tests. But never using a random sample of prod data.

Definitely using this idea. Thanks

alexpotato · on June 17, 2020

Having worked in FinTech (big bank, HFT, prop shop, start up hedge fund etc) for 10+ years we have several types of fuzzers for business logic.

We call them:

- market data feeds

- orders feeds

- users

- developers

YMMV

/sarcasm ;-)

yjftsjthsd-h · on June 17, 2020

Although the ways that developers and users will find to abuse your systems are often extraordinary, they still pale in comparison to what a good fuzzer can do.

carlmr · on June 18, 2020

Are there any good resources on using fuzzers for the first time?

j-kent · on June 17, 2020

I thought this was going to be about the guitar.