First fuzzer I wrote (we didn't call them that, we were just trying to find bugs triggered by bad packets in a cable TV protocol suite) made random bad packets - by making good ones and then mutating them past some random point - it found 3 bugs (which was amazing at the time) but the real problem with all these systems is (in terms of my problem ):
- the number of possible bad packets is literally billions of times larger than good ones
- pretty soon the percentage of those that trigger bad behaviour gets close to 0
Modern fuzzing tools are far more effective at probing the state space than just random mutations. It's hard to appreciate just how effective until you see it in practice.
E.g. last time I fuzzed a network element with AFL, it took seconds from it to go from a starting corpus of a single ethernet IPv4 SYN packet to some double-encapsulated IP-in-NSH-IP-in-NSH-ethernet monstrosity that triggered a misparse. And seconds more for it to generate a IPv6 packet with a fragment extension header that triggered some other problem. A random walk would have no chance of finding that.
Do you know of any good writeups for this particular kind of process? I too did fuzzing of network devices before it was called fuzzing and am interested in trying it again with modern tooling.
The electronic design automation world calls it "constrained random simulation" and has been using the technique for two decades for hardware verification, using the same kind of coverage-driven methodology that modern fuzzers use, though in some ways the problem is simpler with a synchronous hardware model where the state space is explicit.
Right, and a guided fuzzer would be effective at finding distinct classes of bad packets, where "bad" is defined as triggering your error function. Those complex packets that AFL was able to create from thin air were bad in the sense that they triggered bugs in our system, not bad in the sense of being malformed. The latter is interesting only insofar as those malformed packets trigger a bug.
Btw, and maybe I'm misunderstanding what you wrote, if you were generating random packets without fixing up the checksum, you were already wasting basically all of your testing capacity. All that it ends up doing is checking that the negative case of checksum verification works.
For people who are not familiar with guided fuzzers, it would not surprise me if AFL actually managed to get good checksums on its own.
I have seen it consistantly produce "magic" strings; presumably by walking the strcmp function calls, where each individual character comparison is another oppurtunity for execution to take a different path.
Coverage-based fuzzing won't find cases where an extreme value of some variable causes a malfunction but there is no code that treats this value specially, because the coder missed it. The most common case of this is an integer variable with the smallest possible value: 0x80000000 for 32 bits. The problem with this value is that if you negate it, it is still negative. That might cause a computation to go badly wrong, but a coverage-based fuzzer might "think" that it has covered every state your code can reach without finding bugs.
Is anyone aware of fuzzers that take issues like this into account by explicitly trying problematic values, like INT_MIN for an integer variable?
Most fuzzers I think have a dictionary of special values they’ll occasionally use. I wrote a structure-aware fuzzer framework which uses random values for the initial generation, then on subsequent mutations will perform arithmetic/bit flips with a small chance to grab a special value from the dictionary.
Even without a dictionary if your input is reasonably small it should discover these special values given enough iterations.
The dictionary of special values is a good approach to deal with this. Without it, you'd need billions of iterations before randomly trying 0x80000000 for a 32-bit integer.
I find fuzzing fascinating but it seems to only be used in protocols and tools such as compilers, networking protocol implementations etc. Does anyone have experience using it for business logic?
I think as long as it's reproducible it's Ok... but should be a team decision. The data has to be non-sensitive of course, very often it's not ok to pull data out of prod.
Fuzzers in general tend to work hard to make sure crashes are reproducible (usually with a seed at the start of the run, and by saving crashing inputs).
If you can't repro a crash a month or a week later, it's not worth it IMHO.
Usually you can get most of the benefits with data generation, based on a deterministic seed.
Mainly though, you have to ask about the purpose of the tests. Most developers want a test suite that tells them with high confidence that what they just worked on didn't create a new bug, or a regression. That lets them stay focused on their work rather than chasing through the codebase for an unrelated latent issue which might have been introduced by someone else, years ago.
There is often value in separating "exploratory" or "stochastic" tests which might uncover new bugs (previously unknown) from regular tests. To make it really work as part of default workflow you need a culture which understands a feature might take additional time because the team stopped to fix a latent issue to get back to green.
To put it bluntly, letting random-ish testing break your pipeline is making a statement of business priority (we care so much about random bugs that we will stop all other work until they are fixed) which might not align with reality.
"Is this the most important thing for me to be working on for the success of the business?"
I think the "right" way to add fuzzing / random testing to a pipeline is sell the value to the business and have an initiative to rigorously fuzz the snot out of the software. Crucially, have resources dedicated to triage and fix the identified issues. It should be a non-breaking pipeline stage right up to the point where everyone feels confident that any test failures are the result of new code.
The worst case is that an opinionated developer adds stochastic tests, with bad reproducibility, without consulting the team and randomly breaks builds in an organisation that only rewards or understands feature delivery and ticket punching. That is basically going to make their co-workers life hell and is IMHO not the right way to go about it.
Although the ways that developers and users will find to abuse your systems are often extraordinary, they still pale in comparison to what a good fuzzer can do.
- the number of possible bad packets is literally billions of times larger than good ones - pretty soon the percentage of those that trigger bad behaviour gets close to 0