2022 May 21

Fuzzing Hare

Some years ago, I first encountered American Fuzzy Lop (AFL), a fuzz tester. I found it nifty and forgot about it. I recognized the term “fuzz testing”, but didn’t understand it. I wrote mostly Go then, so I didn’t need AFL, which targets C programs.

With time, my interest in C deepened, as I savored its set of tradeoffs in the great expanse of Language–Designland. I remembered that C programs benefit from nifty tooling. What was that fuzzer called? I should try it.

Meanwhile, Drew DeVault was “promoting” the “secret programming language”. I peeked around, found Hare’s project hub, and discovered an adorable C codebase—the bootstrap compiler (harec)—ready to fuzz. Why the hell not? It was a good candidate: clean style, clear architecture, a disciplined commit history, and 20k lines of code—little enough that I might understand it.

Learning a compiler by fuzzing it sounded fun! I’d learn two things at once. AFL is easy to use, so if the idea flopped, I would have wasted little time.

But it didn’t flop.

In the end, I used AFL++, which proved to be faster. If you’re interested in trying fuzz testing, read and heed the docs: use very small initial cases and write a dictionary. Fuzzing can run for arbitrary lengths of time, and a good setup makes it more productive. Or do it the hard way, like I did, and learn for yourself.

Before I submitted any patches, I gave Drew my proposal to fix the bugs found by the fuzzer. He warned me that the bootstrap compiler has one job: to build the self-hosted compiler, so it’s OK if it has bugs. Pfff. The self-hosted compiler? Does that even exist yet?

Yes, it did. But I didn’t care, I was stoked to FUZZ some HARE.

Security researchers and engineers often use fuzz testing to expose security flaws. Fuzzing advocates, noting its lack of popularity relative to other testing techniques, find most software maintainers negligent. I don’t really know, but I also know that fuzzing has benefits beyond security.

Fuzzing is desired on components that receive untrusted input, such as databases and web servers. It’s odd to fuzz a bootstrap compiler, because its input is not only trusted, but static! Sure, in a spy-flick scenario, the self-hosted compiler becomes the victim of a supply chain attack by a state actor, so you can’t even trust your own codebase. This is extremely unlikely at this stage, so it doesn’t rationalize my decision.

Of course, I don’t need to rationalize it, but I anticipate other people needing to. Here’s my suggestion: it was fun and I was spontaneously motivated to do it. And sometimes, you can’t predict the good that comes from play.

Despite Drew’s warning, he’s applied (sometimes with revision) all the patches I submitted—except a few in the pipeline at the time of writing. I take this as evidence that although security and correctness are not critical, a few small, well-scoped patches are still welcome for their improvements to robustness and user experience.

Fuzzing does have benefits beyond security. Maybe engineers find Security too difficult to apprehend, and I agree. Robustness is easier to understand and still helps security.

Anyway, enough rambling. On to the things I learned.

AFL++ took days to create thousands of Hare “programs” that crashed the bootstrap compiler somehow. The fuzzer gives a rough categorization of these cases by what signal the kernel raised to terminate the process. It tries not to be redundant, but is oversensitive and doesn’t understand the categories that matter, so many cases repeat the same bug.

Harec contains a lot of assertions, so most cases were SIGABRT. Rarer, and more interesting, were SIGSEGV and SIGFPE. If you inspect a corpus member at random, the encoding likely not ASCII. While Harec is designed to accept UTF-8 input, in practice the input is often a subset—ASCII. Thus, I first worked on ASCII cases that didn’t cause SIGABRT, automating this query with a shell script. Some of the segfault cases were stack overflows caused by large nested inputs—not interesting, so I limited file length to 80 bytes. After selecting a case, I’d run it in Valgrind to get the stack trace, then navigate to the crash site. Sometimes, the bug could be fixed right there.

In other cases, the bug was caused somewhere else. My initial approach was to try GDB. I don’t recommend this approach. It’s much more important to focus on program design than state. While the fine details of state sometimes matter, what’s more important is the overall program logic and architecture. If you don’t understand that, there’s little hope to fixing it by staring at program state dumps.

I had to learn, somehow, the way harec does what all compilers do, despite having never worked on harec before. Thankfully, since I did know the basics of compiler design, I could formulate testable guesses as to the problem. Using ctags, I navigated the unfamiliar codebase and figured out how to test my hypotheses, often by placing printf calls. That helped me not only learn harec’s design, but narrow down the cause.

There’s a non-invasive narrowing tactic that is often effective: tweaking the input. This is difficult because fuzzer-generated files are nonsensical, and it’s unclear what structure in the file is causing the crash. Minimizing it by hand isn’t too hard, and can also quickly narrow down the relevant compiler subsystem and language specification.

For instance, a crash in the type checker could be caused by several things. Is the syntax tree malformed? That requires understanding the syntax specification, so read the spec. If it’s crashing on a well-specified program, test the parser next. Otherwise, it’s probably internal to the type checker. And so on.

The most important things to know while troubleshooting: what is the system supposed to do, and how does it do it?


Feedback

Discuss this page by emailing my public inbox. Please note the etiquette guidelines.

© 2024 Karl Schultheisz