Faultlore
Faultlore
Faultlore

Compiler Optimizations Are Hard Because They Forget

Aria Beingessner

September 24th, 2022

How exactly would you design an optimizing compiler? Or more specifically, how do you design and implement the actual optimizations? Trying to do it all at once is really overwhelming and probably impossible, so a lot of compiler optimization is:

  1. Identify a situation where a trick could be applied
  2. Build analysis that can find that situation (or a generalization of it)
  3. Apply the trick to all the places you can find

Staple a hundred of those “optimization passes” together (hopefully reusing step 2 as much as possible) and suddenly your compiler starts doing really big and complicated changes as an emergent property of all the small ones!

One of the first things you’ll learn about in compilers is “peephole optimization”, which is the simplest kind of optimization pass, requiring very little analysis. Like, some of them could probably be done with an actual regex! Here’s a couple simple peepholes:

Ah, but, hmm. What if you have (x * 2) * 2? Then you could turn that into x << 2! Oh hey, and we don’t actually need to write a new peephole for that, because the existing ones will combine to compute that, right? Just do:

(x * 2) * 2
(x << 1) << 1
x << 2

Let’s try it out on our definitely real compiler:

I’m done optimizing, the final program is (x << 1) << 1

HUH? Hold on let me look at the code I didn’t write:

fn do_optimizations() {
    merge_shifts();
    replace_muls_with_shifts();
}

Ughhh our peephole passes run in the wrong order for this to work. Ok well we can just swap them around. But wait oh god if we add enough passes there might be no optimal ordering and we’ll be constantly missing “obvious” optimizations! Do we need to run this in a loop? But then how do we know when to stop?? I guess we could check if any change makes other rules potentially newly applicable and then rerun them??? But oh god will that necessarily terminate???? And really at what point does the cost outweigh the benefit and aaAAaaAAAaAAAAA!!!!!

There are no† solid answers here, you just try everything until you feel you’re getting reasonable results, and then you try not to think about pass ordering for a while until it’s time to kick the nest of bees again and discover that two passes have been in the wrong order this whole time – or more likely that your compiler crashes if you swap the order of two passes because no one ever tried that before and they were subtly depending on each other.

To quote someone who suffers in the guts of compilers far more than me:

optimization design is science, but pass ordering is art

† My twitter mentions are a disaster right now because I accidentally baited half of the world’s leading compiler experts into sending me their favourite papers on optimization design so maybe actually there is a perfect design and all previous compilers are cowards and fools.

Hazards

The merge_shifts peephole I first described is too simple. It can’t even handle (x << 2) << 3! Really we should be able to replace any (x << y) << z with x << (y + z)!

Compiler: hello yes I have turned x << 30 << 40 into x << 70

Much better– wait what type is x? A 64-bit integer? Wait, uh, shit, that’s… going to be problematic. Like, if you emit this as literally as possible on x64 it will be interpreted as x << 6 because the shl instruction just modulos (truncates) the shift to be mod 64. Even with the vibiest vibes for semantics that’s just not the same thing the program was doing before!

So we need to be a bit more careful about our optimizations! Usually there will be hazards that make an optimization invalid. In this case the hazard would just be “the sum of the two shifts being too big”.

Sometimes these hazards are easy to notice, but as you try to make your optimizations more generalized you will start to run into places where you don’t have enough information. For instance if you wanted to allow one of the shifts to be a variable! You could conceivably replace (x << 10) << z) with x << (10 + z) but… how can you ever be confident that this is correct? You could make over-wide shift Undefined Behaviour which would let the compiler assume z < 64, but that tells you nothing about 10 + z < 64!

So generalizing the shift peephole to that extent is just unsound, because we can’t prove that it’s correct (ignoring whether it would even be profitable). Except… we can make our analyses smarter! If the code in question is wrapped in if z < 20 {, then it would be valid. So maybe we could have some data structures and analyses that track the flow of data and facts through the code to help prove that hazards don’t exist!

Suddenly we’re starting to build a Serious Optimizing Compiler and not just writing some peepholes anymore!

Oops That Was Important?

So at this point your compiler’s getting pretty complicated. You’re building a lot of fancy data structures and analyses to try to prove that certain optimizations are valid (let’s call of this “metadata” for the sake of exposition). When you do that successfully, you rewrite the program/ir to a new form that you believe is Better.

This is hard enough, because every optimization not only needs to do its One Weird Trick, but it also needs to remember to fix up all that metadata, which was about the old version of the code. This includes debuginfo which is trying to map the current program to the original source program so that when people ask for the value of “x” you can figure out that it ended up in some random register. Needless to say a lot of bugs Just Happen in all of this, but those aren’t the bugs I’m interested in today.

The stuff that’s really interesting to me, and the reason I wanted to write this article, is that these optimization passes are generally destructive. The x * 2 * 2 => x << 2 transform only works because we completely forget that there was originally multiplication and move on with our lives. Similarly a lot of optimizations rely on things like inlining, constant folding, and dead-code-elimination to strip away extraneous details and make the Specific Situation an optimization is looking for more glaringly obvious.

And for the most part this is good! If there’s stuff that my program doesn’t need to do, I don’t want it doing it!

But here’s the problem: with a ton of optimization passes with really complicated hazard checks, it becomes really easy for things that don’t seem important to actually be load-bearing facts that prevent certain optimizations from doing the wrong thing! Oops… That Was Important?

The one that will continue to haunt me for all eternity is one that always throws people off when they first learn about it: it’s arguably incorrect for llvm to optimize out useless pointer-to-integer casts, and this can lead to actual miscompilations in correct code. YEAH.

The basic idea is that compilers have a lot of information about whether two pointers can possibly alias. If they definitely don’t alias, then writes to one can’t be observed through the other. But if you cast two pointers to integers to compare their addresses, the compiler gets really excited and starts replacing expressions containing one pointer with the other pointer, and then suddenly it has turned “write to x, read from x” into “write to y+1, read from x”.

This puts us dangerously close to the compiler concluding that the write and read are unrelated, and therefore that the write can be deleted. The one saving grace is that the compiler should know that casting a pointer to an integer is a huge alias-analysis hazard and hold back on that conclusion… except another optimization saw that the pointer-to-integer casts aren’t needed and removed them.

It’s like Grettle very carefully left a trail of optimization breadcrumbs through the forest so that we’d remember how we got here, but Hansel just saw Delicious Free Breadcrumbs and ate them up. Now we’re lost in the optimization forest and likely to be eaten by a dragon with a messed up neck.

In some sense the problem is that programs are horribly prone to in-band signaling, and so doing some seemingly banal operation like a cast is also a secret handshake about alias analysis. But the alternative is… also messy. With a thousand optimizations and a dozen different auxiliary data structures, the more you store Important Information about a program out-of-band, the easier it is for an optimization to forget to check that information or keep it up to date.

I don’t have any good answers for this. It just Sucks and Compilers Are Hard and I wanted to share that sorrow. Maybe rvsdg or something magically fixes this and makes it tractable for mortals to reason about compiler semantics.

Hardware Cache-Hits Different

One last thing before we go. I recently learned that, allegedly, hardware memory models are actually a lot more well-defined and robust than the ones used by compilers. But why? And if that’s true, why don’t compilers just adopt those kinds of models?

As far as I can tell, this is just another consequence of everything discussed above. We expect compilers to be destructive when they do optimizations, while we expect hardware to more or less Do What We Say. People love pointing out that hardware internally does lots of tricks that are very compilery, but at the end of the day it’s pretty directly interpreting a lot of instructions, and always has the full source code it’s working on right in front of it.

Like, the CPU may do some fancy cache shit to “not really” do a load or a store, but it still knows that it’s supposed to be doing a logical memory access, and so all the in-band signaling that access implies is still there so it doesn’t forget to uphold certain guarantees.

On the other hand, a compiler is expected to see a load that it can cache and delete that load from the program forever, never to be spoken of again. Even if the compiler kept an auxiliary structure to remember that load existed, once we produce the final binary it’s Just Gone. There’s no indication to the CPU that the load ever existed, and so it doesn’t know it needs to maintain the load’s semantics!

And that, my dear audience, is why

drum roll

Fucking memory_order_consume exists and is a disaster! memory_order_consume is exactly trying to capture the more subtle ordering guarantees in hardware.

You see, everyone gives you really strong guarantees about accesses to the same memory (even Relaxed atomics have a Total Modification Order), but things get fuzzier when you want two unrelated pieces of memory to be consistent with each other. Normally the solution to this is to emit some kind of “fence” that tells all the cpus to take a moment and get their story straight, but this is really heavy-handed and expensive.

What hardware memory models recognize is that, well, if you load pointer B from pointer A, accesses with B are actually related to A! Ok that’s confusing, let’s give a simple example:

static mut LATEST_DATA: AtomicPtr<u64> = ptr::null_mut();
static mut MY_DATA: u64 = 0;

unsafe fn thread1() {
    MY_DATA = 5;                                // Initialize our data
    LATEST_DATA.store(&mut MY_DATA, Release);   // Share it
}

unsafe fn thread2() {
    let latest_ptr = LATEST_DATA.load(Consume); // Get the latest
    let data = *latest_ptr;                     // Read out the data
    println!("{data}");
}

One thread writes some data to MY_DATA and then tells all the other threads to go check it out by storing a pointer to it in LATEST_DATA (with Release semantics so that the MY_DATA write definitely came before). Then another thread loads that pointer (Consume semantics, no fence needed) from LATEST_DATA and reads the contents of MY_DATA through it. Because reading the data LATEST_DATA points to obviously requires reading the LATEST_DATA pointer first, there’s obviously a causality relationship and no extra synchronization is required!

“Do some shit and then swap a pointer to that shit into a data structure” is the bread and butter of lock-free data structures, so making that operation more efficient (and intuitively correct!) is a nice big win.

Unfortunately, this beautifully elegant system completely falls over in the face of optimizing compilers, because we’ve ordered them to stomp through the code and eliminate every load and store they can! Your elegantly crafted causal chain will probably be reduced to ash when the compiler realizes that a load is redundant and removes it. More subtly, the compiler might also notice that two pointers are guaranteed to be equal and replace accesses through the one that has causality with the one that doesn’t!

Oops That Was Important!

As a result, all the compilers just interpret consume as the strictly more expensive acquire because they really can’t meaningfully support it while also supporting all the other optimizations we expect of them.

This is understandably upsetting to people who work closely with hardware, and why you often see them asking for “nonsensical” things like “a release load”. They’re just trying to ask for “the thing that hardware already guarantees, please, it’s right there”. But alas, compiler memory models just don’t know what to do with it.

God damn compilers are way too hard.