Source: Euigrp on reddit

I joined a company right out of college to work on a consumer electronics device about 5 months before we showed it off to the public. The device is Linux powered, and while I was still getting used to the idea of monkeying around in kernel space, I got pulled into a meeting where people were trying to triage bugs. Lots of bugs. Hundreds of bugs. All with a "this is impossible, how did this happen?" flavor to them.

"Memory Corruption!" they cried. "Oh boo hoo, fix your bugs," I thought. Looking through crash dumps we see... what's this? A program hit an illegal instruction while concatenating two strings using the standard lib function. Huh, that is weird... Next log: Page could not be retrieved from swap, on a device where there isn't any swap space. (Well I think I know why we couldn't retrieve it!)

Fine. On a Friday afternoon I wrote a short program. This program allocates 80% of system ram into one array and writes sequential integers. It then waits for a press of enter, then checks that the array's contents are still what it wrote. I load it up and give it a shot. I wait 30 seconds, then give it a check. Nope, no problem. I try a few more times - Ha, I knew it wasn't memory corruption! Finally I unplug my debug cable (USB) for about 10 seconds, then I plug it in and out real fast a few times, then put it back in. Bam! 90 errors.

Oh Fuck.

Ok, ok, so I had to mess around on the USB port to make it work. It is USB related then right? It isn't like the USB driver implements the magic bit fairy algorithm that sprinkles around bit errors at random. So it must be hardware right? No, it wasn't, but that didn't stop us from doing all manner of dastardly things with a 15,000 Volt static gun to this device's USB port. Hardware engineers who had long since moved on to the next product got pulled back to scratch their heads over this problem. I don't properly remember how much time we wasted proving to ourselves that the hardware was really, really, reallllly solid. The grounding was fine, the voltage was stable, the clocks ticked in time and the layout of the DDR lines was so beautiful you would weep at the sight.

The devices the hardware team were testing with grew more and more unstable. My guess is that the device would load things into memory, have bit errors, then flush it back to flash, maybe not even in the right location. (The page table was often getting corrupted, so it wasn't beyond belief that file tracking structures did as well. Contents could be written to wrong locations and file system structures would be broken etc.) Over time these devices began to deteriorate to the point where they couldn't boot reliably. A hardware engineer finally broke down and re-flashed with an image he had sitting around on his laptop. Relatively speaking, that image was ancient.

"Dude. Its software."

"What?!?! I assure you we didn't write the bit fairy!" Nope: he flashed a 3 month old software build, and the problem went away. At this point I felt responsible for leading a lot of people on a very long and pointless goose chase, so I stayed through the night and binary searched months of patches. (Full software builds of an entire operating system take longer than I'd like...)

So, who was the magic patch? Someone added a driver to the kernel for a chip we were evaluating. This chip wasn't on this device.

Ha! We found a witch! BURN HER!

At this point a lot of people declared mission accomplished. Nearing release they were happy to have it narrowed to a patch they could simply back out and move on. We reverted the patch with extreme prejudice, built an image, tested it, and all was good. Little did we know that within days the same flaw was reintroduced into the kernel.

So wait. If that chip isn't on our board, how is the driver screwing with us? I run an lsmod, and no, the driver isn't loaded... "So fine, whatever, I'll delete the module file and reboot. Hold on, it keeps happening. That's not right..."

I'm now on my own, looking into what the hell was going on. I start to look deep into the patch. It was a lovely 10,000 line c file the chip vendor had provided us. To call it chaos would be charitable. (To their credit, they got us a much more sane driver a few weeks later.) After poking through it a little, I concluded there was no bit-twiddle-for-fun implementation. So what else was there? 48 bytes derived from 5 lines of code. A small little structure in a bootstrap file that would say what bus address to find this chip under. I delete the massive pile of driver, but left the other struct in. The problem remains.

So, boys and girls, we have ourselves an alignment problem! Somehow, leaving in this 48 byte struct is moving something around in memory in a way that causes a problem. I narrowed it down to putting anything bigger than 32, and smaller than 64 in that file would cause the problem. Finding that range out really didn't help, but it felt productive at the time.

The kernel build outputs a neat file called System.map. This lists where in kernel virtual address space all of your variables compiled into the kernel are. I find my little struct half way through the ".data" section. The .data section is full of initialized variables, so as the kernel's binary is unpacked into RAM, it will fill all of these in from the compiled image. Using a System.map as a guide, I implemented a rather haphazard binary search. This ended up mostly being a binary search over the linker order of the various C files. I found a variable where I'd like to do a compare, find what file in the kernel contains it, put my magic struct next to it in that random file, and see if the problem was reproducible or not.

My search wound its way into the last few elements of .data and turned up empty handed. It was not in initialized variable memory. Scrolling down further in the System.map, I realized there was an entire section that I had neglected, the .bss section where uninitialized variables go. Learning from my previous mistake, I tested the beginning and end first. Sure enough, an uninitialized variable placed at the beginning of that section would cause the problem, and one placed at the end of the section wouldn't. It was only a matter of time before I found the culprit. The variable whose movement caused a problem was...

A function pointer?!?

How on earth does the alignment of a function pointer means life or death for our system? On ARM you can't read words from unaligned access, meaning every 32 bit variable needed to be placed at a memory address that was divisible by 4. The function pointer shouldn't be any different, and it always got the minimum. In fact, in the problem case, the address was divisible by a power of two greater than or equal to 64. Any less and the problem went away. The pointer's alignment was too good.

There is no such thing as too good of an alignment. At least there wasn't until this bug.

Now this function pointer wasn't your grandpa's pointer. It pointed somewhere special. There was a region of SRAM inside our CPU that we could use when we aren't able to use RAM, for various bootstrap purposes. To save power while idle, we copy a routine into this special location, set this particular function pointer to point to it, then call it. What did this routine do? Lets find the assembly file it came from and have a look. At this point I'm no ARM assembly guru, but the comments were alarming enough.

// Calculate the address of a memory mapped control register ... ... // Now we turn off the memory controller and put the LPDDR into self refresh mode

Hold on, you do what?!? You went from doing some basic register operations to turning off the memory controller in quite a hurry there. I shot an email to the vendor who wrote this routine asking them if they missed a step.

Their response (3 days later) was something along the lines of "Oh yeah, there totally should be memory barrier there." It turns out they may have had to do extra TLB maintenance if you happened to write to a memory address divisible by 64 due to something in their L2 cache structure. In those cases we would still be using RAM when we turned off the controller.

Given the minimum 4 alignment requirement for most variables, and that the last thing written couldn't be 64 or more, we had a 1 in 16 shot of having a completely unusable system every time we compiled.

In the end, the product shipped with the memory barrier in place, rock solid, and the customers loved it.

Oh, and if your wondering, I couldn't see it with a USB cable in because we can't go into that low power state while using USB. Totally a USB problem.


More such crazy stories

© 2017-02-17
qznc