DDR5 Fault Injection Platform for Rowhammer Research
As part of my Bachelor’s thesis, I designed a novel DDR5 fault injection system used for Rowhammer research consisting of a custom interposer PCB, a microcontroller board and a control server.
Rowhammer
The Rowhammer vulnerability has been present in nearly all computer and server DRAM devices since around 2014. DRAM works by storing information as electrical charge in tiny capacitors. Because of leakage currents, the capacitors need to be refreshed periodically in order to retain the stored information.
In essence, Rowhammer is a memory integrity issue that has proven and severe security implications. By repeated accesses to certain memory rows, it is possible to corrupt physically nearby memory cells and induce bit flips in them. These bit flips are especially valuable to an attacker if the corrupted cells are normally not accessible by the attacker, because they belong to the memory of another process or the operating system kernel. In other words, with Rowhammer, an attacker can completely bypass memory isolation enforced by any operating system. This means that an unprivileged process can use Rowhammer bit flips to manipulate page tables, tamper with process data structures and escalate its privileges to full root access. Rowhammer breaks the fundamental assumption of memory integrity. Hence, normal operating systems are unable to detect or prevent Rowhammer bit flips.
Timeline and Mitigations
When Rowhammer was first publicly disclosed in 2014, the DDR3 devices seem to have had no active protection against these kinds of attacks. However, it is likely that manufacturers were aware of the Rowhammer problem before this publication, especially in the context of the next generation of DRAM, DDR4, which was also released in 2014.
With DDR4 devices, manufacturers started to incorporate various forms of mitigations. Most of them included some form of Rowhammer detection mechanism (e.g., count the numbers of row activations) and a premature refresh of the potential victim rows. The mitigations were necessary because Rowhammer has become an even bigger issue on DDR4, as capacitor sizes and physical row separation decreased with decreasing node sizes used for DDR4. However, these mitigations were not effective. By using a novel fuzzing technique designed to outsmart the mitigation algorithms, Jattke et al. managed to bypass the mitigations on all tested devices.
Now for the latest version of DRAM released in 2020, DDR5, the state of Rowhammer is somewhat unknown. It is clear that the techniques that worked for DDR4 are not effective anymore on DDR5. Since the devices got even smaller and more dense, it is likely that Rowhammer is still an inherent issue. So, this means that the proprietary mitigation techniques present in DDR5 devices must have significantly improved. For one, DDR5 devices now contain on-die error correction which is able to correct a small number of bit flips on the fly. However, there must be additional mitigation mechanisms at play.
Triggering Bit Flips Using Fault Injection
One way to completely disable Rowhammer mitigations is to prevent row refreshing alltogether. This is because mitigations use the time provided by a refresh command to refresh a victim row. While the absence of refresh commands may lead to expected retention failures of the memory cells, the time scale for this is in the order of multiple seconds, while Rowhammer bit flips can be triggered in a much shorter time span. Additionally, Rowhammer bit flips can easily be distinguished from retention failures due to their location and frequency.
The absence of refresh commands violates the JEDEC specification and is therefore not implemented in normal memory controllers present in commodity CPUs. It is therefore also not a security concern, as it requires specialized hardware and physical access to the victim machine. However, this technique can help to understand the Rowhammer susceptibilty of a DRAM device and the inner workings of its mitigations. This knowledge can later be leveraged to reverse engineer, and ideally defeat, the deployed mitigations in a standard environment.
One possibility to achieve this is to use fault injection. The idea behind this approach is to alter signals on the parallel command bus of the DDRx device in such a way that a command is transformed into another. By looking at the DDR4 command encoding from the JEDEC standard, it is evident that it is possible to suppress refresh commands by shorting the A14 pin to GND. Forcing A14 to low also transforms Read into Write commands, and comes with implications for the addressable range, but this can be accounted for in the experiment design. The bus also features a parity signal over the command bits, which leads to the corrupted command being discarded by the memory device.
This idea has been implemented by Cojocar et al. for DDR4 in the mFIT system. It consists of an interposer PCB that is able to intercept and manipulate the bus signals. It is put between a standard computer motherboard and an unmodified DRAM module.
It proved to be a simple and very cost-effective way to disable on-device Rowhammer mitigations and estimate the devices Rowhammer susceptibilty.
My Contribution: Fault Injection for DDR5
My project is about the design of a system similar to mFIT, but for DDR5. This comes with some challenges: DDR5 uses higher clock speeds than DDR4, has no parity signal, a smaller command bus and a more complex command encoding. On DDR5, subchannels were introduced, which essentially allow for two entirely separate memory channels to co-exist on the same memory module. To make room for this change, the command bus is now less wide and instead features two-cycle commands.
This is challenging for fault injection, as a fault now necessarily affects two bits in a command that spreads over two clock cycles. Also, due to the lack of a parity signal, true command suppression is not possible anymore. It is only possible to transform one command into another.
Designing the Interposer
The goal now was to design an interposer that could, in contrast to the original DDR4 mFIT, force a signal either high or low, and perform this action on multiple command bits. For this, I used a collection of solid-state, high-frequency switches that can connect the intercepted DIMM signals normally to the CPU, or statically to either GND or VCC.
The PCB design required a few iterations to get the high-frequency circuit working, and impedance control mismatches and cross talk issues to an acceptable level. The design was done in KiCAD.
Experiment Machine Software
Ideally, we would have full control over all memory accesses on the experiment machine (where the interposer is plugged in). In reality, this is nearly impossible using a standard operating system, as there are too many unpredictable processes potentially interfering with the memory accesses from the experiment.
In order to keep the memory noise to a minimum, we decided to not use an operating system at all and instead run bare-metal software. For this, we decided to use an existing, open-source UEFI app as a basis, which provides a simple runtime for our experiment code.
I patched the code to run custom Rowhammer experiments (i.e., custom memory access patterns) and modified the existing USB HID stack for communication with the injection controller.
Using this system, the experiment machine is able to inject a fault into its DDR5 bus, hammer a memory region and then check this region for Rowhammer bitflips.
Injection Controller
In order to control the switches on the interposer, I used an external microcontroller. For this, I designed a simple carrier board for a Teensy microcontroller board that fans out the required signals to an Ethernet jack and a connector for the injection controller. I chose this Teensy microcontroller kit as its predecessor was also used in the original mFIT and because it can easily be networked for automation.
The injection controller waits for a data packet from the experiment machine. This packet signals that the experiment machine is ready for the fault injection. The microcontroller then activates the switches on the interposer to suppress refresh commands, while the experiment machine performs Rowhammer on the memory module under test. After the experiment, the injection controller removes the fault and the experiment machine scans its memory for bit flips. The results are collected automatically and sent to the injection controller.
The injection controller also exposes a simple HTTP API that is accessed by the control server, which oversees and automates the whole experiment. On the control server, I also set up a local PXE server to autonomously boot the generated images on the experiment machine.
---
This project was a lot of work, but I am proud to have successfully completed the challenge. I would like to thank Prof. Dr. Kaveh Razavi, Patrick Jattke and the Computer Security Group (COMSEC) at ETH Zurich for making this project possible.