stefan-gloor.ch

DDR5 Fault Injection Platform for Rowhammer Research

Pile of DRAM sticks
Title picture by Wilbysuffolk, CC BY-SA 4.0, via Wikimedia Commons

As part of my Bachelor’s thesis, I designed a novel DDR5 fault injection system used for Rowhammer research consisting of a custom interposer PCB, a microcontroller board and a control server.

Rowhammer

The Rowhammer vulnerability has been present in nearly all computer and server DRAM devices since around 2014. DRAM works by storing information as electrical charge in tiny capacitors. Because of leakage currents, the capacitors need to be refreshed periodically in order to retain the stored information.

Diagram showing that a memory chip on a RAM stick contains of a grid of memory cell, each consisting of a transistor and a capacitor.
Simplified, DRAM devices consist of a grid of memory cells, each consisting of a storage capacitor and an access transistor.

In essence, Rowhammer is a memory integrity issue that has proven and severe security implications. By repeated accesses to certain memory rows, it is possible to corrupt physically nearby memory cells and induce bit flips in them. These bit flips are especially valuable to an attacker if the corrupted cells are normally not accessible by the attacker, because they belong to the memory of another process or the operating system kernel. In other words, with Rowhammer, an attacker can completely bypass memory isolation enforced by any operating system. This means that an unprivileged process can use Rowhammer bit flips to manipulate page tables, tamper with process data structures and escalate its privileges to full root access. Rowhammer breaks the fundamental assumption of memory integrity. Hence, normal operating systems are unable to detect or prevent Rowhammer bit flips.

Timeline and Mitigations

When Rowhammer was first publicly disclosed in 2014, the DDR3 devices seem to have had no active protection against these kinds of attacks. However, it is likely that manufacturers were aware of the Rowhammer problem before this publication, especially in the context of the next generation of DRAM, DDR4, which was also released in 2014.

With DDR4 devices, manufacturers started to incorporate various forms of mitigations. Most of them included some form of Rowhammer detection mechanism (e.g., count the numbers of row activations) and a premature refresh of the potential victim rows. The mitigations were necessary because Rowhammer has become an even bigger issue on DDR4, as capacitor sizes and physical row separation decreased with decreasing node sizes used for DDR4. However, these mitigations were not effective. By using a novel fuzzing technique designed to outsmart the mitigation algorithms, Jattke et al. managed to bypass the mitigations on all tested devices.

Now for the latest version of DRAM released in 2020, DDR5, the state of Rowhammer is somewhat unknown. It is clear that the techniques that worked for DDR4 are not effective anymore on DDR5. Since the devices got even smaller and more dense, it is likely that Rowhammer is still an inherent issue. So, this means that the proprietary mitigation techniques present in DDR5 devices must have significantly improved. For one, DDR5 devices now contain on-die error correction which is able to correct a small number of bit flips on the fly. However, there must be additional mitigation mechanisms at play.

Triggering Bit Flips Using Fault Injection

One way to completely disable Rowhammer mitigations is to prevent row refreshing alltogether. This is because mitigations use the time provided by a refresh command to refresh a victim row. While the absence of refresh commands may lead to expected retention failures of the memory cells, the time scale for this is in the order of multiple seconds, while Rowhammer bit flips can be triggered in a much shorter time span. Additionally, Rowhammer bit flips can easily be distinguished from retention failures due to their location and frequency.

The absence of refresh commands violates the JEDEC specification and is therefore not implemented in normal memory controllers present in commodity CPUs. It is therefore also not a security concern, as it requires specialized hardware and physical access to the victim machine. However, this technique can help to understand the Rowhammer susceptibilty of a DRAM device and the inner workings of its mitigations. This knowledge can later be leveraged to reverse engineer, and ideally defeat, the deployed mitigations in a standard environment.

One possibility to achieve this is to use fault injection. The idea behind this approach is to alter signals on the parallel command bus of the DDRx device in such a way that a command is transformed into another. By looking at the DDR4 command encoding from the JEDEC standard, it is evident that it is possible to suppress refresh commands by shorting the A14 pin to GND. Forcing A14 to low also transforms Read into Write commands, and comes with implications for the addressable range, but this can be accounted for in the experiment design. The bus also features a parity signal over the command bits, which leads to the corrupted command being discarded by the memory device.

Table with DDR4 command encodings.
Subset of the DDR4 command encoding (Table by Wikipedia.org, derivative work of JEDEC Standard JESD79-4: DDR4 SDRAM).

This idea has been implemented by Cojocar et al. for DDR4 in the mFIT system. It consists of an interposer PCB that is able to intercept and manipulate the bus signals. It is put between a standard computer motherboard and an unmodified DRAM module.

3D concept drawing of DRAM interposer. It shows how the interposer PCB is plugged between a computer motherboard and a DRAM module.
Concept of a DRAM interposer.

It proved to be a simple and very cost-effective way to disable on-device Rowhammer mitigations and estimate the devices Rowhammer susceptibilty.

My Contribution: Fault Injection for DDR5

Overview of the DDR5 fault injection system showing the DDR5 interposer with device-under-test memory module and host machine, its injection controller and the control server that controls the experiment.
Overview of the DDR5 fault injection system I designed.

My project is about the design of a system similar to mFIT, but for DDR5. This comes with some challenges: DDR5 uses higher clock speeds than DDR4, has no parity signal, a smaller command bus and a more complex command encoding. On DDR5, subchannels were introduced, which essentially allow for two entirely separate memory channels to co-exist on the same memory module. To make room for this change, the command bus is now less wide and instead features two-cycle commands.

Drawing showing a rough DDR5 pinout. There are two subchannels, each with its own command and address bus.
Approximate pin locations for a DDR5 UDIMM module. On DDR5, there are two independent subchannels, each with its own command/address (CA) bus.

This is challenging for fault injection, as a fault now necessarily affects two bits in a command that spreads over two clock cycles. Also, due to the lack of a parity signal, true command suppression is not possible anymore. It is only possible to transform one command into another.

Designing the Interposer

The goal now was to design an interposer that could, in contrast to the original DDR4 mFIT, force a signal either high or low, and perform this action on multiple command bits. For this, I used a collection of solid-state, high-frequency switches that can connect the intercepted DIMM signals normally to the CPU, or statically to either GND or VCC.

The PCB design required a few iterations to get the high-frequency circuit working, and impedance control mismatches and cross talk issues to an acceptable level. The design was done in KiCAD.

Rendering of the DDR5 fault injection interposer with circuit details redacted.
Rendering of the DDR5 fault injection interposer I designed (details redacted). Not shown is the vertical DDR5 UDIMM slot which is mounted on top.

Experiment Machine Software

Ideally, we would have full control over all memory accesses on the experiment machine (where the interposer is plugged in). In reality, this is nearly impossible using a standard operating system, as there are too many unpredictable processes potentially interfering with the memory accesses from the experiment.

In order to keep the memory noise to a minimum, we decided to not use an operating system at all and instead run bare-metal software. For this, we decided to use an existing, open-source UEFI app as a basis, which provides a simple runtime for our experiment code.

I patched the code to run custom Rowhammer experiments (i.e., custom memory access patterns) and modified the existing USB HID stack for communication with the injection controller.

Using this system, the experiment machine is able to inject a fault into its DDR5 bus, hammer a memory region and then check this region for Rowhammer bitflips.

Injection Controller

In order to control the switches on the interposer, I used an external microcontroller. For this, I designed a simple carrier board for a Teensy microcontroller board that fans out the required signals to an Ethernet jack and a connector for the injection controller. I chose this Teensy microcontroller kit as its predecessor was also used in the original mFIT and because it can easily be networked for automation.

Diagram showing that the injection controller is controlled by the experiment machine to inject a fault into the experiment machine using the interposer.
The injection controller allows the experiment machine to inject a fault into its own memory via the interposer.

The injection controller waits for a data packet from the experiment machine. This packet signals that the experiment machine is ready for the fault injection. The microcontroller then activates the switches on the interposer to suppress refresh commands, while the experiment machine performs Rowhammer on the memory module under test. After the experiment, the injection controller removes the fault and the experiment machine scans its memory for bit flips. The results are collected automatically and sent to the injection controller.

The injection controller also exposes a simple HTTP API that is accessed by the control server, which oversees and automates the whole experiment. On the control server, I also set up a local PXE server to autonomously boot the generated images on the experiment machine.

---

This project was a lot of work, but I am proud to have successfully completed the challenge. I would like to thank Prof. Dr. Kaveh Razavi, Patrick Jattke and the Computer Security Group (COMSEC) at ETH Zurich for making this project possible.