The "AMD Cool'n'Quiet" firmware setting caused memory corruptions on my machine
Published on
A while ago I published a post on downgrading Nvidia proprietary drivers on NixOS in an attempt to troubleshoot a random kernel panic bug that existed on my machine since about two months ago. It turned out that while the Nvidia driver does indeed cause kernel panics once in a while, most of the freezes are actually unrelated to it.
The frequent kernel panics and hard freezes went away once I disabled the “AMD Cool’n’Quiet” option in the UEFI interface.
Symptoms, in case it helps
Here were the symptoms of the random freezes before disabling “AMD Cool’n’Quiet”. The machine has a B550M DS3H board from gigabyte and a Ryzen 5 3500X processor. All the other parts have no effects on the freezing issue.
Logging:
- Sometimes it freezes without any logs.
- Sometimes there are some suspicious logs in
journalctl --boot=-1
, such as:kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
.list_del corruption, ffffabb8c179be58->next is NULL
- Most of the suspicious logs indicate that some contents of the memory just turned into zeros.
Severity:
- Sometimes it freezes but leaves the magic SysRq keys working so that I can press the REISUB sequence to reboot gracefully.
- Sometimes it freezes with everything dead, including the magic SysRq keys and netconsole.
Signs:
- Sometimes it freezes without any sign in advance.
- Sometimes before a freeze, some weird things happens, such as:
- Firefox tabs (or the whole browser) crashes.
- Emacs refuses to start.
- The BTRFS root subvolume becomes read-only due to some errors.