Skip to content

error injection

Mauro Carvalho Chehab edited this page Dec 12, 2024 · 22 revisions

Firmware error injection using QEMU

Setting up an error injection environment on QEMU using firmware-first mode is not hard. Basically, it requires QMP support at QEMU level to do the error injection.

ARM processor QEMU error injection

1. Build QEMU

There's a QEMU patch adding support for error-injection at was based on this patch series. An enhanced version of it is at:

https://gitlab.com/mchehab_kernel/qemu/-/tree/qemu_submission_v16?ref_type=heads

Compiling QEMU with such patch on it adds a QMP extension to error injection compatible with UAPI 2.9A errata.

In order to build QEMU with just arm support, you can do:

git clone https://gitlab.com/mchehab_kernel/qemu -b arm-error-inject
mkdir qemu/build
cd qemu/build
../configure --target-list=aarch64-softmmu --enable-slirp
make

Alternatively, you can just run configure, but on such case it will build support for multiple architectures.

2. Download an arm64 filesystem

An arm64 image is needed. For instance, you can download one from http://cdimage.debian.org/cdimage/cloud/sid/daily/.

3. Generate an arm64 UEFI bios for QEMU

Please see section EDK2 in https://people.kernel.org/jic23/howto-test-cxl-enablement-on-arm64-using-qemu for some instructions.

4. Compile the Linux Kernel with RAS enabled

In order to test it, Linux kernel should be built for ARM64 on its default config, plus with RAS features enabled, e. g.:

 make defconfig
 ./scripts/config  -e CONFIG_FTRACE -e CONFIG_FTRACE_SYSCALLS -e CONFIG_TRACEPOINTS -e CONFIG_TRACING -e CONFIG_ENABLE_DEFAULT_TRACERS
-e CONFIG_FUNCTION_TRACER -e CONFIG_BRANCH_PROFILE_NONE -e CONFIG_PROBE_EVENTS -e CONFIG_TRACEPOINT_BENCHMARK -e CONFIG_STACK_TRACER
 make olddefconfig
 make all

5. Place rasdaemon at the image

For such step, you could build on your local machine (don't forget to cross-compile to arm64, if your machine has another architecture).

Alternatively, boot your image with QEMU, and build rasdaemon locally there.

6. Booting the image with QEMU

Assuming that you're running it from QEMU directory, where:

  • BIOS is named as: ./QEMU_EFI.fd;
  • Kernel is at: ../kernel/arch/arm64/boot/Image.gz;
  • Debian image is at: ./debian.qcow2,

a command line to start QEMU, with ras enabled would be similar to:

$ build/qemu-system-aarch64 -m 4g,maxmem=8G,slots=8 -monitor stdio -no-reboot \
  -bios ./QEMU_EFI.fd -kernel ../kernel/arch/arm64/boot/Image.gz \
  -drive if=none,file=./debian.qcow2,format=qcow2,id=hd \
  -device pcie-root-port,id=root_port1 \
  -device virtio-blk-pci,drive=hd -device virtio-net-pci,netdev=mynet,id=bob \
  -object memory-backend-ram,size=4G,id=mem0 \
  -netdev type=user,id=mynet,hostfwd=tcp::5555-:22 \
  -qmp tcp:localhost:4445,server=on,wait=off \
  -M virt,nvdimm=on,gic-version=3,ras=on -cpu max -smp 4 \
  -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
  -append 'earlycon nomodeset root=/dev/vda1 fsck.mode=skip tp_printk maxcpus=4'

Such command line starts QMP interface under localhost:4445, which is the default for the error injection script.

7. Injecting errors at QEMU

The QMP TCP host/port is used by an util under scripts/ to do error injection. The util is at the patch series pointed earlier.

To run it, doing an ARM64 CPU error injection is as simple as running:

$ ./scripts/ghes_inject.py arm
Error injected.

This will inject with default values.

Kernel output for such event (without running rasdaemon):

[   12.360716] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[   12.361103] {1}[Hardware Error]: event severity: recoverable
[   12.361367] {1}[Hardware Error]:  Error 0, type: recoverable
[   12.361639] {1}[Hardware Error]:   section_type: ARM processor error
[   12.361924] {1}[Hardware Error]:   MIDR: 0x00000000000f0510
[   12.362228] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[   12.362587] {1}[Hardware Error]:   running state: 0x0
[   12.362836] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[   12.363113] {1}[Hardware Error]:   Error info structure 0:
[   12.363323] {1}[Hardware Error]:   num errors: 2
[   12.363589] {1}[Hardware Error]:    error_type: 0x02: cache error
[   12.363851] {1}[Hardware Error]:    error_info: 0x000000000091000f
[   12.364124] {1}[Hardware Error]:     transaction type: Data Access
[   12.364480] {1}[Hardware Error]:     cache error, operation type: Data write
[   12.364796] {1}[Hardware Error]:     cache level: 2
[   12.365008] {1}[Hardware Error]:     processor context not corrupted
[   12.365721] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error

The script supports a variety of different error injection commands, each with lots of parameters to allow customizing the values of error injection fields inside the GHES record. For instance:

$ ./scripts/ghes_inject.py arm --psci 0x49455020 --no-r \
  -t tlb,bus,micro-arch micro-arch tlb bus cache -m 2 \
  --arm mpidr,affinity,running,vendor \
  --ctx-array 0xdead,0xbeef,0xabba,0xbaab \
  --vendor-specific 12,23,53,52 3 123 243 0xff
Error injected.

Will produce this output:

[  239.814751] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[  239.815130] {2}[Hardware Error]: event severity: recoverable
[  239.815382] {2}[Hardware Error]:  Error 0, type: recoverable
[  239.815634] {2}[Hardware Error]:   section_type: ARM processor error
[  239.815896] {2}[Hardware Error]:   MIDR: 0x00000000000f0510
[  239.816134] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000080000000
[  239.816482] {2}[Hardware Error]:   error affinity level: 0
[  239.816711] {2}[Hardware Error]:   running state: 0x0
[  239.816936] {2}[Hardware Error]:   Power State Coordination Interface state: 0
[  239.817279] {2}[Hardware Error]:   Error info structure 0:
[  239.817516] {2}[Hardware Error]:   num errors: 3
[  239.817742] {2}[Hardware Error]:    error_type: 0x1c: TLB error|bus error|micro-architectural error
[  239.818108] {2}[Hardware Error]:   Error info structure 1:
[  239.818353] {2}[Hardware Error]:   num errors: 2
[  239.818562] {2}[Hardware Error]:    error_type: 0x10: micro-architectural error
[  239.818879] {2}[Hardware Error]:    error_info: 0x0000000078da03ff
[  239.819147] {2}[Hardware Error]:   Error info structure 2:
[  239.819393] {2}[Hardware Error]:   num errors: 2
[  239.819611] {2}[Hardware Error]:    error_type: 0x04: TLB error
[  239.819875] {2}[Hardware Error]:    error_info: 0x000000000054007f
[  239.820147] {2}[Hardware Error]:     transaction type: Instruction
[  239.820425] {2}[Hardware Error]:     TLB error, operation type: Instruction fetch
[  239.820741] {2}[Hardware Error]:     TLB level: 1
[  239.820958] {2}[Hardware Error]:     processor context not corrupted
[  239.821232] {2}[Hardware Error]:     the error has not been corrected
[  239.821503] {2}[Hardware Error]:     PC is imprecise
[  239.821721] {2}[Hardware Error]:   Error info structure 3:
[  239.821961] {2}[Hardware Error]:   num errors: 2
[  239.822173] {2}[Hardware Error]:    error_type: 0x08: bus error
[  239.822433] {2}[Hardware Error]:    error_info: 0x00000080d6460fff
[  239.822707] {2}[Hardware Error]:     transaction type: Generic
[  239.822977] {2}[Hardware Error]:     bus error, operation type: Generic read (type of instruction or data request cannot be determined)
[  239.823441] {2}[Hardware Error]:     affinity level at which the bus error occurred: 1
[  239.823759] {2}[Hardware Error]:     processor context corrupted
[  239.824029] {2}[Hardware Error]:     the error has been corrected
[  239.824283] {2}[Hardware Error]:     PC is imprecise
[  239.824503] {2}[Hardware Error]:     Program execution can be restarted reliably at the PC associated with the error.
[  239.824921] {2}[Hardware Error]:     participation type: Local processor observed
[  239.825236] {2}[Hardware Error]:     request timed out
[  239.825469] {2}[Hardware Error]:     address space: External Memory Access
[  239.825765] {2}[Hardware Error]:     memory access attributes:0x20
[  239.826036] {2}[Hardware Error]:     access mode: secure
[  239.826263] {2}[Hardware Error]:   Error info structure 4:
[  239.826505] {2}[Hardware Error]:   num errors: 2
[  239.826721] {2}[Hardware Error]:    error_type: 0x02: cache error
[  239.826991] {2}[Hardware Error]:    error_info: 0x000000000091000f
[  239.827261] {2}[Hardware Error]:     transaction type: Data Access
[  239.827521] {2}[Hardware Error]:     cache error, operation type: Data write
[  239.827812] {2}[Hardware Error]:     cache level: 2
[  239.828035] {2}[Hardware Error]:     processor context not corrupted
[  239.828301] {2}[Hardware Error]:   Context info structure 0:
[  239.828566] {2}[Hardware Error]:    register context type: AArch64 EL1 context registers
[  239.829114] {2}[Hardware Error]:    00000000: 0000dead 00000000 0000beef 00000000
[  239.829486] {2}[Hardware Error]:    00000010: 0000abba 00000000 0000baab 00000000
[  239.829820] {2}[Hardware Error]:    00000020: 00000000 00000000
[  239.830112] {2}[Hardware Error]:   Vendor specific error info has 8 bytes:
[  239.830472] {2}[Hardware Error]:    00000000: 3435170c fff37b03                    ..54.{..
[  239.830908] [Firmware Warn]: GHES: Unhandled processor error type 0x1c: TLB error|bus error|micro-architectural error
[  239.831323] [Firmware Warn]: GHES: Unhandled processor error type 0x10: micro-architectural error
[  239.831689] [Firmware Warn]: GHES: Unhandled processor error type 0x04: TLB error
[  239.832001] [Firmware Warn]: GHES: Unhandled processor error type 0x08: bus error
[  239.832316] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error

Please notice that the GHES: Unrandled messages, on both cases, are because we're not running rasdaemon at the guest. When rasdaemon is runing, it will catch such reports and the firmware warn messages will be suppressed.

Hardware error injection

Some machines can optionally do firmware (and/or hardware) error injection. This is usually done by setting up some special features at the BIOS level to enable EINJ features. Those are hardware-specific and may require special BIOS used on hardware development by OEM vendors.