Workaround for opal-prd using 100% CPU

opal-prd is the Processor RunTime Diagnostics daemon, the userspace process that on OpenPower systems is responsible for some of the runtime diagnostics. Although a userspace process, it memory maps (as in mmap) in some code loaded by early firmware (Hostboot) called the HostBoot RunTime (HBRT) and runs it, using calls to the kernel to accomplish any needed operations (e.g. reading/writing registers inside the chip). Running this in user space gives us benefits such as being able to attach gdb, recover from segfaults etc.

The reason this code is shipped as part of firmware rather than as an OS package is that it is very system specific, and it would be a giant pain to update a package in every Linux distribution every time a new chip or machine was introduced.

Anyway, there’s a bug in the HBRT code that means if there’s an ECC error in the HBEL (HostBoot Error Log) partition in the system flash (“bios” or “pnor”… the flash where your system firmware lives), the opal-prd process may get stuck chewing up 100% CPU and not doing anything useful. There’s https://github.com/open-power/hostboot/issues/67 for this.

You will notice a problem if the opal-prd process is using 100% CPU and the last log messages are something like:

HBRT: ERRL:>>ErrlManager::ErrlManager constructor.
HBRT: ERRL:iv_hiddenErrorLogsEnable = 0x0
HBRT: ERRL:>>setupPnorInfo
HBRT: PNOR:>>RtPnor::getSectionInfo
HBRT: PNOR:>>RtPnor::readFromDevice: i_offset=0x0, i_procId=0 sec=11 size=0x20000 ecc=1
HBRT: PNOR:RtPnor::readFromDevice: removing ECC...
HBRT: PNOR:RtPnor::readFromDevice> Uncorrectable ECC error : chip=0,offset=0x0

(the parameters to readFromDevice may differ)

Luckily, there’s a simple workaround to fix it all up! You will need the pflash utility. Primarily, pflash is meant only for developers and those who know what they’re doing. You can turn your computer into a brick using it.

pflash is packaged in Ubuntu 16.10 and RHEL 7.3, but you can otherwise build it from source easily enough:

git clone https://github.com/open-power/skiboot.git
cd skiboot/external/pflash
make

Now that you have pflash, you just need to erase the HBEL partition and write (ECC) zeros:

dd if=/dev/zero of=/tmp/hbel bs=1 count=147456
pflash -P HBEL -e
pflash -P HBEL -p /tmp/hbel

Note: you cannot just erase the partition or use the pflash option to do an ECC erase, you may render your system unbootable if you get it wrong.

After that, restart opal-prd however your distro handles restarting daemons (e.g. systemctl restart opal-prd.service) and all should be well.

OpenPower firmware up on github!

With the whole OpenPower thing, a lot of low level firmware is being open sourced, which is really exciting for the platform – the less proprietary code sitting in memory the better in my books.

If you go to https://github.com/open-power you’ll see code for a bunch of the low level firmware for OpenPower and POWER8.

Hostboot is the bit of code that brings up the CPU and skiboot both sets up hardware and provides runtime services to Linux (such as talking to the service processor, if one is present).

Patches to https://github.com/open-power/skiboot/blob/master/doc/overview.txt are (of course) really quite welcome. It shouldn’t be too hard to get your head around the basics.

To see the Linux side of the OPAL interface, go check out linux/arch/powerpc/platforms/powernv -there you can see how we ask OPAL to do things for us.

If you buy a POWER8 system from IBM running PowerKVM you’re running this code.