| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 |  | 
|  | 2 |  | 
|  | 3 | PCI Bus EEH Error Recovery | 
|  | 4 | -------------------------- | 
|  | 5 | Linas Vepstas | 
|  | 6 | <linas@austin.ibm.com> | 
|  | 7 | 12 January 2005 | 
|  | 8 |  | 
|  | 9 |  | 
|  | 10 | Overview: | 
|  | 11 | --------- | 
|  | 12 | The IBM POWER-based pSeries and iSeries computers include PCI bus | 
|  | 13 | controller chips that have extended capabilities for detecting and | 
|  | 14 | reporting a large variety of PCI bus error conditions.  These features | 
|  | 15 | go under the name of "EEH", for "Extended Error Handling".  The EEH | 
|  | 16 | hardware features allow PCI bus errors to be cleared and a PCI | 
|  | 17 | card to be "rebooted", without also having to reboot the operating | 
|  | 18 | system. | 
|  | 19 |  | 
|  | 20 | This is in contrast to traditional PCI error handling, where the | 
|  | 21 | PCI chip is wired directly to the CPU, and an error would cause | 
|  | 22 | a CPU machine-check/check-stop condition, halting the CPU entirely. | 
|  | 23 | Another "traditional" technique is to ignore such errors, which | 
|  | 24 | can lead to data corruption, both of user data or of kernel data, | 
|  | 25 | hung/unresponsive adapters, or system crashes/lockups.  Thus, | 
|  | 26 | the idea behind EEH is that the operating system can become more | 
|  | 27 | reliable and robust by protecting it from PCI errors, and giving | 
|  | 28 | the OS the ability to "reboot"/recover individual PCI devices. | 
|  | 29 |  | 
|  | 30 | Future systems from other vendors, based on the PCI-E specification, | 
|  | 31 | may contain similar features. | 
|  | 32 |  | 
|  | 33 |  | 
|  | 34 | Causes of EEH Errors | 
|  | 35 | -------------------- | 
|  | 36 | EEH was originally designed to guard against hardware failure, such | 
|  | 37 | as PCI cards dying from heat, humidity, dust, vibration and bad | 
|  | 38 | electrical connections. The vast majority of EEH errors seen in | 
| Matt LaPlante | 01dd2fb | 2007-10-20 01:34:40 +0200 | [diff] [blame] | 39 | "real life" are due to either poorly seated PCI cards, or, | 
|  | 40 | unfortunately quite commonly, due to device driver bugs, device firmware | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 41 | bugs, and sometimes PCI card hardware bugs. | 
|  | 42 |  | 
|  | 43 | The most common software bug, is one that causes the device to | 
|  | 44 | attempt to DMA to a location in system memory that has not been | 
|  | 45 | reserved for DMA access for that card.  This is a powerful feature, | 
|  | 46 | as it prevents what; otherwise, would have been silent memory | 
|  | 47 | corruption caused by the bad DMA.  A number of device driver | 
|  | 48 | bugs have been found and fixed in this way over the past few | 
|  | 49 | years.  Other possible causes of EEH errors include data or | 
|  | 50 | address line parity errors (for example, due to poor electrical | 
|  | 51 | connectivity due to a poorly seated card), and PCI-X split-completion | 
|  | 52 | errors (due to software, device firmware, or device PCI hardware bugs). | 
|  | 53 | The vast majority of "true hardware failures" can be cured by | 
|  | 54 | physically removing and re-seating the PCI card. | 
|  | 55 |  | 
|  | 56 |  | 
|  | 57 | Detection and Recovery | 
|  | 58 | ---------------------- | 
|  | 59 | In the following discussion, a generic overview of how to detect | 
|  | 60 | and recover from EEH errors will be presented. This is followed | 
|  | 61 | by an overview of how the current implementation in the Linux | 
|  | 62 | kernel does it.  The actual implementation is subject to change, | 
|  | 63 | and some of the finer points are still being debated.  These | 
|  | 64 | may in turn be swayed if or when other architectures implement | 
|  | 65 | similar functionality. | 
|  | 66 |  | 
|  | 67 | When a PCI Host Bridge (PHB, the bus controller connecting the | 
|  | 68 | PCI bus to the system CPU electronics complex) detects a PCI error | 
|  | 69 | condition, it will "isolate" the affected PCI card.  Isolation | 
|  | 70 | will block all writes (either to the card from the system, or | 
|  | 71 | from the card to the system), and it will cause all reads to | 
|  | 72 | return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads). | 
|  | 73 | This value was chosen because it is the same value you would | 
|  | 74 | get if the device was physically unplugged from the slot. | 
|  | 75 | This includes access to PCI memory, I/O space, and PCI config | 
|  | 76 | space.  Interrupts; however, will continued to be delivered. | 
|  | 77 |  | 
|  | 78 | Detection and recovery are performed with the aid of ppc64 | 
|  | 79 | firmware.  The programming interfaces in the Linux kernel | 
|  | 80 | into the firmware are referred to as RTAS (Run-Time Abstraction | 
|  | 81 | Services).  The Linux kernel does not (should not) access | 
|  | 82 | the EEH function in the PCI chipsets directly, primarily because | 
|  | 83 | there are a number of different chipsets out there, each with | 
|  | 84 | different interfaces and quirks. The firmware provides a | 
|  | 85 | uniform abstraction layer that will work with all pSeries | 
|  | 86 | and iSeries hardware (and be forwards-compatible). | 
|  | 87 |  | 
|  | 88 | If the OS or device driver suspects that a PCI slot has been | 
|  | 89 | EEH-isolated, there is a firmware call it can make to determine if | 
|  | 90 | this is the case. If so, then the device driver should put itself | 
|  | 91 | into a consistent state (given that it won't be able to complete any | 
|  | 92 | pending work) and start recovery of the card.  Recovery normally | 
| Matt LaPlante | d6bc8ac | 2006-10-03 22:54:15 +0200 | [diff] [blame] | 93 | would consist of resetting the PCI device (holding the PCI #RST | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 94 | line high for two seconds), followed by setting up the device | 
|  | 95 | config space (the base address registers (BAR's), latency timer, | 
|  | 96 | cache line size, interrupt line, and so on).  This is followed by a | 
|  | 97 | reinitialization of the device driver.  In a worst-case scenario, | 
|  | 98 | the power to the card can be toggled, at least on hot-plug-capable | 
|  | 99 | slots.  In principle, layers far above the device driver probably | 
|  | 100 | do not need to know that the PCI card has been "rebooted" in this | 
|  | 101 | way; ideally, there should be at most a pause in Ethernet/disk/USB | 
|  | 102 | I/O while the card is being reset. | 
|  | 103 |  | 
|  | 104 | If the card cannot be recovered after three or four resets, the | 
|  | 105 | kernel/device driver should assume the worst-case scenario, that the | 
|  | 106 | card has died completely, and report this error to the sysadmin. | 
|  | 107 | In addition, error messages are reported through RTAS and also through | 
|  | 108 | syslogd (/var/log/messages) to alert the sysadmin of PCI resets. | 
|  | 109 | The correct way to deal with failed adapters is to use the standard | 
|  | 110 | PCI hotplug tools to remove and replace the dead card. | 
|  | 111 |  | 
|  | 112 |  | 
|  | 113 | Current PPC64 Linux EEH Implementation | 
|  | 114 | -------------------------------------- | 
|  | 115 | At this time, a generic EEH recovery mechanism has been implemented, | 
|  | 116 | so that individual device drivers do not need to be modified to support | 
|  | 117 | EEH recovery.  This generic mechanism piggy-backs on the PCI hotplug | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 118 | infrastructure,  and percolates events up through the userspace/udev | 
| Matt LaPlante | a2ffd27 | 2006-10-03 22:49:15 +0200 | [diff] [blame] | 119 | infrastructure.  Following is a detailed description of how this is | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 120 | accomplished. | 
|  | 121 |  | 
|  | 122 | EEH must be enabled in the PHB's very early during the boot process, | 
|  | 123 | and if a PCI slot is hot-plugged. The former is performed by | 
| Jon Mason | 2ef9481 | 2006-01-23 10:58:20 -0600 | [diff] [blame] | 124 | eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 125 | drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code. | 
|  | 126 | EEH must be enabled before a PCI scan of the device can proceed. | 
|  | 127 | Current Power5 hardware will not work unless EEH is enabled; | 
|  | 128 | although older Power4 can run with it disabled.  Effectively, | 
|  | 129 | EEH can no longer be turned off.  PCI devices *must* be | 
|  | 130 | registered with the EEH code; the EEH code needs to know about | 
|  | 131 | the I/O address ranges of the PCI device in order to detect an | 
|  | 132 | error.  Given an arbitrary address, the routine | 
|  | 133 | pci_get_device_by_addr() will find the pci device associated | 
|  | 134 | with that address (if any). | 
|  | 135 |  | 
| Stephen Rothwell | b8b572e | 2008-08-01 15:20:30 +1000 | [diff] [blame] | 136 | The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(), | 
| Tobias Klauser | d533f67 | 2005-09-10 00:26:46 -0700 | [diff] [blame] | 137 | etc. include a check to see if the i/o read returned all-0xff's. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 138 | If so, these make a call to eeh_dn_check_failure(), which in turn | 
|  | 139 | asks the firmware if the all-ff's value is the sign of a true EEH | 
|  | 140 | error.  If it is not, processing continues as normal.  The grand | 
|  | 141 | total number of these false alarms or "false positives" can be | 
|  | 142 | seen in /proc/ppc64/eeh (subject to change).  Normally, almost | 
|  | 143 | all of these occur during boot, when the PCI bus is scanned, where | 
|  | 144 | a large number of 0xff reads are part of the bus scan procedure. | 
|  | 145 |  | 
| Jon Mason | 2ef9481 | 2006-01-23 10:58:20 -0600 | [diff] [blame] | 146 | If a frozen slot is detected, code in | 
|  | 147 | arch/powerpc/platforms/pseries/eeh.c will print a stack trace to | 
|  | 148 | syslog (/var/log/messages).  This stack trace has proven to be very | 
|  | 149 | useful to device-driver authors for finding out at what point the EEH | 
|  | 150 | error was detected, as the error itself usually occurs slightly | 
|  | 151 | beforehand. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 152 |  | 
|  | 153 | Next, it uses the Linux kernel notifier chain/work queue mechanism to | 
|  | 154 | allow any interested parties to find out about the failure.  Device | 
|  | 155 | drivers, or other parts of the kernel, can use | 
|  | 156 | eeh_register_notifier(struct notifier_block *) to find out about EEH | 
|  | 157 | events.  The event will include a pointer to the pci device, the | 
|  | 158 | device node and some state info.  Receivers of the event can "do as | 
|  | 159 | they wish"; the default handler will be described further in this | 
|  | 160 | section. | 
|  | 161 |  | 
|  | 162 | To assist in the recovery of the device, eeh.c exports the | 
|  | 163 | following functions: | 
|  | 164 |  | 
|  | 165 | rtas_set_slot_reset() -- assert the  PCI #RST line for 1/8th of a second | 
|  | 166 | rtas_configure_bridge() -- ask firmware to configure any PCI bridges | 
|  | 167 | located topologically under the pci slot. | 
|  | 168 | eeh_save_bars() and eeh_restore_bars(): save and restore the PCI | 
|  | 169 | config-space info for a device and any devices under it. | 
|  | 170 |  | 
|  | 171 |  | 
|  | 172 | A handler for the EEH notifier_block events is implemented in | 
|  | 173 | drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events(). | 
|  | 174 | It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter(). | 
|  | 175 | This last call causes the device driver for the card to be stopped, | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 176 | which causes uevents to go out to user space. This triggers | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 177 | user-space scripts that might issue commands such as "ifdown eth0" | 
|  | 178 | for ethernet cards, and so on.  This handler then sleeps for 5 seconds, | 
|  | 179 | hoping to give the user-space scripts enough time to complete. | 
|  | 180 | It then resets the PCI card, reconfigures the device BAR's, and | 
|  | 181 | any bridges underneath. It then calls rpaphp_enable_pci_slot(), | 
|  | 182 | which restarts the device driver and triggers more user-space | 
|  | 183 | events (for example, calling "ifup eth0" for ethernet cards). | 
|  | 184 |  | 
|  | 185 |  | 
|  | 186 | Device Shutdown and User-Space Events | 
|  | 187 | ------------------------------------- | 
|  | 188 | This section documents what happens when a pci slot is unconfigured, | 
|  | 189 | focusing on how the device driver gets shut down, and on how the | 
|  | 190 | events get delivered to user-space scripts. | 
|  | 191 |  | 
|  | 192 | Following is an example sequence of events that cause a device driver | 
|  | 193 | close function to be called during the first phase of an EEH reset. | 
|  | 194 | The following sequence is an example of the pcnet32 device driver. | 
|  | 195 |  | 
|  | 196 | rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c | 
|  | 197 | { | 
|  | 198 | calls | 
|  | 199 | pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c | 
|  | 200 | { | 
|  | 201 | calls | 
|  | 202 | pci_destroy_dev (struct pci_dev *) | 
|  | 203 | { | 
|  | 204 | calls | 
|  | 205 | device_unregister (&dev->dev) // in /drivers/base/core.c | 
|  | 206 | { | 
|  | 207 | calls | 
|  | 208 | device_del (struct device *) | 
|  | 209 | { | 
|  | 210 | calls | 
|  | 211 | bus_remove_device() // in /drivers/base/bus.c | 
|  | 212 | { | 
|  | 213 | calls | 
|  | 214 | device_release_driver() | 
|  | 215 | { | 
|  | 216 | calls | 
|  | 217 | struct device_driver->remove() which is just | 
|  | 218 | pci_device_remove()  // in /drivers/pci/pci_driver.c | 
|  | 219 | { | 
|  | 220 | calls | 
|  | 221 | struct pci_driver->remove() which is just | 
|  | 222 | pcnet32_remove_one() // in /drivers/net/pcnet32.c | 
|  | 223 | { | 
|  | 224 | calls | 
|  | 225 | unregister_netdev() // in /net/core/dev.c | 
|  | 226 | { | 
|  | 227 | calls | 
|  | 228 | dev_close()  // in /net/core/dev.c | 
|  | 229 | { | 
|  | 230 | calls dev->stop(); | 
|  | 231 | which is just pcnet32_close() // in pcnet32.c | 
|  | 232 | { | 
|  | 233 | which does what you wanted | 
|  | 234 | to stop the device | 
|  | 235 | } | 
|  | 236 | } | 
|  | 237 | } | 
|  | 238 | which | 
|  | 239 | frees pcnet32 device driver memory | 
|  | 240 | } | 
|  | 241 | }}}}}} | 
|  | 242 |  | 
|  | 243 |  | 
|  | 244 | in drivers/pci/pci_driver.c, | 
|  | 245 | struct device_driver->remove() is just pci_device_remove() | 
|  | 246 | which calls struct pci_driver->remove() which is pcnet32_remove_one() | 
|  | 247 | which calls unregister_netdev()  (in net/core/dev.c) | 
|  | 248 | which calls dev_close()  (in net/core/dev.c) | 
|  | 249 | which calls dev->stop() which is pcnet32_close() | 
|  | 250 | which then does the appropriate shutdown. | 
|  | 251 |  | 
|  | 252 | --- | 
|  | 253 | Following is the analogous stack trace for events sent to user-space | 
|  | 254 | when the pci device is unconfigured. | 
|  | 255 |  | 
|  | 256 | rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c | 
|  | 257 | calls | 
|  | 258 | pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c | 
|  | 259 | calls | 
|  | 260 | pci_destroy_dev (struct pci_dev *) { | 
|  | 261 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 262 | device_unregister (&dev->dev) {        // in /drivers/base/core.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 263 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 264 | device_del(struct device * dev) {    // in /drivers/base/core.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 265 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 266 | kobject_del() {                    //in /libs/kobject.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 267 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 268 | kobject_uevent() {               // in /libs/kobject.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 269 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 270 | kset_uevent() {                // in /lib/kobject.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 271 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 272 | kset->uevent_ops->uevent()   // which is really just | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 273 | a call to | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 274 | dev_uevent() {               // in /drivers/base/core.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 275 | calls | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 276 | dev->bus->uevent() which is really just a call to | 
|  | 277 | pci_uevent () {            // in drivers/pci/hotplug.c | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 278 | which prints device name, etc.... | 
|  | 279 | } | 
|  | 280 | } | 
| Kay Sievers | 312c004 | 2005-11-16 09:00:00 +0100 | [diff] [blame] | 281 | then kobject_uevent() sends a netlink uevent to userspace | 
|  | 282 | --> userspace uevent | 
|  | 283 | (during early boot, nobody listens to netlink events and | 
|  | 284 | kobject_uevent() executes uevent_helper[], which runs the | 
|  | 285 | event process /sbin/hotplug) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 286 | } | 
|  | 287 | } | 
|  | 288 | kobject_del() then calls sysfs_remove_dir(), which would | 
|  | 289 | trigger any user-space daemon that was watching /sysfs, | 
|  | 290 | and notice the delete event. | 
|  | 291 |  | 
|  | 292 |  | 
|  | 293 | Pro's and Con's of the Current Design | 
|  | 294 | ------------------------------------- | 
|  | 295 | There are several issues with the current EEH software recovery design, | 
|  | 296 | which may be addressed in future revisions.  But first, note that the | 
|  | 297 | big plus of the current design is that no changes need to be made to | 
|  | 298 | individual device drivers, so that the current design throws a wide net. | 
|  | 299 | The biggest negative of the design is that it potentially disturbs | 
|  | 300 | network daemons and file systems that didn't need to be disturbed. | 
|  | 301 |  | 
|  | 302 | -- A minor complaint is that resetting the network card causes | 
|  | 303 | user-space back-to-back ifdown/ifup burps that potentially disturb | 
|  | 304 | network daemons, that didn't need to even know that the pci | 
|  | 305 | card was being rebooted. | 
|  | 306 |  | 
|  | 307 | -- A more serious concern is that the same reset, for SCSI devices, | 
|  | 308 | causes havoc to mounted file systems.  Scripts cannot post-facto | 
|  | 309 | unmount a file system without flushing pending buffers, but this | 
|  | 310 | is impossible, because I/O has already been stopped.  Thus, | 
|  | 311 | ideally, the reset should happen at or below the block layer, | 
|  | 312 | so that the file systems are not disturbed. | 
|  | 313 |  | 
|  | 314 | Reiserfs does not tolerate errors returned from the block device. | 
|  | 315 | Ext3fs seems to be tolerant, retrying reads/writes until it does | 
|  | 316 | succeed. Both have been only lightly tested in this scenario. | 
|  | 317 |  | 
|  | 318 | The SCSI-generic subsystem already has built-in code for performing | 
|  | 319 | SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter | 
|  | 320 | (HBA) resets.  These are cascaded into a chain of attempted | 
|  | 321 | resets if a SCSI command fails. These are completely hidden | 
|  | 322 | from the block layer.  It would be very natural to add an EEH | 
|  | 323 | reset into this chain of events. | 
|  | 324 |  | 
|  | 325 | -- If a SCSI error occurs for the root device, all is lost unless | 
|  | 326 | the sysadmin had the foresight to run /bin, /sbin, /etc, /var | 
|  | 327 | and so on, out of ramdisk/tmpfs. | 
|  | 328 |  | 
|  | 329 |  | 
|  | 330 | Conclusions | 
|  | 331 | ----------- | 
|  | 332 | There's forward progress ... | 
|  | 333 |  | 
|  | 334 |  |