| Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 1 | The PCI Express Advanced Error Reporting Driver Guide HOWTO | 
|  | 2 | T. Long Nguyen	<tom.l.nguyen@intel.com> | 
|  | 3 | Yanmin Zhang	<yanmin.zhang@intel.com> | 
|  | 4 | 07/29/2006 | 
|  | 5 |  | 
|  | 6 |  | 
|  | 7 | 1. Overview | 
|  | 8 |  | 
|  | 9 | 1.1 About this guide | 
|  | 10 |  | 
|  | 11 | This guide describes the basics of the PCI Express Advanced Error | 
|  | 12 | Reporting (AER) driver and provides information on how to use it, as | 
|  | 13 | well as how to enable the drivers of endpoint devices to conform with | 
|  | 14 | PCI Express AER driver. | 
|  | 15 |  | 
| Randy Dunlap | 4b5ff46 | 2008-03-10 17:16:32 -0700 | [diff] [blame] | 16 | 1.2 Copyright © Intel Corporation 2006. | 
| Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 17 |  | 
|  | 18 | 1.3 What is the PCI Express AER Driver? | 
|  | 19 |  | 
|  | 20 | PCI Express error signaling can occur on the PCI Express link itself | 
|  | 21 | or on behalf of transactions initiated on the link. PCI Express | 
|  | 22 | defines two error reporting paradigms: the baseline capability and | 
|  | 23 | the Advanced Error Reporting capability. The baseline capability is | 
|  | 24 | required of all PCI Express components providing a minimum defined | 
|  | 25 | set of error reporting requirements. Advanced Error Reporting | 
|  | 26 | capability is implemented with a PCI Express advanced error reporting | 
|  | 27 | extended capability structure providing more robust error reporting. | 
|  | 28 |  | 
|  | 29 | The PCI Express AER driver provides the infrastructure to support PCI | 
|  | 30 | Express Advanced Error Reporting capability. The PCI Express AER | 
|  | 31 | driver provides three basic functions: | 
|  | 32 |  | 
|  | 33 | -	Gathers the comprehensive error information if errors occurred. | 
|  | 34 | -	Reports error to the users. | 
|  | 35 | -	Performs error recovery actions. | 
|  | 36 |  | 
|  | 37 | AER driver only attaches root ports which support PCI-Express AER | 
|  | 38 | capability. | 
|  | 39 |  | 
|  | 40 |  | 
|  | 41 | 2. User Guide | 
|  | 42 |  | 
|  | 43 | 2.1 Include the PCI Express AER Root Driver into the Linux Kernel | 
|  | 44 |  | 
|  | 45 | The PCI Express AER Root driver is a Root Port service driver attached | 
|  | 46 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | 
|  | 47 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | 
|  | 48 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | 
|  | 49 | CONFIG_PCIEAER = y. | 
|  | 50 |  | 
|  | 51 | 2.2 Load PCI Express AER Root Driver | 
|  | 52 | There is a case where a system has AER support in BIOS. Enabling the AER | 
|  | 53 | Root driver and having AER support in BIOS may result unpredictable | 
|  | 54 | behavior. To avoid this conflict, a successful load of the AER Root driver | 
|  | 55 | requires ACPI _OSC support in the BIOS to allow the AER Root driver to | 
|  | 56 | request for native control of AER. See the PCI FW 3.0 Specification for | 
|  | 57 | details regarding OSC usage. Currently, lots of firmwares don't provide | 
|  | 58 | _OSC support while they use PCI Express. To support such firmwares, | 
|  | 59 | forceload, a parameter of type bool, could enable AER to continue to | 
|  | 60 | be initiated although firmwares have no _OSC support. To enable the | 
|  | 61 | walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line | 
|  | 62 | when booting kernel. Note that forceload=n by default. | 
|  | 63 |  | 
|  | 64 | 2.3 AER error output | 
|  | 65 | When a PCI-E AER error is captured, an error message will be outputed to | 
|  | 66 | console. If it's a correctable error, it is outputed as a warning. | 
|  | 67 | Otherwise, it is printed as an error. So users could choose different | 
|  | 68 | log level to filter out correctable error messages. | 
|  | 69 |  | 
|  | 70 | Below shows an example. | 
|  | 71 | +------ PCI-Express Device Error -----+ | 
|  | 72 | Error Severity          : Uncorrected (Fatal) | 
|  | 73 | PCIE Bus Error type     : Transaction Layer | 
|  | 74 | Unsupported Request     : First | 
|  | 75 | Requester ID            : 0500 | 
|  | 76 | VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h | 
|  | 77 | TLB Header: | 
|  | 78 | 04000001 00200a03 05010000 00050100 | 
|  | 79 |  | 
|  | 80 | In the example, 'Requester ID' means the ID of the device who sends | 
|  | 81 | the error message to root port. Pls. refer to pci express specs for | 
|  | 82 | other fields. | 
|  | 83 |  | 
|  | 84 |  | 
|  | 85 | 3. Developer Guide | 
|  | 86 |  | 
|  | 87 | To enable AER aware support requires a software driver to configure | 
|  | 88 | the AER capability structure within its device and to provide callbacks. | 
|  | 89 |  | 
|  | 90 | To support AER better, developers need understand how AER does work | 
|  | 91 | firstly. | 
|  | 92 |  | 
|  | 93 | PCI Express errors are classified into two types: correctable errors | 
|  | 94 | and uncorrectable errors. This classification is based on the impacts | 
|  | 95 | of those errors, which may result in degraded performance or function | 
|  | 96 | failure. | 
|  | 97 |  | 
|  | 98 | Correctable errors pose no impacts on the functionality of the | 
|  | 99 | interface. The PCI Express protocol can recover without any software | 
|  | 100 | intervention or any loss of data. These errors are detected and | 
|  | 101 | corrected by hardware. Unlike correctable errors, uncorrectable | 
|  | 102 | errors impact functionality of the interface. Uncorrectable errors | 
|  | 103 | can cause a particular transaction or a particular PCI Express link | 
|  | 104 | to be unreliable. Depending on those error conditions, uncorrectable | 
|  | 105 | errors are further classified into non-fatal errors and fatal errors. | 
|  | 106 | Non-fatal errors cause the particular transaction to be unreliable, | 
|  | 107 | but the PCI Express link itself is fully functional. Fatal errors, on | 
|  | 108 | the other hand, cause the link to be unreliable. | 
|  | 109 |  | 
|  | 110 | When AER is enabled, a PCI Express device will automatically send an | 
|  | 111 | error message to the PCIE root port above it when the device captures | 
|  | 112 | an error. The Root Port, upon receiving an error reporting message, | 
|  | 113 | internally processes and logs the error message in its PCI Express | 
|  | 114 | capability structure. Error information being logged includes storing | 
|  | 115 | the error reporting agent's requestor ID into the Error Source | 
|  | 116 | Identification Registers and setting the error bits of the Root Error | 
|  | 117 | Status Register accordingly. If AER error reporting is enabled in Root | 
|  | 118 | Error Command Register, the Root Port generates an interrupt if an | 
|  | 119 | error is detected. | 
|  | 120 |  | 
|  | 121 | Note that the errors as described above are related to the PCI Express | 
|  | 122 | hierarchy and links. These errors do not include any device specific | 
|  | 123 | errors because device specific errors will still get sent directly to | 
|  | 124 | the device driver. | 
|  | 125 |  | 
|  | 126 | 3.1 Configure the AER capability structure | 
|  | 127 |  | 
|  | 128 | AER aware drivers of PCI Express component need change the device | 
|  | 129 | control registers to enable AER. They also could change AER registers, | 
|  | 130 | including mask and severity registers. Helper function | 
|  | 131 | pci_enable_pcie_error_reporting could be used to enable AER. See | 
|  | 132 | section 3.3. | 
|  | 133 |  | 
|  | 134 | 3.2. Provide callbacks | 
|  | 135 |  | 
|  | 136 | 3.2.1 callback reset_link to reset pci express link | 
|  | 137 |  | 
|  | 138 | This callback is used to reset the pci express physical link when a | 
|  | 139 | fatal error happens. The root port aer service driver provides a | 
|  | 140 | default reset_link function, but different upstream ports might | 
|  | 141 | have different specifications to reset pci express link, so all | 
|  | 142 | upstream ports should provide their own reset_link functions. | 
|  | 143 |  | 
|  | 144 | In struct pcie_port_service_driver, a new pointer, reset_link, is | 
|  | 145 | added. | 
|  | 146 |  | 
|  | 147 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); | 
|  | 148 |  | 
|  | 149 | Section 3.2.2.2 provides more detailed info on when to call | 
|  | 150 | reset_link. | 
|  | 151 |  | 
|  | 152 | 3.2.2 PCI error-recovery callbacks | 
|  | 153 |  | 
|  | 154 | The PCI Express AER Root driver uses error callbacks to coordinate | 
|  | 155 | with downstream device drivers associated with a hierarchy in question | 
|  | 156 | when performing error recovery actions. | 
|  | 157 |  | 
|  | 158 | Data struct pci_driver has a pointer, err_handler, to point to | 
|  | 159 | pci_error_handlers who consists of a couple of callback function | 
|  | 160 | pointers. AER driver follows the rules defined in | 
|  | 161 | pci-error-recovery.txt except pci express specific parts (e.g. | 
|  | 162 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | 
|  | 163 | definitions of the callbacks. | 
|  | 164 |  | 
|  | 165 | Below sections specify when to call the error callback functions. | 
|  | 166 |  | 
|  | 167 | 3.2.2.1 Correctable errors | 
|  | 168 |  | 
|  | 169 | Correctable errors pose no impacts on the functionality of | 
|  | 170 | the interface. The PCI Express protocol can recover without any | 
|  | 171 | software intervention or any loss of data. These errors do not | 
|  | 172 | require any recovery actions. The AER driver clears the device's | 
|  | 173 | correctable error status register accordingly and logs these errors. | 
|  | 174 |  | 
|  | 175 | 3.2.2.2 Non-correctable (non-fatal and fatal) errors | 
|  | 176 |  | 
|  | 177 | If an error message indicates a non-fatal error, performing link reset | 
|  | 178 | at upstream is not required. The AER driver calls error_detected(dev, | 
|  | 179 | pci_channel_io_normal) to all drivers associated within a hierarchy in | 
|  | 180 | question. for example, | 
|  | 181 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. | 
|  | 182 | If Upstream port A captures an AER error, the hierarchy consists of | 
|  | 183 | Downstream port B and EndPoint. | 
|  | 184 |  | 
|  | 185 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | 
|  | 186 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | 
|  | 187 | whether it can recover or the AER driver calls mmio_enabled as next. | 
|  | 188 |  | 
|  | 189 | If an error message indicates a fatal error, kernel will broadcast | 
|  | 190 | error_detected(dev, pci_channel_io_frozen) to all drivers within | 
|  | 191 | a hierarchy in question. Then, performing link reset at upstream is | 
|  | 192 | necessary. As different kinds of devices might use different approaches | 
|  | 193 | to reset link, AER port service driver is required to provide the | 
|  | 194 | function to reset link. Firstly, kernel looks for if the upstream | 
|  | 195 | component has an aer driver. If it has, kernel uses the reset_link | 
|  | 196 | callback of the aer driver. If the upstream component has no aer driver | 
|  | 197 | and the port is downstream port, we will use the aer driver of the | 
|  | 198 | root port who reports the AER error. As for upstream ports, | 
|  | 199 | they should provide their own aer service drivers with reset_link | 
|  | 200 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | 
|  | 201 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | 
|  | 202 | to mmio_enabled. | 
|  | 203 |  | 
|  | 204 | 3.3 helper functions | 
|  | 205 |  | 
| Yu Zhao | 270c66b | 2008-10-19 20:35:20 +0800 | [diff] [blame] | 206 | 3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); | 
| Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 207 | pci_enable_pcie_error_reporting enables the device to send error | 
|  | 208 | messages to root port when an error is detected. Note that devices | 
|  | 209 | don't enable the error reporting by default, so device drivers need | 
|  | 210 | call this function to enable it. | 
|  | 211 |  | 
| Yu Zhao | 270c66b | 2008-10-19 20:35:20 +0800 | [diff] [blame] | 212 | 3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); | 
| Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 213 | pci_disable_pcie_error_reporting disables the device to send error | 
|  | 214 | messages to root port when an error is detected. | 
|  | 215 |  | 
| Yu Zhao | 270c66b | 2008-10-19 20:35:20 +0800 | [diff] [blame] | 216 | 3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); | 
| Zhang, Yanmin | 4740240 | 2006-07-31 15:15:18 +0800 | [diff] [blame] | 217 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable | 
|  | 218 | error status register. | 
|  | 219 |  | 
|  | 220 | 3.4 Frequent Asked Questions | 
|  | 221 |  | 
|  | 222 | Q: What happens if a PCI Express device driver does not provide an | 
|  | 223 | error recovery handler (pci_driver->err_handler is equal to NULL)? | 
|  | 224 |  | 
|  | 225 | A: The devices attached with the driver won't be recovered. If the | 
|  | 226 | error is fatal, kernel will print out warning messages. Please refer | 
|  | 227 | to section 3 for more information. | 
|  | 228 |  | 
|  | 229 | Q: What happens if an upstream port service driver does not provide | 
|  | 230 | callback reset_link? | 
|  | 231 |  | 
|  | 232 | A: Fatal error recovery will fail if the errors are reported by the | 
|  | 233 | upstream ports who are attached by the service driver. | 
|  | 234 |  | 
|  | 235 | Q: How does this infrastructure deal with driver that is not PCI | 
|  | 236 | Express aware? | 
|  | 237 |  | 
|  | 238 | A: This infrastructure calls the error callback functions of the | 
|  | 239 | driver when an error happens. But if the driver is not aware of | 
|  | 240 | PCI Express, the device might not report its own errors to root | 
|  | 241 | port. | 
|  | 242 |  | 
|  | 243 | Q: What modifications will that driver need to make it compatible | 
|  | 244 | with the PCI Express AER Root driver? | 
|  | 245 |  | 
|  | 246 | A: It could call the helper functions to enable AER in devices and | 
|  | 247 | cleanup uncorrectable status register. Pls. refer to section 3.3. | 
|  | 248 |  |