| Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 1 |  | 
|  | 2 | Configurable sysfs parameters for the x86-64 machine check code. | 
|  | 3 |  | 
|  | 4 | Machine checks report internal hardware error conditions detected | 
|  | 5 | by the CPU. Uncorrected errors typically cause a machine check | 
|  | 6 | (often with panic), corrected ones cause a machine check log entry. | 
|  | 7 |  | 
|  | 8 | Machine checks are organized in banks (normally associated with | 
|  | 9 | a hardware subsystem) and subevents in a bank. The exact meaning | 
|  | 10 | of the banks and subevent is CPU specific. | 
|  | 11 |  | 
|  | 12 | mcelog knows how to decode them. | 
|  | 13 |  | 
|  | 14 | When you see the "Machine check errors logged" message in the system | 
|  | 15 | log then mcelog should run to collect and decode machine check entries | 
|  | 16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. | 
|  | 17 |  | 
|  | 18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN | 
|  | 19 | (N = CPU number) | 
|  | 20 |  | 
|  | 21 | The directory contains some configurable entries: | 
|  | 22 |  | 
|  | 23 | Entries: | 
|  | 24 |  | 
|  | 25 | bankNctl | 
|  | 26 | (N bank number) | 
|  | 27 | 64bit Hex bitmask enabling/disabling specific subevents for bank N | 
|  | 28 | When a bit in the bitmask is zero then the respective | 
|  | 29 | subevent will not be reported. | 
|  | 30 | By default all events are enabled. | 
|  | 31 | Note that BIOS maintain another mask to disable specific events | 
|  | 32 | per bank.  This is not visible here | 
|  | 33 |  | 
|  | 34 | The following entries appear for each CPU, but they are truly shared | 
|  | 35 | between all CPUs. | 
|  | 36 |  | 
|  | 37 | check_interval | 
|  | 38 | How often to poll for corrected machine check errors, in seconds | 
| Tim Hockin | 8a336b0 | 2007-05-02 19:27:19 +0200 | [diff] [blame] | 39 | (Note output is hexademical). Default 5 minutes.  When the poller | 
|  | 40 | finds MCEs it triggers an exponential speedup (poll more often) on | 
|  | 41 | the polling interval.  When the poller stops finding MCEs, it | 
|  | 42 | triggers an exponential backoff (poll less often) on the polling | 
|  | 43 | interval. The check_interval variable is both the initial and | 
|  | 44 | maximum polling interval. | 
| Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 45 |  | 
|  | 46 | tolerant | 
|  | 47 | Tolerance level. When a machine check exception occurs for a non | 
|  | 48 | corrected machine check the kernel can take different actions. | 
|  | 49 | Since machine check exceptions can happen any time it is sometimes | 
|  | 50 | risky for the kernel to kill a process because it defies | 
|  | 51 | normal kernel locking rules. The tolerance level configures | 
| Tim Hockin | bd78432 | 2007-07-21 17:10:37 +0200 | [diff] [blame] | 52 | how hard the kernel tries to recover even at some risk of | 
|  | 53 | deadlock.  Higher tolerant values trade potentially better uptime | 
|  | 54 | with the risk of a crash or even corruption (for tolerant >= 3). | 
| Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 55 |  | 
| Tim Hockin | bd78432 | 2007-07-21 17:10:37 +0200 | [diff] [blame] | 56 | 0: always panic on uncorrected errors, log corrected errors | 
|  | 57 | 1: panic or SIGBUS on uncorrected errors, log corrected errors | 
|  | 58 | 2: SIGBUS or log uncorrected errors, log corrected errors | 
|  | 59 | 3: never panic or SIGBUS, log all errors (for testing only) | 
| Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 60 |  | 
|  | 61 | Default: 1 | 
|  | 62 |  | 
|  | 63 | Note this only makes a difference if the CPU allows recovery | 
|  | 64 | from a machine check exception. Current x86 CPUs generally do not. | 
|  | 65 |  | 
|  | 66 | trigger | 
|  | 67 | Program to run when a machine check event is detected. | 
|  | 68 | This is an alternative to running mcelog regularly from cron | 
|  | 69 | and allows to detect events faster. | 
|  | 70 |  | 
|  | 71 | TBD document entries for AMD threshold interrupt configuration | 
|  | 72 |  | 
|  | 73 | For more details about the x86 machine check architecture | 
|  | 74 | see the Intel and AMD architecture manuals from their developer websites. | 
|  | 75 |  | 
|  | 76 | For more details about the architecture see | 
|  | 77 | see http://one.firstfloor.org/~andi/mce.pdf |