|  |  | 
|  | Configurable sysfs parameters for the x86-64 machine check code. | 
|  |  | 
|  | Machine checks report internal hardware error conditions detected | 
|  | by the CPU. Uncorrected errors typically cause a machine check | 
|  | (often with panic), corrected ones cause a machine check log entry. | 
|  |  | 
|  | Machine checks are organized in banks (normally associated with | 
|  | a hardware subsystem) and subevents in a bank. The exact meaning | 
|  | of the banks and subevent is CPU specific. | 
|  |  | 
|  | mcelog knows how to decode them. | 
|  |  | 
|  | When you see the "Machine check errors logged" message in the system | 
|  | log then mcelog should run to collect and decode machine check entries | 
|  | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. | 
|  |  | 
|  | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN | 
|  | (N = CPU number) | 
|  |  | 
|  | The directory contains some configurable entries: | 
|  |  | 
|  | Entries: | 
|  |  | 
|  | bankNctl | 
|  | (N bank number) | 
|  | 64bit Hex bitmask enabling/disabling specific subevents for bank N | 
|  | When a bit in the bitmask is zero then the respective | 
|  | subevent will not be reported. | 
|  | By default all events are enabled. | 
|  | Note that BIOS maintain another mask to disable specific events | 
|  | per bank.  This is not visible here | 
|  |  | 
|  | The following entries appear for each CPU, but they are truly shared | 
|  | between all CPUs. | 
|  |  | 
|  | check_interval | 
|  | How often to poll for corrected machine check errors, in seconds | 
|  | (Note output is hexademical). Default 5 minutes.  When the poller | 
|  | finds MCEs it triggers an exponential speedup (poll more often) on | 
|  | the polling interval.  When the poller stops finding MCEs, it | 
|  | triggers an exponential backoff (poll less often) on the polling | 
|  | interval. The check_interval variable is both the initial and | 
|  | maximum polling interval. | 
|  |  | 
|  | tolerant | 
|  | Tolerance level. When a machine check exception occurs for a non | 
|  | corrected machine check the kernel can take different actions. | 
|  | Since machine check exceptions can happen any time it is sometimes | 
|  | risky for the kernel to kill a process because it defies | 
|  | normal kernel locking rules. The tolerance level configures | 
|  | how hard the kernel tries to recover even at some risk of | 
|  | deadlock.  Higher tolerant values trade potentially better uptime | 
|  | with the risk of a crash or even corruption (for tolerant >= 3). | 
|  |  | 
|  | 0: always panic on uncorrected errors, log corrected errors | 
|  | 1: panic or SIGBUS on uncorrected errors, log corrected errors | 
|  | 2: SIGBUS or log uncorrected errors, log corrected errors | 
|  | 3: never panic or SIGBUS, log all errors (for testing only) | 
|  |  | 
|  | Default: 1 | 
|  |  | 
|  | Note this only makes a difference if the CPU allows recovery | 
|  | from a machine check exception. Current x86 CPUs generally do not. | 
|  |  | 
|  | trigger | 
|  | Program to run when a machine check event is detected. | 
|  | This is an alternative to running mcelog regularly from cron | 
|  | and allows to detect events faster. | 
|  |  | 
|  | TBD document entries for AMD threshold interrupt configuration | 
|  |  | 
|  | For more details about the x86 machine check architecture | 
|  | see the Intel and AMD architecture manuals from their developer websites. | 
|  |  | 
|  | For more details about the architecture see | 
|  | see http://one.firstfloor.org/~andi/mce.pdf |