| Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 1 |  | 
 | 2 | Configurable sysfs parameters for the x86-64 machine check code. | 
 | 3 |  | 
 | 4 | Machine checks report internal hardware error conditions detected | 
 | 5 | by the CPU. Uncorrected errors typically cause a machine check | 
 | 6 | (often with panic), corrected ones cause a machine check log entry. | 
 | 7 |  | 
 | 8 | Machine checks are organized in banks (normally associated with | 
 | 9 | a hardware subsystem) and subevents in a bank. The exact meaning | 
 | 10 | of the banks and subevent is CPU specific. | 
 | 11 |  | 
 | 12 | mcelog knows how to decode them. | 
 | 13 |  | 
 | 14 | When you see the "Machine check errors logged" message in the system | 
 | 15 | log then mcelog should run to collect and decode machine check entries | 
 | 16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. | 
 | 17 |  | 
 | 18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN | 
 | 19 | (N = CPU number) | 
 | 20 |  | 
 | 21 | The directory contains some configurable entries: | 
 | 22 |  | 
 | 23 | Entries: | 
 | 24 |  | 
 | 25 | bankNctl | 
 | 26 | (N bank number) | 
 | 27 | 	64bit Hex bitmask enabling/disabling specific subevents for bank N | 
 | 28 | 	When a bit in the bitmask is zero then the respective | 
 | 29 | 	subevent will not be reported. | 
 | 30 | 	By default all events are enabled. | 
 | 31 | 	Note that BIOS maintain another mask to disable specific events | 
 | 32 | 	per bank.  This is not visible here | 
 | 33 |  | 
 | 34 | The following entries appear for each CPU, but they are truly shared | 
 | 35 | between all CPUs. | 
 | 36 |  | 
 | 37 | check_interval | 
 | 38 | 	How often to poll for corrected machine check errors, in seconds | 
| Tim Hockin | 8a336b0 | 2007-05-02 19:27:19 +0200 | [diff] [blame] | 39 | 	(Note output is hexademical). Default 5 minutes.  When the poller | 
 | 40 | 	finds MCEs it triggers an exponential speedup (poll more often) on | 
 | 41 | 	the polling interval.  When the poller stops finding MCEs, it | 
 | 42 | 	triggers an exponential backoff (poll less often) on the polling | 
 | 43 | 	interval. The check_interval variable is both the initial and | 
 | 44 | 	maximum polling interval. | 
| Andi Kleen | a98f0dd | 2007-02-13 13:26:23 +0100 | [diff] [blame] | 45 |  | 
 | 46 | tolerant | 
 | 47 | 	Tolerance level. When a machine check exception occurs for a non | 
 | 48 | 	corrected machine check the kernel can take different actions. | 
 | 49 | 	Since machine check exceptions can happen any time it is sometimes | 
 | 50 | 	risky for the kernel to kill a process because it defies | 
 | 51 | 	normal kernel locking rules. The tolerance level configures | 
 | 52 | 	how hard the kernel tries to recover even at some risk of deadlock. | 
 | 53 |  | 
 | 54 | 	0: always panic, | 
 | 55 | 	1: panic if deadlock possible, | 
 | 56 | 	2: try to avoid panic, | 
 | 57 |    	3: never panic or exit (for testing only) | 
 | 58 |  | 
 | 59 | 	Default: 1 | 
 | 60 |  | 
 | 61 | 	Note this only makes a difference if the CPU allows recovery | 
 | 62 | 	from a machine check exception. Current x86 CPUs generally do not. | 
 | 63 |  | 
 | 64 | trigger | 
 | 65 | 	Program to run when a machine check event is detected. | 
 | 66 | 	This is an alternative to running mcelog regularly from cron | 
 | 67 | 	and allows to detect events faster. | 
 | 68 |  | 
 | 69 | TBD document entries for AMD threshold interrupt configuration | 
 | 70 |  | 
 | 71 | For more details about the x86 machine check architecture | 
 | 72 | see the Intel and AMD architecture manuals from their developer websites. | 
 | 73 |  | 
 | 74 | For more details about the architecture see | 
 | 75 | see http://one.firstfloor.org/~andi/mce.pdf |