| Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 1 | What is hwpoison? | 
 | 2 |  | 
 | 3 | Upcoming Intel CPUs have support for recovering from some memory errors | 
 | 4 | (``MCA recovery''). This requires the OS to declare a page "poisoned", | 
 | 5 | kill the processes associated with it and avoid using it in the future. | 
 | 6 |  | 
 | 7 | This patchkit implements the necessary infrastructure in the VM. | 
 | 8 |  | 
 | 9 | To quote the overview comment: | 
 | 10 |  | 
 | 11 |  * High level machine check handler. Handles pages reported by the | 
 | 12 |  * hardware as being corrupted usually due to a 2bit ECC memory or cache | 
 | 13 |  * failure. | 
 | 14 |  * | 
 | 15 |  * This focusses on pages detected as corrupted in the background. | 
 | 16 |  * When the current CPU tries to consume corruption the currently | 
 | 17 |  * running process can just be killed directly instead. This implies | 
 | 18 |  * that if the error cannot be handled for some reason it's safe to | 
 | 19 |  * just ignore it because no corruption has been consumed yet. Instead | 
 | 20 |  * when that happens another machine check will happen. | 
 | 21 |  * | 
 | 22 |  * Handles page cache pages in various states. The tricky part | 
 | 23 |  * here is that we can access any page asynchronous to other VM | 
 | 24 |  * users, because memory failures could happen anytime and anywhere, | 
 | 25 |  * possibly violating some of their assumptions. This is why this code | 
 | 26 |  * has to be extremely careful. Generally it tries to use normal locking | 
 | 27 |  * rules, as in get the standard locks, even if that means the | 
 | 28 |  * error handling takes potentially a long time. | 
 | 29 |  * | 
 | 30 |  * Some of the operations here are somewhat inefficient and have non | 
 | 31 |  * linear algorithmic complexity, because the data structures have not | 
 | 32 |  * been optimized for this case. This is in particular the case | 
 | 33 |  * for the mapping from a vma to a process. Since this case is expected | 
 | 34 |  * to be rare we hope we can get away with this. | 
 | 35 |  | 
 | 36 | The code consists of a the high level handler in mm/memory-failure.c, | 
 | 37 | a new page poison bit and various checks in the VM to handle poisoned | 
 | 38 | pages. | 
 | 39 |  | 
 | 40 | The main target right now is KVM guests, but it works for all kinds | 
 | 41 | of applications. KVM support requires a recent qemu-kvm release. | 
 | 42 |  | 
 | 43 | For the KVM use there was need for a new signal type so that | 
 | 44 | KVM can inject the machine check into the guest with the proper | 
 | 45 | address. This in theory allows other applications to handle | 
 | 46 | memory failures too. The expection is that near all applications | 
 | 47 | won't do that, but some very specialized ones might. | 
 | 48 |  | 
 | 49 | --- | 
 | 50 |  | 
 | 51 | There are two (actually three) modi memory failure recovery can be in: | 
 | 52 |  | 
 | 53 | vm.memory_failure_recovery sysctl set to zero: | 
 | 54 | 	All memory failures cause a panic. Do not attempt recovery. | 
 | 55 | 	(on x86 this can be also affected by the tolerant level of the | 
 | 56 | 	MCE subsystem) | 
 | 57 |  | 
 | 58 | early kill | 
 | 59 | 	(can be controlled globally and per process) | 
 | 60 | 	Send SIGBUS to the application as soon as the error is detected | 
 | 61 | 	This allows applications who can process memory errors in a gentle | 
 | 62 | 	way (e.g. drop affected object) | 
 | 63 | 	This is the mode used by KVM qemu. | 
 | 64 |  | 
 | 65 | late kill | 
 | 66 | 	Send SIGBUS when the application runs into the corrupted page. | 
 | 67 | 	This is best for memory error unaware applications and default | 
 | 68 | 	Note some pages are always handled as late kill. | 
 | 69 |  | 
 | 70 | --- | 
 | 71 |  | 
 | 72 | User control: | 
 | 73 |  | 
 | 74 | vm.memory_failure_recovery | 
 | 75 | 	See sysctl.txt | 
 | 76 |  | 
 | 77 | vm.memory_failure_early_kill | 
 | 78 | 	Enable early kill mode globally | 
 | 79 |  | 
 | 80 | PR_MCE_KILL | 
 | 81 | 	Set early/late kill mode/revert to system default | 
 | 82 | 	arg1: PR_MCE_KILL_CLEAR: Revert to system default | 
 | 83 | 	arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode | 
 | 84 | 		PR_MCE_KILL_EARLY: Early kill | 
 | 85 | 		PR_MCE_KILL_LATE:  Late kill | 
 | 86 | 		PR_MCE_KILL_DEFAULT: Use system global default | 
 | 87 | PR_MCE_KILL_GET | 
 | 88 | 	return current mode | 
 | 89 |  | 
 | 90 |  | 
 | 91 | --- | 
 | 92 |  | 
 | 93 | Testing: | 
 | 94 |  | 
| Andi Kleen | fe194d3 | 2009-12-16 12:20:00 +0100 | [diff] [blame] | 95 | madvise(MADV_HWPOISON, ....) | 
| Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 96 | 	(as root) | 
 | 97 | 	Poison a page in the process for testing | 
 | 98 |  | 
 | 99 |  | 
 | 100 | hwpoison-inject module through debugfs | 
| Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 101 |  | 
| Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 102 | /sys/debug/hwpoison/ | 
| Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 103 |  | 
| Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 104 | corrupt-pfn | 
 | 105 |  | 
| Wu Fengguang | 31d3d34 | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 106 | Inject hwpoison fault at PFN echoed into this file. This does | 
 | 107 | some early filtering to avoid corrupted unintended pages in test suites. | 
| Wu Fengguang | 847ce40 | 2009-12-16 12:19:58 +0100 | [diff] [blame] | 108 |  | 
 | 109 | unpoison-pfn | 
 | 110 |  | 
 | 111 | Software-unpoison page at PFN echoed into this file. This | 
 | 112 | way a page can be reused again. | 
 | 113 | This only works for Linux injected failures, not for real | 
 | 114 | memory failures. | 
 | 115 |  | 
 | 116 | Note these injection interfaces are not stable and might change between | 
 | 117 | kernel versions | 
| Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 118 |  | 
| Wu Fengguang | 7c116f2 | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 119 | corrupt-filter-dev-major | 
 | 120 | corrupt-filter-dev-minor | 
 | 121 |  | 
 | 122 | Only handle memory failures to pages associated with the file system defined | 
 | 123 | by block device major/minor.  -1U is the wildcard value. | 
 | 124 | This should be only used for testing with artificial injection. | 
 | 125 |  | 
| Andi Kleen | 4fd466e | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 126 | corrupt-filter-memcg | 
 | 127 |  | 
 | 128 | Limit injection to pages owned by memgroup. Specified by inode number | 
 | 129 | of the memcg. | 
 | 130 |  | 
 | 131 | Example: | 
 | 132 |         mkdir /cgroup/hwpoison | 
 | 133 |  | 
 | 134 |         usemem -m 100 -s 1000 & | 
 | 135 |         echo `jobs -p` > /cgroup/hwpoison/tasks | 
 | 136 |  | 
 | 137 |         memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') | 
 | 138 |         echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | 
 | 139 |  | 
 | 140 |         page-types -p `pidof init`   --hwpoison  # shall do nothing | 
 | 141 |         page-types -p `pidof usemem` --hwpoison  # poison its pages | 
| Wu Fengguang | 478c5ff | 2009-12-16 12:19:59 +0100 | [diff] [blame] | 142 |  | 
 | 143 | corrupt-filter-flags-mask | 
 | 144 | corrupt-filter-flags-value | 
 | 145 |  | 
 | 146 | When specified, only poison pages if ((page_flags & mask) == value). | 
 | 147 | This allows stress testing of many kinds of pages. The page_flags | 
 | 148 | are the same as in /proc/kpageflags. The flag bits are defined in | 
 | 149 | include/linux/kernel-page-flags.h and documented in | 
 | 150 | Documentation/vm/pagemap.txt | 
 | 151 |  | 
| Andi Kleen | f58ee00 | 2009-10-04 02:28:42 +0200 | [diff] [blame] | 152 | Architecture specific MCE injector | 
 | 153 |  | 
 | 154 | x86 has mce-inject, mce-test | 
 | 155 |  | 
 | 156 | Some portable hwpoison test programs in mce-test, see blow. | 
 | 157 |  | 
 | 158 | --- | 
 | 159 |  | 
 | 160 | References: | 
 | 161 |  | 
 | 162 | http://halobates.de/mce-lc09-2.pdf | 
 | 163 | 	Overview presentation from LinuxCon 09 | 
 | 164 |  | 
 | 165 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | 
 | 166 | 	Test suite (hwpoison specific portable tests in tsrc) | 
 | 167 |  | 
 | 168 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | 
 | 169 | 	x86 specific injector | 
 | 170 |  | 
 | 171 |  | 
 | 172 | --- | 
 | 173 |  | 
 | 174 | Limitations: | 
 | 175 |  | 
 | 176 | - Not all page types are supported and never will. Most kernel internal | 
 | 177 | objects cannot be recovered, only LRU pages for now. | 
 | 178 | - Right now hugepage support is missing. | 
 | 179 |  | 
 | 180 | --- | 
 | 181 | Andi Kleen, Oct 2009 | 
 | 182 |  |