| Manish Ahuja | d28a793 | 2008-03-22 10:33:10 +1100 | [diff] [blame] | 1 |  | 
 | 2 |                    Hypervisor-Assisted Dump | 
 | 3 |                    ------------------------ | 
 | 4 |                        November 2007 | 
 | 5 |  | 
 | 6 | The goal of hypervisor-assisted dump is to enable the dump of | 
 | 7 | a crashed system, and to do so from a fully-reset system, and | 
 | 8 | to minimize the total elapsed time until the system is back | 
 | 9 | in production use. | 
 | 10 |  | 
 | 11 | As compared to kdump or other strategies, hypervisor-assisted | 
 | 12 | dump offers several strong, practical advantages: | 
 | 13 |  | 
 | 14 | -- Unlike kdump, the system has been reset, and loaded | 
 | 15 |    with a fresh copy of the kernel.  In particular, | 
 | 16 |    PCI and I/O devices have been reinitialized and are | 
 | 17 |    in a clean, consistent state. | 
 | 18 | -- As the dump is performed, the dumped memory becomes | 
 | 19 |    immediately available to the system for normal use. | 
 | 20 | -- After the dump is completed, no further reboots are | 
 | 21 |    required; the system will be fully usable, and running | 
 | 22 |    in it's normal, production mode on it normal kernel. | 
 | 23 |  | 
 | 24 | The above can only be accomplished by coordination with, | 
 | 25 | and assistance from the hypervisor. The procedure is | 
 | 26 | as follows: | 
 | 27 |  | 
 | 28 | -- When a system crashes, the hypervisor will save | 
 | 29 |    the low 256MB of RAM to a previously registered | 
 | 30 |    save region. It will also save system state, system | 
 | 31 |    registers, and hardware PTE's. | 
 | 32 |  | 
 | 33 | -- After the low 256MB area has been saved, the | 
 | 34 |    hypervisor will reset PCI and other hardware state. | 
 | 35 |    It will *not* clear RAM. It will then launch the | 
 | 36 |    bootloader, as normal. | 
 | 37 |  | 
 | 38 | -- The freshly booted kernel will notice that there | 
 | 39 |    is a new node (ibm,dump-kernel) in the device tree, | 
 | 40 |    indicating that there is crash data available from | 
 | 41 |    a previous boot. It will boot into only 256MB of RAM, | 
 | 42 |    reserving the rest of system memory. | 
 | 43 |  | 
 | 44 | -- Userspace tools will parse /sys/kernel/release_region | 
 | 45 |    and read /proc/vmcore to obtain the contents of memory, | 
 | 46 |    which holds the previous crashed kernel. The userspace | 
 | 47 |    tools may copy this info to disk, or network, nas, san, | 
 | 48 |    iscsi, etc. as desired. | 
 | 49 |  | 
 | 50 |    For Example: the values in /sys/kernel/release-region | 
 | 51 |    would look something like this (address-range pairs). | 
 | 52 |    CPU:0x177fee000-0x10000: HPTE:0x177ffe020-0x1000: / | 
 | 53 |    DUMP:0x177fff020-0x10000000, 0x10000000-0x16F1D370A | 
 | 54 |  | 
 | 55 | -- As the userspace tools complete saving a portion of | 
 | 56 |    dump, they echo an offset and size to | 
 | 57 |    /sys/kernel/release_region to release the reserved | 
 | 58 |    memory back to general use. | 
 | 59 |  | 
 | 60 |    An example of this is: | 
 | 61 |      "echo 0x40000000 0x10000000 > /sys/kernel/release_region" | 
 | 62 |    which will release 256MB at the 1GB boundary. | 
 | 63 |  | 
 | 64 | Please note that the hypervisor-assisted dump feature | 
 | 65 | is only available on Power6-based systems with recent | 
 | 66 | firmware versions. | 
 | 67 |  | 
 | 68 | Implementation details: | 
 | 69 | ---------------------- | 
 | 70 |  | 
 | 71 | During boot, a check is made to see if firmware supports | 
 | 72 | this feature on this particular machine. If it does, then | 
 | 73 | we check to see if a active dump is waiting for us. If yes | 
 | 74 | then everything but 256 MB of RAM is reserved during early | 
 | 75 | boot. This area is released once we collect a dump from user | 
 | 76 | land scripts that are run. If there is dump data, then | 
 | 77 | the /sys/kernel/release_region file is created, and | 
 | 78 | the reserved memory is held. | 
 | 79 |  | 
 | 80 | If there is no waiting dump data, then only the highest | 
 | 81 | 256MB of the ram is reserved as a scratch area. This area | 
 | 82 | is *not* released: this region will be kept permanently | 
 | 83 | reserved, so that it can act as a receptacle for a copy | 
 | 84 | of the low 256MB in the case a crash does occur. See, | 
 | 85 | however, "open issues" below, as to whether | 
 | 86 | such a reserved region is really needed. | 
 | 87 |  | 
 | 88 | Currently the dump will be copied from /proc/vmcore to a | 
 | 89 | a new file upon user intervention. The starting address | 
 | 90 | to be read and the range for each data point in provided | 
 | 91 | in /sys/kernel/release_region. | 
 | 92 |  | 
 | 93 | The tools to examine the dump will be same as the ones | 
 | 94 | used for kdump. | 
 | 95 |  | 
 | 96 | General notes: | 
 | 97 | -------------- | 
 | 98 | Security: please note that there are potential security issues | 
 | 99 | with any sort of dump mechanism. In particular, plaintext | 
 | 100 | (unencrypted) data, and possibly passwords, may be present in | 
 | 101 | the dump data. Userspace tools must take adequate precautions to | 
 | 102 | preserve security. | 
 | 103 |  | 
 | 104 | Open issues/ToDo: | 
 | 105 | ------------ | 
 | 106 |  o The various code paths that tell the hypervisor that a crash | 
 | 107 |    occurred, vs. it simply being a normal reboot, should be | 
 | 108 |    reviewed, and possibly clarified/fixed. | 
 | 109 |  | 
 | 110 |  o Instead of using /sys/kernel, should there be a /sys/dump | 
 | 111 |    instead? There is a dump_subsys being created by the s390 code, | 
 | 112 |    perhaps the pseries code should use a similar layout as well. | 
 | 113 |  | 
 | 114 |  o Is reserving a 256MB region really required? The goal of | 
 | 115 |    reserving a 256MB scratch area is to make sure that no | 
 | 116 |    important crash data is clobbered when the hypervisor | 
 | 117 |    save low mem to the scratch area. But, if one could assure | 
 | 118 |    that nothing important is located in some 256MB area, then | 
 | 119 |    it would not need to be reserved. Something that can be | 
 | 120 |    improved in subsequent versions. | 
 | 121 |  | 
 | 122 |  o Still working the kdump team to integrate this with kdump, | 
 | 123 |    some work remains but this would not affect the current | 
 | 124 |    patches. | 
 | 125 |  | 
 | 126 |  o Still need to write a shell script, to copy the dump away. | 
 | 127 |    Currently I am parsing it manually. |