| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 1 | Debugging hibernation and suspend | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 2 | (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL | 
|  | 3 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 4 | 1. Testing hibernation (aka suspend to disk or STD) | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 5 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 6 | To check if hibernation works, you can try to hibernate in the "reboot" mode: | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 7 |  | 
|  | 8 | # echo reboot > /sys/power/disk | 
|  | 9 | # echo disk > /sys/power/state | 
|  | 10 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 11 | and the system should create a hibernation image, reboot, resume and get back to | 
|  | 12 | the command prompt where you have started the transition.  If that happens, | 
|  | 13 | hibernation is most likely to work correctly.  Still, you need to repeat the | 
|  | 14 | test at least a couple of times in a row for confidence.  [This is necessary, | 
|  | 15 | because some problems only show up on a second attempt at suspending and | 
|  | 16 | resuming the system.]  Moreover, hibernating in the "reboot" and "shutdown" | 
|  | 17 | modes causes the PM core to skip some platform-related callbacks which on ACPI | 
|  | 18 | systems might be necessary to make hibernation work.  Thus, if you machine fails | 
|  | 19 | to hibernate or resume in the "reboot" mode, you should try the "platform" mode: | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 20 |  | 
|  | 21 | # echo platform > /sys/power/disk | 
|  | 22 | # echo disk > /sys/power/state | 
|  | 23 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 24 | which is the default and recommended mode of hibernation. | 
|  | 25 |  | 
|  | 26 | Unfortunately, the "platform" mode of hibernation does not work on some systems | 
|  | 27 | with broken BIOSes.  In such cases the "shutdown" mode of hibernation might | 
|  | 28 | work: | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 29 |  | 
|  | 30 | # echo shutdown > /sys/power/disk | 
|  | 31 | # echo disk > /sys/power/state | 
|  | 32 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 33 | (it is similar to the "reboot" mode, but it requires you to press the power | 
|  | 34 | button to make the system resume). | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 35 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 36 | If neither "platform" nor "shutdown" hibernation mode works, you will need to | 
|  | 37 | identify what goes wrong. | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 38 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 39 | a) Test modes of hibernation | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 40 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 41 | To find out why hibernation fails on your system, you can use a special testing | 
|  | 42 | facility available if the kernel is compiled with CONFIG_PM_DEBUG set.  Then, | 
|  | 43 | there is the file /sys/power/pm_test that can be used to make the hibernation | 
|  | 44 | core run in a test mode.  There are 5 test modes available: | 
|  | 45 |  | 
|  | 46 | freezer | 
|  | 47 | - test the freezing of processes | 
|  | 48 |  | 
|  | 49 | devices | 
|  | 50 | - test the freezing of processes and suspending of devices | 
|  | 51 |  | 
|  | 52 | platform | 
|  | 53 | - test the freezing of processes, suspending of devices and platform | 
|  | 54 | global control methods(*) | 
|  | 55 |  | 
|  | 56 | processors | 
|  | 57 | - test the freezing of processes, suspending of devices, platform | 
|  | 58 | global control methods(*) and the disabling of nonboot CPUs | 
|  | 59 |  | 
|  | 60 | core | 
|  | 61 | - test the freezing of processes, suspending of devices, platform global | 
|  | 62 | control methods(*), the disabling of nonboot CPUs and suspending of | 
|  | 63 | platform/system devices | 
|  | 64 |  | 
|  | 65 | (*) the platform global control methods are only available on ACPI systems | 
|  | 66 | and are only tested if the hibernation mode is set to "platform" | 
|  | 67 |  | 
|  | 68 | To use one of them it is necessary to write the corresponding string to | 
|  | 69 | /sys/power/pm_test (eg. "devices" to test the freezing of processes and | 
|  | 70 | suspending devices) and issue the standard hibernation commands.  For example, | 
|  | 71 | to use the "devices" test mode along with the "platform" mode of hibernation, | 
|  | 72 | you should do the following: | 
|  | 73 |  | 
|  | 74 | # echo devices > /sys/power/pm_test | 
|  | 75 | # echo platform > /sys/power/disk | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 76 | # echo disk > /sys/power/state | 
|  | 77 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 78 | Then, the kernel will try to freeze processes, suspend devices, wait 5 seconds, | 
|  | 79 | resume devices and thaw processes.  If "platform" is written to | 
|  | 80 | /sys/power/pm_test , then after suspending devices the kernel will additionally | 
|  | 81 | invoke the global control methods (eg. ACPI global control methods) used to | 
|  | 82 | prepare the platform firmware for hibernation.  Next, it will wait 5 seconds and | 
|  | 83 | invoke the platform (eg. ACPI) global methods used to cancel hibernation etc. | 
|  | 84 |  | 
|  | 85 | Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal | 
|  | 86 | hibernation/suspend operations.  Also, when open for reading, /sys/power/pm_test | 
|  | 87 | contains a space-separated list of all available tests (including "none" that | 
|  | 88 | represents the normal functionality) in which the current test level is | 
|  | 89 | indicated by square brackets. | 
|  | 90 |  | 
|  | 91 | Generally, as you can see, each test level is more "invasive" than the previous | 
|  | 92 | one and the "core" level tests the hardware and drivers as deeply as possible | 
|  | 93 | without creating a hibernation image.  Obviously, if the "devices" test fails, | 
|  | 94 | the "platform" test will fail as well and so on.  Thus, as a rule of thumb, you | 
|  | 95 | should try the test modes starting from "freezer", through "devices", "platform" | 
|  | 96 | and "processors" up to "core" (repeat the test on each level a couple of times | 
|  | 97 | to make sure that any random factors are avoided). | 
|  | 98 |  | 
|  | 99 | If the "freezer" test fails, there is a task that cannot be frozen (in that case | 
|  | 100 | it usually is possible to identify the offending task by analysing the output of | 
|  | 101 | dmesg obtained after the failing test).  Failure at this level usually means | 
|  | 102 | that there is a problem with the tasks freezer subsystem that should be | 
|  | 103 | reported. | 
|  | 104 |  | 
|  | 105 | If the "devices" test fails, most likely there is a driver that cannot suspend | 
|  | 106 | or resume its device (in the latter case the system may hang or become unstable | 
|  | 107 | after the test, so please take that into consideration).  To find this driver, | 
|  | 108 | you can carry out a binary search according to the rules: | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 109 | - if the test fails, unload a half of the drivers currently loaded and repeat | 
|  | 110 | (that would probably involve rebooting the system, so always note what drivers | 
|  | 111 | have been loaded before the test), | 
|  | 112 | - if the test succeeds, load a half of the drivers you have unloaded most | 
|  | 113 | recently and repeat. | 
|  | 114 |  | 
|  | 115 | Once you have found the failing driver (there can be more than just one of | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 116 | them), you have to unload it every time before hibernation.  In that case please | 
|  | 117 | make sure to report the problem with the driver. | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 118 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 119 | It is also possible that the "devices" test will still fail after you have | 
|  | 120 | unloaded all modules. In that case, you may want to look in your kernel | 
|  | 121 | configuration for the drivers that can be compiled as modules (and test again | 
|  | 122 | with these drivers compiled as modules).  You may also try to use some special | 
|  | 123 | kernel command line options such as "noapic", "noacpi" or even "acpi=off". | 
|  | 124 |  | 
|  | 125 | If the "platform" test fails, there is a problem with the handling of the | 
|  | 126 | platform (eg. ACPI) firmware on your system.  In that case the "platform" mode | 
|  | 127 | of hibernation is not likely to work.  You can try the "shutdown" mode, but that | 
|  | 128 | is rather a poor man's workaround. | 
|  | 129 |  | 
|  | 130 | If the "processors" test fails, the disabling/enabling of nonboot CPUs does not | 
|  | 131 | work (of course, this only may be an issue on SMP systems) and the problem | 
|  | 132 | should be reported.  In that case you can also try to switch the nonboot CPUs | 
|  | 133 | off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and | 
|  | 134 | see if that works. | 
|  | 135 |  | 
|  | 136 | If the "core" test fails, which means that suspending of the system/platform | 
|  | 137 | devices has failed (these devices are suspended on one CPU with interrupts off), | 
|  | 138 | the problem is most probably hardware-related and serious, so it should be | 
|  | 139 | reported. | 
|  | 140 |  | 
|  | 141 | A failure of any of the "platform", "processors" or "core" tests may cause your | 
|  | 142 | system to hang or become unstable, so please beware.  Such a failure usually | 
|  | 143 | indicates a serious problem that very well may be related to the hardware, but | 
|  | 144 | please report it anyway. | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 145 |  | 
|  | 146 | b) Testing minimal configuration | 
|  | 147 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 148 | If all of the hibernation test modes work, you can boot the system with the | 
|  | 149 | "init=/bin/bash" command line parameter and attempt to hibernate in the | 
|  | 150 | "reboot", "shutdown" and "platform" modes.  If that does not work, there | 
|  | 151 | probably is a problem with a driver statically compiled into the kernel and you | 
|  | 152 | can try to compile more drivers as modules, so that they can be tested | 
|  | 153 | individually.  Otherwise, there is a problem with a modular driver and you can | 
|  | 154 | find it by loading a half of the modules you normally use and binary searching | 
|  | 155 | in accordance with the algorithm: | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 156 | - if there are n modules loaded and the attempt to suspend and resume fails, | 
|  | 157 | unload n/2 of the modules and try again (that would probably involve rebooting | 
|  | 158 | the system), | 
|  | 159 | - if there are n modules loaded and the attempt to suspend and resume succeeds, | 
|  | 160 | load n/2 modules more and try again. | 
|  | 161 |  | 
|  | 162 | Again, if you find the offending module(s), it(they) must be unloaded every time | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 163 | before hibernation, and please report the problem with it(them). | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 164 |  | 
|  | 165 | c) Advanced debugging | 
|  | 166 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 167 | In case that hibernation does not work on your system even in the minimal | 
|  | 168 | configuration and compiling more drivers as modules is not practical or some | 
|  | 169 | modules cannot be unloaded, you can use one of the more advanced debugging | 
|  | 170 | techniques to find the problem.  First, if there is a serial port in your box, | 
|  | 171 | you can boot the kernel with the 'no_console_suspend' parameter and try to log | 
|  | 172 | kernel messages using the serial console.  This may provide you with some | 
|  | 173 | information about the reasons of the suspend (resume) failure.  Alternatively, | 
|  | 174 | it may be possible to use a FireWire port for debugging with firescope | 
|  | 175 | (ftp://ftp.firstfloor.org/pub/ak/firescope/).  On x86 it is also possible to | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 176 | use the PM_TRACE mechanism documented in Documentation/s2ram.txt . | 
|  | 177 |  | 
|  | 178 | 2. Testing suspend to RAM (STR) | 
|  | 179 |  | 
|  | 180 | To verify that the STR works, it is generally more convenient to use the s2ram | 
|  | 181 | tool available from http://suspend.sf.net and documented at | 
|  | 182 | http://en.opensuse.org/s2ram .  However, before doing that it is recommended to | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 183 | carry out STR testing using the facility described in section 1. | 
| Rafael J. Wysocki | 5b79520 | 2007-05-08 00:24:07 -0700 | [diff] [blame] | 184 |  | 
| Rafael J. Wysocki | ce2b714 | 2007-11-19 23:43:34 +0100 | [diff] [blame] | 185 | Namely, after writing "freezer", "devices", "platform", "processors", or "core" | 
|  | 186 | into /sys/power/pm_test (available if the kernel is compiled with | 
|  | 187 | CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding | 
|  | 188 | to given string.  The STR test modes are defined in the same way as for | 
|  | 189 | hibernation, so please refer to Section 1 for more information about them.  In | 
|  | 190 | particular, the "core" test allows you to test everything except for the actual | 
|  | 191 | invocation of the platform firmware in order to put the system into the sleep | 
|  | 192 | state. | 
|  | 193 |  | 
|  | 194 | Among other things, the testing with the help of /sys/power/pm_test may allow | 
|  | 195 | you to identify drivers that fail to suspend or resume their devices.  They | 
|  | 196 | should be unloaded every time before an STR transition. | 
|  | 197 |  | 
|  | 198 | Next, you can follow the instructions at http://en.opensuse.org/s2ram to test | 
|  | 199 | the system, but if it does not work "out of the box", you may need to boot it | 
|  | 200 | with "init=/bin/bash" and test s2ram in the minimal configuration.  In that | 
|  | 201 | case, you may be able to search for failing drivers by following the procedure | 
|  | 202 | analogous to the one described in section 1.  If you find some failing drivers, | 
|  | 203 | you will have to unload them every time before an STR transition (ie. before | 
|  | 204 | you run s2ram), and please report the problems with them. |