| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | The Linux Watchdog driver API. | 
|  | 2 |  | 
|  | 3 | Copyright 2002 Christer Weingel <wingel@nano-system.com> | 
|  | 4 |  | 
|  | 5 | Some parts of this document are copied verbatim from the sbc60xxwdt | 
|  | 6 | driver which is (c) Copyright 2000 Jakob Oestergaard <jakob@ostenfeld.dk> | 
|  | 7 |  | 
|  | 8 | This document describes the state of the Linux 2.4.18 kernel. | 
|  | 9 |  | 
|  | 10 | Introduction: | 
|  | 11 |  | 
|  | 12 | A Watchdog Timer (WDT) is a hardware circuit that can reset the | 
|  | 13 | computer system in case of a software fault.  You probably knew that | 
|  | 14 | already. | 
|  | 15 |  | 
|  | 16 | Usually a userspace daemon will notify the kernel watchdog driver via the | 
|  | 17 | /dev/watchdog special device file that userspace is still alive, at | 
|  | 18 | regular intervals.  When such a notification occurs, the driver will | 
|  | 19 | usually tell the hardware watchdog that everything is in order, and | 
|  | 20 | that the watchdog should wait for yet another little while to reset | 
|  | 21 | the system.  If userspace fails (RAM error, kernel bug, whatever), the | 
|  | 22 | notifications cease to occur, and the hardware watchdog will reset the | 
|  | 23 | system (causing a reboot) after the timeout occurs. | 
|  | 24 |  | 
|  | 25 | The Linux watchdog API is a rather AD hoc construction and different | 
|  | 26 | drivers implement different, and sometimes incompatible, parts of it. | 
|  | 27 | This file is an attempt to document the existing usage and allow | 
|  | 28 | future driver writers to use it as a reference. | 
|  | 29 |  | 
|  | 30 | The simplest API: | 
|  | 31 |  | 
|  | 32 | All drivers support the basic mode of operation, where the watchdog | 
|  | 33 | activates as soon as /dev/watchdog is opened and will reboot unless | 
|  | 34 | the watchdog is pinged within a certain time, this time is called the | 
|  | 35 | timeout or margin.  The simplest way to ping the watchdog is to write | 
|  | 36 | some data to the device.  So a very simple watchdog daemon would look | 
|  | 37 | like this: | 
|  | 38 |  | 
|  | 39 | int main(int argc, const char *argv[]) { | 
|  | 40 | int fd=open("/dev/watchdog",O_WRONLY); | 
|  | 41 | if (fd==-1) { | 
|  | 42 | perror("watchdog"); | 
|  | 43 | exit(1); | 
|  | 44 | } | 
|  | 45 | while(1) { | 
|  | 46 | write(fd, "\0", 1); | 
|  | 47 | sleep(10); | 
|  | 48 | } | 
|  | 49 | } | 
|  | 50 |  | 
|  | 51 | A more advanced driver could for example check that a HTTP server is | 
|  | 52 | still responding before doing the write call to ping the watchdog. | 
|  | 53 |  | 
|  | 54 | When the device is closed, the watchdog is disabled.  This is not | 
|  | 55 | always such a good idea, since if there is a bug in the watchdog | 
|  | 56 | daemon and it crashes the system will not reboot.  Because of this, | 
|  | 57 | some of the drivers support the configuration option "Disable watchdog | 
|  | 58 | shutdown on close", CONFIG_WATCHDOG_NOWAYOUT.  If it is set to Y when | 
|  | 59 | compiling the kernel, there is no way of disabling the watchdog once | 
|  | 60 | it has been started.  So, if the watchdog dameon crashes, the system | 
|  | 61 | will reboot after the timeout has passed. | 
|  | 62 |  | 
|  | 63 | Some other drivers will not disable the watchdog, unless a specific | 
|  | 64 | magic character 'V' has been sent /dev/watchdog just before closing | 
|  | 65 | the file.  If the userspace daemon closes the file without sending | 
|  | 66 | this special character, the driver will assume that the daemon (and | 
|  | 67 | userspace in general) died, and will stop pinging the watchdog without | 
|  | 68 | disabling it first.  This will then cause a reboot. | 
|  | 69 |  | 
|  | 70 | The ioctl API: | 
|  | 71 |  | 
|  | 72 | All conforming drivers also support an ioctl API. | 
|  | 73 |  | 
|  | 74 | Pinging the watchdog using an ioctl: | 
|  | 75 |  | 
|  | 76 | All drivers that have an ioctl interface support at least one ioctl, | 
|  | 77 | KEEPALIVE.  This ioctl does exactly the same thing as a write to the | 
|  | 78 | watchdog device, so the main loop in the above program could be | 
|  | 79 | replaced with: | 
|  | 80 |  | 
|  | 81 | while (1) { | 
|  | 82 | ioctl(fd, WDIOC_KEEPALIVE, 0); | 
|  | 83 | sleep(10); | 
|  | 84 | } | 
|  | 85 |  | 
|  | 86 | the argument to the ioctl is ignored. | 
|  | 87 |  | 
|  | 88 | Setting and getting the timeout: | 
|  | 89 |  | 
|  | 90 | For some drivers it is possible to modify the watchdog timeout on the | 
|  | 91 | fly with the SETTIMEOUT ioctl, those drivers have the WDIOF_SETTIMEOUT | 
|  | 92 | flag set in their option field.  The argument is an integer | 
|  | 93 | representing the timeout in seconds.  The driver returns the real | 
|  | 94 | timeout used in the same variable, and this timeout might differ from | 
|  | 95 | the requested one due to limitation of the hardware. | 
|  | 96 |  | 
|  | 97 | int timeout = 45; | 
|  | 98 | ioctl(fd, WDIOC_SETTIMEOUT, &timeout); | 
|  | 99 | printf("The timeout was set to %d seconds\n", timeout); | 
|  | 100 |  | 
|  | 101 | This example might actually print "The timeout was set to 60 seconds" | 
|  | 102 | if the device has a granularity of minutes for its timeout. | 
|  | 103 |  | 
|  | 104 | Starting with the Linux 2.4.18 kernel, it is possible to query the | 
|  | 105 | current timeout using the GETTIMEOUT ioctl. | 
|  | 106 |  | 
|  | 107 | ioctl(fd, WDIOC_GETTIMEOUT, &timeout); | 
|  | 108 | printf("The timeout was is %d seconds\n", timeout); | 
|  | 109 |  | 
|  | 110 | Envinronmental monitoring: | 
|  | 111 |  | 
|  | 112 | All watchdog drivers are required return more information about the system, | 
|  | 113 | some do temperature, fan and power level monitoring, some can tell you | 
|  | 114 | the reason for the last reboot of the system.  The GETSUPPORT ioctl is | 
|  | 115 | available to ask what the device can do: | 
|  | 116 |  | 
|  | 117 | struct watchdog_info ident; | 
|  | 118 | ioctl(fd, WDIOC_GETSUPPORT, &ident); | 
|  | 119 |  | 
|  | 120 | the fields returned in the ident struct are: | 
|  | 121 |  | 
|  | 122 | identity		a string identifying the watchdog driver | 
|  | 123 | firmware_version	the firmware version of the card if available | 
|  | 124 | options			a flags describing what the device supports | 
|  | 125 |  | 
|  | 126 | the options field can have the following bits set, and describes what | 
|  | 127 | kind of information that the GET_STATUS and GET_BOOT_STATUS ioctls can | 
|  | 128 | return.   [FIXME -- Is this correct?] | 
|  | 129 |  | 
|  | 130 | WDIOF_OVERHEAT		Reset due to CPU overheat | 
|  | 131 |  | 
|  | 132 | The machine was last rebooted by the watchdog because the thermal limit was | 
|  | 133 | exceeded | 
|  | 134 |  | 
|  | 135 | WDIOF_FANFAULT		Fan failed | 
|  | 136 |  | 
|  | 137 | A system fan monitored by the watchdog card has failed | 
|  | 138 |  | 
|  | 139 | WDIOF_EXTERN1		External relay 1 | 
|  | 140 |  | 
|  | 141 | External monitoring relay/source 1 was triggered. Controllers intended for | 
|  | 142 | real world applications include external monitoring pins that will trigger | 
|  | 143 | a reset. | 
|  | 144 |  | 
|  | 145 | WDIOF_EXTERN2		External relay 2 | 
|  | 146 |  | 
|  | 147 | External monitoring relay/source 2 was triggered | 
|  | 148 |  | 
|  | 149 | WDIOF_POWERUNDER	Power bad/power fault | 
|  | 150 |  | 
|  | 151 | The machine is showing an undervoltage status | 
|  | 152 |  | 
|  | 153 | WDIOF_CARDRESET		Card previously reset the CPU | 
|  | 154 |  | 
|  | 155 | The last reboot was caused by the watchdog card | 
|  | 156 |  | 
|  | 157 | WDIOF_POWEROVER		Power over voltage | 
|  | 158 |  | 
|  | 159 | The machine is showing an overvoltage status. Note that if one level is | 
|  | 160 | under and one over both bits will be set - this may seem odd but makes | 
|  | 161 | sense. | 
|  | 162 |  | 
|  | 163 | WDIOF_KEEPALIVEPING	Keep alive ping reply | 
|  | 164 |  | 
|  | 165 | The watchdog saw a keepalive ping since it was last queried. | 
|  | 166 |  | 
|  | 167 | WDIOF_SETTIMEOUT	Can set/get the timeout | 
|  | 168 |  | 
|  | 169 |  | 
|  | 170 | For those drivers that return any bits set in the option field, the | 
|  | 171 | GETSTATUS and GETBOOTSTATUS ioctls can be used to ask for the current | 
|  | 172 | status, and the status at the last reboot, respectively. | 
|  | 173 |  | 
|  | 174 | int flags; | 
|  | 175 | ioctl(fd, WDIOC_GETSTATUS, &flags); | 
|  | 176 |  | 
|  | 177 | or | 
|  | 178 |  | 
|  | 179 | ioctl(fd, WDIOC_GETBOOTSTATUS, &flags); | 
|  | 180 |  | 
|  | 181 | Note that not all devices support these two calls, and some only | 
|  | 182 | support the GETBOOTSTATUS call. | 
|  | 183 |  | 
|  | 184 | Some drivers can measure the temperature using the GETTEMP ioctl.  The | 
|  | 185 | returned value is the temperature in degrees farenheit. | 
|  | 186 |  | 
|  | 187 | int temperature; | 
|  | 188 | ioctl(fd, WDIOC_GETTEMP, &temperature); | 
|  | 189 |  | 
|  | 190 | Finally the SETOPTIONS ioctl can be used to control some aspects of | 
|  | 191 | the cards operation; right now the pcwd driver is the only one | 
|  | 192 | supporting thiss ioctl. | 
|  | 193 |  | 
|  | 194 | int options = 0; | 
|  | 195 | ioctl(fd, WDIOC_SETOPTIONS, options); | 
|  | 196 |  | 
|  | 197 | The following options are available: | 
|  | 198 |  | 
|  | 199 | WDIOS_DISABLECARD	Turn off the watchdog timer | 
|  | 200 | WDIOS_ENABLECARD	Turn on the watchdog timer | 
|  | 201 | WDIOS_TEMPPANIC		Kernel panic on temperature trip | 
|  | 202 |  | 
|  | 203 | [FIXME -- better explanations] | 
|  | 204 |  | 
|  | 205 | Implementations in the current drivers in the kernel tree: | 
|  | 206 |  | 
|  | 207 | Here I have tried to summarize what the different drivers support and | 
|  | 208 | where they do strange things compared to the other drivers. | 
|  | 209 |  | 
|  | 210 | acquirewdt.c -- Acquire Single Board Computer | 
|  | 211 |  | 
|  | 212 | This driver has a hardcoded timeout of 1 minute | 
|  | 213 |  | 
|  | 214 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 215 |  | 
|  | 216 | GETSUPPORT returns KEEPALIVEPING.  GETSTATUS will return 1 if | 
|  | 217 | the device is open, 0 if not.  [FIXME -- isn't this rather | 
|  | 218 | silly?  To be able to use the ioctl, the device must be open | 
|  | 219 | and so GETSTATUS will always return 1]. | 
|  | 220 |  | 
|  | 221 | advantechwdt.c -- Advantech Single Board Computer | 
|  | 222 |  | 
|  | 223 | Timeout that defaults to 60 seconds, supports SETTIMEOUT. | 
|  | 224 |  | 
|  | 225 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 226 |  | 
|  | 227 | GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. | 
|  | 228 | The GETSTATUS call returns if the device is open or not. | 
|  | 229 | [FIXME -- silliness again?] | 
|  | 230 |  | 
|  | 231 | eurotechwdt.c -- Eurotech CPU-1220/1410 | 
|  | 232 |  | 
|  | 233 | The timeout can be set using the SETTIMEOUT ioctl and defaults | 
|  | 234 | to 60 seconds. | 
|  | 235 |  | 
|  | 236 | Also has a module parameter "ev", event type which controls | 
|  | 237 | what should happen on a timeout, the string "int" or anything | 
|  | 238 | else that causes a reboot.  [FIXME -- better description] | 
|  | 239 |  | 
|  | 240 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 241 |  | 
|  | 242 | GETSUPPORT returns CARDRESET and WDIOF_SETTIMEOUT but | 
|  | 243 | GETSTATUS is not supported and GETBOOTSTATUS just returns 0. | 
|  | 244 |  | 
|  | 245 | i810-tco.c -- Intel 810 chipset | 
|  | 246 |  | 
|  | 247 | Also has support for a lot of other i8x0 stuff, but the | 
|  | 248 | watchdog is one of the things. | 
|  | 249 |  | 
|  | 250 | The timeout is set using the module parameter "i810_margin", | 
|  | 251 | which is in steps of 0.6 seconds where 2<i810_margin<64.  The | 
|  | 252 | driver supports the SETTIMEOUT ioctl. | 
|  | 253 |  | 
|  | 254 | Supports CONFIG_WATCHDOG_NOWAYOUT. | 
|  | 255 |  | 
|  | 256 | GETSUPPORT returns WDIOF_SETTIMEOUT.  The GETSTATUS call | 
|  | 257 | returns some kind of timer value which ist not compatible with | 
|  | 258 | the other drivers.  GETBOOT status returns some kind of | 
|  | 259 | hardware specific boot status.  [FIXME -- describe this] | 
|  | 260 |  | 
|  | 261 | ib700wdt.c -- IB700 Single Board Computer | 
|  | 262 |  | 
|  | 263 | Default timeout of 30 seconds and the timeout is settable | 
|  | 264 | using the SETTIMEOUT ioctl.  Note that only a few timeout | 
|  | 265 | values are supported. | 
|  | 266 |  | 
|  | 267 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 268 |  | 
|  | 269 | GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. | 
|  | 270 | The GETSTATUS call returns if the device is open or not. | 
|  | 271 | [FIXME -- silliness again?] | 
|  | 272 |  | 
|  | 273 | machzwd.c -- MachZ ZF-Logic | 
|  | 274 |  | 
|  | 275 | Hardcoded timeout of 10 seconds | 
|  | 276 |  | 
|  | 277 | Has a module parameter "action" that controls what happens | 
|  | 278 | when the timeout runs out which can be 0 = RESET (default), | 
|  | 279 | 1 = SMI, 2 = NMI, 3 = SCI. | 
|  | 280 |  | 
|  | 281 | Supports CONFIG_WATCHDOG_NOWAYOUT and the magic character | 
|  | 282 | 'V' close handling. | 
|  | 283 |  | 
|  | 284 | GETSUPPORT returns WDIOF_KEEPALIVEPING, and the GETSTATUS call | 
|  | 285 | returns if the device is open or not.  [FIXME -- silliness | 
|  | 286 | again?] | 
|  | 287 |  | 
|  | 288 | mixcomwd.c -- MixCom Watchdog | 
|  | 289 |  | 
|  | 290 | [FIXME -- I'm unable to tell what the timeout is] | 
|  | 291 |  | 
|  | 292 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 293 |  | 
|  | 294 | GETSUPPORT returns WDIOF_KEEPALIVEPING, GETSTATUS returns if | 
|  | 295 | the device is opened or not [FIXME -- I'm not really sure how | 
|  | 296 | this works, there seems to be some magic connected to | 
|  | 297 | CONFIG_WATCHDOG_NOWAYOUT] | 
|  | 298 |  | 
|  | 299 | pcwd.c -- Berkshire PC Watchdog | 
|  | 300 |  | 
|  | 301 | Hardcoded timeout of 1.5 seconds | 
|  | 302 |  | 
|  | 303 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 304 |  | 
|  | 305 | GETSUPPORT returns WDIOF_OVERHEAT|WDIOF_CARDRESET and both | 
|  | 306 | GETSTATUS and GETBOOTSTATUS return something useful. | 
|  | 307 |  | 
|  | 308 | The SETOPTIONS call can be used to enable and disable the card | 
|  | 309 | and to ask the driver to call panic if the system overheats. | 
|  | 310 |  | 
|  | 311 | sbc60xxwdt.c -- 60xx Single Board Computer | 
|  | 312 |  | 
|  | 313 | Hardcoded timeout of 10 seconds | 
|  | 314 |  | 
|  | 315 | Does not support CONFIG_WATCHDOG_NOWAYOUT, but has the magic | 
|  | 316 | character 'V' close handling. | 
|  | 317 |  | 
|  | 318 | No bits set in GETSUPPORT | 
|  | 319 |  | 
|  | 320 | scx200.c -- National SCx200 CPUs | 
|  | 321 |  | 
|  | 322 | Not in the kernel yet. | 
|  | 323 |  | 
|  | 324 | The timeout is set using a module parameter "margin" which | 
|  | 325 | defaults to 60 seconds.  The timeout can also be set using | 
|  | 326 | SETTIMEOUT and read using GETTIMEOUT. | 
|  | 327 |  | 
|  | 328 | Supports a module parameter "nowayout" that is initialized | 
|  | 329 | with the value of CONFIG_WATCHDOG_NOWAYOUT.  Also supports the | 
|  | 330 | magic character 'V' handling. | 
|  | 331 |  | 
|  | 332 | shwdt.c -- SuperH 3/4 processors | 
|  | 333 |  | 
|  | 334 | [FIXME -- I'm unable to tell what the timeout is] | 
|  | 335 |  | 
|  | 336 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 337 |  | 
|  | 338 | GETSUPPORT returns WDIOF_KEEPALIVEPING, and the GETSTATUS call | 
|  | 339 | returns if the device is open or not.  [FIXME -- silliness | 
|  | 340 | again?] | 
|  | 341 |  | 
|  | 342 | softdog.c -- Software watchdog | 
|  | 343 |  | 
|  | 344 | The timeout is set with the module parameter "soft_margin" | 
|  | 345 | which defaults to 60 seconds, the timeout is also settable | 
|  | 346 | using the SETTIMEOUT ioctl. | 
|  | 347 |  | 
|  | 348 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 349 |  | 
|  | 350 | WDIOF_SETTIMEOUT bit set in GETSUPPORT | 
|  | 351 |  | 
|  | 352 | w83877f_wdt.c -- W83877F Computer | 
|  | 353 |  | 
|  | 354 | Hardcoded timeout of 30 seconds | 
|  | 355 |  | 
|  | 356 | Does not support CONFIG_WATCHDOG_NOWAYOUT, but has the magic | 
|  | 357 | character 'V' close handling. | 
|  | 358 |  | 
|  | 359 | No bits set in GETSUPPORT | 
|  | 360 |  | 
|  | 361 | w83627hf_wdt.c -- w83627hf watchdog | 
|  | 362 |  | 
|  | 363 | Timeout that defaults to 60 seconds, supports SETTIMEOUT. | 
|  | 364 |  | 
|  | 365 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 366 |  | 
|  | 367 | GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. | 
|  | 368 | The GETSTATUS call returns if the device is open or not. | 
|  | 369 |  | 
|  | 370 | wdt.c -- ICS WDT500/501 ISA and | 
|  | 371 | wdt_pci.c -- ICS WDT500/501 PCI | 
|  | 372 |  | 
|  | 373 | Default timeout of 60 seconds.  The timeout is also settable | 
|  | 374 | using the SETTIMEOUT ioctl. | 
|  | 375 |  | 
|  | 376 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 377 |  | 
|  | 378 | GETSUPPORT returns with bits set depending on the actual | 
|  | 379 | card. The WDT501 supports a lot of external monitoring, the | 
|  | 380 | WDT500 much less. | 
|  | 381 |  | 
|  | 382 | wdt285.c -- Footbridge watchdog | 
|  | 383 |  | 
|  | 384 | The timeout is set with the module parameter "soft_margin" | 
|  | 385 | which defaults to 60 seconds.  The timeout is also settable | 
|  | 386 | using the SETTIMEOUT ioctl. | 
|  | 387 |  | 
|  | 388 | Does not support CONFIG_WATCHDOG_NOWAYOUT | 
|  | 389 |  | 
|  | 390 | WDIOF_SETTIMEOUT bit set in GETSUPPORT | 
|  | 391 |  | 
|  | 392 | wdt977.c -- Netwinder W83977AF chip | 
|  | 393 |  | 
|  | 394 | Hardcoded timeout of 3 minutes | 
|  | 395 |  | 
|  | 396 | Supports CONFIG_WATCHDOG_NOWAYOUT | 
|  | 397 |  | 
|  | 398 | Does not support any ioctls at all. | 
|  | 399 |  |