debug lockups: Improve lockup detection
When debugging a recent lockup bug i found various deficiencies
in how our current lockup detection helpers work:
- SysRq-L is not very efficient as it uses a workqueue, hence
it cannot punch through hard lockups and cannot see through
most soft lockups either.
- The SysRq-L code depends on the NMI watchdog - which is off
by default.
- We dont print backtraces from the RCU code's built-in
'RCU state machine is stuck' debug code. This debug
code tends to be one of the first (and only) mechanisms
that show that a lockup has occured.
This patch changes the code so taht we:
- Trigger the NMI backtrace code from SysRq-L instead of using
a workqueue (which cannot punch through hard lockups)
- Trigger print-all-CPU-backtraces from the RCU lockup detection
code
Also decouple the backtrace printing code from the NMI watchdog:
- Dont use variable size cpumasks (it might not be initialized
and they are a bit more fragile anyway)
- Trigger an NMI immediately via an IPI, instead of waiting
for the NMI tick to occur. This is a lot faster and can
produce more relevant backtraces. It will also work if the
NMI watchdog is disabled.
- Dont print the 'dazed and confused' message when we print
a backtrace from the NMI
- Do a show_regs() plus a dump_stack() to get maximum info
out of the dump. Worst-case we get two stacktraces - which
is not a big deal. Sometimes, if register content is
corrupted, the precise stack walker in show_regs() wont
give us a full backtrace - in this case dump_stack() will
do it.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 7717b95..9c5fa9f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -35,6 +35,7 @@
#include <linux/rcupdate.h>
#include <linux/interrupt.h>
#include <linux/sched.h>
+#include <linux/nmi.h>
#include <asm/atomic.h>
#include <linux/bitops.h>
#include <linux/module.h>
@@ -469,6 +470,8 @@
}
printk(" (detected by %d, t=%ld jiffies)\n",
smp_processor_id(), (long)(jiffies - rsp->gp_start));
+ trigger_all_cpu_backtrace();
+
force_quiescent_state(rsp, 0); /* Kick them all. */
}
@@ -479,12 +482,14 @@
printk(KERN_ERR "INFO: RCU detected CPU %d stall (t=%lu jiffies)\n",
smp_processor_id(), jiffies - rsp->gp_start);
- dump_stack();
+ trigger_all_cpu_backtrace();
+
spin_lock_irqsave(&rnp->lock, flags);
if ((long)(jiffies - rsp->jiffies_stall) >= 0)
rsp->jiffies_stall =
jiffies + RCU_SECONDS_TILL_STALL_RECHECK;
spin_unlock_irqrestore(&rnp->lock, flags);
+
set_need_resched(); /* kick ourselves to get things going. */
}