[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-kernel
Subject:    Re: workqueue: WARN at at kernel/workqueue.c:2176
From:       Lai Jiangshan <laijs () cn ! fujitsu ! com>
Date:       2014-05-16 3:50:42
Message-ID: 53758B12.8060609 () cn ! fujitsu ! com
[Download message RAW]

On 05/15/2014 12:52 AM, Jason J. Herne wrote:
> On 05/12/2014 10:17 PM, Sasha Levin wrote:
> > I don't have an easy way to reproduce it as I only saw the bug once, but
> > it happened when I started pressuring CPU hotplug paths by adding and removing
> > CPUs often. Maybe it has anything to do with that?
> 
> As per the original report (http://article.gmane.org/gmane.linux.kernel/1643027)
> I am able to reproduce the problem.
> 
> The workload is (on S390 architecture):
> 2 processes onlining random cpus in a tight loop by using 'echo 1 >
> /sys/bus/cpu.../online'
> 2 processes offlining random cpus in a tight loop by using 'echo 0 >
> /sys/bus/cpu.../online'
> Otherwise, fairly idle system. load average: 5.82, 6.27, 6.27
> 
> The machine has 10 processors.
> The warning message some times hits within a few minutes on starting the
> workload. Other times it takes several hours.
> 
> 
> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
> 
> 


Hi, Peter and other scheduler Gurus:

When I was trying to test wq-VS-hotplug, I always hit a problem in scheduler
with the following WARNING:

[   74.765519] WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 \
native_smp_send_reschedule+0x2d/0x4b() [   74.765520] Modules linked in: \
wq_hotplug(O) fuse cpufreq_ondemand ipv6 kvm_intel kvm uinput snd_hda_codec_realtek \
snd_hda_codec_generic snd_hda_codec_hdmi e1000e snd_hda_intel snd_hda_controller \
snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer ptp iTCO_wdt \
iTCO_vendor_support lpc_ich snd mfd_core pps_core soundcore acpi_cpufreq i2c_i801 \
microcode wmi radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [   74.765545] CPU: \
1 PID: 13 Comm: migration/1 Tainted: G           O  3.15.0-rc3+ #153 [   74.765546] \
Hardware name: LENOVO ThinkCentre M8200T/  , BIOS 5JKT51AUS 11/02/2010 [   74.765547] \
000000000000007c ffff880236199c88 ffffffff814d7d2c 0000000000000000 [   74.765550]  \
0000000000000000 ffff880236199cc8 ffffffff8103add4 ffff880236199cb8 [   74.765552]  \
ffffffff81023e1b ffff8802361861c0 0000000000000001 ffff88023fd92b40 [   74.765555] \
Call Trace: [   74.765559]  [<ffffffff814d7d2c>] dump_stack+0x51/0x75
[   74.765562]  [<ffffffff8103add4>] warn_slowpath_common+0x81/0x9b
[   74.765564]  [<ffffffff81023e1b>] ? native_smp_send_reschedule+0x2d/0x4b
[   74.765566]  [<ffffffff8103ae08>] warn_slowpath_null+0x1a/0x1c
[   74.765568]  [<ffffffff81023e1b>] native_smp_send_reschedule+0x2d/0x4b
[   74.765571]  [<ffffffff8105c2ea>] smp_send_reschedule+0xa/0xc
[   74.765574]  [<ffffffff8105fe46>] resched_task+0x5e/0x62
[   74.765576]  [<ffffffff81060238>] check_preempt_curr+0x43/0x77
[   74.765578]  [<ffffffff81060680>] __migrate_task+0xda/0x100
[   74.765580]  [<ffffffff810606a6>] ? __migrate_task+0x100/0x100
[   74.765582]  [<ffffffff810606c3>] migration_cpu_stop+0x1d/0x22
[   74.765585]  [<ffffffff810a33c6>] cpu_stopper_thread+0x84/0x116
[   74.765587]  [<ffffffff814d8642>] ? __schedule+0x559/0x581
[   74.765590]  [<ffffffff814dae3c>] ? _raw_spin_lock_irqsave+0x12/0x3c
[   74.765592]  [<ffffffff8105bd75>] ? __smpboot_create_thread+0x109/0x109
[   74.765594]  [<ffffffff8105bf46>] smpboot_thread_fn+0x1d1/0x1d6
[   74.765598]  [<ffffffff81056665>] kthread+0xad/0xb5
[   74.765600]  [<ffffffff810565b8>] ? kthread_freezable_should_stop+0x41/0x41
[   74.765603]  [<ffffffff814e0e2c>] ret_from_fork+0x7c/0xb0
[   74.765605]  [<ffffffff810565b8>] ? kthread_freezable_should_stop+0x41/0x41
[   74.765607] ---[ end trace 662efb362b4e8ed0 ]---

After debugging, I found the hotlug-in cpu is atctive but !online in this case.
the problem was introduced by 5fbd036b.
Some code assumes that any cpu in cpu_active_mask is also online, but 5fbd036b breaks
this assumption, so the corresponding code with this assumption should be changed \
too.


Hi, Jason J. Herne and Sasha Levin

Thank you for testing wq-VS-hotplug.

The following patch is just a workaround. After it is applied, the above WARNING
is gone, but I can't hit the wq problem that you found.

You can use the following workaround patch to test wq-VS-hotplug again or just
wait the scheduler guys give us a proper patch.
(A interesting thing, 5fbd036b also touches the arch s390).

Thanks,
Lai
---
diff --git a/kernel/cpu.c b/kernel/cpu.c
index a9e710e..253a129 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -726,9 +726,10 @@ void set_cpu_present(unsigned int cpu, bool present)
 
 void set_cpu_online(unsigned int cpu, bool online)
 {
-	if (online)
+	if (online) {
 		cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
-	else
+		cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
+	} else
 		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..c1a712d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5043,7 +5043,6 @@ static int sched_cpu_active(struct notifier_block *nfb,
 				      unsigned long action, void *hcpu)
 {
 	switch (action & ~CPU_TASKS_FROZEN) {
-	case CPU_STARTING:
 	case CPU_DOWN_FAILED:
 		set_cpu_active((long)hcpu, true);
 		return NOTIFY_OK;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic