[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-block
Subject:    [v4.13-rc BUG] system lockup when running big buffered write(4M) to IB SRP via mpath
From:       Ming Lei <ming.lei () redhat ! com>
Date:       2017-08-08 14:17:16
Message-ID: 20170808141715.GB22763 () ming ! t460p
[Download RAW message or body]

Hi Guys,

Laurence and I see a system lockup issue when running concurrent
big buffered write(4M bytes) to IB SRP on v4.13-rc3.

1 how to reproduce

1) setup IB_SRR & multi path

	#./start_opensm.sh
	#./start_srp.sh 

	# cat start_opensm.sh 
	#!/bin/bash
	rmmod ib_srpt
	opensm -F opensm.1.conf &
	opensm -F opensm.2.conf &

	# cat start_srp.sh
	run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10 -t 7000 -ance -i mlx5_0 -p 1 \
1>/root/srp1.log 2>&1 &  run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10 -t \
7000 -ance -i mlx5_1 -p 1 1>/root/srp2.log 2>&1 &  

2) run the following test script

	#./test-mp.sh
 
	#cat test-mp.sh 
	#!/bin/sh
	DEVS="360014051186d24cc27b4d049ffff94e3 36001405b2b5c6c24c084b6fa4d55da2f \
36001405b26ebe76dcb94a489f6f245f8"

	for i in $DEVS; do
		for j in `seq 0 15`; do
			./hammer_write.sh $i 1>/dev/null 2>&1 &
		done
	done

	#cat hammer_write.sh
	#!/bin/bash
	while true; do
		dd if=/dev/zero of=/dev/mapper/$1 bs=4096k count=800 
	done


2 lockup log
[root@ibclient ~]# ./test-mp.sh 
[root@ibclient ~]# [  169.095494] perf: interrupt took too long (2575 > 2500), \
lowering kernel.perf_event_max_sample_rate to 77000 [  178.569124] perf: interrupt \
took too long (3249 > 3218), lowering kernel.perf_event_max_sample_rate to 61000 [  \
190.586624] perf: interrupt took too long (4066 > 4061), lowering \
kernel.perf_event_max_sample_rate to 49000 [  210.381836] perf: interrupt took too \
long (5083 > 5082), lowering kernel.perf_event_max_sample_rate to 39000 [  \
240.498659] perf: interrupt took too long (6373 > 6353), lowering \
kernel.perf_event_max_sample_rate to 31000

[root@ibclient ~]# 
[root@ibclient ~]# [  292.252795] perf: interrupt took too long (8006 > 7966), \
lowering kernel.perf_event_max_sample_rate to 24000 [  410.354443] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.432725] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.499530] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.564952] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.644493] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.698091] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.755175] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.796784] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.837690] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  410.878743] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.476823] \
multipath_clone_and_map: 32 callbacks suppressed [  463.506773] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.586752] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.656880] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.698919] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.723572] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.751820] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.781110] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.828267] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.856997] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  463.885998] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.572060] \
multipath_clone_and_map: 5201 callbacks suppressed [  468.602191] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.655761] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.697540] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.738610] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.779597] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.820705] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.862712] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.904662] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.945132] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  468.985128] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.643013] \
multipath_clone_and_map: 6569 callbacks suppressed [  473.673466] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.713929] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.754989] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.796069] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.849768] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.891640] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.933321] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  473.973791] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  474.016739] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  474.059858] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  478.732265] \
multipath_clone_and_map: 6310 callbacks suppressed [  478.762164] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  478.802327] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  478.844235] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  478.886113] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  478.927164] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  478.968141] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  479.008772] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  479.060225] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  479.102873] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  479.145341] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  483.834884] \
multipath_clone_and_map: 5703 callbacks suppressed [  483.865056] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  483.906792] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  483.959492] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.001768] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.044785] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.087289] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.129854] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.171317] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.212314] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.263931] device-mapper: \
multipath: blk_get_request() returned -11 - requeuing [  484.982608] perf: interrupt \
took too long (10021 > 10007), lowering kernel.perf_event_max_sample_rate to 19000 [  \
485.362787] watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [ksoftirqd/11:78] [  \
485.400762] Modules linked in: dm_round_robin xt_CHECKSUM ipt_MASQUERADE \
nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT \
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp \
llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle \
ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 \
nf_nat_ipv4 nf_nat rpcrdma nf_conntrack ib_isert iscsi_target_mod iptable_mangle \
iptable_security iptable_raw ebtable_filter ebtables ib_iser ip6table_filter \
ip6_tables libiscsi iptable_filter scsi_transport_iscsi target_core_mod ib_srp \
scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm \
mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul \
crc32_pclmul ghash_clmulni_intel [  485.809222]  pcbc joydev ipmi_si aesni_intel sg \
hpwdt hpilo crypto_simd iTCO_wdt cryptd iTCO_vendor_support ipmi_devintf gpio_ich \
glue_helper shpchp lpc_ich ipmi_msghandler acpi_power_meter pcspkr i7core_edac \
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs \
libcrc32c sd_mod amdkfd amd_iommu_v2 radeon mlx5_core i2c_algo_bit drm_kms_helper \
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlxfw drm ptp hpsa pps_core \
scsi_transport_sas crc32c_intel bnx2 i2c_core serio_raw devlink dm_mirror \
dm_region_hash dm_log dm_mod [last unloaded: ib_srpt] [  486.096352] CPU: 11 PID: 78 \
Comm: ksoftirqd/11 Tainted: G          I     4.13.0-rc3+ #23 [  486.142775] Hardware \
name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 [  486.178816] task: ffff8a3fe453df40 \
task.stack: ffff9a6cc65a8000 [  486.212674] RIP: \
0010:__blk_recalc_rq_segments+0xb8/0x3d0 [  486.242444] RSP: 0018:ffff9a6cc65aba60 \
EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10 [  486.284086] RAX: ffff8a4a8e26e440 RBX: \
0000000000001000 RCX: ffff8a4a8e26e440 [  486.323445] RDX: 0000000000000000 RSI: \
0000000000001000 RDI: ffff8a4a8e2699c8 [  486.363614] RBP: ffff9a6cc65abae0 R08: \
0000000000007000 R09: 0000000000000001 [  486.404621] R10: 0000000000001000 R11: \
0000000000000000 R12: 0000000000000000 [  486.445127] R13: 0000000000000000 R14: \
0000000000000001 R15: 0000000000000007 [  486.486120] FS:  0000000000000000(0000) \
GS:ffff8a4b66940000(0000) knlGS:0000000000000000 [  486.532771] CS:  0010 DS: 0000 \
ES: 0000 CR0: 0000000080050033 [  486.565772] CR2: 00007f32bfcee000 CR3: \
0000000d67609000 CR4: 00000000000006e0 [  486.606325] Call Trace:
[  486.620269]  ? mempool_free+0x2b/0x80
[  486.641248]  blk_recalc_rq_segments+0x28/0x40
[  486.666754]  blk_update_request+0x249/0x310
[  486.690838]  end_clone_bio+0x46/0x70 [dm_mod]
[  486.715949]  bio_endio+0xa1/0x120
[  486.734408]  blk_update_request+0xb7/0x310
[  486.757313]  scsi_end_request+0x38/0x1d0
[  486.779375]  scsi_io_completion+0x13c/0x630
[  486.802928]  scsi_finish_command+0xd9/0x120
[  486.825875]  scsi_softirq_done+0x12a/0x150
[  486.849029]  __blk_mq_complete_request+0x90/0x140
[  486.875484]  blk_mq_complete_request+0x16/0x20
[  486.900631]  scsi_mq_done+0x21/0x70
[  486.920462]  srp_recv_done+0x48f/0x610 [ib_srp]
[  486.946212]  __ib_process_cq+0x5c/0xb0 [ib_core]
[  486.972576]  ib_poll_handler+0x22/0x70 [ib_core]
[  486.998638]  irq_poll_softirq+0xaf/0x110
[  487.021727]  __do_softirq+0xc9/0x269
[  487.042027]  run_ksoftirqd+0x29/0x50
[  487.062675]  smpboot_thread_fn+0x110/0x160
[  487.085597]  kthread+0x109/0x140
[  487.104113]  ? sort_range+0x30/0x30
[  487.124120]  ? kthread_park+0x60/0x60
[  487.144624]  ret_from_fork+0x25/0x30
[  487.164640] Code: c3 83 fa 09 0f 84 8e 01 00 00 48 8b 44 24 20 48 8b 4c 24 28 45 \
31 ff 45 31 c0 45 31 c9 0f b6 80 36 06 00 00 88 44 24 17 8b 59 28 <44> 8b 71 2c 44 8b \
61 34 85 db 0f 84 92 01 00 00 45 89 f5 45 89 

Message from syslogd@ibclient at Aug  8 04:12:23 ...
 kernel:watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [ksoftirqd/11:78]



Thanks,
Ming


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic