We are working on an embedded Linux system using Live555 WIS-Streamer to stream video over RTSP over a network.
On one particular system we see WIS-Streamer get stuck in an TASK_UNINTERRUPTIBLE state; From the command line: the ps
status for the process is shown as DW
, children of the WIS-process are all listed as Z
ombie state.
It looks like there's nothing we can do once we're in this state, other than reboot (not desirable). However, we'd really like to get to the root cause of this - I suspect that within the streamer it's hanging on a blocking send
call or somesuch. Is there anything we can do, either in the code or via the command line etc. to try and narrow down what's blocked?
As an example, I've tried looking at the output of netstat ( netstat -alp
) to see if there are dangling sockets attached to the PID of the blocked / zombie thread but to no avail.
Update with more info:
It's not thrashing the CPU, top
lists blocked & zombie threads as 0% mem / 0% CPU / VSZ 0.
Further things I've tried poking about the system:
/proc/status/ for main & child threads 546 is the parent, which is blocked:
$> cat /proc/546/stat
Name: wis-streamer
State: D (disk sleep)
Tgid: 546
Pid: 546
PPid: 1
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
Threads: 1
SigQ: 17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000004102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: 1
Cpus_allowed_list: 0
voluntary_ctxt_switches: 997329
nonvoluntary_ctxt_switches: 2428751
Children:
Name: wis-streamer
State: Z (zombie)
Tgid: 581
Pid: 581
PPid: 546
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
Threads: 1
SigQ: 17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: 1
Cpus_allowed_list: 0
voluntary_ctxt_switches: 856676
nonvoluntary_ctxt_switches: 15626
Name: wis-streamer
State: Z (zombie)
Tgid: 582
Pid: 582
PPid: 546
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
Threads: 1
SigQ: 17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: 1
Cpus_allowed_list: 0
voluntary_ctxt_switches: 856441
nonvoluntary_ctxt_switches: 15694
Name: wis-streamer
State: Z (zombie)
Tgid: 583
Pid: 583
PPid: 546
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
Threads: 1
SigQ: 17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: 1
Cpus_allowed_list: 0
voluntary_ctxt_switches: 856422
nonvoluntary_ctxt_switches: 15837
Name: wis-streamer
State: Z (zombie)
Tgid: 584
Pid: 584
PPid: 546
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
Threads: 1
SigQ: 17/353
SigPnd: 0000000000000000
ShdPnd: 0000000000000102
SigBlk: 0000000000000000
SigIgn: 0000000000001004
SigCgt: 0000000180006a02
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: 1
Cpus_allowed_list: 0
voluntary_ctxt_switches: 856339
nonvoluntary_ctxt_switches: 15500
Other things from /proc/
filesys:
$> cat /proc/546/personality
00c00000
$> cat /proc/546/stat
546 (wis-streamer) D 1 453 453 0 -1 4194564 391 0 135 0 140098 232409 0 0 20 0 1 0 1094 0 0 4294967295 0 0 0 0 0 0 0 4100 27138 3223605768 0 0 17 0 0 0 0 0 0
Update upon update:
I have a feeling that a SysV-IPC message queue or semaphore call around such may be hanging - our system is held together by inter-process message queues (at least 40% Not Invented Here, written by Elbonian Code Slaves as part of a horrible horrible SDK) which can trap the unwary. I have re-jigged a couple of semaphore get/release routines which I suspect were less than fully wateright (in fact probably only just squirrel-proof) and will keep an eye on things - unfortunately it takes on average 12 hours running on a very particular test setup to induce this failure.
1条答案
按热度按时间oxiaedzo1#
从Documentation for sysrq:
“w”-转储处于不可中断(阻塞)状态的任务。
显示有关控制台上被阻止任务的详细信息(也应该可以通过
dmesg
查看);特别是内核堆栈跟踪有助于阐明该问题。