The right way to deal with frozen processes on Unix

trotsky · on Sept 21, 2012

I'm not sure if that's darwin terminology, but calling a process that's blocking on io or in a tight loop "frozen" certainly isn't tradition in UNIX - you're much more likely to use hung or stuck. In linux specifically the term has a completely different meaning - "freezing" a process is intentional and involves putting the process into a cgroup and then removing all cpu shares to prevent it from executing - suspending the process and allowing for sleep/hibernation/etc.

jlgreco · on Sept 21, 2012

The "frozen" terminology is amusing to me since I would normally say that processes in tight loops are "burning CPU".

SageRaven · on Sept 21, 2012

I have always described such processes as "wedged".

bradleyland · on Sept 21, 2012

I don't mean to disparage other Ruby application servers, but this is why I use Passenger. It's not that I believe other Ruby application servers aren't good, or that their authors don't understand Unix as well as the guys at Phusion, but there's a real dedication to stability and predictability inside Phusion that I don't see elsewhere.

You can acheive the same results with other app servers, but you're going to have to do a lot of the heavy lifting yourself. I'm not ashamed to admit that I have a lot more confidence in Passenger's solution than I do my own.

unixnoob · on Sept 21, 2012

OK, total noob question here. Could we achieve the same thing with something like pgrphack from daemontools?

    pgrphack sh -c "processes"

Kill the pid for the sh ("agent") and you thereby kill all the processes?

Again, sorry for the noob question. I'm still learning and making mistakes.

FooBarWidget · on Sept 22, 2012

No. To instruct kill() to kill a process group, you have to specified the PID of the process group leader as a negative number. Otherwise kill() will kill only a single process.

unixnoob · on Sept 23, 2012

But won't all the processes in my example have the PGID of sh?

FooBarWidget · on Sept 23, 2012

Yes they do, but that is irrelevant. kill(pid) kills the process specified by 'pid'. kill(-pid) kills the process group specified by 'pid'.

unixnoob · on Sept 23, 2012

What if I just use userland kill(1) utility? Is it possible to kill all processes under a PGID using kill(1)?

Say the PGID I get for sh is 321. If I do

    kill [signal] 321

that will not kill all the processes having PGID 321?

If it would not kill them, then couldn't we modify kill(1) to be able to call kill() with a negative integer as you describe?

Sorry for the noob questions. I am still learning and making mistakes.

ibotty · on Sept 21, 2012

isn't frozen in usual unix terminology any SIGSTOPped process?

Evbn · on Sept 22, 2012

That's suspended to me.

bifrost · on Sept 21, 2012

Its nice to see traditional debugging made easy, this is stuff that you try to teach people as a sysadmin/opsguy and they never pay attention. Yay!

X-Istence · on Sept 21, 2012

The page is unfortunately not loading, and there doesn't seem to be a Google cache for the page.

FooBarWidget · on Sept 21, 2012

There was a slight interruption of service, but the blog has been restored now. Our apologies for the inconvenience.

janerik · on Sept 21, 2012

The server is not responding right now and Google Cache is empty. Any one got a copy of this?

dredmorbius · on Sept 21, 2012

In related news, Hongli Lai has solved the halting problem.

geofft · on Sept 21, 2012

You do realize that it's quite easy to solve the halting problem in most cases, right?

The halting problem is unsolvable because _very particular_ (and pathological) cases are unsolvable. If a process is in a loop between the same instructions at the same states, it's very easy to tell that it's not going to halt -- the only challenge is that you can't make this determination for all processes all the time.

Mathematically, note that the concept of an oracle for the halting problem is well-defined (and a useful concept).

dredmorbius · on Sept 22, 2012

I was mostly making a jibe at the notion that there is a single, correct, and reliable method for identifying stuck/hung processes.

There are in fact fairly reliable heuristics for noting when things are going pear-shaped. The edge cases get sticky though.