Debugging high CPU use under xargs
One of my hobbies is figuring out bugs in other people’s software, especially when I don’t have access to the source code. This is why I like using macOS, which has infinite bugs, and almost no source code.
Recently, I’ve been looking into the performance issues of an
EDR daemon that
came with my work laptop. To help our tech support debug it, I needed to collect
nettop
, top
and some other debug output whenever load averages spiked on my
system. For this I wrote a shell script that happened to use xargs
for process
control.
When I ran the script, I noticed that the nettop
program on macOS was using
100% of a CPU core, but only when run under xargs
.
After figuring what the bug was in nettop
, I thought it would make for an
interesting post about how easy it is to do non-blocking IO wrong on UNIX.
TL;DR: Just tell me how to fix it.
If you see high CPU utilization by a program run under xargs
, try running
xargs -o
, which keeps stdin
open.
The Problem
If you’re on a Mac (I haven’t found a coreutil with this bug on modern Linux), try running these two commands, while watching the Activity Monitor:
# Print network utilization every second:
> nettop -PL0
# Same, but I want to fry an egg on the chassis:
> echo "x" | xargs -I{} nettop -PL0
The two commands do the same thing, but the second version keeps a CPU core 100% busy. What gives?
Aside: why am I running nettop
from xargs
?
Glad you asked. You see, bash
is actually a purely functional programming
language and xargs
is pmap
.
You are probably familiar with this usage:
# Call `cat` on two different files:
{
echo "foo.txt"
echo "bar.txt"
} | xargs cat
Less commonly, the argument to xargs
can also be the command you want to run:
# Run two different commands on the same file:
{
echo "stat"
echo "cat"
} | xargs -I{} sh -c '{} foo.txt'
So that’s neat, but the really powerful thing about xargs
is parallelism. Just
by adding -P N
, you tell xargs
to run a threadpool (actually a subprocess
pool, but on modern UNIX that’s almost the same thing1) of N
threads, each of
which gets one line of input.
For example, you might want to output stats about network IO and system load
into a single terminal. xargs
lets you do that:
# Note: for brevity, this example elides dealing with interleaved output.
{
echo "nettop -PL0"
echo "top -d -o cpu -l0 -n0"
} | xargs -I{} -P2 sh -c '{}'
This construction has a number of advantages over the shell’s builtin job
control: it’s portable, it never leaves orphaned processes behind and xargs
blocks until all tasks have finished.
It also has one disadvantage: As of macOS 13, poorly implemented utils like
nettop
, when run like this, will use a full CPU core to do their thing.
Debugging nettop
’s high CPU use
The high CPU utilization wasn’t, in fact, the first thing I noticed about
nettop
. Compared to running without xargs
, the output was also strangely
choppy - the program would buffer 10, sometimes 20 seconds of output before
writing it to STDOUT all at once.
My first thought was to grab an IO profile using
Instruments). Here’s what
that looks like for the well-behaved nettop
(without xargs
):
And here’s the poorly-behaved version with xargs
:
Note that the poorly behaved version has called recvfrom
quite a lot more than
the well-behaved version. This would certainly explain where the CPU time is
going, but not why.
For that we have to see what the damn thing is doing. Sadly, the source code for
nettop
is not included in the open source
Darwin. On real operating
systems, we would at this point
reach for strace
- a tool to log system calls a program
is making. The macOS implementation of the same, on top of DTrace, is called
dtruss
.
You can’t run dtruss
on your MacBook, because of a security measure Apple
calls SIP, which
disables most mechanisms macOS has to inspect the kernel, including DTrace. I
keep a laptop around with SIP disabled for exactly these occasions, and so I’ve
grabbed the output.
This is what system calls nettop
calls when it’s running normally.
(Annotations mine.)
workq_kernreturn(0x20, 0x0, 0x1) = 0 0
// errno 2 is ENOENT, but I suspect this is some kind of mach-o thing instead.
workq_kernreturn(0x4, 0x0, 0x0) = 0 Err#-2
bsdthread_ctl(0x400, 0x2B03, 0x20FF) = 0 0
// Read from FD 3, presumably a socket where the kernel
// is sending information about network traffic.
// Keep reading until no more data
recvfrom(0x3, 0x124808800, 0x1000) = 3816 0
// ...
recvfrom(0x3, 0x124808800, 0x1000) = 16 0
// 35 = EAGAIN - means back off and read again later.
recvfrom(0x3, 0x124808800, 0x1000) = -1 Err#35
// Now write the information to STDOUT (FD 1):
write_nocancel(0x1, "time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,\n\0", 0x7B) = 123 0
// Elided 25 calls to write_nocancel
write_nocancel(0x1, "17:32:40.731151,Adobe Desktop S.1130,,,0,0,0,0,0,,,,,,,,,,,,\n\0", 0x3D) = 61 0
On the other hand, when it’s running under xargs
, the dtruss
output is full
of these 4 system calls repeating forever:
kevent_qos(0xFFFFFFFFFFFFFFFF, 0x16D6BA808, 0x1) = 0 0
workq_kernreturn(0x40, 0x16D6BAB80, 0x0) = 0 Err#-2
bsdthread_ctl(0x400, 0x2903, 0x20FF) = 0 0
bsdthread_ctl(0x100, 0x800020FF, 0xFFFFFFFF) = 0 0
It looks like the program is now stuck in a hot loop, waking up many times a
second. kevent_qos
is a call I didn’t recognize, but kevent
is related to
kqueue
: a sort of a mirror universe BSD version of epoll
. kevent_qos
seems
to be a bizzaro macOS extension.
The kevent_qos
syscall returns the number of IO events that have occurred, in
this case 0. If no events have occurred, that suggests there was an error - such
as would happen if you were polling an invalid (or closed) file descriptor.
We know stdout
is open, but running a program under xargs
runs it with a
closed stdin
- maybe that’s the problem?
Well, if we just pass the -o
flag to xargs
, it will provide a valid stdin
.
Lo and behold, that fixes the problem.
# This will work as expected, and nettop will no longer spin endlessly.
{
echo "nettop -PL0"
echo "top -d -o cpu -l0 -n0"
} | xargs -o -I{} -P2 sh -c '{}'
So what’s the bug?
If you’re familiar with kqueue
and its friends (poll
, select
, epoll
,
kqueue
…) then you have probably also already guessed that nettop
adds
stdin
to its pollset without checking that it’s valid.
Like its predecessor poll
, kqueue
exists to let a single thread multiplex IO
operations. Below, I wrote an explanation of the concepts behind these system
calls using poll
, which is a lot simpler and should be easier to grasp. Here’s
how it works:
#include <poll.h>
#include <unistd.h>
void handle_io() {
struct pollfd pfds[2] = {0};
// Tell poll that we want to know when STDIN has data to read (POLLIN).
pfds[0].fd = 0;
pfds[0].events = POLLIN;
// Tell poll that we also want to know when STDOUT has room to write data
// (POLLOUT).
pfds[1].fd = 1;
pfds[1].events = POLLOUT;
// For the purpose of this exercise, this value will become true when
// the program gets signaled (to shut down, using SIGTERM or SIGINT).
int signaled = 0;
while(!signaled) {
// This will block until STDIN is available to read or STDOUT is
// available to write.
if (poll(pfds, sizeof(pfds), /*timeout=*/-1) < 0) {
// If poll returns a negative number if it gets an error.
// In a real program we would check what the error was and perhaps
// recover.
perror("poll");
return;
}
if (pfds[0].revents & POLLIN) {
// Handle input
}
if (pfds[1].revents & POLLOUT) {
// Handle output
}
}
}
The call to poll
blocks the thread until some IO operation becomes available,
or an error occurs. Then it returns. While waiting in poll
, the thread yields
the CPU, so other threads can be scheduled. The operating system takes care to
wake it up once there’s some work to do.
The above example has a number of bugs, including the one nettop
has. Because
we don’t check whether stdin
is valid, poll
might wake up immediately. We
also always wait for stdout
, even when we have no data to write! Real programs
must take care not to listen for POLLOUT
unless they have data buffered,
otherwise poll
will keep waking up immediately.
There is also no way for another thread of the same process to cause this thread
to wake up from poll
without signaling the entire process.
There are solutions to all of these problems (e.g. the famed self-pipe
trick). Over time, though, it became clear
that new APIs might present fewer such footguns, and so on BSD we ended up with
kqueue
. However, kqueue
’s more complex design can make it less obvious
what’s happening, and - clearly - someone at Apple is holding it
wrong.
POSIX begs to differ, but from an implementation standpoint, processes and threads are very similar. The scheduler only operates on threads. Some threads own resources like a virtual address space and file descriptors, and we call those processes. In terms of performance, there is very little difference between threading and multiprocessing.
↩︎