Debugging high CPU use under xargs

One of my hobbies is figuring out bugs in other people’s software, especially when I don’t have access to the source code. This is why I like using macOS, which has infinite bugs, and almost no source code.

Recently, I’ve been looking into the performance issues of an EDR daemon that came with my work laptop. To help our tech support debug it, I needed to collect nettop, top and some other debug output whenever load averages spiked on my system. For this I wrote a shell script that happened to use xargs for process control.

When I ran the script, I noticed that the nettop program on macOS was using 100% of a CPU core, but only when run under xargs.

After figuring what the bug was in nettop, I thought it would make for an interesting post about how easy it is to do non-blocking IO wrong on UNIX.

TL;DR: Just tell me how to fix it.

If you see high CPU utilization by a program run under xargs, try running xargs -o, which keeps stdin open.

The Problem

If you’re on a Mac (I haven’t found a coreutil with this bug on modern Linux), try running these two commands, while watching the Activity Monitor:

# Print network utilization every second:
> nettop -PL0

# Same, but I want to fry an egg on the chassis:
> echo "x" | xargs -I{} nettop -PL0

The two commands do the same thing, but the second version keeps a CPU core 100% busy. What gives?

Aside: why am I running `nettop` from `xargs`?

Glad you asked. You see, bash is actually a purely functional programming language and xargs is pmap.

You are probably familiar with this usage:

# Call `cat` on two different files:
{
    echo "foo.txt"
    echo "bar.txt"
} | xargs cat

Less commonly, the argument to xargs can also be the command you want to run:

# Run two different commands on the same file:
{
    echo "stat"
    echo "cat"
} | xargs -I{} sh -c '{} foo.txt'

So that’s neat, but the really powerful thing about xargs is parallelism. Just by adding -P N, you tell xargs to run a threadpool (actually a subprocess pool, but on modern UNIX that’s almost the same thing¹) of N threads, each of which gets one line of input.

For example, you might want to output stats about network IO and system load into a single terminal. xargs lets you do that:

# Note: for brevity, this example elides dealing with interleaved output.
{
    echo "nettop -PL0"
    echo "top -d -o cpu -l0 -n0"
} | xargs -I{} -P2 sh -c '{}'

This construction has a number of advantages over the shell’s builtin job control: it’s portable, it never leaves orphaned processes behind and xargs blocks until all tasks have finished.

It also has one disadvantage: As of macOS 13, poorly implemented utils like nettop, when run like this, will use a full CPU core to do their thing.

Debugging `nettop`’s high CPU use

The high CPU utilization wasn’t, in fact, the first thing I noticed about nettop. Compared to running without xargs, the output was also strangely choppy - the program would buffer 10, sometimes 20 seconds of output before writing it to STDOUT all at once.

My first thought was to grab an IO profile using Instruments). Here’s what that looks like for the well-behaved nettop (without xargs):

Good nettop

And here’s the poorly-behaved version with xargs:

Bad nettop

Note that the poorly behaved version has called recvfrom quite a lot more than the well-behaved version. This would certainly explain where the CPU time is going, but not why.

For that we have to see what the damn thing is doing. Sadly, the source code for nettop is not included in the open source Darwin. On real operating systems, we would at this point reach for strace - a tool to log system calls a program is making. The macOS implementation of the same, on top of DTrace, is called dtruss.

You can’t run dtruss on your MacBook, because of a security measure Apple calls SIP, which disables most mechanisms macOS has to inspect the kernel, including DTrace. I keep a laptop around with SIP disabled for exactly these occasions, and so I’ve grabbed the output.

This is what system calls nettop calls when it’s running normally. (Annotations mine.)

workq_kernreturn(0x20, 0x0, 0x1)		 = 0 0
// errno 2 is ENOENT, but I suspect this is some kind of mach-o thing instead.
workq_kernreturn(0x4, 0x0, 0x0)		 = 0 Err#-2
bsdthread_ctl(0x400, 0x2B03, 0x20FF)		 = 0 0

// Read from FD 3, presumably a socket where the kernel
// is sending information about network traffic.
// Keep reading until no more data
recvfrom(0x3, 0x124808800, 0x1000)		 = 3816 0
// ...
recvfrom(0x3, 0x124808800, 0x1000)		 = 16 0
// 35 = EAGAIN - means back off and read again later.
recvfrom(0x3, 0x124808800, 0x1000)		 = -1 Err#35

// Now write the information to STDOUT (FD 1):
write_nocancel(0x1, "time,,interface,state,bytes_in,bytes_out,rx_dupe,rx_ooo,re-tx,rtt_avg,rcvsize,tx_win,tc_class,tc_mgt,cc_algo,P,C,R,W,arch,\n\0", 0x7B)		 = 123 0
// Elided 25 calls to write_nocancel
write_nocancel(0x1, "17:32:40.731151,Adobe Desktop S.1130,,,0,0,0,0,0,,,,,,,,,,,,\n\0", 0x3D)		 = 61 0

On the other hand, when it’s running under xargs, the dtruss output is full of these 4 system calls repeating forever:

kevent_qos(0xFFFFFFFFFFFFFFFF, 0x16D6BA808, 0x1)		 = 0 0
workq_kernreturn(0x40, 0x16D6BAB80, 0x0)		 = 0 Err#-2
bsdthread_ctl(0x400, 0x2903, 0x20FF)		 = 0 0
bsdthread_ctl(0x100, 0x800020FF, 0xFFFFFFFF)		 = 0 0

It looks like the program is now stuck in a hot loop, waking up many times a second. kevent_qos is a call I didn’t recognize, but kevent is related to kqueue: a sort of a mirror universe BSD version of epoll. kevent_qos seems to be a bizzaro macOS extension.

The kevent_qos syscall returns the number of IO events that have occurred, in this case 0. If no events have occurred, that suggests there was an error - such as would happen if you were polling an invalid (or closed) file descriptor.

We know stdout is open, but running a program under xargs runs it with a closed stdin - maybe that’s the problem?

Well, if we just pass the -o flag to xargs, it will provide a valid stdin. Lo and behold, that fixes the problem.

# This will work as expected, and nettop will no longer spin endlessly.
{
    echo "nettop -PL0"
    echo "top -d -o cpu -l0 -n0"
} | xargs -o -I{} -P2 sh -c '{}'

So what’s the bug?

If you’re familiar with kqueue and its friends (poll, select, epoll, kqueue…) then you have probably also already guessed that nettop adds stdin to its pollset without checking that it’s valid.

Like its predecessor poll, kqueue exists to let a single thread multiplex IO operations. Below, I wrote an explanation of the concepts behind these system calls using poll, which is a lot simpler and should be easier to grasp. Here’s how it works:

#include <poll.h>
#include <unistd.h>

void handle_io() {
    struct pollfd pfds[2] = {0};

    // Tell poll that we want to know when STDIN has data to read (POLLIN).
    pfds[0].fd = 0;
    pfds[0].events = POLLIN;

    // Tell poll that we also want to know when STDOUT has room to write data
    // (POLLOUT).
    pfds[1].fd = 1;
    pfds[1].events = POLLOUT;

    // For the purpose of this exercise, this value will become true when
    // the program gets signaled (to shut down, using SIGTERM or SIGINT).
    int signaled = 0;

    while(!signaled) {
        // This will block until STDIN is available to read or STDOUT is
        // available to write.
        if (poll(pfds, sizeof(pfds), /*timeout=*/-1) < 0) {
            // If poll returns a negative number if it gets an error.
            // In a real program we would check what the error was and perhaps
            // recover.
            perror("poll");
            return;
        }

        if (pfds[0].revents & POLLIN) {
            // Handle input
        }

        if (pfds[1].revents & POLLOUT) {
            // Handle output
        }
    }
}

The call to poll blocks the thread until some IO operation becomes available, or an error occurs. Then it returns. While waiting in poll, the thread yields the CPU, so other threads can be scheduled. The operating system takes care to wake it up once there’s some work to do.

The above example has a number of bugs, including the one nettop has. Because we don’t check whether stdin is valid, poll might wake up immediately. We also always wait for stdout, even when we have no data to write! Real programs must take care not to listen for POLLOUT unless they have data buffered, otherwise poll will keep waking up immediately.

There is also no way for another thread of the same process to cause this thread to wake up from poll without signaling the entire process.

There are solutions to all of these problems (e.g. the famed self-pipe trick). Over time, though, it became clear that new APIs might present fewer such footguns, and so on BSD we ended up with kqueue. However, kqueue’s more complex design can make it less obvious what’s happening, and - clearly - someone at Apple is holding it wrong.

POSIX begs to differ, but from an implementation standpoint, processes and threads are very similar. The scheduler only operates on threads. Some threads own resources like a virtual address space and file descriptors, and we call those processes. In terms of performance, there is very little difference between threading and multiprocessing.
↩︎