WaitForMultipleObjects() considered expensive

Here’s an interesting tidbit I discovered at work the other day: The WaitForMultipleObjects() function, part of the win32 API, is surprisingly expensive. For the developers amongst my readers, the function puts the current thread to sleep until an event occurs on a given handle. The complementary function SetEvent() triggers such an event. The net effect is that the two functions can be used similarly to POSIX condition variables, but the win32 functions are more versatile in that you can also wait for socket events to occur1.

My non-developer readers will be confused at this point, no doubt. Feel free to skip the rest of the article, it’ll be similar in content.

I’ve found several articles mentioning that WaitForMultipleObjects() should be considered potentially expensive, but the reason why that would be the case is different from what I found. An article on intel.com probably summarizes the point best (though it’s really about WaitForSingleObject()): The function has the nice feature that it’ll send events even between processes rather than just threads, but that implies that it’ll always perform a system call. Other functions that one could use to synchronize threads in a single process incur this overhead only if there is no contention. Yes, that only makes sense if you implement locks.

I’ll skip over details of what our code looks like, partially because I have no right to publish any of it. Suffice to say that the scenario is one thread in a loop that sleeps via WSAWaitForMultipleEvents(), waking up to perform some work (not hugely CPU intensive), and then going back to sleep again. Other threads, according to their needs, wake up our first thread via SetEvent().

Here’s what the problem was: on Mac OS X, Windows Vista and GNU/Linux (Ubuntu), the program showed up as using, on average, about 5% CPU time. On Windows XP, the CPU was maxed out (more specifically, CPU time was at roughly 50% on a dual-core CPU).

How could that happen?

Non-Windows operating systems don’t have the above functions, of course. So for comparison, the first thread was waiting in select() on a file descriptor created via pipe(), while the other threads would write a single byte into that file descriptor to interrupt select().

As every other part of the code was identical, the culprit must have been in the use of the underlying OS functions to implement this type of thread synchronization. Only WaitForMultipleObjects() by rights should do little more than enter kernel space to sleep, so how can that function incur CPU usage for the user space process? As a colleague pointed out, that’s as if sleep() would consume CPU time — inconceivable!

Yet profiling the process with VTune clearly showed most time spent in that function — oh, spending wall time in there would only have been expected, that’s what it’s supposed to do. But it also showed a surprising amount of CPU time… in NtWaitForMultipleObjects() to be precise, a function used to implement WaitForMultipleEvents().

That was a bit of a surprise, to say the least.

Granted, signalling the first thread about 30 times per second to wake up might have had something to do with the performance hit we saw. In fact, it did. Once we managed to find the best solution to reducing this number of wake-up calls, CPU time of the process dropped to an acceptable — nay outstanding — 2-5%. So it’s not being stuck in WSAWaitForMultipleEvents() that consumes CPU time, it’s something related to entering and/or exiting that function, which happens more often if it’s invoked in a loop and you keep interrupting it.

So while it’s a good idea to avoid large numbers of system calls on any operating system Windows XP seems to add an additional performance hit for [WSA]WaitFor{Single|Multiple}{Event|Object}[s]() — Vista seems not to have that problem. There, you’ve got a reason to upgrade now.

  1. Via WSAWaitForMultipleEvents(). []
  • Paul Mohr

    I actually followed what you were saying there. I have been using Valgrind the last couple days and I have tried VTune. I have done some of the analysis of overhead in older Windows systems up to XP and it probably is improving. Wow, and it only took 17 years and 400 billion dollars to make the improvements. Please don’t wish Vista on me though. Very well presented and informative article, by the way.

    • http://www.unwesen.de/ unwesen


      Yes, valgrind and VTune are fairly similar in the type of information they present, it’s just their approaches that are vastly different.

  • http://www.der-eremit.de der-eremit

    Really great and informative article although I’m not too deep into that topic. The point that I especially noticed, was the testing part on GNU/Linux and Ubuntu — sounds like your planning to release a *nix client at last ;)

    der-eremits last blog post… Duschen: Frauen vs Maenner

    • http://www.unwesen.de/ unwesen

      Our servers have been running on some flavour of *nix for a long time, testing on that platform doesn’t mean there’ll be a GNU/Linux client. Though there might be. Who knows? I don’t!