Introduction to Why Threads Suck, By Robert Zeh

Over the past several years I've had numerous encounters with parallelism, threads, and distributed computing. Nearly all of the experiences have been negative. I continue to have discussions where I play the bad guy, discouraging thread usage. I've decided to simply write everything down once and for all.

NPAC

My first experiences with parallelism were at Syracuse University's NPAC, which had a 32,768 processor beast called the CM-2. The one bit processors were grouped around shared floating point processors (some thought of it as a machine with 1024 floating point processors and some one bit processors thrown in). Each processor executed the same instructions in lockstep, but with different data. Data was shared via message passing. If you needed more processors you could allocate virtual processors that were time shared.

My CM-2 experience was rather enjoyable. I was able to do interesting graphics work, producing real time animated ray tracing scenes in 1990.

Only much later would I realize how much better the lack of data sharing and the lock step execution made my CM-2 experience. At the time they seemed constraining, but in hindsight they were good constraints. Like turning the fuse off before you start working on wiring, and then locking the fuse box with a pad lock. Annoying in the short run, but gratifying in the long run.

Parallel Programming Laboratory

After Syracuse I went to graduate school at the University of Illinois Urbana Champaign, and joined the Parallel Programming Laboratory. The primary product at the time was CHARM, a C library for writing distributed and threaded applications. At the time CHARM offered something no one else did: a portable and efficient way to write distributed and threaded applications.

The combination of distributed computation with threads was a nightmare.

It was routine to see programs crash and burn after simply changing the number of processors. This wasn't CHARM's fault, it was the developers. Changing the number of processors changed the scheduling, bringing latent bugs to the fore.

Charm was written in C (later C++), so all of C's memory horrors arose, compounded by threads. Memory freed in one thread but still being used in another. Crashes without usable call stacks. Library calls that were not thread safe (pretty rare these days, but not then).

Structuring programs so they didn't overwhelm the communications network was difficult. Again, this had nothing to do with CHARM and everything to do with distributed computing. It's hard to break work down into just the right size chunks, and as processors get faster more quickly then communication latencies go down the problem becomes worse.

A routine trick in the parallel computing community at the time was to compute the speed-up by running your parallel program on a number of processors, from 1 to the most you had. This created much more impressive results then comparing your parallel program to the fastest single processor version (which was always faster then the parallel version running on a single processor).

The threaded version has overhead the single processor version doesn't. For example, the SMP version of the Linux kernel has spin locks to guard data structures from concurrent access. In the UP version of the kernel the spin locks compile to nothing.

Several years after graduating I spoke with a research group trying to sell an options pricing system. They'd used the latest in parallel architectures, with a very, very fast network to connect everything. They were able to compute a range of option valuations for many, many stocks with the same type of computation being performed for different stocks on the same processor. They thought this was cool.

My then employer was able to do the same thing, on a standard 100MB ethernet, by combining all of the work for a single stock on the same processor and buying multiple machines.

Oops.

The Bank

A bank computing portfolio credit risk with a Monte Carlo simulation had two problems: the simulation crashed a lot, and took far too long to compute.

It crashed because it was written by a physicist who didn't know how to code, and didn't understand C's memory rules. This was tedious but fixable. First, stop the physicist from coding, and second spend a lot of time with purify.

The application was perfectly parallel, as most Monteo Carlo simulations are. If different runs could be done in parallel (and if they can't something really strange is going on) the speed up should be linear.

Two approaches were considered: buying a single multi-processor machine and reworking the code to take advantage of threads, or buying a bunch of single processor machines and rewriting the code to run in a distributed manner.

The bank wisely chose buy a bunch of single processor machines and modify the code to run in a distributed manner. From the start, this solution had a number of benefits. The C code, which was a mess, would have been very expensive to make thread safe. Each run was independent of the other runs, so no communication was required at all. As the icing on the cake buying 10 single processor machines was much cheaper then buying a single 10 processor machine.

Down the road there was another benefit. As the application's load increased it was easy to buy more single processor machines. If the bank had taken the threaded approach it would have had to buy a bigger multiprocessor machine.

The Pricing Engine

At a later firm I worked on a complicated, distributed, multi-threaded C++ application that published prices.

It had one major problem. It only worked in two of the firm's three offices.

The difference between offices was load. The threaded application worked fine under low load, but would crash under high load. Again, changes in thread scheduling were to blame.

Like any threaded application, the engine was impossible to properly test. Simply because one run worked didn't mean the program was correct. The engine worked fine in production for months, but after data volumes rose it began crashing daily.

Fixing the engine was difficult. Making it faster to cope with the increased load was even harder.

It was difficult to fix because reading threaded code requires knowing which thread could be executing each piece of code. You have to combine the program's call graph with all of the threading calls, which is hard (at least for me).

Making it fast was even harder, because the threads were actually slowing the program down. The overhead for acquiring a lock is frequently underestimated; one many current architectures a single mutex acquisition is equivalent to about 100 instructions, and a complete memory flush. As the number of processors grows, the memory penalty becomes worse, because more cache is being invalidated.

As we removed threads (and hence locks) the engine became faster. Every thread removal saved many, many context switches, something else that is frequently ignored. At another firm I saw better performance from a single threaded Java application.

Along the way I also noticed that people tend to model threads after real world objects. For example, a thread for each exchange , or a thread for each connected client. On my CM-2 animation project I'd assigned a pixel to each virtual processor.

Unfortunately, for efficiency the number of threads should be proportional to the number of processors, not the number of real world objects.

Summary

Why threads suck:

Threaded programs are impossible to test. One successful run means nothing.
Threaded programs tendency to fail varies with load. This frequently leads to misplaced optimism at a project's start, when load is low.
Threaded programs are difficult to read, because you need to know in which thread each piece of code could execute. This is difficult to determine from inspection.
Threaded programs are difficult to maintain for the same reason.
Threaded programs are difficult to debug. Being unable to repeat failed execution runs hurts.
Threaded code is frequently slower then single-threaded code because of the extra overhead.
Threaded programs don't scale. Even if you buy hardware with more processors, the speed up will be sub-linear while the increase in cost is super-linear. This is a space where distributed computation wins: there is a reason Google buys many commodity boxes rather then big multi-processor boxes.
Threaded programs will work fine for years, and then start failing without having been changed.

What to do about it

Don't use them.

If you have a set of tasks that are parallel, break them down into communicating processes. You'll be able to take advantage of multiple processor machines, but better yet, you'll be able to take advantage of multiple machines.

Make your environment take advantage of the threading for you (for example: force the garbage collector or compiler into its own thread).

And for the love of God, use a language that has full, built-in support for threads.

Last Changed September 10th, 2006 by Robert Zeh