More threads makes the program slower?

Multithreading is A Good Thing. Many solutions can be structured so that you can bring multiple processor cores to bear on the problem, and you can get very close to linear increases in throughput. That is, your quad-core processor will complete a job four times as fast as a single-core processor. Or very close to it. But you have to be careful how you structure your data, or you’ll find that adding threads decreases throughput.

As an example, let’s use a program that has multiple threads, each decrementing a separate counter. That is, each thread does this:

while (myCounter > 0)
{
    --myCounter;
}

myCounter, in this case, is a separate variable for each thread. The program below will handle up to four threads, each decrementing its own counter. Wrapping it all up so we can keep track of everything and supply timings takes a bit of code, but it’s still pretty simple.

struct Counters
{
    public long counter1;
    public long counter2;
    public long counter3;
    public long counter4;
}

const long maxCount = 500000000;
static void DoIt()
{
    var c = new Counters();
    c.counter1 = maxCount;
    c.counter2 = maxCount;
    c.counter3 = maxCount;
    c.counter4 = maxCount;

    int numThreads = 3;
    var threads = new Thread[4];
    threads[0] = new Thread((s) => { while (c.counter1 > 0) --c.counter1; });
    if (numThreads > 1)
        threads[1] = new Thread((s) => { while (c.counter2 > 0) --c.counter2; });
    if (numThreads > 2)
        threads[2] = new Thread((s) => { while (c.counter3 > 0) --c.counter3; });
    if (numThreads > 3)
        threads[3] = new Thread((s) => { while (c.counter4 > 0) --c.counter4; });

    // start threads
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < numThreads; ++i)
    {
        threads[i].Start();
    }
    // Wait for 500 ms and then access the counters.
    // This just proves that the threads are actually updating the counters.
    Thread.Sleep(500);
    long z1 = c.counter1;
    long z2 = c.counter2;
    long z3 = c.counter3;
    long z4 = c.counter4;
    Console.WriteLine("{0},{1},{2},{3}", z1, z2, z3, z4);
    // Wait for threads to stop
    for (int i = 0; i < numThreads; ++i)
    {
        threads[i].Join();
    }
    sw.Stop();
    Console.WriteLine();
    Console.WriteLine("Elapsed time = {0:N0} ms", sw.ElapsedMilliseconds);
}

Understand, I didn’t set out to write a program that just decrements a couple of counters. This is just a very simple illustration of a problem that I encountered. I’m well aware that having public mutable members in a struct is generally a bad idea, and I wouldn’t do it in production code. This code is for illustration purposes only.

Run with a single thread on my system, that code executes in about 1,325 ms. With two threads, you would expect about the same thing, right? After all, each thread is executing on a separate core (assuming you have multiple cores), doing the same amount of work, and there aren’t any locks protecting shared data. It’s straight computation.

Instead, it took about 1,870 ms–more than half second longer. Three threads takes 2,600 ms, and four threads about 2,560 ms. There’s definitely something wrong.

The timings I’m reporting here are from my machine at home, which is a Core2 quad, running at 2.4 GHz. I’m running Windows 8 Developer Preview, and the Visual Studio 11 Developer Preview with .NET 4.5. I’ve also run the tests on my development machine at the office: a Core2 Quad, 2.0 GHz with Visual Studio 2010 and .NET 4.0. My home machine is faster, of course, but the timings are relatively the same. That is, the test that takes 1,325 ms on my home machine takes 1,600 ms on my work machine.

At first I thought that for some reason the threads were not running concurrently, and re-ran the test. But, no. When this starts up, CPU usage is 25% times the number of threads. That makes sense, since I have four cores. When four threads are running, CPU usage is 100%. With two threads, it’s 50%.

The problem turns out to be cache contention: multiple cores trying to write to memory locations that are very close to each other. Let me explain cache contention to those who might not be aware of it.

The memory management hardware divides memory up into “cache lines,” which on my system are 64 bytes. When a processor core reads a value from memory, it actually loads the entire cache line that contains that single value. Those 64 bytes are stored in the core’s memory cache. The core can then read from and write to its cache, which is much faster than reading and writing main memory. The cache control circuitry in the processor takes care of copying changed data from the cache to main memory. Usually, that’s done asynchronously so there’s no performance impact on the processing unit.

Things get messy, though, when another core wants to read a value from the same cache line that the first core has modified. The memory control circuitry only knows that something in the cache line changed–not which value. So in order to ensure that the second core gets the most up-to-date value, it has to get that whole cache line from the first core. And, of course, the second core has to wait for that.

In the little program above, the individual counters are right next to each other in memory. They will occupy one or two cache lines. That is, the first one could be at the end of one cache line, with the next three being allocated on the next cache line. That’s assuming that they are actually stored in memory in the order that they’re defined. Assuming that they’re stored sequentially in memory on the same cache line, whenever any thread wants to read or update its counter, it has to wait for some other thread to flush its cache. Those cache stalls add up to a whole lot of wasted processor time.

If you know about cache contention (which I did, prior to encountering this problem), you’d never write a program that has such an obvious defect as what I show above. And I didn’t. The program in which I encountered this problem is somewhat more complex, and the cache contention took a little bit of debugging to find.

So how do I prove that cache contention really is the issue? What would happen if I added some padding to the Counters structure to force each of the counters onto a separate cache line?

[StructLayout(LayoutKind.Sequential)]
struct Counters
{
    public long counter1;
    public long a1, a2, a3, a4, a5, a6, a7;
    public long counter2;
    public long b1, b2, b3, b4, b5, b6, b7;
    public long counter3;
    public long c1, c2, c3, c4, c5, c6, c7;
    public long counter4;
    public long d1, d2, d3, d4, d5, d6, d7;
}

I just added seven long values after each counter so that, with the counter it added up to 64 bytes. I can’t guarantee that the counter is at the beginning of a cache line, but I know that, since there are 64 bytes from the start of counter1 to the start of counter2, those two values will be on different cache lines. The StructLayout attribute is there to ensure that the compiler lays the structure out in memory exactly as I have defined it.

So what happens if I run that? One thread takes 1,325 ms. Two threads, same thing. Three threads takes a little longer: about 1,350 ms. And with four threads it takes between 1,390 and 1,420 ms. The times for three and four threads aren’t terribly surprising because this isn’t the only program running on the system. System services still have to run, which means that some of my program’s threads will be swapped out. I could spend a lot of time shutting down unnecessary services so I could test that theory, but there’s really no need. I’m satisfied with the “other programs” explanation.

If you reduce the padding to just three long values (i.e. just a1a2, and a3), the execution time is about 1,900 ms for four threads. That’s because you’re ensuring that no more than two of the counters will be on any one cache line. This is the same behavior you see with just two threads when there’s no padding at all.

The lesson here is that you don’t want multiple threads mutating memory locations that are very close to each other. It’s bound to happen from time to time, of course, and that’s fine. A few cache flushes now and then aren’t going to kill your program. But you should avoid writing code that has critical loops in separate threads modifying adjacent memory locations. If your threads are just reading those locations, there’s no problem. Reads are essentially free. But if you modify memory close to where one of the other threads is reading, those other threads will stall until the processor core updates its cache.

All the fancy new tools that make multithreaded programming easier are wonderful things. I like the Task Parallel Library and the new Asynchronous Programming stuff that’s coming with .NET 4.5. Using those is much easier and often performs better than explicitly managing threads as I’ve shown above. Hiding complexity is usually a good thing. But those libraries don’t do everything. You still need to understand, at least to some extent, what the hardware is doing.