Data Layout & Performance

After unwisely getting myself involved in a discussion on /., I figure that I’ve actually got a topic I can go into some depth about. It’s all about how the way you lay out data in memory affects performance, especially but not exclusively when trying to process larger amounts of data in parallel.

The key to understanding this lies in two pieces of knowledge:

  1. Modern CPUs are so incredibly fast at processing data in registers that performance bottlenecks are almost certainly going to be involving moving main memory data into registers, via the CPU cache(s).
  2. Data is loaded into the CPU cache rather coarse granularity.

Main Memory Access

At the assembly language level, you’re not going to issue instructions for loading memory pages into the CPU cache. There might be hardware out there where you do, but most developers won’t ever need to think about that. What you do is issue instructions such as this:

mov ax, [123h]

That is, the instruction will load the data located at offset 0x123 into the register ax. This sort of thing will happen when you increment a variable in your C code, i.e.

int i = 1;  // first statement
++i;        // second statement

will be translated into something like this:

  mov [123h], 1   ; first statement
  mov ax, [123h]  ; second statement
  inc ax
  mov [123h], ax

Note that this isn’t meant to be good assembly code, it’s meant to illustrate how you don’t deal with memory pages at the assembly level, you deal with memory addresses much like you would in C when using pointers.

Memory & Cache

So what happens under the hood for the mov and similar instructions is that the CPU issues a command to the memory controller that data at the indicated address is needed1. The memory controller then either serves that data from one of the caches it knows, or, if the data is not in cache, loads a cache line containing that memory location into the cache first2.

The caches exist because loading data from main memory is slow3. There are faster, but very much more expensive types of memory available, though. Processor caches are made of this faster memory, and are used to temporarily store memory that’s currently accessed a lot in order to speed up the memory load and store times, from the CPU’s perspective.

Note that a cache line is a contiguous memory region, which for reasons of efficiency is loaded into the cache all at once. These reasons can be summarized as:

  1. Loading data from main memory is comparatively slow4.
  2. It’s statistically likely that data that is close together in main memory is used around the same time.

It’s that last point that is both a blessing and a curse, and that we programmers really should understand the implications of.

Data Layout

It doesn’t matter much for the purposes of this discussion how large a cache line actually is. For one thing that varies considerably from processor to processor, as you might imagine. For another, it’s sufficient to assume that each cache line will hold more than one variable of the built-in data types you might use in C. That is, in the following code snippet, all data is likely going to end up in the same cache line:

int a = 1;
int b = 2;
int c = a + b;

If that’s the case, it doesn’t take much imagination to assume that at least for smaller sized structures, all data members will also end up in the same cache line:

struct foo {
  int a;
  int b;
  int c;
};
 
foo f;

In fact, this is almost guaranteed to be the case, unless your structure is huge and you’re wondering whether e.g. the first and last member are in the same cache line. This is firstly because structures in C are just contiguous bits of memory.

Secondly and more importantly, CPUs require data to be aligned on a boundary that is convenient for them. Cache lines are typically multiples of this boundary; that is, if data is supposed to be aligned on a 64 bit boundary, cache lines are going to be multiples of 64 bits in size rather than, say, 96 bits.

Primarily that means that individual variables will always fit into a cache line.

Structures, of course, might be oddly “shaped” as it were. If your structure contains three 64 bit integers, the CPU requires data to be aligned on 64 bit boundaries, and cache lines are 256 bits in size, then only one of two adjacent structures would fit neatly into a cache line. Keep that in mind, as it compounds the problems I’m going into later on.

  1. At this point the offset will already have been converted into an absolute address, but never mind the details. []
  2. More details on that than most mere mortals can handle are available in Ulrich Drepper’s series of articles “What every programmer should know about memory”. Even if you think you can’t handle it, you should try. []
  3. I’m going to repeat this a few times… []
  4. Yes, here we are again. []