Memory Alignment

Submitted by epreisz on Sat, 02/17/2007 - 20:57.

Reading from unaligned memory can cause sever performance problems in memory. Luckily, the likelihood of un-aligning memory by accident is rare. This following passage is important to learn in preparation for using SIMD instructions, which require us to alter default memory alignment.

Humans think of memory independent of the memory alignment. This is because normally, memory alignment is handled for us by the compiler automatically. When a struct is compiled, it looks at the largest primitive type and aligns the memory to this type. If your struct’s largest member is 4 bytes, then your memory will be 4 byte aligned. If your structs largest member is 8 bytes, then the memory will be 8 bytes aligned. Consider the following struct:

Struct myBytes
{
int m_Int;
double m_Double1;
double m_Double2;
};

Upon first examination, a programmer may proclaim that the struct is 20 bytes in size. This is understandable since an int is 4 bytes and a double is 8 bytes. The correct answer is actually 24.

The reason this struct is 24 bytes lies in is memory alignment. Since the largest type is 8 bytes, this struct is 8 byte aligned. When the compiler builds the struct it pads the 4 byte int with 4 more bytes. By doing this, we ensure that the first double is aligned on a 8 byte boundry.

Examining the memory sequentially gives a clear picture of the padding that’s involved with aligning your memory.

int | padding | double | double

Consider the following:

Struct myBytes
{
int m_Int;
int m_Int2;
double m_Double1;
double m_Double2;
};

How many bytes will this struct be? The answer is still 24. Since the second int already exists, there is no need for the compiler to pad.

int | int | double | double

How do we un-align memory?

Most of the time, we do not need to consider memory alignment; however, there are several occasions where we may need to, or accidentally, un-align memory. In a later chapter, we will cover SIMD operations that allow us to perform 4 operations at the speed of 1. In order to use SIMD operations optimally, we will need to align our memory to 16 bytes. When casting from a larger sized type to a lower sized type, memory may become unaligned.

One way that we may un-align memory is to cast from a larger type to a lower type. For example, if you allocate memory as an integers ( 4bytes ) and cast it to an long integer (8bytes), When the memory is allocated as the 4byte integer, it will be aligned at a 4 byte address. An 4 byte address can be identified by the lower bits of the address. If the lower bits are a multiple of 4, then the address is 4 byte aligned. Some address that are a multiple of 4 are also a multiple of 8. 8, 16, and 24, are multiples of both 8 and 4. 4 , 12, and 20, are multiples of 4, but not 8. If memory is allocated at a 4 byte address, that is not a multiple of 8, and is cast to an 8 byte type, that address will be un-aligned. A subsequent read from that unaligned memory, as an 8 byte type, will cause a costly exception. This exception is caught by the OS and handled by loading the lower 4btyes, the upper 4 bytes, and combining them to create an 8 byte type. Tests using programs such a v-tune shows that loading un-aligned data is around 3 times slower than reading aligned memory.

Another way that we may un-align memory is to use a preprocessor define, visual studio property, or a command-line argument. These readings will explore these concepts again when we discuss SIMD processing arguments.

Cache Line -Aligning Memory

Another micro optimization that may increase performance is aligning your data to cache lines. This optimization is not preferred for PC development since not all cache lines are 64 bytes. Our cache systems loads and stores cache lines. If your data is consistently straddling cache lines, you may cause what is know as split loads and split writes. A split load or write increases the overhead costs associated with loading and saving memory.

To avoid a split loads and writes, build critical structures to have a size of 64 and align your structure to a 64 byte address. You may need to pad your structure with wasted data to achieve the correct size. To align your structure to a 64 byte address, use the following preprocessor define:

__declspec(align(64));