CPU cache知识四 —— 为什么要cache line对齐

  • Post author:
  • Post category:其他


What does “cacheline aligned” mean?

CPU caches transfer data from and to main memory in chunks(一块) called cache lines;

a typical size for this seems to be 64 bytes.

Data that are located closer to each other than this(this指64B) may end up on the same cache line.

If these data are needed by different cores, the system has to work hard to keep the

data consistent between the copies residing in the cores’ caches. Essentially, while

one thread modifies the data, the other thread is blocked by a lock from accessing the data.

The article you reference talks about one such problem that was found in PostgreSQL

in a data structure in shared memory that is frequently updated by different processes.

By introducing padding into the structure to inflate it to 64 bytes, it is guaranteed

that no two such data structures end up in the same cache line, and the processes that

access them are not blocked more that absolutely necessary.

This is only relevant if your program parallelizes execution and accesses a shared

memory region, either by multithreading or by multiprocessing with shared memory.

In this case you can benefit by making sure that data that are frequently accessed by

different execution threads are not located close enough in memory that they can end

up in the same cache line.

The typical way to do that is by adding “dead” padding space at the end of a data structure.


一、假设两个数据结构,在内存中的位置和布局如下:

—————————-  <— 0x0

A{


unsigned int a;

}

—————————-  <— 0x4

B{


unsinged int b;

}

—————————-  <— 0x8


二、双核处理器各个CPU的cache line都是64字节

如果CPU0 A进程要访问A数据结构,CPU0的cache就会将0x0~0x40内存区间的数据加载到CPU0的某个cache line。

如果CPU0 A进程修改了A数据结构,那么CPU0的该cache line对应的内存数据块(0x0~0x40)就会被加锁,以阻止其他进程访问。一个难以接受的现象就是B数据结构也位于该内存区域内,也被加锁了,导致访问B数据结构的CPU1的B进程也被阻塞。


三、如果对两个数据结构进行cache line对齐,在内存中的位置和布局如下:

—————————-  <— 0x0

A{


unsigned int a;

} __cacheline_aligned

pad

pad

pad



pad

—————————-  <— 0x40

B{


unsinged int b;

}__cacheline_aligned

pad

pad

pad



pad

—————————-  <— 0x80

四、双核处理器各个CPU的cache line都是64字节

如果CPU0 A进程要访问A数据结构,CPU0的cache就会将0x0~0x40内存区间的数据加载到CPU0的某个cache line。

如果CPU0 A进程修改了A数据结构,那么CPU0的该cache line对应的内存数据块(0x0~0x40)就会被加锁,以阻止其他进程访问。

此时,CPU1 B进程要访问B数据结构,CPU1的cache就会将0x40~0x80内存区间的数据加载到CPU1的某个cache line。

因为内存0x0~0x40加锁并不会影响0x40~0x80的访问。

这就是为什么使用__cacheline_aligned修饰数据结构的原因。


五、多年来一直说cache line不对齐可能对性能造成影响,现在可以总结出原因了吧?

一个CPU核要访问一个数据结构A,就会将从该结构A开始的一个cache line大小的内存读入自己的cache,如果修改了该cache line内容(结构A内容),该段cache line映射的内存就会被上锁。如果上述数据结构不是cache line对齐的,有可能该cache line中也包含了另外一个CPU进程要访问的其他数据结构B,上述锁就会阻塞要访问数据结构B的其他CPU上的进程。降低系统性能。

六、为何要对这段内存上锁?

根据cache一致性协议(MESI),CPU0修改结构体A会导致CPU1的cache line失效,同理,CPU1对结构体B的修改也会导致CPU0的cache line失效。如果CPU0和CPU1反复修改,那么就会使得Linux系统进行频繁的内存加锁操作,必然引起系统性能下降。这种现象叫做“cache line伪共享”,两个CPU原本没有共享访问,因为要共同访问同一个cache line,产生了事实上的共享。解决上述问题的一个方法是让结构体按照cache line对齐,典型的以时间换空间。



版权声明:本文为denglin12315原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。