Introduction To Implementation Of The SLUB Allocator Via Socket Buffer Allocation And Deallocation

· omacs's blog


Contents #

  1. Introduction
  2. Per-CPU
  3. NUMA Node
  4. Kernel Memory Cache and Socket Buffer
  5. SLUB allocator and SKB allocation
  6. SLUB allocator and SKB deallocation
  7. References

Introduction #

Jeff Bonwick이 SLAB allocator를 소개하고 Christoph Lameter가 SLUB allocator를 대안으로 제시한 이후로 현대 리눅스 커널에는 SLUB allocator가 주로 사용되고 있다[1, 2].

본 글에서는 리눅스 커널 5.15.30 버전을 기준으로 SKB (struct sk_buff)의 할당과 해제를 통해 SLUB allocator 구현을 살펴볼 것이다.

Per-CPU #

Symmetric multiprocessing system에서 다수의 CPU들이 핸들링하게 되는 데이터는 락으로 인한 병목이나 cache line bouncing 등을 초래할 수 있다. 이에 리눅스 커널 등은 각 CPU local에 가용한 데이터 공간을 두어 성능 저하를 최소화한다. 즉, CPU 마다 lockless하게 접근할 수 있는 공간을 두는 것이다[3].

리눅스 커널은 CPU에 접근하기 위한 인터페이스로 per_cpu_ptr()raw_cpu_ptr() 등을 제공한다. 이들은 다음과 같은 call trace를 통해 접근할 수 있다. (raw_cpu_ptr()의 call trace가 SHIFT_PERCPU_PTR()에서 겹침을 확인하라)

/* function call trace */

per_cpu_ptr(ptr, cpu)                                 raw_cpu_ptr(ptr)
SHIFT_PERCPU_PTR() /* __p = ptr,                      arch_raw_cpu_ptr()
           __offset = per_cpu_offset((cpu)) */        SHIFT_PERCPU_PTR() /* __p = ptr
RELOC_HIDE()                                                __offset = __my_cpu_offset */

위 call trace에서 RELOC_HIDE()는 전달된 ptr에 전달된 off (= __offset)을 더하여 CPU에 접근가능한 포인터를 반환한다. 이때 그 오프셋은 각각 다음과 같이 얻어진다.

 1/* include/asm-generic/percpu.h */
 2
 3#include <linux/compiler.h>
 4#include <linux/threads.h>
 5#include <linux/percpu-defs.h>
 6
 7#ifdef CONFIG_SMP
 8
 9/*
10 * per_cpu_offset() is the offset that has to be added to a
11 * percpu variable to get to the instance for a certain processor.
12 *
13 * Most arches use the __per_cpu_offset array for those offsets but
14 * some arches have their own ways of determining the offset (x86_64, s390).
15 */
16#ifndef __per_cpu_offset
17extern unsigned long __per_cpu_offset[NR_CPUS];
18
19#define per_cpu_offset(x) (__per_cpu_offset[x])
20#endif
 1/* arch/ia64/include/asm/percpu.h */
 2
 3#define __my_cpu_offset	__ia64_per_cpu_var(local_per_cpu_offset)
 4
 5extern void *per_cpu_init(void);
 6
 7#else /* ! SMP */
 8
 9#define per_cpu_init()				(__phys_per_cpu_start)
10
11#endif	/* SMP */
12
13#define PER_CPU_BASE_SECTION ".data..percpu"
14
15/*
16 * Be extremely careful when taking the address of this variable!  Due to virtual
17 * remapping, it is different from the canonical address returned by this_cpu_ptr(&var)!
18 * On the positive side, using __ia64_per_cpu_var() instead of this_cpu_ptr() is slightly
19 * more efficient.
20 */
21#define __ia64_per_cpu_var(var) (*({					\
22	__verify_pcpu_ptr(&(var));					\
23	((typeof(var) __kernel __force *)&(var));			\
24}))
25
26#include <asm-generic/percpu.h>
27
28/* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
29DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);

이러한 per-CPU는 kernel memory cache를 초기화할 때 할당되며, 다음 call trace가 그 예시이다.

/* function call trace */

socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
                                size = sizeof(struct sk_buff) = 0xE0,
                                align = 0,
                                flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
                                useroffset = offsetof(struct sk_buff, cb),
                                usersize = sizeof_field(struct sk_buff, cb),
                                ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
__kmem_cache_create()
kmeme_cache_open()
alloc_kmem_cache_cpus()
      |
      +-----------------------+
      |1                      |2
      V                       V
 __alloc_percpu()        init_kmem_cache_cpus()
 pcpu_alloc()
 pcpu_find_block_fit()

위 calltrace의 pcpu_alloc()에서 pcpu_setup_first_chunk() 등으로 할당된 slots인 per-CPU chunks로부터 할당하게 된다.

NUMA Node #

Non-Uniformed Memory Access (NUMA)는 일정 단위의 하드웨어 자원 (processor, I/O bus, ...)마다 가용한 메모리 부분을 나누어 symmetric multiprocessing system의 평균적인 경우에 대한 access time을 줄이기 위한 shared memory architecture이다. 리눅스 커널은 물리적 뷰에서 메모리를 포함한 각 하드웨어 자원을 cell로 구분하며, 이를 다시 node로 추상화한다. 이러한 node는 다시 cell과 매핑된다. 이를 그림으로 나타내면 다음과 같다[4, 5].

              Node 1                                           Node 2
+-----------------------------------+         +-----------------------------------+
| +------------------------------+  |         | +------------------------------+  |
| |  processor1, processor2, ... |  |         | |  processor1, processor2, ... |  |
| +------------------------------+  |---------| +------------------------------+  |
|           |       |               |inter    |           |       |               |
|           |       |               |connected|           |       |               |
|   +--------------------------+    |---------|    +------------------------+     |
|   |       memory             |    |         |    |      memory            |     |
|   +--------------------------+    |         |    +------------------------+     |
+-----------------------------------+ \     / +-----------------------------------+
                                  \     \  /     /
                                   \     \/     /
                                    \    /\    /
                                     \  /  \  /
                                      \/    \/
                                      /\    /\                                    
              Node 3                 /  \  /  \                 Node 4
+-----------------------------------+    \/   +-----------------------------------+
| +------------------------------+  |    /\   | +------------------------------+  |
| |  processor1, processor2, ... |  |   /  \  | |  processor1, processor2, ... |  |
| +------------------------------+  |---------| +------------------------------+  |
|           |       |               |inter    |           |       |               |
|           |       |               |connected|           |       |               |
|   +--------------------------+    |---------|    +------------------------+     |
|   |       memory             |    |         |    |      memory            |     |
|   +--------------------------+    |         |    +------------------------+     |
+-----------------------------------+         +-----------------------------------+

NUMA 아키텍처는 위 그림에서 볼 수 있듯이 local node에서의 access time과 remote node에서의 access time에 차이가 나게 된다.

리눅스 커널은 NUMA node에 접근하기 위한 인터페이스로 get_node() 등을 제공한다. 그리고 그 동작을 살펴보면 node를 인덱스로 하여 반환함을 알 수 있다.

1/* mm/slab.h */
2
3static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
4{
5	return s->node[node];
6}

이러한 NUMA node는 kernel memory cache를 초기화할 때 매핑되며, 다음 call trace를 통해 알 수 있다. (상대적 순서임에 주의하라)

/* function call trace */

socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
                                size = sizeof(struct sk_buff) = 0xE0,
                                align = 0,
                                flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
                                useroffset = offsetof(struct sk_buff, cb),
                                usersize = sizeof_field(struct sk_buff, cb),
                                ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
__kmem_cache_create()
kmeme_cache_open()
init_kmem_cache_nodes()
       |
       +----------------------------+
       |1                           |2
       V                            V
    kmem_cache_alloc_node()     init_kmem_cache_node()

Kernel Memory Cache and Socket Buffer #

리눅스 커널은 소켓 버퍼에 대한 slab cache를 관리하기 위해 skbuff_head_cacheskbuff_fclone_cache를 둔다. 여기서 관심을 갖는 것은 skbuff_head_cache이며, struct kmem_cache 타입으로 선언되어 있다.

1/* include/linux/skbuff.h */
2
3extern struct kmem_cache *skbuff_head_cache;
 1/* include/linux/slub_def.h */
 2
 3/*
 4 * Slab cache management.
 5 */
 6struct kmem_cache {
 7	struct kmem_cache_cpu __percpu *cpu_slab;
 8	/* Used for retrieving partial slabs, etc. */
 9	slab_flags_t flags;
10	unsigned long min_partial;
11	unsigned int size;	/* The size of an object including metadata */
12	unsigned int object_size;/* The size of an object without metadata */
13	struct reciprocal_value reciprocal_size;
14	unsigned int offset;	/* Free pointer offset */
15#ifdef CONFIG_SLUB_CPU_PARTIAL
16	/* Number of per cpu partial objects to keep around */
17	unsigned int cpu_partial;
18#endif
19	struct kmem_cache_order_objects oo;
20
21	/* Allocation and freeing of slabs */
22	struct kmem_cache_order_objects max;
23	struct kmem_cache_order_objects min;
24	gfp_t allocflags;	/* gfp flags to use on each alloc */
25	int refcount;		/* Refcount for slab cache destroy */
26	void (*ctor)(void *);
27	unsigned int inuse;		/* Offset to metadata */
28	unsigned int align;		/* Alignment */
29	unsigned int red_left_pad;	/* Left redzone padding size */
30	const char *name;	/* Name (only for display!) */
31	struct list_head list;	/* List of slab caches */
32#ifdef CONFIG_SYSFS
33	struct kobject kobj;	/* For sysfs */
34#endif
35#ifdef CONFIG_SLAB_FREELIST_HARDENED
36	unsigned long random;
37#endif
38
39#ifdef CONFIG_NUMA
40	/*
41	 * Defragmentation by allocating from a remote node.
42	 */
43	unsigned int remote_node_defrag_ratio;
44#endif
45
46#ifdef CONFIG_SLAB_FREELIST_RANDOM
47	unsigned int *random_seq;
48#endif
49
50#ifdef CONFIG_KASAN
51	struct kasan_cache kasan_info;
52#endif
53
54	unsigned int useroffset;	/* Usercopy region offset */
55	unsigned int usersize;		/* Usercopy region size */
56
57	struct kmem_cache_node *node[MAX_NUMNODES];
58};

이 캐시는 skb_init()에서 초기화 작업을 진행한다.

/* function call trace */

socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
                                size = sizeof(struct sk_buff) = 0xE0,
                                align = 0,
                                flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
                                useroffset = offsetof(struct sk_buff, cb),
                                usersize = sizeof_field(struct sk_buff, cb),
                                ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
 1static struct kmem_cache *create_cache(const char *name,
 2		unsigned int object_size, unsigned int align,
 3		slab_flags_t flags, unsigned int useroffset,
 4		unsigned int usersize, void (*ctor)(void *),
 5		struct kmem_cache *root_cache)
 6{
 7	struct kmem_cache *s;
 8	int err;
 9
10	/* ... */
11
12	err = -ENOMEM;
13	s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
14	if (!s)
15		goto out;
16
17	s->name = name;
18	s->size = s->object_size = object_size;
19	s->align = align;
20	s->ctor = ctor;
21	s->useroffset = useroffset;
22	s->usersize = usersize;
23
24	s->refcount = 1;
25	list_add(&s->list, &slab_caches);
26out:
27	if (err)
28		return ERR_PTR(err);
29	return s;
30
31out_free_cache:
32	kmem_cache_free(kmem_cache, s);
33	goto out;
34}

따라서 s->object_sizestruct sk_buff 타입의 크기임을 알 수 있고, GDB로 확인해보면 0xe0 바이트임을 알 수 있다. 그리고 struct kmem_cache에서 중요한 멤버로 struct kmem_cache_cpu가 있다. 이는 per-cpu 오브젝트이며, 다음과 같은 call trace를 거쳐 할당되고 초기화된다. (아래 call trace가 모든 함수를 표기하지 않았으며, 상대적 순서임에 주의하라)

/* function call trace */

socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
                                size = sizeof(struct sk_buff) = 0xE0,
                                align = 0,
                                flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
                                useroffset = offsetof(struct sk_buff, cb),
                                usersize = sizeof_field(struct sk_buff, cb),
                                ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
__kmem_cache_create()
kmeme_cache_open()
      |
      |
      +---------------------------+------------------------------------+
      |1                          |2                                   |3
      |                           |                                    |
      V                           V                                    V
alloc_kmem_cache_cpus()       set_min_partial() /* s = s, */      set_cpu_partial()
      |                            /* min = ilog2(s->size / 2) */
      +-----------------------+
      |1                      |2
      V                       V
 __alloc_percpu()        init_kmem_cache_cpus()
 pcpu_alloc()
 pcpu_find_block_fit()
 1/* include/linux/slub_def.h */
 2
 3/*
 4 * When changing the layout, make sure freelist and tid are still compatible
 5 * with this_cpu_cmpxchg_double() alignment requirements.
 6 */
 7struct kmem_cache_cpu {
 8	void **freelist;	/* Pointer to next available object */
 9	unsigned long tid;	/* Globally unique transaction id */
10	struct page *page;	/* The slab from which we are allocating */
11#ifdef CONFIG_SLUB_CPU_PARTIAL
12	struct page *partial;	/* Partially allocated frozen slabs */
13#endif
14	local_lock_t lock;	/* Protects the fields above */
15#ifdef CONFIG_SLUB_STATS
16	unsigned stat[NR_SLUB_STAT_ITEMS];
17#endif
18};
 1/* mm/slub.c */
 2
 3static void init_kmem_cache_cpus(struct kmem_cache *s)
 4{
 5	int cpu;
 6	struct kmem_cache_cpu *c;
 7
 8	for_each_possible_cpu(cpu) {
 9		c = per_cpu_ptr(s->cpu_slab, cpu);
10		local_lock_init(&c->lock);
11		c->tid = init_tid(cpu);
12	}
13}

또, 살펴볼만한 것으로 set_min_partial()set_cpu_partial()이 있다. 전자는 일부의 오브젝트만 free된 slab의 linked list인 partial list의 최소 노드 수를 설정한다. 그리고 후자는 per-cpu partial list의 최대 노드 수를 설정한다. 이들은 후에 설명할 slab free 과정에서 다시 나올 것이다.

 1/* mm/slub.c */
 2
 3/*
 4 * Minimum number of partial slabs. These will be left on the partial
 5 * lists even if they are empty. kmem_cache_shrink may reclaim them.
 6 */
 7#define MIN_PARTIAL 5
 8
 9/*
10 * Maximum number of desirable partial slabs.
11 * The existence of more partial slabs makes kmem_cache_shrink
12 * sort the partial list by the number of objects in use.
13 */
14#define MAX_PARTIAL 10
15
16static void set_min_partial(struct kmem_cache *s, unsigned long min)
17{
18	if (min < MIN_PARTIAL)
19		min = MIN_PARTIAL;
20	else if (min > MAX_PARTIAL)
21		min = MAX_PARTIAL;
22	s->min_partial = min;
23}
 1/* mm/slub.c */
 2
 3static void set_cpu_partial(struct kmem_cache *s)
 4{
 5#ifdef CONFIG_SLUB_CPU_PARTIAL
 6	/*
 7	 * cpu_partial determined the maximum number of objects kept in the
 8	 * per cpu partial lists of a processor.
 9	 *
10	 * Per cpu partial lists mainly contain slabs that just have one
11	 * object freed. If they are used for allocation then they can be
12	 * filled up again with minimal effort. The slab will never hit the
13	 * per node partial lists and therefore no locking will be required.
14	 *
15	 * This setting also determines
16	 *
17	 * A) The number of objects from per cpu partial slabs dumped to the
18	 *    per node list when we reach the limit.
19	 * B) The number of objects in cpu partial slabs to extract from the
20	 *    per node list when we run out of per cpu objects. We only fetch
21	 *    50% to keep some capacity around for frees.
22	 */
23	if (!kmem_cache_has_cpu_partial(s))
24		slub_set_cpu_partial(s, 0);
25	else if (s->size >= PAGE_SIZE)
26		slub_set_cpu_partial(s, 2);
27	else if (s->size >= 1024)
28		slub_set_cpu_partial(s, 6);
29	else if (s->size >= 256)
30		slub_set_cpu_partial(s, 13);
31	else
32		slub_set_cpu_partial(s, 30);
33#endif
34}
1#define slub_cpu_partial(s)		((s)->cpu_partial)
2#define slub_set_cpu_partial(s, n)		\
3({						\
4	slub_cpu_partial(s) = (n);		\
5})

이밖에도 여러 멤버 변수들이 초기화되며, 그 예시는 다음과 같다:

gef> p *(struct kmem_cache *) skbuff_head_cache
$3 = {
  cpu_slab = 0x2e380,
  flags = 0x40042000,
  min_partial = 0x5,
  size = 0x100,
  object_size = 0xe0,
  reciprocal_size = {
    m = 0x1,
    sh1 = 0x1,
    sh2 = 0x7
  },
  offset = 0x70,
  cpu_partial = 0xd,
  oo = {
    x = 0x10
  },
  max = {
    x = 0x10
  },
  min = {
    x = 0x10
  },
  allocflags = 0x0,
  refcount = 0x1,
  ctor = 0x0 <fixed_percpu_data>,
  inuse = 0xe0,
  align = 0x40,
  red_left_pad = 0x0,
  name = 0xffffffff8261d1f0 "skbuff_head_cache",
  list = {
    next = 0xffff8880036a4168,
    prev = 0xffff8880036a4368
  },
  kobj = {
    name = 0xffffffff8261d1f0 "skbuff_head_cache",
    entry = {
      next = 0xffff8880036a4180,
      prev = 0xffff8880036a4380
    },
    parent = 0xffff888003629a98,
    kset = 0xffff888003629a80,
    ktype = 0xffffffff8295b620 <slab_ktype>,
    sd = 0xffff88800363ff80,
    kref = {
      refcount = {
        refs = {
          counter = 0x1
        }
      }
    },
    state_initialized = 0x1,
    state_in_sysfs = 0x1,
    state_add_uevent_sent = 0x0,
    state_remove_uevent_sent = 0x0,
    uevent_suppress = 0x0
  },
  remote_node_defrag_ratio = 0x3e8,
  useroffset = 0x28,
  usersize = 0x30,
  node = {
    [0x0] = 0xffff8880036a5080,
    [0x1] = 0x0 <fixed_percpu_data>,
    [0x2] = 0x0 <fixed_percpu_data>,
    [0x3] = 0x0 <fixed_percpu_data>,
    [0x4] = 0x0 <fixed_percpu_data>,
    [0x5] = 0x0 <fixed_percpu_data>,
    [0x6] = 0x0 <fixed_percpu_data>,
    [0x7] = 0x2e3a0,
    [0x8] = 0x40042000,
    [0x9] = 0x5 <fixed_percpu_data+5>,
    [0xa] = 0x20000000200,
    [0xb] = 0x80100000001,
    [0xc] = 0xd000000e0,
    [0xd] = 0x1001000010010,
    [0xe] = 0x4000000000008,
    [0xf] = 0x2 <fixed_percpu_data+2>,
    [0x10] = 0x0 <fixed_percpu_data>,
    [0x11] = 0x4000000200,
    [0x12] = 0x0 <fixed_percpu_data>,
    [0x13] = 0xffffffff8261d202,
    [0x14] = 0xffff8880036a4268,
    [0x15] = 0xffff8880036a4468,
    [0x16] = 0xffff8880034fc2d0,
    [0x17] = 0xffff8880036a4280,
    [0x18] = 0xffff8880036a4480,
    [0x19] = 0xffff888003629a98,
    [0x1a] = 0xffff888003629a80,
    [0x1b] = 0xffffffff8295b620 <slab_ktype>,
    [0x1c] = 0xffff88800363f000,
    [0x1d] = 0x300000001,
    [0x1e] = 0x3e8,
    [0x1f] = 0x0 <fixed_percpu_data>,
    [0x20] = 0xffff8880036a50c0,
    [0x21] = 0x0 <fixed_percpu_data>,
    [0x22] = 0x0 <fixed_percpu_data>,
    [0x23] = 0x0 <fixed_percpu_data>,
    [0x24] = 0x0 <fixed_percpu_data>,
    [0x25] = 0x0 <fixed_percpu_data>,
    [0x26] = 0x0 <fixed_percpu_data>,
    [0x27] = 0x2e3c0,
    [0x28] = 0x40022000,
    [0x29] = 0x5 <fixed_percpu_data+5>,
    [0x2a] = 0x30000000340,
    [0x2b] = 0x9013b13b13c,
    [0x2c] = 0xd00000300,
    [0x2d] = 0x2001300020013,
    [0x2e] = 0x4001000000004,
    [0x2f] = 0x1 <fixed_percpu_data+1>,
    [0x30] = 0xffffffff818d8e40 <init_once>,
    [0x31] = 0x4000000300,
    [0x32] = 0x0 <fixed_percpu_data>,
    [0x33] = 0xffffffff8261cbe8,
    [0x34] = 0xffff8880036a4368,
    [0x35] = 0xffff888003508c68,
    [0x36] = 0xffffffff8261cbe8,
    [0x37] = 0xffff8880036a4380,
    [0x38] = 0xffff888003508c80,
    [0x39] = 0xffff888003629a98,
    [0x3a] = 0xffff888003629a80,
    [0x3b] = 0xffffffff8295b620 <slab_ktype>,
    [0x3c] = 0xffff88800363e100,
    [0x3d] = 0x300000001,
    [0x3e] = 0x3e8,
    [0x3f] = 0x0 <fixed_percpu_data>
  }
}
gef> 

SLUB allocator and SKB allocation #

SLUB allocator는 기존에 사용되던 SLAB allocator의 대안으로 제시되었으며, slab을 특정 오브젝트로 구성된 페이지의 그룹으로 바라본다. 이러한 slab은 그 자체로는 freelist를 관리하지 않는다. 그래서 freelist를 관리하기 위한 데이터를 페이지 구조체에 저장한다[1].

  1/* include/linux/mm_types.h */
  2
  3/*
  4 * Each physical page in the system has a struct page associated with
  5 * it to keep track of whatever it is we are using the page for at the
  6 * moment. Note that we have no way to track which tasks are using
  7 * a page, though if it is a pagecache page, rmap structures can tell us
  8 * who is mapping it.
  9 *
 10 * If you allocate the page using alloc_pages(), you can use some of the
 11 * space in struct page for your own purposes.  The five words in the main
 12 * union are available, except for bit 0 of the first word which must be
 13 * kept clear.  Many users use this word to store a pointer to an object
 14 * which is guaranteed to be aligned.  If you use the same storage as
 15 * page->mapping, you must restore it to NULL before freeing the page.
 16 *
 17 * If your page will not be mapped to userspace, you can also use the four
 18 * bytes in the mapcount union, but you must call page_mapcount_reset()
 19 * before freeing it.
 20 *
 21 * If you want to use the refcount field, it must be used in such a way
 22 * that other CPUs temporarily incrementing and then decrementing the
 23 * refcount does not cause problems.  On receiving the page from
 24 * alloc_pages(), the refcount will be positive.
 25 *
 26 * If you allocate pages of order > 0, you can use some of the fields
 27 * in each subpage, but you may need to restore some of their values
 28 * afterwards.
 29 *
 30 * SLUB uses cmpxchg_double() to atomically update its freelist and
 31 * counters.  That requires that freelist & counters be adjacent and
 32 * double-word aligned.  We align all struct pages to double-word
 33 * boundaries, and ensure that 'freelist' is aligned within the
 34 * struct.
 35 */
 36#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
 37#define _struct_page_alignment	__aligned(2 * sizeof(unsigned long))
 38#else
 39#define _struct_page_alignment
 40#endif
 41
 42struct page {
 43	unsigned long flags;		/* Atomic flags, some possibly
 44					 * updated asynchronously */
 45	/*
 46	 * Five words (20/40 bytes) are available in this union.
 47	 * WARNING: bit 0 of the first word is used for PageTail(). That
 48	 * means the other users of this union MUST NOT use the bit to
 49	 * avoid collision and false-positive PageTail().
 50	 */
 51	union {
 52		/* ... */
 53		struct {	/* slab, slob and slub */
 54			union {
 55				struct list_head slab_list;
 56				struct {	/* Partial pages */
 57					struct page *next;
 58#ifdef CONFIG_64BIT
 59					int pages;	/* Nr of pages left */
 60					int pobjects;	/* Approximate count */
 61#else
 62					short int pages;
 63					short int pobjects;
 64#endif
 65				};
 66			};
 67			struct kmem_cache *slab_cache; /* not slob */
 68			/* Double-word boundary */
 69			void *freelist;		/* first free object */
 70			union {
 71				void *s_mem;	/* slab: first object */
 72				unsigned long counters;		/* SLUB */
 73				struct {			/* SLUB */
 74					unsigned inuse:16;
 75					unsigned objects:15;
 76					unsigned frozen:1;
 77				};
 78			};
 79		};
 80		/* ... */
 81
 82		/** @rcu_head: You can use this to free a page by RCU. */
 83		struct rcu_head rcu_head;
 84	};
 85
 86	union {		/* This union is 4 bytes in size. */
 87		/*
 88		 * If the page can be mapped to userspace, encodes the number
 89		 * of times this page is referenced by a page table.
 90		 */
 91		atomic_t _mapcount;
 92
 93		/*
 94		 * If the page is neither PageSlab nor mappable to userspace,
 95		 * the value stored here may help determine what this page
 96		 * is used for.  See page-flags.h for a list of page types
 97		 * which are currently stored here.
 98		 */
 99		unsigned int page_type;
100
101		unsigned int active;		/* SLAB */
102		int units;			/* SLOB */
103	};
104
105	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
106	atomic_t _refcount;
107
108#ifdef CONFIG_MEMCG
109	unsigned long memcg_data;
110#endif
111
112	/*
113	 * On machines where all RAM is mapped into kernel address space,
114	 * we can simply calculate the virtual address. On machines with
115	 * highmem some memory is mapped into kernel virtual memory
116	 * dynamically, so we need a place to store that address.
117	 * Note that this field could be 16 bits on x86 ... ;)
118	 *
119	 * Architectures with slow multiplication can define
120	 * WANT_PAGE_VIRTUAL in asm/page.h
121	 */
122#if defined(WANT_PAGE_VIRTUAL)
123	void *virtual;			/* Kernel virtual address (NULL if
124					   not kmapped, ie. highmem) */
125#endif /* WANT_PAGE_VIRTUAL */
126
127#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
128	int _last_cpupid;
129#endif
130} _struct_page_alignment;

SLUB은, 위 소스 코드의 주석에서도 볼 수 있듯이 freelist 멤버를 참조하여 오브젝트를 할당하고 다음으로 할당할 위치를 offset으로 관리한다. 이를 그림으로 나타내면 다음과 같다:

               page:
                     +-----------------------+
                     |         ...           |
    freelist ------->+-----------------------+
                     | returned object       |
                     +-----------------------+
                     |         ...           |
                     +-----------------------+<----- freelist + offset
                     | next object           |
                     +-----------------------+
                     |         ...           |
                     +-----------------------+

그럼 SLUB이 struct sk_buff를 할당하는 과정을 소스 코드로 알아보겠다[1].

그 시작은 alloc_skb 함수이고, __alloc_skb 함수의 레퍼 함수 (wrapper function)이다. (kmalloc()의 call trace가 slab_alloc_node()부터 겹침을 확인하라)

/* function call trace */

alloc_skb()                                                 kmalloc()
__alloc_skb() /* flags = 0, node = NUMA_NO_NODE = -1 */ __kmalloc() /* size<pagesize*2 */
kmem_cache_alloc_node() /* cache = skbuff_head_cache, */        |
                        /* gfp_mask & ~GFP_DMA, */              +-----------+
                        /* node = NUMA_NO_NODE = -1 */          |           |
slab_alloc_node() /* s = cache = skbuff_head_cache, */          V           V
                  /* gfpflags = gfp_mask & ~GFP_DMA, */  kmalloc_slab()  slab_alloc()
                  /* node = NUMA_NO_NODE = -1, */                       slab_alloc_node()  
                  /* addr = _RET_IP_, */
                  /* orig_size = s->object_size */

위 call trace에서 slab_alloc_node 함수의 구현을 살펴보면 할당의 상황들이 나오기 시작한다. 만약 현재 상황이 초기 할당이거나 slab cache가 모두 할당된 상황이라면 struct kmem_cache_cpufreelistpage 멤버가 유효한 메모리 주소를 갖고 있지 않을 것이다. 하지만 이에 해당하지 않는다면 이들은 어떤 주소든 갖고 있을 것이며, 따라서 이로부터 할당하면 될 것이다.

  1/* mm/slub.c */
  2
  3/*
  4 * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  5 * have the fastpath folded into their functions. So no function call
  6 * overhead for requests that can be satisfied on the fastpath.
  7 *
  8 * The fastpath works by first checking if the lockless freelist can be used.
  9 * If not then __slab_alloc is called for slow processing.
 10 *
 11 * Otherwise we can simply pick the next object from the lockless free list.
 12 */
 13static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 14		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
 15{
 16	void *object;
 17	struct kmem_cache_cpu *c;
 18	struct page *page;
 19	unsigned long tid;
 20	struct obj_cgroup *objcg = NULL;
 21	bool init = false;
 22
 23	/* ... */
 24
 25redo:
 26	/*
 27	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
 28	 * enabled. We may switch back and forth between cpus while
 29	 * reading from one cpu area. That does not matter as long
 30	 * as we end up on the original cpu again when doing the cmpxchg.
 31	 *
 32	 * We must guarantee that tid and kmem_cache_cpu are retrieved on the
 33	 * same cpu. We read first the kmem_cache_cpu pointer and use it to read
 34	 * the tid. If we are preempted and switched to another cpu between the
 35	 * two reads, it's OK as the two are still associated with the same cpu
 36	 * and cmpxchg later will validate the cpu.
 37	 */
 38	c = raw_cpu_ptr(s->cpu_slab);
 39	tid = READ_ONCE(c->tid);
 40
 41	/*
 42	 * Irqless object alloc/free algorithm used here depends on sequence
 43	 * of fetching cpu_slab's data. tid should be fetched before anything
 44	 * on c to guarantee that object and page associated with previous tid
 45	 * won't be used with current tid. If we fetch tid first, object and
 46	 * page could be one associated with next tid and our alloc/free
 47	 * request will be failed. In this case, we will retry. So, no problem.
 48	 */
 49	barrier();
 50
 51	/*
 52	 * The transaction ids are globally unique per cpu and per operation on
 53	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
 54	 * occurs on the right processor and that there was no operation on the
 55	 * linked list in between.
 56	 */
 57
 58	object = c->freelist;
 59	page = c->page;
 60	/*
 61	 * We cannot use the lockless fastpath on PREEMPT_RT because if a
 62	 * slowpath has taken the local_lock_irqsave(), it is not protected
 63	 * against a fast path operation in an irq handler. So we need to take
 64	 * the slow path which uses local_lock. It is still relatively fast if
 65	 * there is a suitable cpu freelist.
 66	 */
 67	if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
 68	    unlikely(!object || !page || !node_match(page, node))) {
 69		object = __slab_alloc(s, gfpflags, node, addr, c);
 70	} else {
 71		void *next_object = get_freepointer_safe(s, object);
 72
 73		/*
 74		 * The cmpxchg will only match if there was no additional
 75		 * operation and if we are on the right processor.
 76		 *
 77		 * The cmpxchg does the following atomically (without lock
 78		 * semantics!)
 79		 * 1. Relocate first pointer to the current per cpu area.
 80		 * 2. Verify that tid and freelist have not been changed
 81		 * 3. If they were not changed replace tid and freelist
 82		 *
 83		 * Since this is without lock semantics the protection is only
 84		 * against code executing on this cpu *not* from access by
 85		 * other cpus.
 86		 */
 87		if (unlikely(!this_cpu_cmpxchg_double(
 88				s->cpu_slab->freelist, s->cpu_slab->tid,
 89				object, tid,
 90				next_object, next_tid(tid)))) {
 91
 92			note_cmpxchg_failure("slab_alloc", s, tid);
 93			goto redo;
 94		}
 95		prefetch_freepointer(s, next_object);
 96		stat(s, ALLOC_FASTPATH);
 97	}
 98
 99	maybe_wipe_obj_freeptr(s, object);
100	init = slab_want_init_on_alloc(gfpflags, s);
101
102out:
103	slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
104
105	return object;
106}

계속해서 나아가면, 다음과 같이 call trace를 정리할 수 있다. (kmalloc()의 call trace가 slab_alloc_node()부터 겹침을 확인하라)

/* function call trace */

alloc_skb()                                                 kmalloc()
__alloc_skb() /* flags = 0, node = NUMA_NO_NODE = -1 */ __kmalloc() /* size<pagesize*2 */
kmem_cache_alloc_node() /* cache = skbuff_head_cache, */        |
                        /* gfp_mask & ~GFP_DMA, */              +-----------+
                        /* node = NUMA_NO_NODE = -1 */          |           |
slab_alloc_node() /* s = cache = skbuff_head_cache, */          V           V
     |            /* gfpflags = gfp_mask & ~GFP_DMA, */  kmalloc_slab()  slab_alloc()
     |            /* node = NUMA_NO_NODE = -1, */                       slab_alloc_node()  
     |            /* addr = _RET_IP_, */
     |            /* orig_size = s->object_size */
     |
     +-------------------+
     |slowpath           |fastpath (slab cache freelist)
     |                   |
     |no freelist or     |
     |no page            |--------------------------+
     |                   |1                         |2
     V                   V                          V
__slab_alloc()         get_freepointer_safe()   this_cpu_cmpxchg_double()
___slab_alloc()                                 raw_cpu_cmpxchg_double()
     |                                          __pcpu_double_call_return_bool()
     |                                          raw_cpu_cmpxchg_double_[1248]()
     |                                          raw_cpu_generic_cmpxchg_double()
     |
     |
     +---------------------------+------------------------+
     |3                          |1                       |2
     |no page                    |per-cpu partial list    |NUMA node partial list
     |                           |                        |
     V                           V                        V
new_slab()                slub_percpu_partial()       get_partial()
allocate_slab()
alloc_slab_page()
alloc_pages()
alloc_pages_node()
__alloc_pages_node()
__alloc_pages() /* This is the 'heart' of the zoned buddy allocator */
get_page_from_freelist()
rmqueue()
    |
    |order <= 3
    |
    V
rmqueue_pcplist()
__rmqueue_pcplist() /* Remove page from the per-cpu list, caller must protect the list */
    |
    |if the list is empty
    |
    V
rmqueue_bulk() /* Obtain a specified number of elements from the buddy allocator */

위 call trace를 살펴보면, slab의 초기 할당에서는 page allocator (zoned buddy allocator)로부터 페이지를 할당받는 것이 분명해보인다. 특히 per-cpu의 page list (linked list)가 비어있을 때는 buddy allocator로부터 페이지를 얻어올 것이다. 여기서는 slab fastpath를 살펴보도록 하겠다. 먼저 get_freepointer_safe()slab_alloc_node()의 구현을 읽어보면 slab cache의 freelists->offset를 더하여 할당할 오브젝트의 주소를 얻음을 알 수 있다. 그리고 여기에 KASLR이 활성화될 경우 추가적인 연산 (freelist_ptr()을 참고하라)이 수행된다.

 1/* mm/slub.c */
 2
 3static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
 4{
 5	unsigned long freepointer_addr;
 6	void *p;
 7
 8	if (!debug_pagealloc_enabled_static())
 9		return get_freepointer(s, object);
10
11	object = kasan_reset_tag(object);
12	freepointer_addr = (unsigned long)object + s->offset;
13	copy_from_kernel_nofault(&p, (void **)freepointer_addr, sizeof(p));
14	return freelist_ptr(s, p, freepointer_addr);
15}

이렇게 얻어진 주소는 raw_cpu_generic_cmpxchg_double()에서 freelist에 쓰여지게 된다.

 1/* include/asm-generic/percpu.h */
 2/* 
 3   pcp1 = s->cpu_slab->freelist
 4   pcp2 = s->cpu_slab->tid
 5   oval1 = object (= freelist)
 6   oval2 = tid
 7   nval1 = next_object (= get_freepointer_safe())
 8   nval2 = next_tid(tid)
 9*/
10
11#define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
12({									\
13	typeof(pcp1) *__p1 = raw_cpu_ptr(&(pcp1));			\
14	typeof(pcp2) *__p2 = raw_cpu_ptr(&(pcp2));			\
15	int __ret = 0;							\
16	if (*__p1 == (oval1) && *__p2  == (oval2)) {			\
17		*__p1 = nval1;						\
18		*__p2 = nval2;						\
19		__ret = 1;						\
20	}								\
21	(__ret);							\
22})

정리하면, 커널 내부에서 사용되는 오브젝트, 특히 SKB를 할당할 때의 우선순위는 다음과 같다:

  1. Slab cache (skbuff cache) freelist
  2. Slub per-cpu partial list
  3. NUMA node partial list
  4. Page allocator (zoned buddy allocator)
    1. Per-cpu freelis
    2. Other paths ...

SLUB allocator and SKB deallocation #

SLUB allocator는 slab이 해제될 때 이 slab이 속한 페이지를 partial list에 넣는다. 그런데 이때 slab 전체가 해제된 경우 그 페이지는 page allocator가 갖게 된다[1].

SKB를 해제하는 과정은 kfree_skb 함수에서부터 시작된다. (kfree()의 call trace가 slab_free()에서부터 겹침을 확인하라)

/* function call trace */

kfree_skb()                                           kfree()
__kfree_skb()                                         slab_free()
kfree_skbmem()
kmem_cache_free()
slab_free() /* s = skbuff_head_cache
               page = virt_to_head_page(skb)
               head = skb
               tail = NULL
               cnt = 1
               addr = _RET_IP */
do_slab_free() /* this function is fastpath */

만약 해제하고자 하는 slab이 속한 페이지와 cpu slab이 바라보고 있는 페이지가 같다면, 그 freelist를 조작하는 것으로 작업을 끝낼 수 있다. 이것이 slab free에서의 fastpath이고, 상기의 조건을 만족시키지 못한다면 slowpath로 넘어간다.

 1/* mm/slub.c */
 2
 3/*
 4 * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
 5 * can perform fastpath freeing without additional function calls.
 6 *
 7 * The fastpath is only possible if we are freeing to the current cpu slab
 8 * of this processor. This typically the case if we have just allocated
 9 * the item before.
10 *
11 * If fastpath is not possible then fall back to __slab_free where we deal
12 * with all sorts of special processing.
13 *
14 * Bulk free of a freelist with several objects (all pointing to the
15 * same page) possible by specifying head and tail ptr, plus objects
16 * count (cnt). Bulk free indicated by tail pointer being set.
17 */
18static __always_inline void do_slab_free(struct kmem_cache *s,
19				struct page *page, void *head, void *tail,
20				int cnt, unsigned long addr)
21{
22	void *tail_obj = tail ? : head;
23	struct kmem_cache_cpu *c;
24	unsigned long tid;
25
26	/* memcg_slab_free_hook() is already called for bulk free. */
27	if (!tail)
28		memcg_slab_free_hook(s, &head, 1);
29redo:
30	/*
31	 * Determine the currently cpus per cpu slab.
32	 * The cpu may change afterward. However that does not matter since
33	 * data is retrieved via this pointer. If we are on the same cpu
34	 * during the cmpxchg then the free will succeed.
35	 */
36	c = raw_cpu_ptr(s->cpu_slab);
37	tid = READ_ONCE(c->tid);
38
39	/* Same with comment on barrier() in slab_alloc_node() */
40	barrier();
41
42	if (likely(page == c->page)) {
43#ifndef CONFIG_PREEMPT_RT
44		/* ... */
45#else /* CONFIG_PREEMPT_RT */
46		/*
47		 * We cannot use the lockless fastpath on PREEMPT_RT because if
48		 * a slowpath has taken the local_lock_irqsave(), it is not
49		 * protected against a fast path operation in an irq handler. So
50		 * we need to take the local_lock. We shouldn't simply defer to
51		 * __slab_free() as that wouldn't use the cpu freelist at all.
52		 */
53		void **freelist;
54
55		local_lock(&s->cpu_slab->lock);
56		c = this_cpu_ptr(s->cpu_slab);
57		if (unlikely(page != c->page)) {
58			local_unlock(&s->cpu_slab->lock);
59			goto redo;
60		}
61		tid = c->tid;
62		freelist = c->freelist;
63
64		set_freepointer(s, tail_obj, freelist);
65		c->freelist = head;
66		c->tid = next_tid(tid);
67
68		local_unlock(&s->cpu_slab->lock);
69#endif
70		stat(s, FREE_FASTPATH);
71	} else
72		__slab_free(s, page, head, tail_obj, cnt, addr);
73
74}

Slowpath에서 계속 따라가면 다음과 같이 call trace를 그려볼 수 있다. (아래 call trace에서 else condition은 다른 조건이 만족되지 않을 경우 진행되는 경로임에 주의; kfree()의 call trace가 slab_free()에서부터 겹침을 확인하라)

/* function call trace */

kfree_skb()                                           kfree()
__kfree_skb()                                         slab_free()
kfree_skbmem()
kmem_cache_free()
slab_free() /* s = skbuff_head_cache,
               page = virt_to_head_page(skb),
               head = skb,
               tail = NULL,
               cnt = 1,
               addr = _RET_IP */
do_slab_free() /* this function is fastpath */
     |
     |slowpath
     |
     V
__slab_free() /* s = skbuff_head_cache,
     |           page = virt_to_head_page(skb),
     |           head = skb,
     |           tail = skb,
     |           cnt = 1,
     |           addr = _RET_IP_ */
     |
     +---------------------------+---------------------------------+
     |if slab is empty           |likely,                          |
     |                           |if slab is full                  |
     |                           |                                 |
     |                           |                                 |
     |page list                  |per-cpu partial list             |NUMA node partial
     |                           |                                 |list
     |                           |                                 |
     V                           V                                 V
discard_slab()<--+            put_cpu_partial()/* drain=1 */ add_partial() <---+
free_slab()      |               |                           __add_partial()   |
__free_slab()    |slab empty     |per-cpu partial list                         |else
__free_pages()   |               |is full                                      |condition
free_the_page()  |               |                                             |
    |            |               V                                             |
    |            +----------- __unfreeze_partials()----------------------------+
    |order <= 3
    |
    V
free_unref_page()
free_unref_page_commit() /* Free a pcp page */

위 call trace에서 slab emptyslab full, per-cpu partial list full의 조건은 __slab_free()에서 확인할 수 있다.

 1/* mm/slab.h */
 2
 3/*
 4 * The slab lists for all objects.
 5 */
 6struct kmem_cache_node {
 7	spinlock_t list_lock;
 8
 9#ifdef CONFIG_SLAB
10	/* ... */
11#endif
12
13#ifdef CONFIG_SLUB
14	unsigned long nr_partial;
15	struct list_head partial;
16#ifdef CONFIG_SLUB_DEBUG
17	atomic_long_t nr_slabs;
18	atomic_long_t total_objects;
19	struct list_head full;
20#endif
21#endif
22
23};
  1/* mm/slub.c */
  2
  3/*
  4 * Slow path handling. This may still be called frequently since objects
  5 * have a longer lifetime than the cpu slabs in most processing loads.
  6 *
  7 * So we still attempt to reduce cache line usage. Just take the slab
  8 * lock and free the item. If there is no additional partial page
  9 * handling required then we can return immediately.
 10 */
 11static void __slab_free(struct kmem_cache *s, struct page *page,
 12			void *head, void *tail, int cnt,
 13			unsigned long addr)
 14
 15{
 16	void *prior;
 17	int was_frozen;
 18	struct page new;
 19	unsigned long counters;
 20	struct kmem_cache_node *n = NULL;
 21	unsigned long flags;
 22
 23	/* ... */
 24  
 25	do {
 26		if (unlikely(n)) {
 27			spin_unlock_irqrestore(&n->list_lock, flags);
 28			n = NULL;
 29		}
 30		prior = page->freelist;
 31		counters = page->counters;
 32		set_freepointer(s, tail, prior);
 33		new.counters = counters;
 34		was_frozen = new.frozen;
 35		new.inuse -= cnt;
 36		if ((!new.inuse || !prior) && !was_frozen) {
 37
 38			if (kmem_cache_has_cpu_partial(s) && !prior) {
 39
 40				/*
 41				 * Slab was on no list before and will be
 42				 * partially empty
 43				 * We can defer the list move and instead
 44				 * freeze it.
 45				 */
 46				new.frozen = 1;
 47
 48			} else { /* Needs to be taken off a list */
 49
 50				n = get_node(s, page_to_nid(page));
 51				/*
 52				 * Speculatively acquire the list_lock.
 53				 * If the cmpxchg does not succeed then we may
 54				 * drop the list_lock without any processing.
 55				 *
 56				 * Otherwise the list_lock will synchronize with
 57				 * other processors updating the list of slabs.
 58				 */
 59				spin_lock_irqsave(&n->list_lock, flags);
 60
 61			}
 62		}
 63
 64	} while (!cmpxchg_double_slab(s, page,
 65		prior, counters,
 66		head, new.counters,
 67		"__slab_free"));
 68
 69	if (likely(!n)) {
 70
 71		if (likely(was_frozen)) {
 72			/*
 73			 * The list lock was not taken therefore no list
 74			 * activity can be necessary.
 75			 */
 76			stat(s, FREE_FROZEN);
 77		} else if (new.frozen) {
 78			/*
 79			 * If we just froze the page then put it onto the
 80			 * per cpu partial list.
 81			 */
 82			put_cpu_partial(s, page, 1);
 83			stat(s, CPU_PARTIAL_FREE);
 84		}
 85
 86		return;
 87	}
 88
 89	if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
 90		goto slab_empty;
 91
 92	/*
 93	 * Objects left in the slab. If it was not on the partial list before
 94	 * then add it.
 95	 */
 96	if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
 97		remove_full(s, n, page);
 98		add_partial(n, page, DEACTIVATE_TO_TAIL);
 99		stat(s, FREE_ADD_PARTIAL);
100	}
101	spin_unlock_irqrestore(&n->list_lock, flags);
102	return;
103
104slab_empty:
105    if (prior) {
106		/*
107		 * Slab on the partial list.
108		 */
109		remove_partial(n, page);
110		stat(s, FREE_REMOVE_PARTIAL);
111	} else {
112		/* Slab must be on the full list */
113		remove_full(s, n, page);
114	}
115  
116	/* ... */
117    discard_slab(s, page);
118}
 1/* mm/slub.c */
 2
 3static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
 4		void *freelist_old, unsigned long counters_old,
 5		void *freelist_new, unsigned long counters_new,
 6		const char *n)
 7{
 8#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
 9    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
10	if (s->flags & __CMPXCHG_DOUBLE) {
11		if (cmpxchg_double(&page->freelist, &page->counters,
12				   freelist_old, counters_old,
13				   freelist_new, counters_new))
14			return true;
15	} else
16#endif
17	{
18		unsigned long flags;
19
20		local_irq_save(flags);
21		__slab_lock(page);
22		if (page->freelist == freelist_old &&
23					page->counters == counters_old) {
24			page->freelist = freelist_new;
25			page->counters = counters_new;
26			__slab_unlock(page);
27			local_irq_restore(flags);
28			return true;
29		}
30		__slab_unlock(page);
31		local_irq_restore(flags);
32	}
33
34	cpu_relax();
35	stat(s, CMPXCHG_DOUBLE_FAIL);
36
37#ifdef SLUB_DEBUG_CMPXCHG
38	pr_info("%s %s: cmpxchg double redo ", n, s->name);
39#endif
40
41	return false;
42}

이때 slab empty를 판별하는 과정에서 NUMA node의 partial list 내 페이지 수 (n->nr_partial) 와 s->min_partial을 비교함을 확인하라. 이는 전달된 페이지에 할당된 slab이 없고, NUMA node에 이미 최소한의 partial slab이 확보되어 있음을 의미한다.

References #

  1. corbet, Apr. 2007, "The SLUB allocator," LWN. [Online]. Available: https://lwn.net/Articles/229984/
  2. Jeff Bonwick, "The Slab Allocator: An Object-Cacheing Kenrel Memory Allocator," in Proc. USENIX Summer 1994 Technical Conference, Boston, Massachusetts, USA, Jun. 1994, pp. 87 -- 98.
  3. Jonathan Corbet, Jul. 2011, "Per-CPU variables and the realtime tree," LWN. [Online]. Available: https://lwn.net/Articles/452884/
  4. Kanoj Sarcar, et al., Nov. 1999, "What is NUMA?," Linux Memory Management Documentation. [Online]. Available: https://www.kernel.org/doc/html/v4.18/vm/numa.html
  5. "Optimizing Applications for NUMA," Intel Guide for Developing Multithreaded Applications. [Online]. Available: https://www.intel.com/content/dam/develop/external/us/en/documents/3-5-memmgt-optimizing-applications-for-numa-184398.pdf