Table of Contents
- Introduction
- Per-CPU
- NUMA Node
- Kernel Memory Cache and Socket Buffer
- SLUB allocator and SKB allocation
- SLUB allocator and SKB deallocation
- References
Introduction #
Jeff Bonwick이 SLAB allocator를 소개하고 Christoph Lameter가 SLUB allocator를 대안으로 제시한 이후로 현대 리눅스 커널에는 SLUB allocator가 주로 사용되고 있다[1, 2].
본 글에서는 리눅스 커널 5.15.30 버전을 기준으로 SKB (struct sk_buff
)의 할당과 해제를 통해 SLUB allocator 구현을 살펴볼 것이다.
Per-CPU #
Symmetric multiprocessing system에서 다수의 CPU들이 핸들링하게 되는 데이터는 락으로 인한 병목이나 cache line bouncing 등을 초래할 수 있다. 이에 리눅스 커널 등은 각 CPU local에 가용한 데이터 공간을 두어 성능 저하를 최소화한다. 즉, CPU 마다 lockless하게 접근할 수 있는 공간을 두는 것이다[3].
리눅스 커널은 CPU에 접근하기 위한 인터페이스로 per_cpu_ptr()
나 raw_cpu_ptr()
등을 제공한다. 이들은 다음과 같은 call trace를 통해 접근할 수 있다. (raw_cpu_ptr()
의 call trace가 SHIFT_PERCPU_PTR()
에서 겹침을 확인하라)
/* function call trace */
per_cpu_ptr(ptr, cpu) raw_cpu_ptr(ptr)
SHIFT_PERCPU_PTR() /* __p = ptr, arch_raw_cpu_ptr()
__offset = per_cpu_offset((cpu)) */ SHIFT_PERCPU_PTR() /* __p = ptr
RELOC_HIDE() __offset = __my_cpu_offset */
위 call trace에서 RELOC_HIDE()
는 전달된 ptr
에 전달된 off
(= __offset
)을 더하여 CPU에 접근가능한 포인터를 반환한다. 이때 그 오프셋은 각각 다음과 같이 얻어진다.
1/* include/asm-generic/percpu.h */
2
3#include <linux/compiler.h>
4#include <linux/threads.h>
5#include <linux/percpu-defs.h>
6
7#ifdef CONFIG_SMP
8
9/*
10 * per_cpu_offset() is the offset that has to be added to a
11 * percpu variable to get to the instance for a certain processor.
12 *
13 * Most arches use the __per_cpu_offset array for those offsets but
14 * some arches have their own ways of determining the offset (x86_64, s390).
15 */
16#ifndef __per_cpu_offset
17extern unsigned long __per_cpu_offset[NR_CPUS];
18
19#define per_cpu_offset(x) (__per_cpu_offset[x])
20#endif
1/* arch/ia64/include/asm/percpu.h */
2
3#define __my_cpu_offset __ia64_per_cpu_var(local_per_cpu_offset)
4
5extern void *per_cpu_init(void);
6
7#else /* ! SMP */
8
9#define per_cpu_init() (__phys_per_cpu_start)
10
11#endif /* SMP */
12
13#define PER_CPU_BASE_SECTION ".data..percpu"
14
15/*
16 * Be extremely careful when taking the address of this variable! Due to virtual
17 * remapping, it is different from the canonical address returned by this_cpu_ptr(&var)!
18 * On the positive side, using __ia64_per_cpu_var() instead of this_cpu_ptr() is slightly
19 * more efficient.
20 */
21#define __ia64_per_cpu_var(var) (*({ \
22 __verify_pcpu_ptr(&(var)); \
23 ((typeof(var) __kernel __force *)&(var)); \
24}))
25
26#include <asm-generic/percpu.h>
27
28/* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
29DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
이러한 per-CPU는 kernel memory cache를 초기화할 때 할당되며, 다음 call trace가 그 예시이다.
/* function call trace */
socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
size = sizeof(struct sk_buff) = 0xE0,
align = 0,
flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
useroffset = offsetof(struct sk_buff, cb),
usersize = sizeof_field(struct sk_buff, cb),
ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
__kmem_cache_create()
kmeme_cache_open()
alloc_kmem_cache_cpus()
|
+-----------------------+
|1 |2
V V
__alloc_percpu() init_kmem_cache_cpus()
pcpu_alloc()
pcpu_find_block_fit()
위 calltrace의 pcpu_alloc()
에서 pcpu_setup_first_chunk()
등으로 할당된 slots인 per-CPU chunks로부터 할당하게 된다.
NUMA Node #
Non-Uniformed Memory Access (NUMA)는 일정 단위의 하드웨어 자원 (processor, I/O bus, ...)마다 가용한 메모리 부분을 나누어 symmetric multiprocessing system의 평균적인 경우에 대한 access time을 줄이기 위한 shared memory architecture이다. 리눅스 커널은 물리적 뷰에서 메모리를 포함한 각 하드웨어 자원을 cell로 구분하며, 이를 다시 node로 추상화한다. 이러한 node는 다시 cell과 매핑된다. 이를 그림으로 나타내면 다음과 같다[4, 5].
Node 1 Node 2
+-----------------------------------+ +-----------------------------------+
| +------------------------------+ | | +------------------------------+ |
| | processor1, processor2, ... | | | | processor1, processor2, ... | |
| +------------------------------+ |---------| +------------------------------+ |
| | | |inter | | | |
| | | |connected| | | |
| +--------------------------+ |---------| +------------------------+ |
| | memory | | | | memory | |
| +--------------------------+ | | +------------------------+ |
+-----------------------------------+ \ / +-----------------------------------+
\ \ / /
\ \/ /
\ /\ /
\ / \ /
\/ \/
/\ /\
Node 3 / \ / \ Node 4
+-----------------------------------+ \/ +-----------------------------------+
| +------------------------------+ | /\ | +------------------------------+ |
| | processor1, processor2, ... | | / \ | | processor1, processor2, ... | |
| +------------------------------+ |---------| +------------------------------+ |
| | | |inter | | | |
| | | |connected| | | |
| +--------------------------+ |---------| +------------------------+ |
| | memory | | | | memory | |
| +--------------------------+ | | +------------------------+ |
+-----------------------------------+ +-----------------------------------+
NUMA 아키텍처는 위 그림에서 볼 수 있듯이 local node에서의 access time과 remote node에서의 access time에 차이가 나게 된다.
리눅스 커널은 NUMA node에 접근하기 위한 인터페이스로 get_node()
등을 제공한다. 그리고 그 동작을 살펴보면 node를 인덱스로 하여 반환함을 알 수 있다.
1/* mm/slab.h */
2
3static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
4{
5 return s->node[node];
6}
이러한 NUMA node는 kernel memory cache를 초기화할 때 매핑되며, 다음 call trace를 통해 알 수 있다. (상대적 순서임에 주의하라)
/* function call trace */
socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
size = sizeof(struct sk_buff) = 0xE0,
align = 0,
flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
useroffset = offsetof(struct sk_buff, cb),
usersize = sizeof_field(struct sk_buff, cb),
ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
__kmem_cache_create()
kmeme_cache_open()
init_kmem_cache_nodes()
|
+----------------------------+
|1 |2
V V
kmem_cache_alloc_node() init_kmem_cache_node()
Kernel Memory Cache and Socket Buffer #
리눅스 커널은 소켓 버퍼에 대한 slab cache를 관리하기 위해 skbuff_head_cache
와 skbuff_fclone_cache
를 둔다. 여기서 관심을 갖는 것은 skbuff_head_cache
이며, struct kmem_cache
타입으로 선언되어 있다.
1/* include/linux/skbuff.h */
2
3extern struct kmem_cache *skbuff_head_cache;
1/* include/linux/slub_def.h */
2
3/*
4 * Slab cache management.
5 */
6struct kmem_cache {
7 struct kmem_cache_cpu __percpu *cpu_slab;
8 /* Used for retrieving partial slabs, etc. */
9 slab_flags_t flags;
10 unsigned long min_partial;
11 unsigned int size; /* The size of an object including metadata */
12 unsigned int object_size;/* The size of an object without metadata */
13 struct reciprocal_value reciprocal_size;
14 unsigned int offset; /* Free pointer offset */
15#ifdef CONFIG_SLUB_CPU_PARTIAL
16 /* Number of per cpu partial objects to keep around */
17 unsigned int cpu_partial;
18#endif
19 struct kmem_cache_order_objects oo;
20
21 /* Allocation and freeing of slabs */
22 struct kmem_cache_order_objects max;
23 struct kmem_cache_order_objects min;
24 gfp_t allocflags; /* gfp flags to use on each alloc */
25 int refcount; /* Refcount for slab cache destroy */
26 void (*ctor)(void *);
27 unsigned int inuse; /* Offset to metadata */
28 unsigned int align; /* Alignment */
29 unsigned int red_left_pad; /* Left redzone padding size */
30 const char *name; /* Name (only for display!) */
31 struct list_head list; /* List of slab caches */
32#ifdef CONFIG_SYSFS
33 struct kobject kobj; /* For sysfs */
34#endif
35#ifdef CONFIG_SLAB_FREELIST_HARDENED
36 unsigned long random;
37#endif
38
39#ifdef CONFIG_NUMA
40 /*
41 * Defragmentation by allocating from a remote node.
42 */
43 unsigned int remote_node_defrag_ratio;
44#endif
45
46#ifdef CONFIG_SLAB_FREELIST_RANDOM
47 unsigned int *random_seq;
48#endif
49
50#ifdef CONFIG_KASAN
51 struct kasan_cache kasan_info;
52#endif
53
54 unsigned int useroffset; /* Usercopy region offset */
55 unsigned int usersize; /* Usercopy region size */
56
57 struct kmem_cache_node *node[MAX_NUMNODES];
58};
이 캐시는 skb_init()
에서 초기화 작업을 진행한다.
/* function call trace */
socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
size = sizeof(struct sk_buff) = 0xE0,
align = 0,
flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
useroffset = offsetof(struct sk_buff, cb),
usersize = sizeof_field(struct sk_buff, cb),
ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
1static struct kmem_cache *create_cache(const char *name,
2 unsigned int object_size, unsigned int align,
3 slab_flags_t flags, unsigned int useroffset,
4 unsigned int usersize, void (*ctor)(void *),
5 struct kmem_cache *root_cache)
6{
7 struct kmem_cache *s;
8 int err;
9
10 /* ... */
11
12 err = -ENOMEM;
13 s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
14 if (!s)
15 goto out;
16
17 s->name = name;
18 s->size = s->object_size = object_size;
19 s->align = align;
20 s->ctor = ctor;
21 s->useroffset = useroffset;
22 s->usersize = usersize;
23
24 s->refcount = 1;
25 list_add(&s->list, &slab_caches);
26out:
27 if (err)
28 return ERR_PTR(err);
29 return s;
30
31out_free_cache:
32 kmem_cache_free(kmem_cache, s);
33 goto out;
34}
따라서 s->object_size
는 struct sk_buff
타입의 크기임을 알 수 있고, GDB로 확인해보면 0xe0
바이트임을 알 수 있다. 그리고 struct kmem_cache
에서 중요한 멤버로 struct kmem_cache_cpu
가 있다. 이는 per-cpu 오브젝트이며, 다음과 같은 call trace를 거쳐 할당되고 초기화된다. (아래 call trace가 모든 함수를 표기하지 않았으며, 상대적 순서임에 주의하라)
/* function call trace */
socket_init()
skb_init()
kmem_cache_create_usercopy() /* name = "skbuff_head_cache,
size = sizeof(struct sk_buff) = 0xE0,
align = 0,
flags = SLAB_HWCACHE_ALIGN | SLAB_PANIC,
useroffset = offsetof(struct sk_buff, cb),
usersize = sizeof_field(struct sk_buff, cb),
ctor = NULL */
create_cache() /* object_size = size = sizeof(struct sk_buff) */
__kmem_cache_create()
kmeme_cache_open()
|
|
+---------------------------+------------------------------------+
|1 |2 |3
| | |
V V V
alloc_kmem_cache_cpus() set_min_partial() /* s = s, */ set_cpu_partial()
| /* min = ilog2(s->size / 2) */
+-----------------------+
|1 |2
V V
__alloc_percpu() init_kmem_cache_cpus()
pcpu_alloc()
pcpu_find_block_fit()
1/* include/linux/slub_def.h */
2
3/*
4 * When changing the layout, make sure freelist and tid are still compatible
5 * with this_cpu_cmpxchg_double() alignment requirements.
6 */
7struct kmem_cache_cpu {
8 void **freelist; /* Pointer to next available object */
9 unsigned long tid; /* Globally unique transaction id */
10 struct page *page; /* The slab from which we are allocating */
11#ifdef CONFIG_SLUB_CPU_PARTIAL
12 struct page *partial; /* Partially allocated frozen slabs */
13#endif
14 local_lock_t lock; /* Protects the fields above */
15#ifdef CONFIG_SLUB_STATS
16 unsigned stat[NR_SLUB_STAT_ITEMS];
17#endif
18};
1/* mm/slub.c */
2
3static void init_kmem_cache_cpus(struct kmem_cache *s)
4{
5 int cpu;
6 struct kmem_cache_cpu *c;
7
8 for_each_possible_cpu(cpu) {
9 c = per_cpu_ptr(s->cpu_slab, cpu);
10 local_lock_init(&c->lock);
11 c->tid = init_tid(cpu);
12 }
13}
또, 살펴볼만한 것으로 set_min_partial()
와 set_cpu_partial()
이 있다. 전자는 일부의 오브젝트만 free된 slab의 linked list인 partial list의 최소 노드 수를 설정한다. 그리고 후자는 per-cpu partial list의 최대 노드 수를 설정한다. 이들은 후에 설명할 slab free 과정에서 다시 나올 것이다.
1/* mm/slub.c */
2
3/*
4 * Minimum number of partial slabs. These will be left on the partial
5 * lists even if they are empty. kmem_cache_shrink may reclaim them.
6 */
7#define MIN_PARTIAL 5
8
9/*
10 * Maximum number of desirable partial slabs.
11 * The existence of more partial slabs makes kmem_cache_shrink
12 * sort the partial list by the number of objects in use.
13 */
14#define MAX_PARTIAL 10
15
16static void set_min_partial(struct kmem_cache *s, unsigned long min)
17{
18 if (min < MIN_PARTIAL)
19 min = MIN_PARTIAL;
20 else if (min > MAX_PARTIAL)
21 min = MAX_PARTIAL;
22 s->min_partial = min;
23}
1/* mm/slub.c */
2
3static void set_cpu_partial(struct kmem_cache *s)
4{
5#ifdef CONFIG_SLUB_CPU_PARTIAL
6 /*
7 * cpu_partial determined the maximum number of objects kept in the
8 * per cpu partial lists of a processor.
9 *
10 * Per cpu partial lists mainly contain slabs that just have one
11 * object freed. If they are used for allocation then they can be
12 * filled up again with minimal effort. The slab will never hit the
13 * per node partial lists and therefore no locking will be required.
14 *
15 * This setting also determines
16 *
17 * A) The number of objects from per cpu partial slabs dumped to the
18 * per node list when we reach the limit.
19 * B) The number of objects in cpu partial slabs to extract from the
20 * per node list when we run out of per cpu objects. We only fetch
21 * 50% to keep some capacity around for frees.
22 */
23 if (!kmem_cache_has_cpu_partial(s))
24 slub_set_cpu_partial(s, 0);
25 else if (s->size >= PAGE_SIZE)
26 slub_set_cpu_partial(s, 2);
27 else if (s->size >= 1024)
28 slub_set_cpu_partial(s, 6);
29 else if (s->size >= 256)
30 slub_set_cpu_partial(s, 13);
31 else
32 slub_set_cpu_partial(s, 30);
33#endif
34}
1#define slub_cpu_partial(s) ((s)->cpu_partial)
2#define slub_set_cpu_partial(s, n) \
3({ \
4 slub_cpu_partial(s) = (n); \
5})
이밖에도 여러 멤버 변수들이 초기화되며, 그 예시는 다음과 같다:
gef> p *(struct kmem_cache *) skbuff_head_cache
$3 = {
cpu_slab = 0x2e380,
flags = 0x40042000,
min_partial = 0x5,
size = 0x100,
object_size = 0xe0,
reciprocal_size = {
m = 0x1,
sh1 = 0x1,
sh2 = 0x7
},
offset = 0x70,
cpu_partial = 0xd,
oo = {
x = 0x10
},
max = {
x = 0x10
},
min = {
x = 0x10
},
allocflags = 0x0,
refcount = 0x1,
ctor = 0x0 <fixed_percpu_data>,
inuse = 0xe0,
align = 0x40,
red_left_pad = 0x0,
name = 0xffffffff8261d1f0 "skbuff_head_cache",
list = {
next = 0xffff8880036a4168,
prev = 0xffff8880036a4368
},
kobj = {
name = 0xffffffff8261d1f0 "skbuff_head_cache",
entry = {
next = 0xffff8880036a4180,
prev = 0xffff8880036a4380
},
parent = 0xffff888003629a98,
kset = 0xffff888003629a80,
ktype = 0xffffffff8295b620 <slab_ktype>,
sd = 0xffff88800363ff80,
kref = {
refcount = {
refs = {
counter = 0x1
}
}
},
state_initialized = 0x1,
state_in_sysfs = 0x1,
state_add_uevent_sent = 0x0,
state_remove_uevent_sent = 0x0,
uevent_suppress = 0x0
},
remote_node_defrag_ratio = 0x3e8,
useroffset = 0x28,
usersize = 0x30,
node = {
[0x0] = 0xffff8880036a5080,
[0x1] = 0x0 <fixed_percpu_data>,
[0x2] = 0x0 <fixed_percpu_data>,
[0x3] = 0x0 <fixed_percpu_data>,
[0x4] = 0x0 <fixed_percpu_data>,
[0x5] = 0x0 <fixed_percpu_data>,
[0x6] = 0x0 <fixed_percpu_data>,
[0x7] = 0x2e3a0,
[0x8] = 0x40042000,
[0x9] = 0x5 <fixed_percpu_data+5>,
[0xa] = 0x20000000200,
[0xb] = 0x80100000001,
[0xc] = 0xd000000e0,
[0xd] = 0x1001000010010,
[0xe] = 0x4000000000008,
[0xf] = 0x2 <fixed_percpu_data+2>,
[0x10] = 0x0 <fixed_percpu_data>,
[0x11] = 0x4000000200,
[0x12] = 0x0 <fixed_percpu_data>,
[0x13] = 0xffffffff8261d202,
[0x14] = 0xffff8880036a4268,
[0x15] = 0xffff8880036a4468,
[0x16] = 0xffff8880034fc2d0,
[0x17] = 0xffff8880036a4280,
[0x18] = 0xffff8880036a4480,
[0x19] = 0xffff888003629a98,
[0x1a] = 0xffff888003629a80,
[0x1b] = 0xffffffff8295b620 <slab_ktype>,
[0x1c] = 0xffff88800363f000,
[0x1d] = 0x300000001,
[0x1e] = 0x3e8,
[0x1f] = 0x0 <fixed_percpu_data>,
[0x20] = 0xffff8880036a50c0,
[0x21] = 0x0 <fixed_percpu_data>,
[0x22] = 0x0 <fixed_percpu_data>,
[0x23] = 0x0 <fixed_percpu_data>,
[0x24] = 0x0 <fixed_percpu_data>,
[0x25] = 0x0 <fixed_percpu_data>,
[0x26] = 0x0 <fixed_percpu_data>,
[0x27] = 0x2e3c0,
[0x28] = 0x40022000,
[0x29] = 0x5 <fixed_percpu_data+5>,
[0x2a] = 0x30000000340,
[0x2b] = 0x9013b13b13c,
[0x2c] = 0xd00000300,
[0x2d] = 0x2001300020013,
[0x2e] = 0x4001000000004,
[0x2f] = 0x1 <fixed_percpu_data+1>,
[0x30] = 0xffffffff818d8e40 <init_once>,
[0x31] = 0x4000000300,
[0x32] = 0x0 <fixed_percpu_data>,
[0x33] = 0xffffffff8261cbe8,
[0x34] = 0xffff8880036a4368,
[0x35] = 0xffff888003508c68,
[0x36] = 0xffffffff8261cbe8,
[0x37] = 0xffff8880036a4380,
[0x38] = 0xffff888003508c80,
[0x39] = 0xffff888003629a98,
[0x3a] = 0xffff888003629a80,
[0x3b] = 0xffffffff8295b620 <slab_ktype>,
[0x3c] = 0xffff88800363e100,
[0x3d] = 0x300000001,
[0x3e] = 0x3e8,
[0x3f] = 0x0 <fixed_percpu_data>
}
}
gef>
SLUB allocator and SKB allocation #
SLUB allocator는 기존에 사용되던 SLAB allocator의 대안으로 제시되었으며, slab을 특정 오브젝트로 구성된 페이지의 그룹으로 바라본다. 이러한 slab은 그 자체로는 freelist를 관리하지 않는다. 그래서 freelist를 관리하기 위한 데이터를 페이지 구조체에 저장한다[1].
1/* include/linux/mm_types.h */
2
3/*
4 * Each physical page in the system has a struct page associated with
5 * it to keep track of whatever it is we are using the page for at the
6 * moment. Note that we have no way to track which tasks are using
7 * a page, though if it is a pagecache page, rmap structures can tell us
8 * who is mapping it.
9 *
10 * If you allocate the page using alloc_pages(), you can use some of the
11 * space in struct page for your own purposes. The five words in the main
12 * union are available, except for bit 0 of the first word which must be
13 * kept clear. Many users use this word to store a pointer to an object
14 * which is guaranteed to be aligned. If you use the same storage as
15 * page->mapping, you must restore it to NULL before freeing the page.
16 *
17 * If your page will not be mapped to userspace, you can also use the four
18 * bytes in the mapcount union, but you must call page_mapcount_reset()
19 * before freeing it.
20 *
21 * If you want to use the refcount field, it must be used in such a way
22 * that other CPUs temporarily incrementing and then decrementing the
23 * refcount does not cause problems. On receiving the page from
24 * alloc_pages(), the refcount will be positive.
25 *
26 * If you allocate pages of order > 0, you can use some of the fields
27 * in each subpage, but you may need to restore some of their values
28 * afterwards.
29 *
30 * SLUB uses cmpxchg_double() to atomically update its freelist and
31 * counters. That requires that freelist & counters be adjacent and
32 * double-word aligned. We align all struct pages to double-word
33 * boundaries, and ensure that 'freelist' is aligned within the
34 * struct.
35 */
36#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
37#define _struct_page_alignment __aligned(2 * sizeof(unsigned long))
38#else
39#define _struct_page_alignment
40#endif
41
42struct page {
43 unsigned long flags; /* Atomic flags, some possibly
44 * updated asynchronously */
45 /*
46 * Five words (20/40 bytes) are available in this union.
47 * WARNING: bit 0 of the first word is used for PageTail(). That
48 * means the other users of this union MUST NOT use the bit to
49 * avoid collision and false-positive PageTail().
50 */
51 union {
52 /* ... */
53 struct { /* slab, slob and slub */
54 union {
55 struct list_head slab_list;
56 struct { /* Partial pages */
57 struct page *next;
58#ifdef CONFIG_64BIT
59 int pages; /* Nr of pages left */
60 int pobjects; /* Approximate count */
61#else
62 short int pages;
63 short int pobjects;
64#endif
65 };
66 };
67 struct kmem_cache *slab_cache; /* not slob */
68 /* Double-word boundary */
69 void *freelist; /* first free object */
70 union {
71 void *s_mem; /* slab: first object */
72 unsigned long counters; /* SLUB */
73 struct { /* SLUB */
74 unsigned inuse:16;
75 unsigned objects:15;
76 unsigned frozen:1;
77 };
78 };
79 };
80 /* ... */
81
82 /** @rcu_head: You can use this to free a page by RCU. */
83 struct rcu_head rcu_head;
84 };
85
86 union { /* This union is 4 bytes in size. */
87 /*
88 * If the page can be mapped to userspace, encodes the number
89 * of times this page is referenced by a page table.
90 */
91 atomic_t _mapcount;
92
93 /*
94 * If the page is neither PageSlab nor mappable to userspace,
95 * the value stored here may help determine what this page
96 * is used for. See page-flags.h for a list of page types
97 * which are currently stored here.
98 */
99 unsigned int page_type;
100
101 unsigned int active; /* SLAB */
102 int units; /* SLOB */
103 };
104
105 /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
106 atomic_t _refcount;
107
108#ifdef CONFIG_MEMCG
109 unsigned long memcg_data;
110#endif
111
112 /*
113 * On machines where all RAM is mapped into kernel address space,
114 * we can simply calculate the virtual address. On machines with
115 * highmem some memory is mapped into kernel virtual memory
116 * dynamically, so we need a place to store that address.
117 * Note that this field could be 16 bits on x86 ... ;)
118 *
119 * Architectures with slow multiplication can define
120 * WANT_PAGE_VIRTUAL in asm/page.h
121 */
122#if defined(WANT_PAGE_VIRTUAL)
123 void *virtual; /* Kernel virtual address (NULL if
124 not kmapped, ie. highmem) */
125#endif /* WANT_PAGE_VIRTUAL */
126
127#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
128 int _last_cpupid;
129#endif
130} _struct_page_alignment;
SLUB은, 위 소스 코드의 주석에서도 볼 수 있듯이 freelist
멤버를 참조하여 오브젝트를 할당하고 다음으로 할당할 위치를 offset
으로 관리한다. 이를 그림으로 나타내면 다음과 같다:
page:
+-----------------------+
| ... |
freelist ------->+-----------------------+
| returned object |
+-----------------------+
| ... |
+-----------------------+<----- freelist + offset
| next object |
+-----------------------+
| ... |
+-----------------------+
그럼 SLUB이 struct sk_buff
를 할당하는 과정을 소스 코드로 알아보겠다[1].
그 시작은 alloc_skb
함수이고, __alloc_skb
함수의 레퍼 함수 (wrapper function)이다. (kmalloc()
의 call trace가 slab_alloc_node()
부터 겹침을 확인하라)
/* function call trace */
alloc_skb() kmalloc()
__alloc_skb() /* flags = 0, node = NUMA_NO_NODE = -1 */ __kmalloc() /* size<pagesize*2 */
kmem_cache_alloc_node() /* cache = skbuff_head_cache, */ |
/* gfp_mask & ~GFP_DMA, */ +-----------+
/* node = NUMA_NO_NODE = -1 */ | |
slab_alloc_node() /* s = cache = skbuff_head_cache, */ V V
/* gfpflags = gfp_mask & ~GFP_DMA, */ kmalloc_slab() slab_alloc()
/* node = NUMA_NO_NODE = -1, */ slab_alloc_node()
/* addr = _RET_IP_, */
/* orig_size = s->object_size */
위 call trace에서 slab_alloc_node
함수의 구현을 살펴보면 할당의 상황들이 나오기 시작한다. 만약 현재 상황이 초기 할당이거나 slab cache가 모두 할당된 상황이라면 struct kmem_cache_cpu
의 freelist
나 page
멤버가 유효한 메모리 주소를 갖고 있지 않을 것이다. 하지만 이에 해당하지 않는다면 이들은 어떤 주소든 갖고 있을 것이며, 따라서 이로부터 할당하면 될 것이다.
1/* mm/slub.c */
2
3/*
4 * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
5 * have the fastpath folded into their functions. So no function call
6 * overhead for requests that can be satisfied on the fastpath.
7 *
8 * The fastpath works by first checking if the lockless freelist can be used.
9 * If not then __slab_alloc is called for slow processing.
10 *
11 * Otherwise we can simply pick the next object from the lockless free list.
12 */
13static __always_inline void *slab_alloc_node(struct kmem_cache *s,
14 gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
15{
16 void *object;
17 struct kmem_cache_cpu *c;
18 struct page *page;
19 unsigned long tid;
20 struct obj_cgroup *objcg = NULL;
21 bool init = false;
22
23 /* ... */
24
25redo:
26 /*
27 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
28 * enabled. We may switch back and forth between cpus while
29 * reading from one cpu area. That does not matter as long
30 * as we end up on the original cpu again when doing the cmpxchg.
31 *
32 * We must guarantee that tid and kmem_cache_cpu are retrieved on the
33 * same cpu. We read first the kmem_cache_cpu pointer and use it to read
34 * the tid. If we are preempted and switched to another cpu between the
35 * two reads, it's OK as the two are still associated with the same cpu
36 * and cmpxchg later will validate the cpu.
37 */
38 c = raw_cpu_ptr(s->cpu_slab);
39 tid = READ_ONCE(c->tid);
40
41 /*
42 * Irqless object alloc/free algorithm used here depends on sequence
43 * of fetching cpu_slab's data. tid should be fetched before anything
44 * on c to guarantee that object and page associated with previous tid
45 * won't be used with current tid. If we fetch tid first, object and
46 * page could be one associated with next tid and our alloc/free
47 * request will be failed. In this case, we will retry. So, no problem.
48 */
49 barrier();
50
51 /*
52 * The transaction ids are globally unique per cpu and per operation on
53 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
54 * occurs on the right processor and that there was no operation on the
55 * linked list in between.
56 */
57
58 object = c->freelist;
59 page = c->page;
60 /*
61 * We cannot use the lockless fastpath on PREEMPT_RT because if a
62 * slowpath has taken the local_lock_irqsave(), it is not protected
63 * against a fast path operation in an irq handler. So we need to take
64 * the slow path which uses local_lock. It is still relatively fast if
65 * there is a suitable cpu freelist.
66 */
67 if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
68 unlikely(!object || !page || !node_match(page, node))) {
69 object = __slab_alloc(s, gfpflags, node, addr, c);
70 } else {
71 void *next_object = get_freepointer_safe(s, object);
72
73 /*
74 * The cmpxchg will only match if there was no additional
75 * operation and if we are on the right processor.
76 *
77 * The cmpxchg does the following atomically (without lock
78 * semantics!)
79 * 1. Relocate first pointer to the current per cpu area.
80 * 2. Verify that tid and freelist have not been changed
81 * 3. If they were not changed replace tid and freelist
82 *
83 * Since this is without lock semantics the protection is only
84 * against code executing on this cpu *not* from access by
85 * other cpus.
86 */
87 if (unlikely(!this_cpu_cmpxchg_double(
88 s->cpu_slab->freelist, s->cpu_slab->tid,
89 object, tid,
90 next_object, next_tid(tid)))) {
91
92 note_cmpxchg_failure("slab_alloc", s, tid);
93 goto redo;
94 }
95 prefetch_freepointer(s, next_object);
96 stat(s, ALLOC_FASTPATH);
97 }
98
99 maybe_wipe_obj_freeptr(s, object);
100 init = slab_want_init_on_alloc(gfpflags, s);
101
102out:
103 slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init);
104
105 return object;
106}
계속해서 나아가면, 다음과 같이 call trace를 정리할 수 있다. (kmalloc()
의 call trace가 slab_alloc_node()
부터 겹침을 확인하라)
/* function call trace */
alloc_skb() kmalloc()
__alloc_skb() /* flags = 0, node = NUMA_NO_NODE = -1 */ __kmalloc() /* size<pagesize*2 */
kmem_cache_alloc_node() /* cache = skbuff_head_cache, */ |
/* gfp_mask & ~GFP_DMA, */ +-----------+
/* node = NUMA_NO_NODE = -1 */ | |
slab_alloc_node() /* s = cache = skbuff_head_cache, */ V V
| /* gfpflags = gfp_mask & ~GFP_DMA, */ kmalloc_slab() slab_alloc()
| /* node = NUMA_NO_NODE = -1, */ slab_alloc_node()
| /* addr = _RET_IP_, */
| /* orig_size = s->object_size */
|
+-------------------+
|slowpath |fastpath (slab cache freelist)
| |
|no freelist or |
|no page |--------------------------+
| |1 |2
V V V
__slab_alloc() get_freepointer_safe() this_cpu_cmpxchg_double()
___slab_alloc() raw_cpu_cmpxchg_double()
| __pcpu_double_call_return_bool()
| raw_cpu_cmpxchg_double_[1248]()
| raw_cpu_generic_cmpxchg_double()
|
|
+---------------------------+------------------------+
|3 |1 |2
|no page |per-cpu partial list |NUMA node partial list
| | |
V V V
new_slab() slub_percpu_partial() get_partial()
allocate_slab()
alloc_slab_page()
alloc_pages()
alloc_pages_node()
__alloc_pages_node()
__alloc_pages() /* This is the 'heart' of the zoned buddy allocator */
get_page_from_freelist()
rmqueue()
|
|order <= 3
|
V
rmqueue_pcplist()
__rmqueue_pcplist() /* Remove page from the per-cpu list, caller must protect the list */
|
|if the list is empty
|
V
rmqueue_bulk() /* Obtain a specified number of elements from the buddy allocator */
위 call trace를 살펴보면, slab의 초기 할당에서는 page allocator (zoned buddy allocator)로부터 페이지를 할당받는 것이 분명해보인다. 특히 per-cpu의 page list (linked list)가 비어있을 때는 buddy allocator로부터 페이지를 얻어올 것이다. 여기서는 slab fastpath를 살펴보도록 하겠다. 먼저 get_freepointer_safe()
와 slab_alloc_node()
의 구현을 읽어보면 slab cache의 freelist
에 s->offset
를 더하여 할당할 오브젝트의 주소를 얻음을 알 수 있다. 그리고 여기에 KASLR이 활성화될 경우 추가적인 연산 (freelist_ptr()
을 참고하라)이 수행된다.
1/* mm/slub.c */
2
3static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
4{
5 unsigned long freepointer_addr;
6 void *p;
7
8 if (!debug_pagealloc_enabled_static())
9 return get_freepointer(s, object);
10
11 object = kasan_reset_tag(object);
12 freepointer_addr = (unsigned long)object + s->offset;
13 copy_from_kernel_nofault(&p, (void **)freepointer_addr, sizeof(p));
14 return freelist_ptr(s, p, freepointer_addr);
15}
이렇게 얻어진 주소는 raw_cpu_generic_cmpxchg_double()
에서 freelist
에 쓰여지게 된다.
1/* include/asm-generic/percpu.h */
2/*
3 pcp1 = s->cpu_slab->freelist
4 pcp2 = s->cpu_slab->tid
5 oval1 = object (= freelist)
6 oval2 = tid
7 nval1 = next_object (= get_freepointer_safe())
8 nval2 = next_tid(tid)
9*/
10
11#define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
12({ \
13 typeof(pcp1) *__p1 = raw_cpu_ptr(&(pcp1)); \
14 typeof(pcp2) *__p2 = raw_cpu_ptr(&(pcp2)); \
15 int __ret = 0; \
16 if (*__p1 == (oval1) && *__p2 == (oval2)) { \
17 *__p1 = nval1; \
18 *__p2 = nval2; \
19 __ret = 1; \
20 } \
21 (__ret); \
22})
정리하면, 커널 내부에서 사용되는 오브젝트, 특히 SKB를 할당할 때의 우선순위는 다음과 같다:
- Slab cache (skbuff cache) freelist
- Slub per-cpu partial list
- NUMA node partial list
- Page allocator (zoned buddy allocator)
- Per-cpu freelis
- Other paths ...
SLUB allocator and SKB deallocation #
SLUB allocator는 slab이 해제될 때 이 slab이 속한 페이지를 partial list에 넣는다. 그런데 이때 slab 전체가 해제된 경우 그 페이지는 page allocator가 갖게 된다[1].
SKB를 해제하는 과정은 kfree_skb
함수에서부터 시작된다. (kfree()
의 call trace가 slab_free()
에서부터 겹침을 확인하라)
/* function call trace */
kfree_skb() kfree()
__kfree_skb() slab_free()
kfree_skbmem()
kmem_cache_free()
slab_free() /* s = skbuff_head_cache
page = virt_to_head_page(skb)
head = skb
tail = NULL
cnt = 1
addr = _RET_IP */
do_slab_free() /* this function is fastpath */
만약 해제하고자 하는 slab이 속한 페이지와 cpu slab이 바라보고 있는 페이지가 같다면, 그 freelist를 조작하는 것으로 작업을 끝낼 수 있다. 이것이 slab free에서의 fastpath이고, 상기의 조건을 만족시키지 못한다면 slowpath로 넘어간다.
1/* mm/slub.c */
2
3/*
4 * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
5 * can perform fastpath freeing without additional function calls.
6 *
7 * The fastpath is only possible if we are freeing to the current cpu slab
8 * of this processor. This typically the case if we have just allocated
9 * the item before.
10 *
11 * If fastpath is not possible then fall back to __slab_free where we deal
12 * with all sorts of special processing.
13 *
14 * Bulk free of a freelist with several objects (all pointing to the
15 * same page) possible by specifying head and tail ptr, plus objects
16 * count (cnt). Bulk free indicated by tail pointer being set.
17 */
18static __always_inline void do_slab_free(struct kmem_cache *s,
19 struct page *page, void *head, void *tail,
20 int cnt, unsigned long addr)
21{
22 void *tail_obj = tail ? : head;
23 struct kmem_cache_cpu *c;
24 unsigned long tid;
25
26 /* memcg_slab_free_hook() is already called for bulk free. */
27 if (!tail)
28 memcg_slab_free_hook(s, &head, 1);
29redo:
30 /*
31 * Determine the currently cpus per cpu slab.
32 * The cpu may change afterward. However that does not matter since
33 * data is retrieved via this pointer. If we are on the same cpu
34 * during the cmpxchg then the free will succeed.
35 */
36 c = raw_cpu_ptr(s->cpu_slab);
37 tid = READ_ONCE(c->tid);
38
39 /* Same with comment on barrier() in slab_alloc_node() */
40 barrier();
41
42 if (likely(page == c->page)) {
43#ifndef CONFIG_PREEMPT_RT
44 /* ... */
45#else /* CONFIG_PREEMPT_RT */
46 /*
47 * We cannot use the lockless fastpath on PREEMPT_RT because if
48 * a slowpath has taken the local_lock_irqsave(), it is not
49 * protected against a fast path operation in an irq handler. So
50 * we need to take the local_lock. We shouldn't simply defer to
51 * __slab_free() as that wouldn't use the cpu freelist at all.
52 */
53 void **freelist;
54
55 local_lock(&s->cpu_slab->lock);
56 c = this_cpu_ptr(s->cpu_slab);
57 if (unlikely(page != c->page)) {
58 local_unlock(&s->cpu_slab->lock);
59 goto redo;
60 }
61 tid = c->tid;
62 freelist = c->freelist;
63
64 set_freepointer(s, tail_obj, freelist);
65 c->freelist = head;
66 c->tid = next_tid(tid);
67
68 local_unlock(&s->cpu_slab->lock);
69#endif
70 stat(s, FREE_FASTPATH);
71 } else
72 __slab_free(s, page, head, tail_obj, cnt, addr);
73
74}
Slowpath에서 계속 따라가면 다음과 같이 call trace를 그려볼 수 있다. (아래 call trace에서 else condition
은 다른 조건이 만족되지 않을 경우 진행되는 경로임에 주의; kfree()
의 call trace가 slab_free()
에서부터 겹침을 확인하라)
/* function call trace */
kfree_skb() kfree()
__kfree_skb() slab_free()
kfree_skbmem()
kmem_cache_free()
slab_free() /* s = skbuff_head_cache,
page = virt_to_head_page(skb),
head = skb,
tail = NULL,
cnt = 1,
addr = _RET_IP */
do_slab_free() /* this function is fastpath */
|
|slowpath
|
V
__slab_free() /* s = skbuff_head_cache,
| page = virt_to_head_page(skb),
| head = skb,
| tail = skb,
| cnt = 1,
| addr = _RET_IP_ */
|
+---------------------------+---------------------------------+
|if slab is empty |likely, |
| |if slab is full |
| | |
| | |
|page list |per-cpu partial list |NUMA node partial
| | |list
| | |
V V V
discard_slab()<--+ put_cpu_partial()/* drain=1 */ add_partial() <---+
free_slab() | | __add_partial() |
__free_slab() |slab empty |per-cpu partial list |else
__free_pages() | |is full |condition
free_the_page() | | |
| | V |
| +----------- __unfreeze_partials()----------------------------+
|order <= 3
|
V
free_unref_page()
free_unref_page_commit() /* Free a pcp page */
위 call trace에서 slab empty
와 slab full
, per-cpu partial list full
의 조건은 __slab_free()
에서 확인할 수 있다.
1/* mm/slab.h */
2
3/*
4 * The slab lists for all objects.
5 */
6struct kmem_cache_node {
7 spinlock_t list_lock;
8
9#ifdef CONFIG_SLAB
10 /* ... */
11#endif
12
13#ifdef CONFIG_SLUB
14 unsigned long nr_partial;
15 struct list_head partial;
16#ifdef CONFIG_SLUB_DEBUG
17 atomic_long_t nr_slabs;
18 atomic_long_t total_objects;
19 struct list_head full;
20#endif
21#endif
22
23};
1/* mm/slub.c */
2
3/*
4 * Slow path handling. This may still be called frequently since objects
5 * have a longer lifetime than the cpu slabs in most processing loads.
6 *
7 * So we still attempt to reduce cache line usage. Just take the slab
8 * lock and free the item. If there is no additional partial page
9 * handling required then we can return immediately.
10 */
11static void __slab_free(struct kmem_cache *s, struct page *page,
12 void *head, void *tail, int cnt,
13 unsigned long addr)
14
15{
16 void *prior;
17 int was_frozen;
18 struct page new;
19 unsigned long counters;
20 struct kmem_cache_node *n = NULL;
21 unsigned long flags;
22
23 /* ... */
24
25 do {
26 if (unlikely(n)) {
27 spin_unlock_irqrestore(&n->list_lock, flags);
28 n = NULL;
29 }
30 prior = page->freelist;
31 counters = page->counters;
32 set_freepointer(s, tail, prior);
33 new.counters = counters;
34 was_frozen = new.frozen;
35 new.inuse -= cnt;
36 if ((!new.inuse || !prior) && !was_frozen) {
37
38 if (kmem_cache_has_cpu_partial(s) && !prior) {
39
40 /*
41 * Slab was on no list before and will be
42 * partially empty
43 * We can defer the list move and instead
44 * freeze it.
45 */
46 new.frozen = 1;
47
48 } else { /* Needs to be taken off a list */
49
50 n = get_node(s, page_to_nid(page));
51 /*
52 * Speculatively acquire the list_lock.
53 * If the cmpxchg does not succeed then we may
54 * drop the list_lock without any processing.
55 *
56 * Otherwise the list_lock will synchronize with
57 * other processors updating the list of slabs.
58 */
59 spin_lock_irqsave(&n->list_lock, flags);
60
61 }
62 }
63
64 } while (!cmpxchg_double_slab(s, page,
65 prior, counters,
66 head, new.counters,
67 "__slab_free"));
68
69 if (likely(!n)) {
70
71 if (likely(was_frozen)) {
72 /*
73 * The list lock was not taken therefore no list
74 * activity can be necessary.
75 */
76 stat(s, FREE_FROZEN);
77 } else if (new.frozen) {
78 /*
79 * If we just froze the page then put it onto the
80 * per cpu partial list.
81 */
82 put_cpu_partial(s, page, 1);
83 stat(s, CPU_PARTIAL_FREE);
84 }
85
86 return;
87 }
88
89 if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
90 goto slab_empty;
91
92 /*
93 * Objects left in the slab. If it was not on the partial list before
94 * then add it.
95 */
96 if (!kmem_cache_has_cpu_partial(s) && unlikely(!prior)) {
97 remove_full(s, n, page);
98 add_partial(n, page, DEACTIVATE_TO_TAIL);
99 stat(s, FREE_ADD_PARTIAL);
100 }
101 spin_unlock_irqrestore(&n->list_lock, flags);
102 return;
103
104slab_empty:
105 if (prior) {
106 /*
107 * Slab on the partial list.
108 */
109 remove_partial(n, page);
110 stat(s, FREE_REMOVE_PARTIAL);
111 } else {
112 /* Slab must be on the full list */
113 remove_full(s, n, page);
114 }
115
116 /* ... */
117 discard_slab(s, page);
118}
1/* mm/slub.c */
2
3static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
4 void *freelist_old, unsigned long counters_old,
5 void *freelist_new, unsigned long counters_new,
6 const char *n)
7{
8#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
9 defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
10 if (s->flags & __CMPXCHG_DOUBLE) {
11 if (cmpxchg_double(&page->freelist, &page->counters,
12 freelist_old, counters_old,
13 freelist_new, counters_new))
14 return true;
15 } else
16#endif
17 {
18 unsigned long flags;
19
20 local_irq_save(flags);
21 __slab_lock(page);
22 if (page->freelist == freelist_old &&
23 page->counters == counters_old) {
24 page->freelist = freelist_new;
25 page->counters = counters_new;
26 __slab_unlock(page);
27 local_irq_restore(flags);
28 return true;
29 }
30 __slab_unlock(page);
31 local_irq_restore(flags);
32 }
33
34 cpu_relax();
35 stat(s, CMPXCHG_DOUBLE_FAIL);
36
37#ifdef SLUB_DEBUG_CMPXCHG
38 pr_info("%s %s: cmpxchg double redo ", n, s->name);
39#endif
40
41 return false;
42}
이때 slab empty를 판별하는 과정에서 NUMA node의 partial list 내 페이지 수 (n->nr_partial
) 와 s->min_partial
을 비교함을 확인하라. 이는 전달된 페이지에 할당된 slab이 없고, NUMA node에 이미 최소한의 partial slab이 확보되어 있음을 의미한다.
References #
- corbet, Apr. 2007, "The SLUB allocator," LWN. [Online]. Available: https://lwn.net/Articles/229984/
- Jeff Bonwick, "The Slab Allocator: An Object-Cacheing Kenrel Memory Allocator," in Proc. USENIX Summer 1994 Technical Conference, Boston, Massachusetts, USA, Jun. 1994, pp. 87 -- 98.
- Jonathan Corbet, Jul. 2011, "Per-CPU variables and the realtime tree," LWN. [Online]. Available: https://lwn.net/Articles/452884/
- Kanoj Sarcar, et al., Nov. 1999, "What is NUMA?," Linux Memory Management Documentation. [Online]. Available: https://www.kernel.org/doc/html/v4.18/vm/numa.html
- "Optimizing Applications for NUMA," Intel Guide for Developing Multithreaded Applications. [Online]. Available: https://www.intel.com/content/dam/develop/external/us/en/documents/3-5-memmgt-optimizing-applications-for-numa-184398.pdf