CVE-2022-0847: Dirty Pipe In Linux Kernel 5.8

Introduction
Background
Root Cause Analysis
PoC: Exploit
References

Introduction #

Dirty pipe 취약점은 Max Kalleramnn이 고객사 문의를 해결하는 과정에서 발견한 취약점으로, 리눅스 커널 내부에서 새로운 파이프 버퍼가 할당될 때 적절한 초기화 작업이 수행되지 않아 발생하였다[6, 9].

파이프는 단방향 데이터 채널로, 프로세스 간 데이터 통신에 사용될 수 있다. 리눅스 커널은 이를 파이프 버퍼의 ring (FIFO)로 구현하며, 각 파이프 버퍼는 페이지를 참조하여 데이터를 읽거나 쓸 수 있다. 이때 user space와 kernel space 간 데이터 복사 과정에서 발생하는 overhead를 줄이기 위해 5274f052e7b3 커밋 [8]에서 splice system call이 제안되었다[2, 6, 7].

Splice system call는 페이지의 데이터를 복사하는 것이 아닌, 페이지를 참조하는 포인터를 추가하는 방식으로 데이터를 이동시킨다. 이때 f6dd975583bd 커밋 [10] 이후에는 해당 페이지에 이미 데이터가 존재한다면, write system call은 파이프 버퍼 구조체의 flag 멤버 변수의 값을 확인한다. 그리고 PIPE_BUF_FLAG_CAN_MERGE가 셋트되어 있다면 페이지에 데이터를 추가한다[4, 6].

그런데 데이터가 추가되는 페이지가 파이프가 소유하는 페이지가 아니라면, 파이프가 아닌 다른 파일에 데이터가 복사될 수 있다[6].

Background #

그럼 이제 리눅스 커널에서 파일 입출력이 어떻게 동작하는지, 그리고 파이프와 splice system call이 어떻게 동작하는지 좀 더 세부적으로 알아보겠다.

How file I/O works in linux kernel #

리눅스 커널은 사용자가 read 또는 write system call이 호출하면 일반적으로 (generic) 검사해야 하는 사항들을 확인한다. 그리고 해당 파일 시스템의 read 또는 write 함수를 호출한다. 파일 시스템의 함수는 파일을 읽거나 쓸 때 검사해야 하는 사항들을 검사하고 페이지 캐시에 데이터를 읽거나 쓰기 위한 함수를 호출한다. 이 과정은 다음과 같은 함수 콜 트레이스 예시 (파일 쓰기 연산)로부터 알 수 있다:

 1write()
 2ksys_write()
 3vfs_write()
 4    rw_verfiy_area()
 5    new_sync_write()
 6call_write_iter()
 7file->f_op->write_iter()  /* Here, we think this as ext4_file_write_iter() */
 8                          /* f_op is assigned when open system call */
 9ext4_buffered_write_iter()
10generic_perform_write() /* Write data to page cache and mark as dirty */

이렇게 리눅스 커널은 파일을 읽고 쓸 때 페이지 캐시를 사용하여 상대적으로 비싼 비용을 치러야 하는 디스크 연산을 피한다. 즉, 파일로부터 데이터를 읽을 때는 먼저 디스크로부터 페이지 캐시에 데이터를 넣고 그 이후의 읽기 연산에서는 디스크가 아닌 페이지 캐시로부터 읽어들인다. 이는 파일에 데이터를 쓸 때도 적용되는데, 먼저 페이지 캐시에 데이터를 쓰고 해당 페이지에 dirty 표시를 한 후에 해당 데이터를 다른 프로세스가 참조할 때 디스크에 쓰는 것이다[3].

Linux pipe and splice system call #

리눅스 커널에서 파이프는 파이프 파일 시스템에 속한 파일로 다루어진다. 따라서 파이프 파일을 연 후에 write system call을 호출하면 pipe_write 함수가 호출될 것이다. 그 이유는 파이프 파일 시스템의 file_operations가 다음과 같이 구현되어 있기 때문이다 (path: fs/pipe.c):

 1const struct file_operations pipefifo_fops = {
 2        .open           = fifo_open,
 3        .llseek         = no_llseek,
 4        .read_iter      = pipe_read,
 5        .write_iter     = pipe_write,
 6        .poll           = pipe_poll,
 7        .unlocked_ioctl = pipe_ioctl,
 8        .release        = pipe_release,
 9        .fasync         = pipe_fasync,
10};

이러한 파이프는 내부적으로 파이프 버퍼의 ring (FIFO)로 구현되며, 각각의 버퍼는 데이터를 읽거나 쓰기 위한 페이지를 참조한다 (path: include/linux/pipe_fs_i.h).

    +--> bufs: [0] <-------------head, tail<-+
    |           +---> page            |      |
    |          [1]                    V      |
    |           +---> page            |      |
    |          [2]                    V      |
    |           +---> page            |      |
    |          [...]                  V      |
    |           +---> page            +------+
    |
    +--> tmp_page
    |
    |
pipe_inode_info

위 그림에서 head는 다음에 할당될 파이프 버퍼에 대한 인덱스이고, tail은 다음에 사용될 파이프 버퍼에 대한 인덱스이다. 이러한 파이프 버퍼는 다음과 같이 코드상에서 구현된다:

 1/**
 2 *	struct pipe_buffer - a linux kernel pipe buffer
 3 *	@page: the page containing the data for the pipe buffer
 4 *	@offset: offset of data inside the @page
 5 *	@len: length of data inside the @page
 6 *	@ops: operations associated with this buffer. See @pipe_buf_operations.
 7 *	@flags: pipe buffer flags. See above.
 8 *	@private: private data owned by the ops.
 9 **/
10struct pipe_buffer {
11	struct page *page;
12	unsigned int offset, len;
13	const struct pipe_buf_operations *ops;
14	unsigned int flags;
15	unsigned long private;
16};

지금까지 설명한 것을 종합하면 파이프에 데이터를 쓰는 과정은 다음과 같을 것이라고 생각해볼 수 있다:

1. user calls write system call
2. pipe_write() is called by call_write_iter()
3. data is transferred from write end pipe to read end pipe

그리고 다음 코드의 (1)로부터 PIPE_BUF_FLAG_CAN_MERGE가 셋트되어 있다면 데이터를 기 존재하는 페이지에 추가할 것임을 알 수 있다:

 1static ssize_t
 2pipe_write(struct kiocb *iocb, struct iov_iter *from)
 3{
 4        struct file *filp = iocb->ki_filp;
 5        struct pipe_inode_info *pipe = filp->private_data;
 6        unsigned int head;
 7        ssize_t ret = 0;
 8        size_t total_len = iov_iter_count(from);
 9        ssize_t chars;
10        bool was_empty = false;
11        bool wake_next_writer = false;
12
13        /* Null write succeeds. */
14        if (unlikely(total_len == 0))
15                return 0;
16
17        __pipe_lock(pipe);
18
19        if (!pipe->readers) {
20                send_sig(SIGPIPE, current, 0);
21                ret = -EPIPE;
22                goto out;
23        }
24	
25	/* ... */
26	
27	head = pipe->head;
28        was_empty = pipe_empty(head, pipe->tail);
29        chars = total_len & (PAGE_SIZE-1);
30        if (chars && !was_empty) {
31                unsigned int mask = pipe->ring_size - 1;
32                struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
33                int offset = buf->offset + buf->len;
34
35                if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
36                    offset + chars <= PAGE_SIZE) {
37                        ret = pipe_buf_confirm(pipe, buf);
38                        if (ret)
39                                goto out;
40
41                        ret = copy_page_from_iter(buf->page, offset, chars, from); /* (1) */
42                        if (unlikely(ret < chars)) {
43                                ret = -EFAULT;
44                                goto out;
45                        }
46
47                        buf->len += ret;
48			if (!iov_iter_count(from))
49                                goto out;
50                }
51        }
52	
53	for (;;) {
54                if (!pipe->readers) {
55                        send_sig(SIGPIPE, current, 0);
56                        if (!ret)
57                                ret = -EPIPE;
58                        break;
59                }
60
61                head = pipe->head;
62		if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
63                        unsigned int mask = pipe->ring_size - 1;
64                        struct pipe_buffer *buf = &pipe->bufs[head & mask];
65                        struct page *page = pipe->tmp_page;
66                        int copied;
67			
68			/* ... */
69			
70                        if (is_packetized(filp))
71                                buf->flags = PIPE_BUF_FLAG_PACKET;
72                        else /* (2) */
73                                buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
74				
75			/* ... */
76		}
77		
78		/* ... */
79	}
80	
81	/* ... */
82}

그럼 PIPE_BUF_FLAG_CAN_MERGE는 어떤 상황에서 셋트될까? 위 코드의 (2)를 보면 파이프의 파일 구조체가 packetize되지 않은 경우에 셋트함을 알 수 있다. 그리고 파이프에 대한 manpage를 살펴보면 파이프를 열 때 O_DIRECT를 셋트한 경우에 packet mode로 동작함을 알 수 있다. 따라서 O_DIRECT를 셋트하지 않고 파이프를 연다면 PIPE_BUF_FLAG_CAN_MERGE가 셋트될 것이다[2].

Splice system call #

Splice system call은 파이프를 사용하여 데이터 통신을 수행할 때 user space와 kernel space 간 데이터 복사로 인해 발생하는 overhead를 줄이기 위해 제안되었다. 그러나 파이프 파일 간에만 사용되는 것은 아니며, file-to-pipe, pipe-to-file, pipe-to-pipe에 사용될 수 있다. 이때 overhead를 줄이는 방식은 복사하고자 하는 데이터가 존재하는 페이지에 대한 참조를 추가하는 것 (또는, refcount를 증가시키는 것)이다[4, 8].

splice between file and pipe:
               +-----> [page] <----+
               |                   |
               |                   |
          page cache             pipe buffer
	  
splice between pipe and pipe:
               +-----> [page] <----+
               |                   |
               |                   |
          ipipe buffer        opipe buffer

다음 코드의 (1), (2)는 이를 잘 나타낸다:

 1/*
 2 * Splice contents of ipipe to opipe.
 3 */
 4static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 5			       struct pipe_inode_info *opipe,
 6			       size_t len, unsigned int flags)
 7{
 8	struct pipe_buffer *ibuf, *obuf;
 9	unsigned int i_head, o_head;
10	unsigned int i_tail, o_tail;
11	unsigned int i_mask, o_mask;
12	int ret = 0;
13	bool input_wakeup = false;
14
15retry:
16	ret = ipipe_prep(ipipe, flags);
17	if (ret)
18		return ret;
19
20	ret = opipe_prep(opipe, flags);
21	if (ret)
22		return ret;
23
24	/*
25	 * Potential ABBA deadlock, work around it by ordering lock
26	 * grabbing by pipe info address. Otherwise two different processes
27	 * could deadlock (one doing tee from A -> B, the other from B -> A).
28	 */
29	pipe_double_lock(ipipe, opipe);
30
31	i_tail = ipipe->tail;
32	i_mask = ipipe->ring_size - 1;
33	o_head = opipe->head;
34	o_mask = opipe->ring_size - 1;
35	
36	do {
37	        /* ... */
38		
39		ibuf = &ipipe->bufs[i_tail & i_mask];
40		obuf = &opipe->bufs[o_head & o_mask];
41
42		if (len >= ibuf->len) {
43			/*
44			 * Simply move the whole buffer from ipipe to opipe
45			 */
46			*obuf = *ibuf; /* (1) */
47			
48			/* ... */
49		} else {
50		        /* ... */
51			
52			*obuf = *ibuf;
53			
54			/* ... */
55		}
56		
57		/* ... */
58	} while (len);
59	
60	/* ... */
61}

 1static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
 2			 struct iov_iter *i)
 3{
 4        /* ... */
 5	off = i->iov_offset;
 6	buf = &pipe->bufs[i_head & p_mask];
 7	if (off) { /* (3) */
 8		if (offset == off && buf->page == page) {
 9			/* merge with the last one */
10			buf->len += bytes;
11			i->iov_offset += bytes;
12			goto out;
13		}
14		i_head++;
15		buf = &pipe->bufs[i_head & p_mask];
16	}
17	if (pipe_full(i_head, p_tail, pipe->max_usage))
18		return 0;
19	
20	buf->ops = &page_cache_pipe_buf_ops;
21	get_page(page); /* increases refcount of argument page */
22	buf->page = page; /* (2) */
23	buf->offset = offset;
24	buf->len = bytes;
25	
26	/* ... */
27}

위 코드의 (2)에서 get_page()는 인자로 전달된 페이지의 refcount를 증가시키는 함수임에 주목하라. 이는 copy_page_to_iter_pipe()에 전달된 page와 (2)의 page가 같음을 의미한다. 즉, file-to-pipe에서 파일의 페이지 캐싱을 위해 할당된 페이지를 파이프 버퍼의 page 멤버 변수가 참조한다는 것이다. 그럼 위 코드의 (3)이 가리키는 if 문에서 마지막 것과 병합하는 코드가 그 이후에 실행될 수 있다.

Root Cause Analysis #

다시 splice system call을 살펴보면 file-to-pipe의 경우에 다음과 같은 함수 콜 트레이스를 거쳐서

splice()
do_splice()
do_splice_to()

do_splice_to()가 호출됨을 확인할 수 있다.

 1/*
 2 * Attempt to initiate a splice from a file to a pipe.
 3 */
 4static long do_splice_to(struct file *in, loff_t *ppos,
 5			 struct pipe_inode_info *pipe, size_t len,
 6			 unsigned int flags)
 7{
 8	int ret;
 9
10	if (unlikely(!(in->f_mode & FMODE_READ)))
11		return -EBADF;
12
13	ret = rw_verify_area(READ, in, ppos, len);
14	if (unlikely(ret < 0))
15		return ret;
16
17	if (unlikely(len > MAX_RW_COUNT))
18		len = MAX_RW_COUNT;
19
20	if (in->f_op->splice_read)
21		return in->f_op->splice_read(in, ppos, pipe, len, flags);
22	return default_file_splice_read(in, ppos, pipe, len, flags);
23}

이때 해당 파일이 ext4 파일 시스템에 속하는 파일이라고 가정하면, 다음과 같은 함수 콜 트레이스를 거쳐서

file->f_op->splice_read() /* this is generic_file_splice_read() */
call_read_iter()
file->f_op->read_iter()
ext4_file_read_iter()
generic_file_read_iter()
generic_file_buffered_read()
copy_page_to_iter()
copy_page_to_iter_pipe()

copy_page_to_iter_pipe()가 호출됨을 알 수 있다. 이때 이 함수의 코드는 다음과 같다:

 1static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
 2			 struct iov_iter *i)
 3{
 4	struct pipe_inode_info *pipe = i->pipe;
 5	struct pipe_buffer *buf;
 6	unsigned int p_tail = pipe->tail;
 7	unsigned int p_mask = pipe->ring_size - 1;
 8	unsigned int i_head = i->head;
 9	size_t off;
10
11	if (unlikely(bytes > i->count))
12		bytes = i->count;
13
14	if (unlikely(!bytes))
15		return 0;
16
17	if (!sanity(i))
18		return 0;
19
20	off = i->iov_offset;
21	buf = &pipe->bufs[i_head & p_mask];
22	if (off) {
23		if (offset == off && buf->page == page) {
24			/* merge with the last one */
25			buf->len += bytes;
26			i->iov_offset += bytes;
27			goto out;
28		}
29		i_head++;
30		buf = &pipe->bufs[i_head & p_mask];
31	}
32	if (pipe_full(i_head, p_tail, pipe->max_usage))
33		return 0;
34
35        /****** (1) ******/
36	buf->ops = &page_cache_pipe_buf_ops;
37	get_page(page); /* increases refcount of argument page */
38	buf->page = page;
39	buf->offset = offset;
40	buf->len = bytes;
41
42	pipe->head = i_head + 1;
43	i->iov_offset = offset + bytes;
44	i->head = i_head;
45out:
46	i->count -= bytes;
47	return bytes;
48}

위 코드의 (1) 이하의 코드를 살펴보면 파이프 버퍼 구조체의 flag 멤버 변수에 대한 초기화가 없음을 알 수 있다. 즉, 위 함수를 통해 할당된 파이프 버퍼는 그 이전에 셋트된 flag 값을 (e.g., PIPE_BUF_FLAG_CAN_MERGE) 그대로 따라간다는 것이다.

PoC: Exploit #

References #

Max Kallermann, "lib/iov_iter: initialize flags in new pipe_buffer," 2022. [Online]. Available: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/lib/iov_iter.c?id=9d2231c5d74e13b2a0546fee6737ee4446017903, [Accessed Sep. 07, 2023].
"pipe(2) -- Linux manual page," 2023. [Online]. Available: https://man7.org/linux/man-pages/man2/pipe.2.html, [Accessed Sep. 07, 2023].
"Linux memory management." [Online]. Available: https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#page-cache, [Accessed Sep. 07, 2023].
"splice(2) -- Linux manual page," 2023. [Online]. Available: https://man7.org/linux/man-pages/man2/splice.2.html, [Accessed Sep. 07, 2023].
hyeyoo, "Page Cache: filemap_read", 2022. [Online]. Available: https://hyeyoo.com/161, [Accessed Sep. 07, 2023].
Max Kallermann, "The Dirty Pipe Vulnerability," 2022. [Online]. Available: https://dirtypipe.cm4all.com/, [Accessed Sep. 08, 2023].
Jonathan Corbet, "Rethinking splice()," 2023. [Online]. Available: https://lwn.net/Articles/923237/, [Accessed Jan. 21, 2024].
Jens Axboe,"[PATCH] Introduce sys_splice() system call," 2006. [Online]. Available: https://github.com/torvalds/linux/commit/5274f052e7b3dbd81935772eb551dfd0325dfa9d, [Accessed Jan. 21, 2024].
"CVE-2022-0847 Detail," NIST. [Online]. Available: https://nvd.nist.gov/vuln/detail/cve-2022-0847, [Accessed Jan. 21, 2024].
Christoph Hellwig, "pipe: merge anon_pipe_buf*_ops," 2020. [Online]. Available: https://github.com/torvalds/linux/commit/f6dd975583bd8ce088400648fd9819e4691c8958, [Accessed Jan. 21, 2024].

#1day

last updated: 2024-12-23

Table of Contents