Bypassing eBPF-based Security Enforcement Tools

blogs· 8min

June 6, 2022

During penetration tests and red team engagements, eBPF-based security observability and runtime enforcement tools can make it difficult to use public offensive security tools and techniques, as they are more often detected and blocked.

However, eBPF-based tools have limitations which allow adversaries to bypass their controls. In this blog post, I will introduce some of the limitations and bypass techniques.

This post will cover:

01

Tetragon: open-source eBPF-based security observability and runtime enforcement tool

02

Policies limitations

03

Bypassing I/O system call monitoring with io_uring

04

Process execution context

Tetragon

Recently Isovalent released the Tetragon opensource project, an eBPF-based security observability and runtime enforcement platform that has been part of Isovalent Cilium Enterprise for a couple of years.

Open-sourcing parts of Isovalent Cilium Enterprise as project Tetragon and opening it up for the entire community inspired us at Form3 Offensive Security team to explore eBPF-based security observability and runtime enforcement tools capabilities and limitations.

Setup

To get a hands-on experience with Tetragon and the generated events follow the Tetragonquickstart-guideto setup a Kind cluster and install Tetragon using a helm-based installation.

Functionality Overview

Tetragon uses kprobe hook points to observe arbitrary kernel calls in the Linux kernel, giving it the ability to monitor process opens, reads, writes, and closes throughout its lifecycle.

To explore how Tetragon syscall monitoring works apply the write.yaml TracingPolicy.

kubectl apply -f ./tetragon/crds/examples/write.yaml

Create a testing pod using the ubuntu image.

kubectl run demo --image ubuntu --command sleep infinity

In another terminal, start monitoring the events from the demo pod.

kubectl logs -n kube-system ds/tetragon -c export-stdout -f | tetragon observe --namespace default --pod demo

To test the TracingPolicy kubectl exec into the demo pod and read the contents of /etc/hostname.

kubectl exec -it demo -- cat /etc/hostname 

The output in the terminal should be the hostname of the pod.

demo

Looking into the output in the terminal running tetragon observe we should see the write system call used by the cat command.

🚀 process default/demo /usr/bin/cat /etc/hostname                        
📝 write   default/demo /usr/bin/cat  5 bytes                             
💥 exit    default/demo /usr/bin/cat /etc/hostname 0             

The output shows the cat process inside a Kubernetes workload performing a write system call with the contents of /etc/hostname to stdout, also known as standard output with the default file descriptor with the number 1. Now that we have a baseline let's look into how to bypass this TracingPolicy.

The sys-write TracingPolicy

This TracingPolicy uses kprobe hook points, to observe arbitrary kernel system calls in the Linux kernel. We will look at the sys-write example TracingPolicy and see how it monitors detecting the write system calls.

The snippet below shows how the __x64_sys_write call is used in the TracingPolicy.

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "sys-write"
spec:
  kprobes:
  - call: "__x64_sys_write"
    syscall: true
    args:
    - index: 0
      type: "int"
    - index: 1
      type: "char_buf"
      sizeArgIndex: 3
    - index: 2
      type: "size_t"
    # follow any non-init pids stdout e.g. exec into container
    selectors:
    - matchPIDs:
      - operator: NotIn
        followForks: true
        isNamespacePID: true
        values:
        - 1
      matchArgs:
      - index: 0
        operator: "Equal"
        values:
        - "1"

Baseline

The first observation is that the effectiveness of the policy, as with most security monitoring tools, is directly related to the synchronous behaviour of the evaluated processes and system calls. To analyse this behaviour, we start by isolating the system call in a small C program and use it as a baseline.

write(2) function synopsis

#include <unistd.h>

ssize_t write(int fd, const void *buf, size_t count);

The write(2) system call is used to write to a file descriptor, in our baseline example the file descriptor is standard output with the file descriptor number 1.

Baseline

#include <unistd.h>

int main()
{
    write(1,"Writing using write()!\n",24); 
}

Executing the program with strace allows us to observe the system calls.

$ strace ./write
execve("./write", ["./write"], 0x7ffce310ca50 /* 24 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe797eb220) = -1 EINVAL (Invalid argument)
brk(NULL)                               = 0x175c000
brk(0x175d1c0)                          = 0x175d1c0
arch_prctl(ARCH_SET_FS, 0x175c880)      = 0
uname({sysname="Linux", nodename="lab", ...}) = 0
readlink("/proc/self/exe", "/home/ubuntu/write", 4096) = 18
brk(0x177e1c0)                          = 0x177e1c0
brk(0x177f000)                          = 0x177f000
mprotect(0x4bd000, 12288, PROT_READ)    = 0
write(1, "Writing using write()!\n\0", 24Writing using write()!
) = 24
exit_group(0)                           = ?
+++ exited with 0 +++

From the output, we verify how the write(2) system call is detected using the kprobe hook.

write(1, "Writing using write()!\n\0", 24) = 24

Executing the program in the demo pod confirms that our baseline works as intended.

$ kubectl exec -it demo -- write
Writing using write()!

From the output, we can see that Tetragon can detect the write(2) system call.

$ kubectl logs -n kube-system ds/tetragon -c export-stdout -f | tetragon observe --namespace default --pod demo
🚀 process default/demo /usr/bin/write                                    
📝 write   default/demo /usr/bin/write  24 bytes                          
💥 exit    default/demo /usr/bin/write  0                        

Limitations

Obviously, the simplest way to bypass the example rule is to use a function equivalent to write() like writev() for example, which performs the same action as write(), but gathers the output data from the iovcnt buffers specified by the members of the iov array.

writev()

#include <sys/uio.h>

int main()
{
    struct iovec vecs;
    vecs.iov_base = "Writing using writev()!\n";
    vecs.iov_len = 25;

    writev(1, &vecs, 1);
}

Executing the program with strace to verify what system calls are called.

$ strace ./writev 
execve("./writev", ["./writev"], 0x7ffc915e7460 /* 24 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcf4e0cf80) = -1 EINVAL (Invalid argument)
brk(NULL)                               = 0x6d8000
brk(0x6d91c0)                           = 0x6d91c0
arch_prctl(ARCH_SET_FS, 0x6d8880)       = 0
uname({sysname="Linux", nodename="lab", ...}) = 0
readlink("/proc/self/exe", "/home/ubuntu/writev", 4096) = 19
brk(0x6fa1c0)                           = 0x6fa1c0
brk(0x6fb000)                           = 0x6fb000
mprotect(0x4bd000, 12288, PROT_READ)    = 0
writev(1, [{iov_base="Writing using writev()!\n\0", iov_len=25}], 1Writing using writev()!
) = 25
exit_group(0)                           = ?
+++ exited with 0 +++

From the output we can see the writev(2) system call is called by the writev program.

writev(1, [{iov_base="Writing using writev()!\n\0", iov_len=25}], 1) = 25

Executing the program in the demo pod allows us to confirm that the system call is not detected.

$ kubectl exec -it demo -- writev
Writing using writev()!

From the output, as expected, Tetragon is unable to detect the writev(2) system call because it does not match the TracingPolicy.

$ kubectl logs -n kube-system ds/tetragon -c export-stdout -f | tetragon observe --namespace default --pod demo
🚀 process default/demo /usr/bin/writev                                   
💥 exit    default/demo /usr/bin/writev  0                                            

These limitations although obvious and expected from such a simple example rule is still worth mentioning, as this is a recurring problem for security monitoring tools, often expressed as a cat-and-mouse game. Defenders will increase the coverage of their rules and adversaries will look for exceptions and fringe cases.

A less intuitive example of this issue can occur with the sendfile(2) system call, used to transfer data between file descriptors, and in practice sendfile(2) a combination of the read(2) and write(2) system calls.

sendfile(2) synopsis

#include <sys/sendfile.h>

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

The sendfile(2) system call is used as an example by BusyBox, a software suite that provides several Unix utilities in a single executable file commonly present in container images. If not taken into consideration when developing detection policies equivalent functions even when not intentionally misused can still be abused by an adversary to bypass detection.

Executing busybox with strace we can verify what system calls are being called.

$ strace busybox cat /etc/hostname
[...snip...]
openat(AT_FDCWD, "/etc/hostname", O_RDONLY) = 3
sendfile(1, 3, NULL, 16777216lab
)          = 4
sendfile(1, 3, NULL, 16777216)          = 0
close(3)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Execution

$ kubectl exec -it demo -- busybox cat /etc/hostname
demo

Output

🚀 process default/demo /usr/bin/busybox cat /etc/hostname                
💥 exit    default/demo /usr/bin/busybox cat /etc/hostname 0     

Enter the io_uring

Now that we have seen how system call detection can be bypassed using equivalent functions, let's look at a more generic and novel way to bypass system call monitoring, taking advantage of io_uring. This new asynchronous I/O API for Linux was created by Jens Axboe from Facebook to address performance issues with similar interfaces provided by functions like readwrite and other functions that operate on data accessed by sockets and file descriptors.

The purpose of this example is not to bypass detection using another equivalent function, but to observe how the new asynchronous I/O APIs for Linux can present new challenges to security monitoring tools.

io_uring a high level overview

The io_uring asynchronous I/O API was introduced in Linux kernel version 5.1 (March 2019) and consists of three system calls, io_uring_setup(2), io_uring_register(2) and io_uring_enter(2). The io_uring instance uses two rings, a submission queue (SQ) for submission of requests and a completion queue (CQ) that informs about the completion of those requests, shared between the kernel and the program using io_uring_setup() and mapped using two mmap(2) calls.

The program creates one or more SQ entries (SQE) instructing io_uring what asynchronous I/O operation it needs to get done, readv(2) or writev(2) for example, and then updates the SQ tail. The kernel reads the SQEs, and updates the SQ head.

The kernel then creates CQ entries (CQE) for one or more of the completed requests and updates the CQ tail. The program then consumes the CQEs and updates the CQ head. An important note is that completion events can arrive in any order, associated only with the specific SQEs.

The bypass

With this in mind, we can write a small C program to bypass system call monitoring.

#include <liburing.h>

int main()
{
	struct iovec vecs;
	struct io_uring ring;
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;

	vecs.iov_base = "Writing using io_uring!\n";
	vecs.iov_len = 25;

	io_uring_queue_init(8, &ring, 0);

	sqe = io_uring_get_sqe(&ring);

	io_uring_prep_writev(sqe, 1, &vecs, 1, 0);

	io_uring_submit(&ring);

	io_uring_wait_cqe(&ring, &cqe);

	io_uring_cqe_seen(&ring, cqe);
}

In the program the io_uring_queue_init() function executes the io_uring_setup syscall to initialise the submission and completion queues in the kernel and then maps the resulting file descriptor to memory shared between the program and the kernel.

The io_uring_get_sqe() function gets the next vacant event from the submission queue belonging to the ring param and returns a pointer to the submission queue event.

Then the ring SQE is fetched and prepared for the IORING_OP_WRITEV operation which provides an asynchronous interface to write(2) system call using the liburing io_uring_prep_writev() helper function.

The SQE is submitted with a call to io_uring_submit() that returns the number of submitted SQEs and our program waits for a completion by calling io_uring_wait_cqe(), finally the program calls io_uring_cqe_seen() to inform the kernel that the given CQE has been consumed.

Executing the uwrite program with strace allows us to verify that it is only calling the io_uring_setup(2)mmap(2) and io_uring_enter(2) system calls while still writing the message to standard output.

$ strace ./uwrite
execve("./uwrite", ["./uwrite"], 0x7ffde19a0850 /* 24 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe7d93b6a0) = -1 EINVAL (Invalid argument)
brk(NULL)                               = 0x1ede000
brk(0x1edf1c0)                          = 0x1edf1c0
arch_prctl(ARCH_SET_FS, 0x1ede880)      = 0
uname({sysname="Linux", nodename="lab", ...}) = 0
readlink("/proc/self/exe", "/home/ubuntu/uwrite", 4096) = 19
brk(0x1f001c0)                          = 0x1f001c0
brk(0x1f01000)                          = 0x1f01000
mprotect(0x4be000, 12288, PROT_READ)    = 0
io_uring_setup(8, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=8, cq_entries=16, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABLE|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|0x7e0, sq_off={head=0, tail=64, ring_mask=256, ring_entries=264, flags=276, dropped=272, array=576}, cq_off={head=128, tail=192, ring_mask=260, ring_entries=268, overflow=284, cqes=320, resv=[0x118, 0]}}) = 3
mmap(NULL, 608, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 3, 0) = 0x7f961a244000
mmap(NULL, 512, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 3, 0x10000000) = 0x7f961a243000
io_uring_enter(3, 1, 0, 0, NULL, 8Writing using io_uring!
)     = 1
exit_group(0)                           = ?
+++ exited with 0 +++

Running the program in the demo pod confirms that the write operation was not detected.

$ kubectl exec -it demo -- uwrite
Writing using io_uring!

From the output, as expected, Tetragon is unable to detect the write operation.

🚀 process default/demo /usr/bin/uwrite                                   
💥 exit    default/demo /usr/bin/uwrite  0                       

In summary, io_uring is effectively a runtime for processing I/O requests, that spawns threads, sets up work queues, and dispatches requests for processing. Using asynchronous I/O increases complexity, making it harder for runtime security enforcement and observability tools to filter, block and react to events, requiring tools to keep track of the process, submitted requests and increasing the complexity of the rules.

One more thing

After successfully bypassing the write operation detection there is still one more thing that we can try to bypass: the process calling the write operation. Since security tools also monitor process execution events, a new binary spanning a process even if not associated with system call activity is always prone to raise an alert.

To tamper with the process execution context, we will look at Bash builtin commands to execute the write operation directly, without invoking another program.

First, let's rewrite our baseline example so it can be loaded in Bash using the enable command.

#include "/usr/include/bash/builtins.h"

int writeb_builtin_load ()
{
    write(1,"Writing using writeb!\n",23); 
    return (1);
}

struct builtin writeb_struct = {
	"writeb",		/* builtin name */
	NULL,			/* function implementing the builtin */
	0x1,			/* this builtin is enabled. */
	NULL,			/* array of long documentation strings. */
	NULL,			/* usage synopsis; becomes short_doc */
	0			/* reserved for internal use */
};

Loading the shared library shows that Bash builtin commands can be used to run code in the Bash execution context.

$ bash -c 'enable -f ./writeb.so writeb'
Writing using writeb!

From the strace output we can see that the write(2) system call is called by Bash itself.

$ strace bash -c 'enable -f ./writeb.so writeb'
[...snip...]
write(1, "Writing using writeb!\n\0", 23Writing using writeb!
) = 23
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

This technique allows us to decouple the write operation from the execution context of the process and adds another layer of complexity to the attack.

Running the program in the demo pod shows that the write operation executes normally.

$ kubectl exec -it demo -- bash -c 'enable -f ./writeb writeb'
Writing using writeb!

But on Tetragon's output, the write operation is now executed in the Bash execution context.

🚀 process default/demo /usr/bin/bash -c "enable -f ./writeb writeb"   
📝 write   default/demo /usr/bin/bash  23 bytes                           
💥 exit    default/demo /usr/bin/bash -c "enable -f ./writeb writeb" 0 

This technique we can build bypass using io_uring what will run in the Bash execution context.

#include <liburing.h>
#include "/usr/include/bash/builtins.h"

int uwriteb_builtin_load ()
{
	struct iovec vecs;
	struct io_uring ring;
	struct io_uring_cqe *cqe;
	struct io_uring_sqe *sqe;

	vecs.iov_base = "Writing from bash using io_uring!\n";
	vecs.iov_len = 34;

	io_uring_queue_init(8, &ring, 0);

	sqe = io_uring_get_sqe(&ring);

	io_uring_prep_writev(sqe, 1, &vecs, 1, 0);

	io_uring_submit(&ring);

	io_uring_wait_cqe(&ring, &cqe);

	io_uring_cqe_seen(&ring, cqe);

    return (1);
}

struct builtin uwriteb_struct = {
	"uwriteb",		/* builtin name */
	NULL,			/* function implementing the builtin */
	0x1,			/* this builtin is enabled. */
	NULL,			/* array of long documentation strings. */
	NULL,			/* usage synopsis; becomes short_doc */
	0			/* reserved for internal use */
};

Running the program in the demo pod shows that the write operation executes normally.

$ kubectl exec -it demo -- bash -c 'enable -f ./uwriteb uwriteb'
Writing from bash using io_uring!

On Tetragon's output, the write operation is not detected and it is executed in the Bash execution context.

$ kubectl logs -n kube-system ds/tetragon -c export-stdout -f | tetragon observe --namespace default --pod demo
🚀 process default/demo /usr/bin/bash -c "enable -f ./uwriteb uwriteb"     
💥 exit    default/demo /usr/bin/bash -c "enable -f ./uwriteb uwriteb" 0 

Conclusion

This blog post introduced some of the limitations and challenges faced by the defensive teams and basic techniques used regularly by red teams and adversaries. The techniques and limitations described in this blog post are not exclusive to Tetragon, affecting other monitoring solutions that use similar system call detection rules. Although by no means exhaustive, I hope that the techniques presented will inspire both teams to improve and keep the cat-and-mouse engaging for some time.

Further reading

by Daniel Teixeira Lead of Offensive Security