Rethinking the Linux kernel with eBPF

Rethinking the Linux kernel with eBPF


The Linux kernel is not directly programmable by non-core developers as any mistakes made with the kernel can lead to a system crash. But, there are times when we want a feature that isn’t in the kernel and hoping for the feature to be added might take many years: think of major kernel releases occurring after a minimum of 3+ years. This is generally a good thing as kernel stability and safety are paramount and it takes a lot of testing to introduce new features.

Kernel modules, which are used to develop device drivers for Linux, are an option to add programmability to the kernel, however, modules still pose a security and stability risk to the kernel as modules run with full kernel privileges. Testing the kernel module at the earliest stages of development is imperative, and even then vulnerabilities to kernel modules can be discovered much later by hackers. So, in the summer of 1990 Steven McCanne and Van Jacobson developed BPF–the Berkeley Packet Filter–or as it was announced in their official paper in December of 1992, the BSD Packet Filter. This allowed the Linux kernel to be dynamically programmable without kernel recompilation.

Classic BPF (cBPF)

User space is the location from where normal user processes and user applications run. These processes can’t directly access the kernel space, although some parts of kernel space can be accessed via system calls. But, these system calls are limited in their scope and privileges. Kernel space manages applications and processes running in user space, provides the system call interface, is strictly reserved for running a privileged operating system kernel, provides kernel extensions and most device drivers, and provides controlled access to the underlying system hardware. Both kernel space and user space have their own unique memory address space that doesn’t overlap. This architecture is required to keep the kernel stable and secure.

The genesis for Classic BPF was due in part to the restrictions posed by the Linux system where network packet filtering had to be done in user space by copying the packets from the kernel space. This was a highly inefficient process. We did have application performance monitoring (APM) tools in user space that allowed for network troubleshooting, but those tools weren’t able to give minute details that the kernel would allow for. Where you might get network information in intervals of seconds with APMs, you can get information in intervals of milliseconds from kernel space, all while being close to the origin of the problem where you can investigate in novel ways. Classic BPF was proposed to allow you to develop programs that performed stateless packet filtering closest to the kernel and system hardware in kernel space. But, cBPF, at this stage, had a limited use case.

The second problem cBPF aimed to solve is the trust issue with kernel modules. Classic BPF, as it was initially conceived, eliminated trust issues by ensuring that BPF programs were highly restrictive in what they were allowed to do. The restrictions put in place for BPF programs, along with the rudimentary kernel verifier which statically analyzed the programs before loading them into the kernel, prevented programs from operating beyond their intended scope. You were only allowed to program in a restricted version of the C programming language, dereferencing pointers was not permitted, program instructions were not allowed to jump back or loop, there was an upper bound on the instruction count, programming was strictly typed, the use of only two 32-bit registers were permitted, memory was of a fixed size so there was no accessible memory stack, no global variables were permitted, no variadic functions were permitted, no floating-point numbers were permitted, etc. As you can see, it was a very restrictive environment and for good reason. Due to the limitations of the C programming language itself, that BPF programming was based on, and the advent of modern computer architectures, Classic BPF gave way to Enhanced BPF (eBPF).

Enhanced BPF (eBPF)

In December of 2014, Alexei Starovoitov and Daniel Borkmann enhanced the BPF virtual machine within the Linux kernel with ten 64-bit registers, fall-through loops, the addition of key-value store maps for shared memory between user space and kernel space, a 512-byte stack, enabling easier JIT compilation of native BPF bytecode into platform-specific assembly instructions, helper functions, a new bpf function for syscalls, tail calls, an improved interpreter, and an improved verifier to check if eBPF programs are secure and stable.

eBPF programs are similar to kernel modules. Unlike kernel modules, however, eBPF programs don’t require you to recompile your kernel, and they are guaranteed to complete without crashing. Fundamentally, it allows user space programs to run in kernel space. They are loaded by the user process and automatically unloaded when the process exits. Each eBPF program is a safe run-to-completion set of instructions. The eBPF verifier statically determines that the program terminates and is safe to execute. During verification, the program takes a hold of maps that it intends to use, so selected maps cannot be removed until the program is unloaded. The program can be attached to different events. These events can be packets, tracepoint events and other types in the future. A new event triggers the execution of the program which may store information about the event in the maps. Beyond storing data, the programs may call into in-kernel helper functions which may dump stack, do trace_printk or other forms of live kernel debugging. The same program can be attached to multiple events and different programs can access the same map.

These enhancements have allowed eBPF to outgrow the simple packet filtering purpose and are now used in kernel tracing, performance tuning, event monitoring, software-defined load balancing, firewalling, DDoS mitigation, cloud-native observability, stateful processing, and dynamic interactions with user space applications among other benefits. The interesting bit is that eBPF allows developers to safely and efficiently embed eBPF programs in any piece of software, not just the Linux kernel. These various use cases are being used across companies including Meta, Google, Microsoft, Isovalent (creators of Cilium), Netflix, Cloudflare, et al. Today, we now can refer to eBPF as just BPF as it has become the default standard even though cBPF is still used in tools like tcpdump.

BPF to the Future

What started as the hard work of a few kernel developers has surpassed 200 contributors. There is still more to improve upon in this technology. As recently as 2020, a zero-day exploit allowed a bug in Linux’s BPF range analysis to pass and escalate the process privilege to root. But, this technology is being adopted readily as it allows developers to have much more power with their systems in a dynamic way to enhance the capabilities of the Linux kernel. As of now, only low-level languages can be utilized to program BPF programs in kernel space, such as C and Rust, as you use an LLVM, Clang, or GCC compiler. In user space, we can create BPF tools using high-level languages like Java, Python, Go, and Lua with BPF libraries. This opens up the possibilities to many more developers. Microsoft is even working on taking advantage of BPF to work on top of Windows. You will be hearing a lot about BPF and tools based on BPF in the future as it continues to mature.