Categories
Software Development

Zero to Performance Hero: How to Benchmark and Profile Your eBPF Code in Rust

Home-Software Development-Zero to Performance Hero: How to Benchmark and Profile Your eBPF Code in Rust
eBPF Code in Rust

Zero to Performance Hero: How to Benchmark and Profile Your eBPF Code in Rust

The silent eBPF revolution is well underway. Extended Berkeley Packet Filter (eBPF) is used across the cloud-native world to enable faster and more customizable computing. eBPF is a virtual machine within the Linux kernel that allows for extending the kernel’s functionality safely and maintainably. As more logic moves into the kernel, ensuring systems stay performant is crucial.

Profiling eBPF Code

Profiling eBPF code helps developers identify areas needing performance optimizations. Different profiling techniques highlight various areas of interest, helping pinpoint the root cause of performance problems.

Getting Started with eBPF

eBPF allows you to extend the kernel’s functionality without developing a kernel module. It ensures safety by verifying code at load time. eBPF bytecode is loaded into the eBPF virtual machine and executed within the kernel to perform tasks like tracing syscalls, probing user or kernel space, capturing perf events, instrumenting Linux Security Modules (LSM), and filtering packets.

Building an eBPF Profiler

We will create a basic eBPF sampling profiler in Rust using Aya. This profiler will periodically get a snapshot of the stack of a target application.

Setting Up the Development Environment

First, set up your Aya development environment and create a new project called profiler.

        // In eBPF, we can’t use the Rust standard library.
#![no_std]
// The kernel calls our `perf_event`, so there is no `main` function.
#![no_main]

use aya_ebpf::{
    helpers::gen::{bpf_get_stack, bpf_ktime_get_ns},
    macros::{map, perf_event},
    maps::ring_buf::RingBuf,
    programs::PerfEventContext,
    EbpfContext,
};
use profiler_common::{Sample, SampleHeader};

// Create a global variable that will be set by user space.
#[no_mangle]
static PID: u32 = 0;

// Use the Aya `map` procedural macro to create a ring buffer eBPF map.
#[map]
static SAMPLES: RingBuf = RingBuf::with_byte_size(4_096 * 4_096, 0);

#[perf_event]
pub fn perf_profiler(ctx: PerfEventContext) -> u32 {
    let Some(mut sample) = SAMPLES.reserve::(0) else {
        aya_log_ebpf::error!(&ctx, "Failed to reserve sample.");
        return 0;
    };

    unsafe {
        let stack_len = bpf_get_stack(
            ctx.as_ptr(),
            sample.as_mut_ptr().byte_add(SampleHeader::SIZE) as *mut core::ffi::c_void,
            Sample::STACK_SIZE as u32,
            aya_ebpf::bindings::BPF_F_USER_STACK as u64,
        );

        let Ok(stack_len) = u64::try_from(stack_len) else {
            aya_log_ebpf::error!(&ctx, "Failed to get stack.");
            sample.discard(aya_ebpf::bindings::BPF_RB_NO_WAKEUP as u64);
            return 0;
        };

        core::ptr::write_unaligned(
            sample.as_mut_ptr() as *mut SampleHeader,
            SampleHeader {
                ktime: bpf_ktime_get_ns(),
                pid: ctx.tgid(),
                tid: ctx.pid(),
                stack_len,
            },
        )
    }

    sample.submit(0);
    0
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    unsafe { core::hint::unreachable_unchecked() }
}  

Loading eBPF Code into the Kernel

Next, set up user-space code to load the eBPF program into the kernel.

        use aya::{include_bytes_aligned, maps::ring_buf::RingBuf, programs::perf_event, BpfLoader};
#[tokio::main]
async fn main() -> Result {
    env_logger::init();
    let pid: u32 = std::env::args().last().unwrap().parse()?;

    #[cfg(debug_assertions)]
    let mut bpf = BpfLoader::new()
        .set_global("PID", &pid, true)
        .load(include_bytes_aligned!(
            "../../target/bpfel-unknown-none/debug/profiler"
        ))?;
    #[cfg(not(debug_assertions))]
    let mut bpf = BpfLoader::new()
        .set_global("PID", &pid, true)
        .load(include_bytes_aligned!(
            "../../target/bpfel-unknown-none/release/profiler"
        ))?;
    aya_log::BpfLogger::init(&mut bpf)?;

    let program: &mut perf_event::PerfEvent =
        bpf.program_mut("perf_profiler").unwrap().try_into()?;
    program.load()?;
    program.attach(
        perf_event::PerfTypeId::Software,
        perf_event::perf_sw_ids::PERF_COUNT_SW_CPU_CLOCK as u64,
        perf_event::PerfEventScope::OneProcessAnyCpu { pid },
        perf_event::SamplePolicy::Frequency(100),
        true,
    )?;

    tokio::spawn(async move {
        let samples = RingBuf::try_from(bpf.take_map("SAMPLES").unwrap()).unwrap();
        let mut poll = tokio::io::unix::AsyncFd::new(samples).unwrap();
        loop {
            let mut guard = poll.readable_mut().await.unwrap();
            let ring_buf = guard.get_inner_mut();
            while let Some(sample) = ring_buf.next() {
                log::info!("{sample:?}");
            }
            guard.clear_ready();
        }
    });

    tokio::signal::ctrl_c().await?;
    Ok(())
}  

Profiling the Profiler

Users of our profiler report sluggishness. Let’s use sampling and instrumenting profilers to pinpoint the issue.

Sampling Profiler

Install flamegraph to visualize the stack traces and use perf to sample the profiler. Generate a flame graph to identify the bottleneck.

Instrumenting Profiler

Use dhat-rs to measure heap allocations.

        #[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

#[tokio::main]
async fn main() -> Result {
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();
    ...
}  

Run the profiler with --features dhat-heap and analyze the results.

Benchmarking the Profiler

Use Criterion to benchmark the process_sample function

        pub fn process_sample(sample: profiler_common::Sample) -> Result {
    // Don't look at me!
    let _oops = Box::new(std::thread::sleep(std::time::Duration::from_millis(
        u64::from(chrono::Utc::now().timestamp_subsec_millis()),
    )));
    log::info!("{sample:?}");
    Ok(())
}  

Add benchmarks using Criterion.

        fn bench_process_sample(c: &mut criterion::Criterion) {
    c.bench_function("process_sample", |b| {
        b.iter(|| {
            profiler::process_sample(profiler_common::Sample::default()).unwrap();
        })
    });
}

criterion::criterion_main!(benchmark_profiler);
criterion::criterion_group!(benchmark_profiler, bench_process_sample);  

Run the benchmarks with cargo bench.

Continuous Benchmarking

Implement continuous benchmarking using Bencher to catch performance regressions in CI.

        bencher run \
    --project simple-profiler \
    --token $BENCHER_API_TOKEN \
    cargo bench  

Track and compare results over time and across different dimensions.

Conclusion

eBPF allows adding custom capabilities to the Linux kernel. Using Rust and Aya, we built a simple profiler, identified performance regressions using sampling and instrumenting profilers, and verified optimizations with benchmarks. Continuous benchmarking ensures performance regressions are caught before merging changes.

By following these steps, you can ensure your eBPF programs remain performant and maintainable.

All the source code for this guide is available on GitHub.

logo softsculptor bw

Experts in development, customization, release and production support of mobile and desktop applications and games. Offering a well-balanced blend of technology skills, domain knowledge, hands-on experience, effective methodology, and passion for IT.

Search

© All rights reserved 2012-2024.