Skip to content

Latest commit

 

History

History
401 lines (359 loc) · 21 KB

analysis_of_ebpf_for_safety_mentorship_program.md

File metadata and controls

401 lines (359 loc) · 21 KB

Analysis for eBPF for safety

Agenda

  1. BPF
    1. Introduction
    2. Practical use of BPF
      1. Observability
        • BPF front-end tool
        • BCC tools and programming
        • BPFtrace
      2. Security
        • BPF verifier
      3. Networking
        • XDP
    3. Work done during mentorship
      1. XDP tools
      2. Accepted patches in the Linux Kernel
      3. Ongoing
    4. Attempted but not accepted
      1. BPF current issue proposed by Daniel Borkmann
    5. How to use ebpf for safety
    6. Mentorship wrap up
    ________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

    Introduction

    The Berkeley Packet Filter(BPF/eBPF) is a technology developed in 1992. BPF brought two innovations in packet filtering technology:

    • A virtual machine working efficiently with register-based CPUs
    • The usage of per-application buffers that could filter packets without copying all the packet information.

    BPF gained popularity then after massively improving the performance of packet capture tools of that time (tcpdump).

    In 2013, BPF was extended and optimised for modern machine. This version was known as eBPF. It has been in constant development since then. The number of registers in BPF VM was increased from two 32 bit to ten 64-bit registers; writing more complex bpf programs became possible. This extended version of BPF added JIT support that increased the performance by four times.

    This new version turned BPF into a general purpose execution engine that can be used for a variety of use case ranging from security , networking to observability to name but a few.

    What is BPF?

    BPF can be difficult to define because of its wide range of use cases.

    Alexei Starovoirtov, the creator of the new version, define BPF as simply an instruction set, a new language, an extension to C or a safer C. Any programming language canbe compiled into BPF, he added.

    eBPF/BPF implements a dedicated virtual machine with custom interpreter. User space programs can attach at various tracepoints within the kernel and perform a wide variety of tasks: tracing, monitoring and debugging.

    BPF can be considered a virtual machine: it has an in kernel execution engine that processes this virtual instruction set.

    The technology is actually composed of an instruction set, storage objects and helper functions.

    Having started as a simple language for writing packet-filtering code for utilities like tcpdump: McCanne 92, BPF grew to a general purpose execution engine; that can be used for a variety of things including creation of advanced performance analysis tools.

    With BPF, we can run mini programs on wide variety of kernel and application events.

    An eBPF program is attached to a designated code path in the Kernel. When a code path is traversed, any attached eBPF programs are executed.

    The main use of BPF are networking, observability(tracing) and security.

    In this introduction, we will focus on the main use of the BPF subsystem.

    Practical Use

    1. Observability
    2. Observability is the understanding of the system through observation by use of tracing tools, sampling tools and tools based on fixed counters. These tools are written using bpf.

      BPF is an event driven programming that provides observability or tracing. System administrator tools that give extra informations that are not provided by common system administrator tools.

      • BCC
      • BPF compiler collection (BCC) is the higher level tracing framework developed for BPF. The framework provides a C programming environment for writing Kernel BPF code and other languages(python, Lua, C++) for user-level interface.
        1. bcc tools
        2. BCC repository has more than 70 BPF tools for performance and analysis. We will go through 12 BCC tools.
          1. execsnoop
          2. # execsnoop
            PCOMM            PID    PPID   RET ARGS
            dhcpcd-run-hook  29407  2642     0 /lib/dhcpcd/dhcpcd-run-hooks
            sed              29410  29409    0 /bin/sed -n s/^domain //p wlan0.dhcp
            cmp              29417  29407    0 /usr/bin/cmp -s /etc/resolv.conf ../resolv.conf.wlan0.ra
                     qemu-system-x86  29422  27546    0 /usr/bin/qemu-system-x86_64 -m 4096 -smp 8 ... -snapshot 
            
            execsnoop works by tracing the execve(2) system call and reveal processes that may be shortlived that they are invisible to other tools like ps.
          3. opensnoop
          4. # opensnoop -T
            	      TIME(s)       PID    COMM               FD ERR PATH
            	      0.000000000   11552  baloo_file_extr    20   0 /home/jules/../linux/../unistd_32.h
            	      0.000433000   11552  baloo_file_extr    20   0 /home/jules/../linux/../unistd_64.h
            	      0.000764000   11552  baloo_file_extr    20   0 /home/jules/../linux/../unistd_x32.h
            	      0.001084000   11552  baloo_file_extr    20   0 /home/jules/../linux/../syscalls_32.h
            	      0.001391000   11552  baloo_file_extr    20   0 /home/jules/../linux/../unistd_32_ia32.h
            	      0.001685000   11552  baloo_file_extr    20   0 /home/jules/../linux/../unistd_64_x32.h
            	      0.079771000   3486   qemu-system-x86    23   0 /etc/resolv.conf
            	      0.422395000   11858  Chrome_IOThread   389   0 /dev/shm/.com.google.Chrome.ct746O
            	      
            This debugging tools prints a line of the output per each open() system call and its variants. opensnoop can be used to troubleshoot failing software which may be attempting to open files from a wrong path as well as determine where the config and log files are kept.
          5. ext4slower
          6. # ext4slower
                   Tracing ext4 operations slower than 10 ms
                   TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
                   22:16:08 baloo_file_ext 4458   S 0       0         125.20 index
                   22:16:12 baloo_file_ext 4458   S 0       0         134.65 index
                   22:16:16 baloo_file_ext 4458   S 0       0         151.65 index
                   22:16:20 baloo_file_ext 4458   S 0       0         172.81 index
            	     22:16:25 baloo_file_ext 4458   W 60678144 5098540    11.48 index
            	     
            This tool trace common operation of ext4 file system(reads, write, open, syncs) and prints those that exceed a time threshold
          7. biolatency
          8. # biolatency 
            	      Tracing block device I/O... Hit Ctrl-C to end.
            	      usecs               : count     distribution
            	      0 -> 1          : 0        |                                        |
            	      2 -> 3          : 0        |                                        |
            	      4 -> 7          : 3        |                                        |
            	      8 -> 15         : 115      |**************                          |
            	      16 -> 31         : 49       |******                                  |
            	      32 -> 63         : 36       |****                                    |
            	      64 -> 127        : 1        |                                        |
            	      128 -> 255        : 286      |************************************    |
                    256 -> 511        : 160      |********************                    |
                    512 -> 1023       : 315      |****************************************|
                    1024 -> 2047       : 21       |**                                      |
            	      2048 -> 4095       : 1        |                                        |
            	      
            biolatency traces disk I/O latency and shows result as an histogram. Latency here refers to the time taken from device issue to completion. The tool gives better performance information than iostat(1)
          9. biosnoop
          10. # biosnoop
            	      TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)
            	      0.000000    kworker/23:1   9126           R 18446744073709551615 0         0.61
            	      1.774198    ThreadPoolFore 5270   nvme0n1 W 520198144  225280    0.48
            	      1.774381    jbd2/nvme0n1p3 686    nvme0n1 W 490161296  65536     0.03
            	      1.774609    ?              0              R 0          0         0.21
            	      1.774809    jbd2/nvme0n1p3 686    nvme0n1 W 490161424  4096      0.19
            	      2.069546    kworker/23:1   9126           R 18446744073709551615 0         0.17
            	      2.159061    ?              0              R 0          0         0.24
            	      2.159129    ThreadPoolFore 5270   nvme0n1 W 777702184  4096      0.01
            	      2.159341    ?              0              R 0          0         0.20
            	      2.159387    ThreadPoolFore 5270   nvme0n1 W 15221256   8192      0.01
            	      2.159598    ?              0              R 0          0         0.20
            	      2.159713    jbd2/nvme0n1p3 686    nvme0n1 W 490161432  53248     0.02
            The tool prints a line of output for each disk I/O with details include latency
          11. tcpconnect
          12. # tcpconnect
            	      Tracing connect ... Hit Ctrl-C to end
            	      PID    COMM         IP SADDR            DADDR            DPORT
            	      4909   Chrome_Child 4  192.168.1.245    40.74.98.194     443
            	      4909   Chrome_Child 4  192.168.1.245    40.74.98.194     443
            	      5564   Chrome_Child 4  192.168.1.245    172.217.16.238   443
            	      4909   Chrome_Child 4  192.168.1.245    52.97.208.18     443
            	      5564   Chrome_Child 4  192.168.1.245    142.250.200.14   443
            	      5564   Chrome_Child 4  192.168.1.245    35.206.151.171   443
            	      4909   Chrome_Child 4  192.168.1.245    52.113.205.5     443
            	      5564   Chrome_Child 4  192.168.1.245    34.131.36.146    443
            	      4909   Chrome_Child 4  192.168.1.245    13.89.179.10     443
            	      5564   Chrome_Child 4  192.168.1.245    142.250.179.229  443
            	      
            tcpconnect display a one line of output of every active TCP connection(connect())
          13. tcpretrans
          14. # tcpretrans
            Tracing retransmits ... Hit Ctrl-C to end
            TIME     PID    IP LADDR:LPORT          T> RADDR:RPORT          STATE
             22:36:32 0  4  192.168.1.245:42072  R> 13.33.52.19:443  ESTABLISHED
             22:39:50 0  4  192.168.1.245:59090  R> 142.250.179.229:443  ESTABLISHED
            22:39:50 0   4  192.168.1.245:59070  R> 142.250.179.229:443  ESTABLISHED
            22:39:51 1372 4  192.168.1.245:59090  R> 142.250.179.229:443  ESTABLISHED
            22:39:51 1372 4  192.168.1.245:59092  R> 142.250.179.229:443  ESTABLISHED
             
            This tool uses dynamic tracing of the kernel tcp_retransmit_skb() and tcp_send_loss_probe() functions to only traces TCP retransmits, showing address, port, PID and information state.
          15. dcsnoop
          16. # dcsnoop
              TIME(s)     PID    COMM             T FILE
              8.893741    29295  sadc             M dev
              8.893782    29295  sadc             M dev
              8.893813    29295  sadc             M dev
              8.894006    29295  sadc             M nfs
              8.894028    29295  sadc             M nfsd
              8.894041    29295  sadc             M sockstat
              8.894053    29295  sadc             M softnet_stat
              13.240580   3743   ThreadPoolForeg  M todelete_e3e954c5761fd557_0_1
              13.240988   3557   Chrome_IOThread  M .org.chromium.Chromium.PPeyk5
              13.243443   3743   ThreadPoolForeg  M e3e954c5761fd557_0
              21.747442   29303  dhcpcd-run-hook  M resolv.conf.wlan0.ra
              21.747854   29313  cmp              M maps
              21.748626   29315  rm               M resolv.conf.wlan0.ra
              21.750007   2470   dhcpcd           M if_inet6 
            The tool traces every dcache lookup, and shows the process performing the lookup and the filename requested.
          17. cachestat
          18. # cachestat
              HITS   MISSES  DIRTIES HITRATIO   BUFFERS_MB  CACHED_MB
               16        1        1   94.12%         1312       3249
                0        0        0    0.00%         1312       3249
               34        3       15   91.89%         1312       3249
                0        0        0    0.00%         1312       3249
               14        3        5   82.35%         1312       3249
              407        0       80  100.00%         1312       3249
            	0        0        0    0.00%         1312       3249
            	0        0        0    0.00%         1312       3249
            	0        0       19    0.00%         1312       3249
            	0        0        0    0.00%         1312       3249
            	9743        0      136  100.00%         1312       3249
            	0        0        3    0.00%         1312       3249
            	0        0        0    0.00%         1312       3249
            	5        0        0  100.00%         1312       3249
               0        0        0    0.00%         1312       3249
            cachestat prints a one line summary every second (or every custom interval) showing statistics from the file system cache.
          19. trace
          20.  # trace 'do_nanosleep(struct hrtimer_sleeper *t) "task: %x", t->task'
             PID     TID     COMM            FUNC             -
             3437    3489    teams           do_nanosleep  task: d4588000
             2511    2815    pool-gsd-smartc do_nanosleep  task: 6b248000
             18685   18693   nautilus        do_nanosleep  task: 21b55200
             112985  113009  vqueue:src      do_nanosleep  task: 328b0000
             3437    3489    teams           do_nanosleep  task: d4588000
             
            trace is a multi-tool per event tracing from many different sources: kprobes, uprobes, tracepoints and USDT probes. It is used when looking for the arguments when a kernel or user-level function is called, the return value of a function, finding out whether a function is failing nad how a function is called or what the user or kernel level stack trace. The tool is suited for infrequently called events. If used for frequently occuring events, trace would produce so much output that would cost significant overhead to instrument. To reduce overhead , it is advised to use a filter expressiopn to print only events of interest.
          21. funccount
          22. <<<<<<< HEAD
                   # funccount 'tcp_send*'
                   Tracing 16 functions for "b'tcp_send*'"... Hit Ctrl-C to end.
                   ^C        
            =======
                   # funccount 'vfs_*'
                   Tracing 70 functions for "b'vfs_*'"...
                   Hit Ctrl-C to end. ^C
            >>>>>>> d88860d83a4d33fdd69b6b3bf76019d86e9a54e0
                   FUNC                                    COUNT
                   b'tcp_send_probe0'                          5
                   b'tcp_send_active_reset'                    8
                   b'tcp_send_loss_probe'                      9
                   b'tcp_send_dupack'                         18
                   b'tcp_send_fin'                            58
                   b'tcp_send_mss'                          2594
                   b'tcp_sendmsg_locked'                    2595
                   b'tcp_sendmsg'                           2595
                   b'tcp_send_delayed_ack'                  2778
                   b'tcp_send_ack'                          3723
                   Detaching...
                   
            funccount counts events, specifically function calls and respond to queries on whether a specific kernel or user-level function is being called the rate at which the function is being called(per second)
          23. stackcount
          24. The tool counts stack traces that led to an event. The event can be a kernel- or user -level function, tracepoint or USDT probe. stackcount respond to queries:
            1. Why is this event called? What is the code path?
            2. What are all the different code paths that call this event, and what are their frequencies?
            stackcount performs a summary entirely in kernel context
           # sudo stackcount t:sched:sched_switch
                  b'__sched_text_start'
            b'__sched_text_start'
            b'schedule'
            b'__down_write_common'
            b'down_write_killable'
            b'mmap_write_lock_killable'
            b'__vm_munmap'
            b'__x64_sys_munmap'
            b'do_syscall_64'
            b'entry_SYSCALL_64_after_hwframe'
              1
        3. bcc programming
        4. bpftrace
        5. bpftrace is an open source tracer built on bpf and bcc. bpftrace provides a high level programming language that allows you to create powerful one-liners and short tools for performance analysis. Below is an example on summarizing the vfs_read() return value (bytes or error value) as a histogram using bpftrace
           
          $ sudo bpftrace  -e 'kretprobe:vfs_read { @bytes = hist(retval); }'
          Attaching 1 probe...
          ^C
          @bytes:
          (..., 0)              78 |                                                    |
          [0]                   56 |                                                    |
          [1]                  492 |@@@@@                                               |
          [2, 4)                54 |                                                    |
          [4, 8)              2395 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
          [8, 16)             1414 |@@@@@@@@@@@@@@@                                     |
          [16, 32)              30 |                                                    |
          [32, 64)             236 |@@                                                  |
          [64, 128)             12 |                                                    |
          [128, 256)             4 |                                                    |
          [256, 512)             6 |                                                    |
          [512, 1K)              6 |                                                    |
          [1K, 2K)             311 |@@@                                                 |
          [2K, 4K)               0 |                                                    |
          [4K, 8K)               0 |                                                    |
          [8K, 16K)              0 |                                                    |
          [16K, 32K)          4656 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
          [32K, 64K)             0 |                                                    |
          [64K, 128K)            1 |                                                    |
          [128K, 256K)           0 |                                                    |
          [256K, 512K)           1 |
          

          BPFtrace being a programming language, an hello world is written as

          $ sudo bpftrace -e 'BEGIN { printf("Hello, World!\n"); }'
          Attaching 1 probe...
          Hello, World!
          
    3. XDP
    4. XDP (eXpress Data Path) is an eBPF-based high-performance data path used to send and receive network packets at high rates by bypassing most of the operating system networking stack.

      In XDP, bpf hook is added early in RX path of the kernel, enabling the user supplied bpf program to decide the fate of the packet. With XDP code are executed very early on when network packet arrive at the kernel.

      Unsurprisingly XDP programs are controlled through bpf syscall and loaded using the program type BPF_PROG_TYPE_XDP.

      The execution of an XDP program can happen in one of the three modes:

      1. Native XDP: XDP BPF program is run directly out of the networking driver's early receive path. Most drivers support XDP. The command below will check whether the driver support XDP

         # git grep -l XDP_SETUP_PROG drivers/
         
      2. Offloaded XDP

        XDP BPF program is directly offloaded into th NIC instead of being executed by host CPU. The command below checks whether the driver support XDP.

       # git grep -l XDP_SETUP_PROG_HW drivers/
          
      1. Generic XDP

      This is a test-mode for developers, XDP program is loaded on virtualised card - veth devices.

    5. eBPF Verifer
    6. The verifier is a mechanism that determines the safety of the eBPF program\nand only allow the execution of the program that passes the safety checks.

      The checks are done in two steps:

      1. Directed Acyclic Graph (DAG) check Here the verifier checks whether the program will terminate (acyclic), ensuring that the program does not have any backward branches as it must be directed graph, though the program can branch forward to the same point. Program with unreachable instruction will not not be allowed to run.

      This step is done by doing a depth-first search of the program's control flow graph. The CFG enforce two eBPF rules:

      a) No back-edges b) No unreachable instruction

      1. Simulation check: The verifier simulates the execution of every instruction in the program, starting from the first instruction and try all possible paths the instructions can lead to, while observing the state change of registers and stack making sure no invalid operations are performed.

      Workplan

      • BPF current issue proposed by Daniel Borkmann * Move samples/bpf to BPF selftests folder to improve on test_prog BPFci - currently ongoing, rewriting the Makefile * Require reading on Makefile writing * Duration: ongoing - one more week - expected to complete 26th May * progress: + moved files, made them compiled correctly in the same directory + challenges : + merging two makefiles into one + add more tests to bpf CI from the samples/bpf

        • Create a bcc tool for network statistics using XDP/BPF technology
          • proposed tool to gather statisitics on per ip address connected to network event block an IP if suspicious
          • Duration: two weeks
            • 1 week reading xdp documents and practice
            • 1 week coding
            • expected completion 5th June.
        • Improving the eBPF verifier : work with Wenhui * Duration: One Month - starts 6th June
        • Any other issues assigned by mentor