-
-
Notifications
You must be signed in to change notification settings - Fork 606
Components of OSv
The most extensive description of what OSv is and how it works you can find in our Usenix ATC 2014 paper: see https://www.usenix.org/conference/atc14/technical-sessions/presentation/kivity for the paper, the slides, and an 18-minute video of me presenting them.
But I'll also write now a brief introduction to OSv's software components. For brevity, I'll omit a lot of details and smaller features, and assume x86 architecture. I can elaborate on anything that people have questions about. I hope this will be helpful and that I didn't forget anything.
In the diagram below, you can see the major components of OSv across the logical layers. Starting with libc at the top, which is greatly based on musl, then the core layer in the middle, comprised of ELF dynamic linker, VFS, networking stack, thread scheduler, page cache, RCU, and memory management components. Then finally down, the layer composed of the clock, block, and networking device drivers that allow OSv to interact with hypervisors like VMware and VirtualBox or the ones based on KVM and XEN.
General-purpose operating systems need to work on thousands of different hardware devices, and thus have millions of lines of driver code. Luckily, OSv only needs to implement drivers for the small number of (virtual) hardware presented by the hypervisors we support (KVM, Xen, VMware, and VirtualBox). This includes a minimal set of traditional PC hardware (PCI, IDE, APIC, serial port, keyboard, VGA, HPET, etc.), as well paravirtual drivers: for KVM we support kvmclock (a paravirtual high-resolution clock much more efficient than HPET), virtio, and in particular virtio-net (for network), virtio-blk (for disk), and virtio-rng (for paravirtual random number seeding). Note that we do not have drivers for physical network cards, so device assignment will not work out-of-the-box with OSv. If we do want to use device assignment of a particular network card, one option is to write such a driver, or port an existing driver (e.g. from FreeBSD). Another option is to use a user-space application framework with support for this card - a prime example of such a framework is DPDK, which contains user-space drivers for several high-end network cards.
OSv's filesystem design is based on the traditional Unix "VFS" abstraction layer (http://en.wikipedia.org/wiki/Virtual_file_system). Below it, we have several toy filesystem implementations (procfs, ramfs, devfs), and one serious filesystem implementation - ZFS. ZFS is a sophisticated (some say, best of breed) file system and volume manager implementation, originating in Solaris. We use it to implement a persistent filesystem on top of the block device or devices given to us (via virtio-blk or similar) by the host.
OSv obviously contains the bootstraping code, starting with a real-mode boot-loader running on one CPU (like any kernel begins on the antediluvian x86 architecture), and then loading the rest of the OSv kernel into memory (the compressed kernel is uncompressed prior to loading it), setting up all the CPUs (OSv fully supports SMP VMs), etc. The loader ends by reading a "command line" from the disk (set by, e.g., scripts/imgedit.py) which can specify a few kernel options, as well as which application or applications the dynamic linker should load from the filesystem and run:
OSv executes unmodified Linux executables. Currently, we only support dynamically-linked executables (statically-linked executables are not supported). Moreover, "ordinary" non-relocatable executables are not currently supported, so to run an executable on OSv it must be compiled as either a shared object or a position-independent executable (PIE). The dynamic linker maps the executable and its dependent shared libraries to memory (OSv has http://en.wikipedia.org/wiki/Demand_paging), and does the appropriate relocations and symbol resolutions necessary to make the code runnable. ELF TLS (thread-local storage, e.g, gcc's "__thread" or C++11's thread_local) is also fully supported. The ELF dynamic linker is what makes OSv into a "library OS" - there are no "system calls" or system-call-specific overheads: When the application calls read(), the dynamic linker resolves this call to a call to the read() implementation inside the kernel, and it's just a function call. The entire application runs in one address space, and in the kernel privilege level (ring 0).
As mentioned earlier, OSv maintains a single address space, i.e., a single page table, for the kernel and all application threads (i.e., we do not support the notion of processes with separate address spaces. OSv supports, obviously, both malloc() and mmap() memory allocations. For efficiency, malloc() allocations are always backed by huge pages (2 MB pages), while mmap() allocations are also backed by huge pages if large enough. disk-based mmap() support demand paging as well as page eviction - these are assumed by most applications using mmap() for disk IO (this is popular on Java especially due to Java's limitations).
OSv does not support processes, but definitely does have complete support for threads, as almost all modern applications use them (and besides, what good would an SMP VM do us if we only have one thread?). Our thread scheduler multiplexes N threads on top of M cpus (N might be, in some cases, much larger than M), and guarantees fairness (competing threads get equal share the CPU) and load balancing (threads are moved by CPU to improve fairness). Thread priorities, real-time threads, and other user-visible features of the Linux scheduler are also supported, but the implementation is quite different from that of Linux. One of the consequences of our simpler and more efficient scheduler implementation is that in OSv, context switches are significantly faster than in Linux: Between 3 to 10 times faster.
OSv does not use spin-locks, because those cause the "lock holder preemption" problem on VMs. Rather, we have a rare implementation of a lock-free mutex, as well as an interesting collection of lock-free algorithms and an implementation of RCU (http://en.wikipedia.org/wiki/Read-copy-update).
As explained above, we needed to implement in OSv all the traditional Linux system calls and glibc library calls, in a way that is 100% ABI-compatible (i.e., binary compatibility) with glibc. We implemented many of the functions ourselves. We imported some of the functions - such as the math functions and stdio functions - from the musl-libc project - a BSD-licensed libc implementation.
OSv has a full-featured TCP/IP network stack (on top of the network driver like virtio-net). The code was originally imported from FreeBSD, but later underwent a major overhaul to use Van Jacobson's "network channels" design, to reduce the number of locks and lock operations (even uncontended locks are slow, and contended locks are obviously worse) and the amount of slow cache-line bounces on SMP VMs.
OSv contains a buitin DHCP client, so it can find its IP address without being configured manually.
OSv's makefile is a complicated beast (currently undergoing a rewrite), which builds the OSv kernel, as well as a complete disk image containing the OSv kernel and an application. An application is built and specified by simple python scripts. Explaining how to build applications for OSv is beyond the scope of this mail.
A small python script for running the image created by Makefile (see above) on the local host (using KVM, qemu, or Xen).
"apps.git" is a collection of about 60 applications known to run on OSv. The repository does not contain the applications themselves - just small scripts to fetch the applications from their own online sources, and to build them into an image using the Makefile (see above) or Capstan (see below). Some of these apps aren't quite apps, they are run-time environments on which many apps can be run. A prime example is "java" (OpenJDK 7) and openjdk8-fedora (OpenJDK 8). At some point during our development, Java support was OSv's primary focus, so our Java support is quite complete and tested.
Capstan is an alternative to the Makefile and Run.py described above. It is similar in purpose (but not implementation) to Docker: it composes images from OSv and an application specified by a "Capstanfile" file. Capstan also allows you to upload the images you compose to a site, to download pre-composed images, and also to run these images. Using Capstan is not necessary for using OSv, but some find it more natural than the Makefile/run.py approach. I'm not one of these people ;-)
Our "httpd" is a separate application (running as a thread, of course), but it is by default compiled into images created by our Makefile. It provides a "REST API" to OSv, to do all sorts of things from getting the list of threads to rebooting the machine. On top of this REST API we also have a shell (written in Lua) which runs on the host (or guest, if you prefer) and functions by sending REST (http) requests to the guest. Another thing we have on top of the REST API is a graphical administration UI.