If you have read my previous blog posts, you can see that sometime ago I started to get involved with low-level programming. I wrote some posts about x86_64 assembly programming for Linux. At the same time, I started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how they are located in memory, how the kernel manages processes and memory, how the network stack works at a low level and many many other things. So, I decided to write yet another series of posts about the Linux kernel for x86_64.
Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter 0xAX, drop me an email or just create an issue. I appreciate it. All posts will also be accessible at linux-insides and if you find something wrong with my English or the post content, feel free to send a pull request.
Note that this isn't the official documentation, just learning and sharing knowledge.
Required knowledge
- Understanding C code
- Understanding assembly code (AT&T syntax)
Anyway, if you just start to learn some tools, I will try to explain some parts during this and the following posts. Ok, simple introduction finishes and now we can start to dive into the kernel and low-level stuff.
All code is actually for kernel - 3.18. If there are changes, I will update the posts accordingly.
Despite that this is a series of posts about the Linux kernel, we will not start from the kernel code (at least not in this paragraph). Ok, you press the magic power button on your laptop or desktop computer and it starts to work. After the motherboard sends a signal to the power supply, the power supply provides the computer with the proper amount of electricity. Once the motherboard receives the power good signal, it tries to start the CPU. The CPU resets all leftover data in its registers and sets up predefined values for each of them.
80386 and later CPUs define the following predefined data in CPU registers after the computer resets:
IP 0xfff0
CS selector 0xf000
CS base 0xffff0000
The processor starts working in real mode. Let's back up a little to try and understand memory segmentation in this mode. Real mode is supported on all x86-compatible processors, from the 8086 all the way to the modern Intel 64-bit CPUs. The 8086 processor has a 20-bit address bus, which means that it could work with a 0-0x100000 address space (1 megabyte). But it only has 16-bit registers, and with 16-bit registers the maximum address is 2^16 - 1 or 0xffff (64 kilobytes). Memory segmentation is used to make use of all the address space available. All memory is divided into small, fixed-size segments of 65536 bytes, or 64 KB. Since we cannot address memory above 64 KB with 16 bit registers, an alternate method is devised. An address consists of two parts: a segment selector which has an associated base address and an offset from this base address. In real mode, the associated base address of a segment selector is Segment Selector * 16
. Thus, to get a physical address in memory, we need to multiply the segment selector part by 16 and add the offset part:
PhysicalAddress = Segment Selector * 16 + Offset
For example if CS:IP
is 0x2000:0x0010
, the corresponding physical address will be:
>>> hex((0x2000 << 4) + 0x0010)
'0x20010'
But if we take the largest segment selector and offset: 0xffff:0xffff
, it will be:
>>> hex((0xffff << 4) + 0xffff)
'0x10ffef'
which is 65520 bytes over first megabyte. Since only one megabyte is accessible in real mode, 0x10ffef
becomes 0x00ffef
with disabled A20.
Ok, now we know about real mode and memory addressing. Let's get back to discuss about register values after reset:
The CS
register consists of two parts: the visible segment selector and the hidden base address. While the base address is normally formed by multiplying the segment selector value by 16, during a hardware reset, the segment selector in the CS register is loaded with 0xf000 and the base address is loaded with 0xffff0000. The processor uses this special base address until CS is changed.
The starting address is formed by adding the base address to the value in the EIP register:
>>> 0xffff0000 + 0xfff0
'0xfffffff0'
We get 0xfffffff0
which is 4GB - 16 bytes. This point is called the Reset vector. This is the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a jump instruction which usually points to the BIOS entry point. For example, if we look in the coreboot source code, we see:
.section ".reset"
.code16
.globl reset_vector
reset_vector:
.byte 0xe9
.int _start - ( . + 2 )
...
Here we can see the jmp instruction opcode - 0xe9 and its destination address - _start - ( . + 2)
, and we can see that the reset
section is 16 bytes and starts at 0xfffffff0
:
SECTIONS {
_ROMTOP = 0xfffffff0;
. = _ROMTOP;
.reset . : {
*(.reset)
. = 15 ;
BYTE(0x00);
}
}
Now the BIOS starts: after initializing and checking the hardware, it needs to find a bootable device. A boot order is stored in the BIOS configuration, controlling which devices the BIOS attempts to boot from. When attempting to boot from a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector (which is 512 bytes). The final two bytes of the first sector are 0x55
and 0xaa
, which signals the BIOS that this device is bootable. For example:
;
; Note: this example is written in Intel Assembly syntax
;
[BITS 16]
[ORG 0x7c00]
boot:
mov al, '!'
mov ah, 0x0e
mov bh, 0x00
mov bl, 0x07
int 0x10
jmp $
times 510-($-$$) db 0
db 0x55
db 0xaa
Build and run it with:
nasm -f bin boot.nasm && qemu-system-x86_64 boot
This will instruct QEMU to use the boot
binary we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to 0x7c00
, and we end with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image.
You will see:
In this example we can see that the code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After starting it calls the 0x10 interrupt which just prints the !
symbol. It fills the rest of the 510 bytes with zeros and finishes with the two magic bytes 0xaa
and 0x55
.
You can see a binary dump of this with the objdump
util:
nasm -f bin boot.nasm
objdump -D -b binary -mi386 -Maddr16,data16,intel boot
A real-world boot sector has code to continue the boot process and the partition table instead of a bunch of 0's and an exclamation mark :) From this point onwards, BIOS hands over control to the bootloader.
NOTE: As you can read above the CPU is in real mode. In real mode, calculating the physical address in memory is done as follows:
PhysicalAddress = Segment Selector * 16 + Offset
The same as mentioned before. We have only 16 bit general purpose registers, the maximum value of a 16 bit register is 0xffff
, so if we take the largest values, the result will be:
>>> hex((0xffff * 16) + 0xffff)
'0x10ffef'
Where 0x10ffef
is equal to 1MB + 64KB - 16b
. But a 8086 processor, which is the first processor with real mode, has a 20 bit address line and 2^20 = 1048576
is 1MB. This means the actual memory available is 1MB.
General real mode's memory map is:
0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
0x00000400 - 0x000004FF - BIOS Data Area
0x00000500 - 0x00007BFF - Unused
0x00007C00 - 0x00007DFF - Our Bootloader
0x00007E00 - 0x0009FFFF - Unused
0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory
0x000B0000 - 0x000B7777 - Monochrome Video Memory
0x000B8000 - 0x000BFFFF - Color Video Memory
0x000C0000 - 0x000C7FFF - Video ROM BIOS
0x000C8000 - 0x000EFFFF - BIOS Shadow Area
0x000F0000 - 0x000FFFFF - System BIOS
In the beginning of this post I wrote that the first instruction executed by the CPU is located at address 0xFFFFFFF0
, which is much larger than 0xFFFFF
(1MB). How can the CPU access this in real mode? This is in the coreboot documentation:
0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
At the start of execution, the BIOS is not in RAM, but in ROM.
There are a number of bootloaders that can boot Linux, such as GRUB 2 and syslinux. The Linux kernel has a Boot protocol which specifies the requirements for bootloaders to implement Linux support. This example will describe GRUB 2.
Now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from boot.img. This code is very simple due to the limited amount of space available, and contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with diskboot.img, which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image into memory, which contains GRUB 2's kernel and drivers for handling filesystems. After loading the rest of the core image, it executes grub_main.
grub_main
initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules etc. At the end of execution, grub_main
moves grub to normal mode. grub_normal_execute
(from grub-core/normal/main.c
) completes the last preparation and shows a menu to select an operating system. When we select one of the grub menu entries, grub_menu_execute_entry
runs, which executes the grub boot
command, booting the selected operating system.
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at 0x01f1
offset from the kernel setup code. The kernel header arch/x86/boot/header.S starts from:
.globl hdr
hdr:
setup_sects: .byte 0
root_flags: .word ROOT_RDONLY
syssize: .long 0
ram_size: .word 0
vid_mode: .word SVGA_MODE
root_dev: .word 0
boot_flag: .word 0xAA55
The bootloader must fill this and the rest of the headers (only marked as write
in the Linux boot protocol, for example this) with values which it either got from command line or calculated. We will not see a description and explanation of all fields of the kernel setup header, we will get back to that when the kernel uses them. You can find a description of all fields in the boot protocol.
As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel:
| Protected-mode kernel |
100000 +------------------------+
| I/O memory hole |
0A0000 +------------------------+
| Reserved for BIOS | Leave as much as possible unused
~ ~
| Command line | (Can also be below the X+10000 mark)
X+10000 +------------------------+
| Stack/heap | For use by the kernel real-mode code.
X+08000 +------------------------+
| Kernel setup | The kernel real-mode code.
| Kernel boot sector | The kernel legacy boot sector.
X +------------------------+
| Boot loader | <- Boot sector entry point 0x7C00
001000 +------------------------+
| Reserved for MBR/BIOS |
000800 +------------------------+
| Typically used by MBR |
000600 +------------------------+
| BIOS use only |
000000 +------------------------+
So when the bootloader transfers control to the kernel, it starts at:
0x1000 + X + sizeof(KernelBootSector) + 1
where X
is the address of the kernel boot sector loaded. In my case X
is 0x10000
, as we can see in a memory dump:
The bootloader has now loaded the Linux kernel into memory, filled the header fields and jumped to it. Now we can move directly to the kernel setup code.
Finally we are in the kernel. Technically the kernel hasn't run yet, we need to set up the kernel, memory manager, process manager etc first. Kernel setup execution starts from arch/x86/boot/header.S at _start. It is a little strange at first sight, as there are several instructions before it.
A Long time ago the Linux kernel had its own bootloader, but now if you run for example:
qemu-system-x86_64 vmlinuz-3.18-generic
You will see:
Actually header.S
starts from MZ (see image above), error message printing and following PE header:
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
...
...
...
pe_header:
.ascii "PE"
.word 0
It needs this to load an operating system with UEFI. We won't see how this works right now, we'll see this in one of the next chapters.
So the actual kernel setup entry point is:
// header.S line 292
.globl _start
_start:
The bootloader (grub2 and others) knows about this point (0x200
offset from MZ
) and makes a jump directly to this point, despite the fact that header.S
starts from .bstext
section which prints an error message:
//
// arch/x86/boot/setup.ld
//
. = 0; // current position
.bstext : { *(.bstext) } // put .bstext section to position 0
.bsdata : { *(.bsdata) }
So the kernel setup entry point is:
.globl _start
_start:
.byte 0xeb
.byte start_of_setup-1f
1:
//
// rest of the header
//
Here we can see a jmp
instruction opcode - 0xeb
to the start_of_setup-1f
point. Nf
notation means 2f
refers to the next local 2:
label. In our case it is label 1
which goes right after jump. It contains the rest of the setup header. Right after the setup header we see the .entrytext
section which starts at the start_of_setup
label.
Actually this is the first code that runs (aside from the previous jump instruction of course). After the kernel setup got the control from the bootloader, the first jmp
instruction is located at 0x200
(first 512 bytes) offset from the start of the kernel real mode. This we can read in the Linux kernel boot protocol and also see in the grub2 source code:
segment = grub_linux_real_target >> 4;
state.gs = state.fs = state.es = state.ds = state.ss = segment;
state.cs = segment + 0x20;
It means that segment registers will have the following values after kernel setup starts:
gs = fs = es = ds = ss = 0x1000
cs = 0x1020
In my case when the kernel is loaded at 0x10000
.
After the jump to start_of_setup
, it needs to do the following:
- Be sure that all values of all segment registers are equal
- Set up correct stack if needed
- Set up bss
- Jump to C code at main.c
Let's look at the implementation.
First of all it ensures that ds
and es
segment registers point to the same address and clears the direction flag with the cld
instruction:
movw %ds, %ax
movw %ax, %es
cld
As I wrote earlier, grub2 loads kernel setup code at address 0x10000
and cs
at 0x1020
because execution doesn't start from the start of file, but from:
_start:
.byte 0xeb
.byte start_of_setup-1f
jump
, which is at 512 bytes offset from the 4d 5a. It also needs to align cs
from 0x10200
to 0x10000
as all other segment registers. After that we set up the stack:
pushw %ds
pushw $6f
lretw
push ds
value to the stack with the address of the 6 label and execute lretw
instruction. When we call lretw
, it loads address of label 6
into the instruction pointer register and cs
with the value of ds
. After this ds
and cs
will have the same values.
Actually, almost all of the setup code is preparation for the C language environment in real mode. The next step is checking the ss
register value and making a correct stack if ss
is wrong:
movw %ss, %dx
cmpw %ax, %dx
movw %sp, %dx
je 2f
This can lead to 3 different scenarios:
ss
has valid value 0x10000 (as all other segment registers besidecs
)ss
is invalid andCAN_USE_HEAP
flag is set (see below)ss
is invalid andCAN_USE_HEAP
flag is not set (see below)
Let's look at all three of these scenarios:
ss
has a correct address (0x10000). In this case we go to label 2:
2: andw $~3, %dx
jnz 3f
movw $0xfffc, %dx
3: movw %ax, %ss
movzwl %dx, %esp
sti
Here we can see the alignment of dx
(contains sp
given by bootloader) to 4 bytes and a check for whether or not it is zero. If it is zero, we put 0xfffc
(4 byte aligned address before maximum segment size - 64 KB) in dx
. If it is not zero we continue to use sp
given by the bootloader (0xf7f4 in my case). After this we put the ax
value to ss
which stores the correct segment address of 0x10000
and sets up a correct sp
. We now have a correct stack:
- In the second scenario, (
ss
!=ds
). First of all put the _end (address of end of setup code) value indx
and check theloadflags
header field with thetestb
instruction to see whether we can use the heap or not. loadflags is a bitmask header which is defined as:
#define LOADED_HIGH (1<<0)
#define QUIET_FLAG (1<<5)
#define KEEP_SEGMENTS (1<<6)
#define CAN_USE_HEAP (1<<7)
And as we can read in the boot protocol:
Field name: loadflags
This field is a bitmask.
Bit 7 (write): CAN_USE_HEAP
Set this bit to 1 to indicate that the value entered in the
heap_end_ptr is valid. If this field is clear, some setup code
functionality will be disabled.
If the CAN_USE_HEAP
bit is set, put heap_end_ptr
in dx
which points to _end
and add STACK_SIZE
(minimal stack size - 512 bytes) to it. After this if dx
is not carry (it will not be carry, dx = _end + 512), jump to label 2
as in the previous case and make a correct stack.
- When
CAN_USE_HEAP
is not set, we just use a minimal stack from_end
to_end + STACK_SIZE
:
The last two steps that need to happen before we can jump to the main C code, are setting up the BSS area and checking the "magic" signature. First, signature checking:
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
This simply compares the setup_sig with the magic number 0x5a5aaa55
. If they are not equal, a fatal error is reported.
If the magic number matches, knowing we have a set of correct segment registers and a stack, we only need to set up the BSS section before jumping into the C code.
The BSS section is used to store statically allocated, uninitialized data. Linux carefully ensures this area of memory is first blanked, using the following code:
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
First of all the __bss_start address is moved into di
and the _end + 3
address (+3 - aligns to 4 bytes) is moved into cx
. The eax
register is cleared (using a xor
instruction), and the bss section size (cx
-di
) is calculated and put into cx
. Then, cx
is divided by four (the size of a 'word'), and the stosl
instruction is repeatedly used, storing the value of eax
(zero) into the address pointed to by di
, automatically increasing di
by four (this occurs until cx
reaches zero). The net effect of this code is that zeros are written through all words in memory from __bss_start
to _end
:
That's all, we have the stack and BSS so we can jump to the main()
C function:
calll main
The main()
function is located in arch/x86/boot/main.c. You can read about what this does in the next part.
This is the end of the first part about Linux kernel insides. If you have questions or suggestions, ping me in twitter 0xAX, drop me email or just create issue. In the next part we will see first C code which executes in Linux kernel setup, implementation of memory routines as memset
, memcpy
, earlyprintk
implementation and early console initialization and many more.
Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.