Kernel startup entry point / how Linux kernel boots ?

We will consider our hardware platform as ARM, so the kernel startup entry point code is at arch/arm/kernel/head.S

This is normally called from the decompressor code. The requirements
are: MMU = off, D-cache = off, I-cache = dont care, r0 = 0,
r1 = machine nr, r2 = atags or dtb pointer.

This code is mostly position independent, so if you link the kernel at
0xc0008000, you call this at __pa(0xc0008000).

See linux/arch/arm/tools/mach-types for the complete list of machine numbers for r1.

.arm

__HEAD
ENTRY(stext)

Now, check for processor type, with code as below,

mrc p15, 0, r9, c0, c0 @ get processor id
bl __lookup_processor_type @ r5=procinfo r9=cpuid
movs r10, r5 @ invalid processor (r5=0)?

This code is referenced to arch/arm/kernel/head-common.S as below,

/*Read processor ID register (CP#15, CR0), and look up in the linker-built supported processor list. Note that we can’t use the absolute addresses for the __proc_info lists since we aren’t running with the MMU on (and therefore, we are not in the correct address space). We have to calculate the offset.
r9 = cpuid
Returns:
r3, r4, r6 corrupted
r5 = proc_info pointer in physical address space
r9 = cpuid (preserved)
*/
__lookup_processor_type:
adr r3, __lookup_processor_type_data
ldmia r3, {r4 – r6}

[ … some more code ( refer file ) …]
ENDPROC(__lookup_processor_type)

Now, we will continue with original source at arch/arm/kernel/head.S after __lookup_processor_type is completed execution,

bl __create_page_tables

/* The following calls CPU specific code in a position independent
* manner. See arch/arm/mm/proc-*.S for details. r10 = base of
* xxx_proc_info structure selected by __lookup_processor_type
* above.
*
* The processor init function will be called with:
* r1 – machine type
* r2 – boot data (atags/dt) pointer
* r4 – translation table base (low word)
* r5 – translation table base (high word, if LPAE)
* r8 – translation table base 1 (pfn if LPAE)
* r9 – cpuid
* r13 – virtual address for __enable_mmu -> __turn_mmu_on
* On return, the CPU will be ready for the MMU to be turned on,
* r0 will hold the CPU control register value, r1, r2, r4, and
* r9 will be preserved. r5 will also be preserved if LPAE.
*/
ldr r13, =__mmap_switched @ address to jump to after
@ mmu has been enabled

1: b __enable_mmu
ENDPROC(stext)

Now, as mentioned above r13 contained the address of __mmap_switched which is called after MMU is enabled, the related source code is at arch/arm/kernel/head-common.S as,

/*
* The following fragment of code is executed with the MMU on in MMU mode,
* and uses absolute addresses; this is not position independent.
*
* r0 = cp#15 control register
* r1 = machine ID
* r2 = atags/dtb pointer
* r9 = processor ID
*/
__INIT
__mmap_switched:
adr r3, __mmap_switched_data

[ … some more code, refer source …]

str r9, [r4] @ Save processor ID
str r1, [r5] @ Save machine type
str r2, [r6] @ Save atags pointer
cmp r7, #0
strne r0, [r7] @ Save control register values
b start_kernel
ENDPROC(__mmap_switched)

Check above, the “start_kernel” function is called, which is the actual entry from assembly code to C code of linux kernel,

“start_kernel” function is defined in init/main.c as below, “start_kernel” is the important function in this code, which does all basic & initial initialization to starting the first application program.

asmlinkage __visible void __init start_kernel(void)
{
char *command_line;
char *after_dashes;

/*
* Need to run as early as possible, to initialize the
* lockdep hash:
*/
lockdep_init();
set_task_stack_end_magic(&init_task);
smp_setup_processor_id();
debug_objects_early_init();

/*
* Set up the the initial canary ASAP:
*/
boot_init_stack_canary();

cgroup_init_early();

local_irq_disable();
early_boot_irqs_disabled = true;

/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/
boot_cpu_init();

/*
* Activate the first processor.
*/

static void __init boot_cpu_init(void)
{
int cpu = smp_processor_id();
/* Mark the boot cpu “present”, “online” etc for SMP and UP case */
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
}
page_address_init();
pr_notice(“%s”, linux_banner);

“linux_banner” is defined in init/version.c as below,

/* FIXED STRINGS! Don’t touch! */
const char linux_banner[] =
“Linux version ” UTS_RELEASE ” (” LINUX_COMPILE_BY “@”
LINUX_COMPILE_HOST “) (” LINUX_COMPILER “) ” UTS_VERSION “\n”;

These macros such as “LINUX_COMPILE_HOST” are generated at compile time as per the development environments and machine you will use for compiling linux kernel, you can find those in the file include/generated/compile.h as below,

/* This file is auto generated, version 1 */
/* SMP */
#define UTS_MACHINE “arm”
#define UTS_VERSION “#1 SMP Sun Oct 11 14:57:27 IST 2015”
#define LINUX_COMPILE_BY “devbee”
#define LINUX_COMPILE_HOST “ubuntu”
#define LINUX_COMPILER “gcc version 4.9.2 20140904 (prerelease) (crosstool-NG linaro-1.13.1-4.9-2014.09 – Linaro GCC 4.9-2014.09) “

setup_arch(&command_line);

The setup_arch function is defined in arch/arm/kernel/setup.c and does all low level initialization / setup, below we will mention few of the important section of code from setup_arch function, you may refer source code for more details,

void __init setup_arch(char **cmdline_p)
{
const struct machine_desc *mdesc;

setup_processor();

init_mm.start_code = (unsigned long) _text;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = (unsigned long) _end;

/* populate cmd_line too for later use, preserving boot_command_line */
strlcpy(cmd_line, boot_command_line, COMMAND_LINE_SIZE);
*cmdline_p = cmd_line;

}

The “setup_processor” is another important function which is called from “setup_arch” and validates processor type etc.

static void __init setup_processor(void)
{
struct proc_info_list *list;

/*
* locate processor in the list of supported processor
* types. The linker builds this table for us from the
* entries in arch/arm/mm/proc-*.S
*/
list = lookup_processor_type(read_cpuid_id());

cpu_name = list->cpu_name;
__cpu_architecture = __get_cpu_architecture();

pr_info(“CPU: %s [%08x] revision %d (ARMv%s), cr=%08lx\n”,
cpu_name, read_cpuid_id(), read_cpuid_id() & 15,
proc_arch[cpu_architecture()], get_cr());

cpu_init();
}
mm_init_cpumask(&init_mm);
setup_command_line(command_line);
setup_nr_cpu_ids();
setup_per_cpu_areas();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */

build_all_zonelists(NULL, NULL);
page_alloc_init();

pr_notice(“Kernel command line: %s\n”, boot_command_line);
parse_early_param();
after_dashes = parse_args(“Booting kernel”,
static_command_line, __start___param,
__stop___param – __start___param,
-1, -1, &unknown_bootoption);

The above code initialises the kernel boot parameters, like initialising of console required to print debug messages on UART etc. refer to kernel/printk/printk.c function

/*
* Set up a console. Called via do_early_param() in init/main.c
* for each “console=” parameter in the boot command line.
*/
static int __init console_setup(char *str)
{

/* Refer source for code */

}
__setup(“console=”, console_setup);

You can also refer to source from kernel/params.c
if (!IS_ERR_OR_NULL(after_dashes))
parse_args(“Setting init args”, after_dashes, NULL, 0, -1, -1,
set_init_arg);

jump_label_init();

/*

* These use large bootmem allocations and must precede
* kmem_cache_init()
*/
setup_log_buf(0);
pidhash_init();
vfs_caches_init_early();
sort_main_extable();
trap_init();
mm_init();

/*
* Set up the scheduler prior starting any interrupts (such as the
* timer interrupt). Full topology setup happens at smp_init()
* time – but meanwhile we still have a functioning scheduler.
*/
sched_init();
/*
* Disable preemption – early bootup scheduling is extremely
* fragile until we cpu_idle() for the first time.
*/
preempt_disable();
if (WARN(!irqs_disabled(),
“Interrupts were enabled *very* early, fixing it\n”))
local_irq_disable();
idr_init_cache();
rcu_init();

/* trace_printk() and trace points may be used after this */
trace_init();

context_tracking_init();
radix_tree_init();
/* init some links before init_ISA_irqs() */
early_irq_init();
init_IRQ();

The init_IRQ is the hook from kernel core to platform / machinne hardware specific IRQ initialisation, the init_IRQ is defined in arch/arm/kernel/irq.c as below,

void __init init_IRQ(void)
{
int ret;

if (IS_ENABLED(CONFIG_OF) && !machine_desc->init_irq)
irqchip_init();
else
machine_desc->init_irq();
}

Here, machine_desc->init_irq() is a call to board specific IRQ initialization function.

for example, in case of beagleboard, check below macro initialisation as defined in board file, arch/arm/mach-omap2/board-omap3beagle.c

MACHINE_START(OMAP3_BEAGLE, “OMAP3 Beagle Board”)
/* Maintainer: Syed Mohammed Khasim – http://beagleboard.org */
.atag_offset = 0x100,
.reserve = omap_reserve,
.map_io = omap3_map_io,
.init_early = omap3_init_early,
.init_irq = omap3_init_irq,
.init_machine = omap3_beagle_init,
.init_late = omap3_init_late,
.init_time = omap3_secure_sync32k_timer_init,
.restart = omap3xxx_restart,
MACHINE_END

here, MACHINE_START is a macro to initialize machine_desc structure and is declared in arch/arc/include/asm/mach_desc.h
tick_init();
rcu_init_nohz();
init_timers();

This function is defined in kernel/time/timer.c
hrtimers_init(); /* refer kernel/time/hrtimer.c */
softirq_init(); /* refer source from kernel/softirq.c */
timekeeping_init(); /* refer source from kernel/time/timekeeping.c */
time_init();

time_init() is defined in arch/arm/kernel/time.c and is a hook from kernel core to board / hardware time subsystem as,

void __init time_init(void)
{
if (machine_desc->init_time) {
machine_desc->init_time();
} else {
#ifdef CONFIG_COMMON_CLK
of_clk_init(NULL);
#endif
clocksource_of_init();
}
}

The above call from machine_desc->init_time() goes to hardware initialization as described above for example, in case of beagleboard at source arch/arm/mach-omap2/board-omap3beagle.c

MACHINE_START(OMAP3_BEAGLE, “OMAP3 Beagle Board”)

.init_time      = omap3_secure_sync32k_timer_init,

MACHINE_END

sched_clock_postinit();
perf_event_init();
profile_init();

Above code does initialization regarding performance and profiling.
call_function_init();
WARN(!irqs_disabled(), “Interrupts were enabled early\n”);
early_boot_irqs_disabled = false;
local_irq_enable();

kmem_cache_init_late();

/*
* HACK ALERT! This is early. We’re enabling the console before
* we’ve done PCI setups etc, and console_init() must be aware of
* this. But we do want output early, in case something goes wrong.
*/
console_init();
if (panic_later)
panic(“Too many boot %s vars at `%s'”, panic_later,
panic_param);

lockdep_info();

/*
* Need to run this when irqs are enabled, because it wants
* to self-test [hard/soft]-irqs on/off lock inversion bugs
* too:
*/
locking_selftest();

#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
pr_crit(“initrd overwritten (0x%08lx < 0x%08lx) – disabling it.\n”,
page_to_pfn(virt_to_page((void *)initrd_start)),
min_low_pfn);
initrd_start = 0;
}
#endif
page_ext_init();
debug_objects_mem_init();
kmemleak_init();
setup_per_cpu_pageset();
numa_policy_init();
if (late_time_init)
late_time_init();
sched_clock_init();
calibrate_delay();
pidmap_init();
anon_vma_init();
acpi_early_init();
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
#endif
#ifdef CONFIG_X86_ESPFIX64
/* Should be run before the first non-init thread is created */
init_espfix_bsp();
#endif
thread_info_cache_init();
cred_init();
fork_init();
proc_caches_init();
buffer_init();
key_init();
security_init();
dbg_late_init();
vfs_caches_init(totalram_pages);
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();

The above code / functions does the necessary initialization. Refer to actual source to know more details of functionalities.
proc_root_init();

proc_root_init() defined in fs/proc/root.c registers and initializes the /proc filesystem as,

void __init proc_root_init(void)
{
int err;

err = register_filesystem(&proc_fs_type);
if (err)
return;

#ifdef CONFIG_SYSVIPC
proc_mkdir(“sysvipc”, NULL);
#endif
proc_mkdir(“fs”, NULL);
proc_mkdir(“driver”, NULL);
proc_create_mount_point(“fs/nfsd”);

proc_tty_init();
proc_mkdir(“bus”, NULL);
proc_sys_init();
}

Refer source at fs/proc/root.c for more details.

nsfs_init();
cpuset_init();
cgroup_init();
taskstats_init_early();
delayacct_init();

check_bugs();

acpi_subsystem_init();
sfi_init_late();

if (efi_enabled(EFI_RUNTIME_SERVICES)) {
efi_late_init();
efi_free_boot_services();
}

ftrace_init();

/* Do the rest non-__init’ed, we’re now alive */
rest_init();

The above rest_init() function is defined in same source file init/main.c and is used to start the init thread which will become as the first “init” which is parent of all processes with PID 1. The respective code is as below,

static noinline void __init_refok rest_init(void)
{
int pid;
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
kernel_thread(kernel_init, NULL, CLONE_FS);

numa_default_policy();

pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
complete(&kthreadd_done);

}

The function used while creating thread “kernel_init” actually starts the “init” process, as below,

static int __ref kernel_init(void *unused)
{

kernel_init_freeable();

/*
* We try each of these until one succeeds.
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine.
*/
if (execute_command) {
ret = run_init_process(execute_command);
if (!ret)
return 0;
panic(“Requested init %s failed (error %d).”,
execute_command, ret);
}
if (!try_to_run_init_process(“/sbin/init”) ||
!try_to_run_init_process(“/etc/init”) ||
!try_to_run_init_process(“/bin/init”) ||
!try_to_run_init_process(“/bin/sh”))
return 0;

}

The above call from kernel_init_freeable() does all initialisation which was defined as part of freeable memory macro __init , this does all filesystem, machine initialization etc. as below,

static noinline void __init kernel_init_freeable(void)
{

do_basic_setup();

}

The above do_basic_setup() function is defined as below,

/*
* Ok, the machine is now initialized. None of the devices
* have been touched yet, but the CPU subsystem is up and
* running, and memory and process management works.
*
* Now we can finally start doing some real work..
*/
static void __init do_basic_setup(void)
{
cpuset_init_smp();
usermodehelper_init();
shmem_init(); /* initializes tmpfs */
driver_init();  /* initializes the device driver model , refer below */
init_irq_proc();
do_ctors();
usermodehelper_enable();
do_initcalls();
random_int_secret_init();
}

The above call from driver_init() does the necessary initialization of linux kernel device driver model and the respective source code is at drivers/base/init.c

void __init driver_init(void)
{
/* These are the core pieces */
devtmpfs_init();
devices_init();
buses_init();
classes_init();
firmware_init();
hypervisor_init();

/* These are also core pieces, but must come after the
* core core pieces.
*/
platform_bus_init();
cpu_dev_init();
memory_dev_init();
container_dev_init();
of_core_init();
}

the above code self explanatory for device , bus , class, firmware initialization. refer to respective source code.
}

Now, as we seen how the init process is started, during the kernel booting, lets check now on terminal, type command “ps -ax” on the terminal and check first two lines,

$ ps -ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:05 /sbin/init
2 ? S 0:00 [kthreadd]

Leave a Comment