Nov 11, 2014

Linux kernel startup

I've been looking at the Linux kernel after a 10 year hiatus, so I started again at the very beginning: when the kernel is just starting.

Reset vector takes you to architecture specific assembly

This section is likely to lose the audience unless explained carefully.  I will use the ARM assembly, because I am playing around with the Zedboard.

WIP.

C "main"

As we just saw, there is a lot assembly code that sets up the interrupt vectors, stack, and the C runtime environment, but let's start when the C code begins to run.  This IBM article (written for x86) is a good read, but it still does not have enough detail on how the drivers are loaded in the beginning, so I went to the <kernel>/init/main.c:start_kernel(), which matches the dmesg lines very well:


start_kernel()dmesg
boot_cpu_init();Booting Linux on physical CPU 0x0
page_address_init();
pr_notice("%s", linux_banner);Linux version 3.15.0 (henry@Zotac64) (gcc version 4.8.3 (Buildroot 2014.08) ) #3 SMP PREEMPT Sun Oct 12 16:55:23 PDT 2014
setup_arch(&command_line);CPU: ARMv7 Processor [413fc090] revision 0 (ARMv7), cr=18c5387d
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
Machine model: Xilinx Zynq ZED
mm_init_owner(&init_mm, &init_task);
mm_init_cpumask(&init_mm);
setup_command_line(command_line);bootconsole [earlycon0] enabled
setup_nr_cpu_ids();cma: CMA: reserved 128 MiB at 17800000...
setup_per_cpu_areas();PERCPU: Embedded 8 pages/cpu @dfb9e000 s8448 r8192 d16128 u32768
smp_prepare_boot_cpu();pcpu-alloc: s8448 r8192 d16128 u32768 alloc=8*4096

pcpu-alloc: [0] 0 [0] 1
smp_prepare_boot_cpu();Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048
page_alloc_init();
pr_notice("Kernel command line: %s\n", boot_command_line);
parse_early_param()
Kernel command line: console=ttyPS0,115200 ip=192.168.1.9:192.168.1.2:192.168.1.1:255.255.255.0 root=/dev/nfs nfsroot=192.168.1.2:/export/root/zed rw earlyprintk
jump_label_init();
setup_log_buf(0);
pidhash_init();PID hash table entries: 2048 (order: 1, 8192 bytes)
vfs_caches_init_early();Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)

Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
sort_main_extable();
trap_init();
mm_init();Memory: 381504K/524288K available (4498K kernel code, 254K rwdata, 1684K rodata, 204K init, 146K bss, 142784K reserved, 0K highmem)

Virtual kernel memory layout: …
sched_init();
preempt_disable();
idr_init_cache();
rcu_init();Preemptible hierarchical RCU implementation...
tick_nohz_init();
context_tracking_init();
radix_tree_init();
early_irq_init();
init_IRQ();NR_IRQS:16 nr_irqs:16 16
tick_init();
init_timers();
hrtimers_init();
softirq_init();
timekeeping_init();
time_init();zynq_clock_init: clkc starts at e0802100...
sched_clock_postinit();sched_clock: 16 bits at 54kHz, resolution 18432ns, wraps every 1207951633ns
perf_event_init();
profile_init();
call_function_init();


local_irq_enable();
kmem_cache_init_late();
console_init();Console: colour dummy device 80x30

calibrate_delay();Calibrating delay loop... 1332.01 BogoMIPS (lpj=6660096)
pidmap_init();pid_max: default: 32768 minimum: 301

proc_caches_init();L310 cache controller enabled

l2x0: 8 ways, CACHE_ID 0x410000c8, AUX_CTRL 0x72760000, Cache size: 512 kB
check_bugs();CPU: Testing write buffer coherency: ok
sfi_init_late();

ftrace_init();
rest_init();

In rest_init(), all constructors are called, then the init process is forked off.  The forked off init process will execute later (in parallel with the rest of the kernel init) in its own thread that starts from a software interrupt.  Finally, as in all embedded program, the kernel goes into an infinite idle loop, putting the processor to sleep as much as possible, to save power.  As a proof, when you halt the processor at a random time while the kernel is running, I usually get this stack trace:

ARM Cortex-A9 MPCore #0 (Suspended)
0xc0020428 cpu_v7_do_idle(): arch/arm/mm/proc-v7.S, line 74
0xc0013d1c arm_cpuidle_simple_enter(): arch/arm/kernel/cpuidle.c, line 18
0xc03d11a4 cpuidle_enter_state(): drivers/cpuidle/cpuidle.c, line 104
0xc03d1298 cpuidle_enter(): drivers/cpuidle/cpuidle.c, line 159
0xc0060ab0 cpu_startup_entry(): kernel/sched/idle.c, line 154
0xc05578e4 rest_init(): init/main.c, line 397
0xc07dbba4 start_kernel(): init/main.c, line 652
0x00008074
...

The way to idle the CPU is to call WFI with IRQ disabled:
ENTRY(cpu_v7_do_idle)
dsb @ WFI may enter a low-power mode
wfi
mov pc, lr
ENDPROC(cpu_v7_do_idle)

__initcall: how everything else that is static (vs. dynamically inserted) in the kernel is initialized

"simple-bus" (AKA the top level platform bus) is created in init_machine() --> of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL), very early in kernel initialization.  The platform bus/dev children of the top level simple-bus are then created recursively at that time, and calls bus_probe_device(), which triggers the bus <--> device match check, and if so a platform device (struct device) is probed (the .probe function is called) in <kernel>/drivers/base/platform.c:platform_drv_probe().  The important thing is to remember that probe() happens when the device driver and the device's name matches, according to MODULE_DEVICE_TABLE.

rest_init() forks off a kernel thread to run kernel_init() --> kernel_init_freeable() --> do_basic_setup():

/*
 * Ok, the machine is now initialized. None of the devices
 * have been touched yet, but the CPU subsystem is up and
 * running, and memory and process management works.
 *
 * Now we can finally start doing some real work..
 */
static void __init do_basic_setup(void)
{
cpuset_init_smp();
usermodehelper_init();
shmem_init();
driver_init();
init_irq_proc();
do_ctors();
usermodehelper_enable();
do_initcalls();
random_int_secret_init();
}

The initialization order within <kernel>/drivers/base/init.c:driver_init() subroutines shed some light on the device model dependency: for example, dev and devices are the both right under /sys, and block and char are in turn below /sys/dev.

void __init driver_init(void) {
  /* These are the core pieces */
  devtmpfs_init();
  devices_init();
  buses_init();
  classes_init();
  firmware_init();
  hypervisor_init();

  /* These are also core pieces, but must come after the * core core pieces. */
  platform_bus_init();
  cpu_dev_init();
  memory_dev_init();
  container_dev_init();
}

All __init functions have a level, so that the kernel can stage the initializations:

static void __init do_initcalls(void)
{
int level;
for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
do_initcall_level(level);
}

These levels have suggestive names:

static char *initcall_level_names[] __initdata = {
  "early", "core", "postcore", "arch", "subsys", "fs", "device", "late",
};

__initcalls appearing in on this Zedboard HW:

Kernel codedmesg
arch_hw_breakpoint_init()found 5 (+1 reserved) breakpoint and 1 watchpoint registers
zynq_ocm_init()zynq-ocm f800c000.ps7-ocmc: ZYNQ OCM pool: 256 KiB @ 0xe0880000
customize_machine()For mach-zynq, calls zynq_init_machine(). Supposed to act on device tree information
usb_init()usbcore: registered new interface driver usbfs...
inet_initNET: Registered protocol family 2

TCP established hash table entries: 4096 (order: 2, 16384 bytes)
pl330_probedma-pl330 f8003000.ps7-dma: Loaded driver for PL330 DMAC-2364208...

What a clever idea: if a module (or whatever with an init function complying to the kernel's __initcall declaration protocol) is statically compiled into the kernel, the kernel will call the registered function, without knowing apriori the full list of such functions.

<kernel>/arch/arm/mach-zynq/common.c enforces that the DTS is compatible field is "xlnx,zynq-7000" with the zynq_dt_match string array:

DT_MACHINE_START(XILINX_EP107, "Xilinx Zynq Platform")
.smp = smp_ops(zynq_smp_ops),
.map_io = zynq_map_io,
.init_irq = zynq_irq_init,
.init_machine = zynq_init_machine,
.init_late = zynq_init_late,
.init_time = zynq_timer_init,
.dt_compat zynq_dt_match,
.reserve = zynq_memory_init,
.restart = zynq_system_reset,
MACHINE_END

The machine dt_compat is checked while parsing the DTB, very early in start_kernel(), in setup_arch() --> setup_machine_fdt().  The device tree actually becomes a tree (from a flattened blob) in setup_arch() --> unflatten_device_tree(), where machine specific setup hooks like .init_machine() run.

Platform drivers are for HW that will not dynamically come and go into a Linux system, such as the video and audio controllers in a tablet.  In makes sense to statically pull in those code necessary through the __initcall magic discussed above.  For example:

/* module_platform_driver() - Helper macro for drivers that don't do
 * anything special in module init/exit.  This eliminates a lot of
 * boilerplate.  Each module may only use this macro once, and
 * calling it replaces module_init() and module_exit()
 */
#define module_platform_driver(__platform_driver) \
module_driver(__platform_driver, platform_driver_register, \
platform_driver_unregister)

An example is a MDIO driver for an Ethernet MAC:

static struct platform_driver fsl_pq_mdio_driver = {
.driver = {
.name = "fsl-pq_mdio",
.owner = THIS_MODULE,
.of_match_table = fsl_pq_mdio_match,
},
.probe = fsl_pq_mdio_probe,
.remove = fsl_pq_mdio_remove,
};

module_platform_driver(fsl_pq_mdio_driver);

Loading firmware into devices

When you stop to think about it, a computer system is composed of many peripheral HW that is only commanded by the device drivers through well defined messaging protocols, but in fact are otherwise
independent embedded HW, controlled by its own CPU--often 8-bit/16-bit microcontrollers but increasingly more sophisticated DSP and even a reconfigurable FPGA or CPU/FPGA hybrids--running its own program--called firmware generically.  Because these uC/uP/DSP/FPGA usually have its own non-volatile memory to store the program/data, their FW usually do NOT need to be programmed during the kernel startup.  The only real reasons to do so are:
  • To upgrade the FW/data, for new behavior
  • Program memory is corrupt (perhaps during the upgrade?) and needs to be reverted to a known good version.
Either way, FW in consumer grade computers are rarely updated--because the users either don't care or don't know how to.  A way to work around this "human obstacle" is for the kernel (the device driver, really) to ALWAYS push the firmware on startup.  Since the FW management is the responsibility of corresponding device driver rather than the kernel proper, I hesitated about discussing the topic in this blog entry.  But since the kernel offers the firmware API request_firmware(), perhaps many device drivers handle the FW upload similarly.  I attached Xilinx SDK to the running kernel on Zedboard, the set a HW breakpoint on request_firmware() in the  (see my other blog entry for how), and pressed the CPU reset button to see the CPU halt on request_firmware() entry, apparently from the ADAU1761 HDMI audio device driver:

ARM Cortex-A9 MPCore #0 (Breakpoint)
0xc02f1fd0 request_firmware(): drivers/base/firmware_class.c, line 1162
0xc048baa4 _process_sigma_firmware(): sound/soc/codecs/sigmadsp.c, line 135
0xc048bc6c process_sigma_firmware_regmap(): ...und/soc/codecs/sigmadsp-regmap.c, line 30
0xc048b194 adau17x1_load_firmware(): sound/soc/codecs/adau17x1.c, line 761
0xc048b8b4 adau1761_codec_probe(): sound/soc/codecs/adau1761.c, line 702
0xc047c004 soc_probe_codec(): sound/soc/soc-core.c, line 1151
0xc047d010 snd_soc_register_card(): sound/soc/soc-core.c, line 1353
0xc048c5ec zed_adau1761_probe(): sound/soc/adi/zed_adau1761.c, line 131
0xc02e2c28 platform_drv_probe(): drivers/base/platform.c, line 491
0xc02e1028 driver_probe_device(): drivers/base/dd.c, line 302
0xc02e12d8 __driver_attach(): drivers/base/dd.c, line 477
0xc02df2fc bus_for_each_dev(): drivers/base/bus.c, line 311
0xc02e0aec driver_attach(): drivers/base/dd.c, line 496
0xc02e0734 bus_add_driver(): drivers/base/bus.c, line 692
0xc02e198c driver_register(): drivers/base/driver.c, line 167
0xc02e2bf8 __platform_driver_register(): drivers/base/platform.c, line 546
0xc080e148 zed_adau1761_card_driver_init(): sound/soc/adi/zed_adau1761.c, line 159
0xc00089bc do_one_initcall(): init/main.c, line 696
0xc07dbcf4 kernel_init_freeable(): init/main.c, line 762
0xc0557904 kernel_init(): init/main.c, line 840
0xc000ed78 ret_from_fork(): arch/arm/kernel/entry-common.S, line 91

Looking at this stack trace that starts at ret_from_fork() kernel entrypoint (it's assembly) and reading the source code, I realized that:
  • kernel_init_freeable() runs BEFORE the userland init process is started
  • driver inits are at level 6 (counting from 0), which agrees with initcall_level_names seen earlier.
  • There is a kernel parameter called "initcall_debug" that will printk initcall names and the usec taken in the initcall.  Might be useful to figure how where the kernel is dying during startup.
  • adau1761 driver complies with the kernel's sound library--the SoC variant.
  • This (sigma codec) FW is "builtin"; request_firmware() calls fw_get_builtin_firmware()
  • After getting the FW binary, _process_sigma_firmware will perform sanity checks (size, CRC32, magic number in header).
  • If there is a problem, the adau1761 DSP is NOT enabled.

3 comments:

  1. Thanks for the detailed info here. I have a question about initializations that are dependent on another. For e.g., I have an I2C eprom memory which holds a hardware revision info, which if I could read early in .init_machine, will help with initializing the rest of the hardware. but for reading the I2C memory, I Guess I will need the I2C core to initialize and process the I2C bus/device information initialized by .init_machine thru i2c_register_board_info(). Is there a way I can wait early in the .initmachine routine right after setting up this I2C info, until the I2C eprom probe occur and update a global hardware revision data.??
    Any help is much appreciated.

    ReplyDelete
    Replies
    1. There is a probe deferral mechanism, that is used by my DRM GPU driver, but I don't know how that works exactly. I am currently studying this code, so I might have more information later.

      Delete
    2. Actually, I did write something about this maybe more than a year ago, but it's not much. http://henryomd.blogspot.com/2014/10/qt5-gui-on-zedboard.html

      Bottom line is that probe deferral mechanism seems to be easy and robust.

      Delete