Jan 30, 2016

Mainline uClinux on STM32F429-discovery

In a previous blog, I explored the mainline U-Boot on the stm32f429-discovery eval board.  The mainline kernel now supports the same board, so I will study that uClinux (the Linux kernel without the MMU) on that board in this blog entry.

Mainline kernel's arch/arm/config/stm32_defconfig

Studying the kernel configs for the reference board should reveal what kernel features are available for the Cortex M4, so let's go through arch/arm/config/stm32_defconfig

CONFIG_NO_HZ_IDLE=y

Enable tickless idle system, required for any wearable device that can last more than a day on a small rechargeable battery.

CONFIG_HIGH_RES_TIMERS=y

I would guess this is only required for a high sampling frequency control system.  Why is this turned on?

CONFIG_LOG_BUF_SHIFT=16

Log buffer size.  64 KB seems reasonable.

CONFIG_BLK_DEV_INITRD=y

initramfs/initrd support.  Essential for a disk-less system.

CONFIG_CC_OPTIMIZE_FOR_SIZE=y

Choose "-Os" over "-O2", since the flash space is severely constrained (only 2 MB)

# CONFIG_UID16 is not set

Legacy 16-bit UID syscall wrappers.  Agree we do not need to support legacy syscall.

# CONFIG_BASE_FULL is not set

Disabling this option reduces the size of miscellaneous core kernel data structures. This saves memory on small machines, but may reduce performance.

# CONFIG_FUTEX is not set

Disabling this option will cause the kernel to be built without support for "fast userspace mutexes".  The resulting kernel may not run glibc-based applications correctly.  This is OK because I will use uclibc (smaller).

# CONFIG__EPOLL is not set

Do not support for epoll family of system calls.  "man epoll" yields:
The  epoll  API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.
I will need to turn this back on if any userspace libraries I need require it.

# CONFIG_SIGNALFD is not set

Needed if I want to receive signals on a file.

# CONFIG_EVENTFD is not set

Needed to receive events (kernel or userspace) on a file.

# CONFIG_AIO is not set

Disable POSiz async I/O (used by some high performance threaded applications) to save 7 KB from the kernel.

CONFIG_EMBEDDED=y

Need for embedded system so certain expert options are available for configuration; selects CONFIG_EXPERT.

# CONFIG_VM_EVENT_COUNTERS is not set

Turn off showing the event count in /proc/vmstat (to reduce the kernel size I guess).

# CONFIG_SLUB_DEBUG is not set

According to this discussion, we should be using the SLOB (drastically simpler and space efficient) as the SLAB allocator instead of the default SLUB allocator.  Anyway, Kconfig says:
SLUB has extensive debug support features. Disabling these can result in significant savings in code size. This also disables SLUB sysfs support. /sys/slab will not exist and there will be no support for cache validation etc. 

# CONFIG_LBDAF is not set

I do NOT need to support large (> 2 TB) block device.

# CONFIG_BLK_DEV_BSG is not set

No need for SCSI generic v4.

# CONFIG_IOSCHED_DEADLINE is not set

Exclude deadline-iosched.o

# CONFIG_IOSCHED_CFQ is not set

Excludes cfq-iosched.o

# CONFIG_MMU is not set

Cortex M CPUs have MMU.

CONFIG_ARM_SINGLE_ARMV7M=y

1 CPU ARMv7-M based platforms (Cortex-M0/M3/M4).  Selects:
  • ARM_NVIC: ARM Cortex has NVIC (nested vectored interrupt controller) HW
  • AUTO_ZRELADDR: ZRELADDR is the physical address where the decompressed kernel image will be placed. If AUTO_ZRELADDR is selected, the address will be determined at run-time by masking the current IP with 0xf8000000. This assumes the zImage being placed in the first 128MB from start of memory.
  • CLKSRC_OF
  • COMMON_CLK: single definition of struct clk, useful across many platforms
  • CPU_V7M
  • GENERIC_CLOCKEVENTS
  • NO_IOPORT_MAP
  • SPARSE_IRQ: Sparse irq numbering is useful for [if you] want to define a high CONFIG_NR_CPUS value but still want to have low kernel memory footprint on smaller machines. ( Sparse irqs can also be beneficial on NUMA boxes, as they spread out the interrupt descriptors in a more NUMA-friendly way. )
  • USE_OF: FDT support

CONFIG_ARCH_STM32=y

There is no finer granularity for the STM32 flavors than this.  Selects:
  • ARCH_HAS_RESET_CONTROLLER
  • ARMV7M_SYSTICK: ARMv7m has systick HW
  • CLKSRC_STM32
  • RESET_CONTROLLER

CONFIG_SET_MEM_PARAM=y

This does NOT appear in any Kconfig, Makefile, or source?!

CONFIG_DRAM_BASE=0x90000000

Where the external SDRAM begins.

CONFIG_FLASH_MEM_BASE=0x08000000

The on-chip flash is at 0x08000000 for all STM32 variants.  The STM32Lxxx has another bank at 0x0, but let's not digress.

CONFIG_FLASH_SIZE=0x00200000

On-chip flash for the STM32F429 is 2 MB.

CONFIG_PREEMPT=y

Enable kernel preemption, which improves (but does NOT quite guarantee) real-time responsiveness.

# CONFIG_ATAGS is not set

No need for this traditional way of passing data to the kernel at boot time. Unnecessary if solely relying on the flattened device tree.

CONFIG_ZBOOT_ROM_TEXT=0x0

Kconfig:
The physical address at which the ROM-able zImage is to be placed in the target.  Platforms which normally make use of ROM-able zImage formats normally set this to a suitable value in their defconfig file.
Q: what is a ROM-able zImage?

ZBOOT_ROM lets you execute the zImage (compressed kernel image) directly from ROM/flash.

CONFIG_ZBOOT_ROM_BSS=0x0

 The base address of an area of read/write memory in the target for the ROM-able zImage which must be available while the decompressor is running. It must be large enough to hold the entire decompressed kernel (which is unlikely to be < 1 MB!) plus an additional 128 KiB.  Since ZBOOT_ROM_TEXT == ZBOOT_ROM_BSS, it means ZOOT_ROM (execute zImage directly from ROM/flash) is NOT selected.

CONFIG_XIP_KERNEL=y

Kconfig help:
Execute-In-Place allows the kernel to run from non-volatile storage directly addressable by the CPU, such as NOR flash. This saves RAM space since the text section of the kernel is not loaded from flash to RAM.  Read-write sections, such as the data section and stack, are still copied to RAM.  The XIP kernel is not compressed since it has to run directly from flash, so it will take more space to store it.  The flash address used to link the kernel object files, and for storing it, is configuration dependent. Therefore, if you say Y here, you must know the proper physical address where to store the kernel image depending on your own flash memory usage.  Also note that the make target becomes "make xipImage" rather than "make zImage" or "make Image".  The final kernel binary to put in ROM memory will be arch/arm/boot/xipImage.
Denx.de documentation:
  • XIP conserves RAM at the expense of flash. This might be useful if you have a big flash memory and little RAM.
Q: Is the simplicity of running the kernel directly from the on-chip flash worth giving up the in-field upgrade capability? 

CONFIG_XIP_PHYS_ADDR=0x08008000

For !CONFIG_MMU, XIP_VIRT_ADDR is the same as CONFIG_XIP_PHYS_ADDR.  

This is NOT the uC's boot address, you can confirm in the STM32F4 reference manual Boot configuration section:
After this startup delay is over, the CPU fetches the top-of-stack value from address 0x0000 0000, then starts code execution from the boot memory starting from 0x0000 0004.  On the device, 0x0 ~ 0x1FFFFF is aliased to the boot memory (whether ROM, SRAM, or flash depending on the BOOT pins). 
So we are leaving 32 KB in the front--16 KB for the DTB at 0x08004000, and the first 16 KB (at the boot address 0x8000000) is meant for the bootloader (like U-Boot or afboot-stm32), which should pick up the address of the kernel entry function stextin the .head.text section (AKA __HEAD in head.nommu.S) at the CONFIG_XIP_PHYS_ADDR.  The address is "+ 1" to indicate thumb mode.  The linker script (arch/arm/kernel/vmlinux.lds.S) places the .head.text section at this address:

#ifdef CONFIG_XIP_KERNEL
. = XIP_VIRT_ADDR(CONFIG_XIP_PHYS_ADDR);
#else
. = PAGE_OFFSET + TEXT_OFFSET;
#endif
.head.text : {
_text = .;
HEAD_TEXT
}

CONFIG_BINFMT_FLAT=y

Picks up binfmt_flat.o

CONFIG_BINFMT_SHARED_FLAT=y

A kind of flat binary file.  I thought shared library is not viable on no-MMU, but when I checked the Buildroot target option, sure enough, I found "Shared binary" option under FLAT binary type.  But gcc does not even build for this option if NPTL threads is enabled.  Besides, the Buildroot arch/Config.in explains that BR2_BINFMT_FLAT_SHARED is no substitute for the "real" shared library:
        # Even though this really generates shared binaries, there is no libdl
        # and dlopen() cannot be used. So packages that require shared
        # libraries cannot be built. Therefore, we don't select
        # BR2_BINFMT_SUPPORTS_SHARED and therefore force BR2_STATIC_LIBS

Since I configured my toolchain without shared binary file format, the kernel support for it is pointless.  Still, it does raise a question on how one uses loads binary file

# CONFIG_COREDUMP is not set

Turns off kernel coredump support.  Is this the user space "core dump" I know about?

CONFIG_DEVTMPFS=y

Kconfig help:
This creates a tmpfs/ramfs filesystem instance early at bootup.  In this filesystem, the kernel driver core maintains device nodes with their default names and permissions for all registered devices with an assigned major/minor number.  Userspace can modify the filesystem content as needed, add symlinks, and apply needed permissions.  It provides a fully functional /dev directory, where usually udev runs on top, managing permissions and adding meaningful symlinks.  In very limited environments, it may provide a sufficient functional /dev without any further help. It also allows simple rescue systems, and reliably handles dynamic major/minor numbers.  Notice: if CONFIG_TMPFS isn't enabled, the simpler ramfs file system will be used instead.

CONFIG_DEVTMPFS_MOUNT=y

Kconfig help:
Automount devtmpfs at /dev, after the kernel mounted the rootfs.   This option does not affect initramfs based booting, here the devtmpfs filesystem always needs to be mounted manually after the rootfs is mounted. With this option enabled, it allows to bring up a system in rescue mode with init=/bin/sh, even when the /dev directory on the rootfs is completely empty.

# CONFIG_FW_LOADER is not set

Turns off the USERSPACE (vs. kernel) FW loading.  No loss of feature for me.

# CONFIG_BLK_DEV is not set

Turns off ALL block dev (even ramdisk or loop).

CONFIG_EEPROM_93CX6=y

EEPROM chipsets 93c46 and 93c66 driver.  Does stm32f429 even have an EEPROM?

# CONFIG_INPUT is not set

This board will be a headless system for now.

# CONFIG_SERIO is not set

Some input device (e.g. keyboard, PS2 mouse) uses serial I/O to communicate with the kernel.  Again, this is a headless system.

# CONFIG_VT is not set

VT = virtual terminal.  Again, this is a headless system.

# CONFIG_UNIX98_PTYS is not set

I won't run telnet or xterm into this system, so turning off PTY is OK.

# CONFIG_LEGACY_PTYS is not set

Ditto.

CONFIG_SERIAL_NONSTANDARD=y

Necessary for the STM32 serial port below.

# CONFIG_DEVKMEM is not set

No need for /dev/kmem to debug the kernel

CONFIG_SERIAL_STM32=y

This driver supports all industry standard baud rates (what might those be?).

CONFIG_SERIAL_STM32_CONSOLE=y

Here's the console on STM32!

# CONFIG_HW_RANDOM is not set

STM32F4 has a RNG HW, but perhaps older uC don't.  TODO: enable RNG support for the discovery board.

# CONFIG_HWMON is not set

Not going to care about the HW health for now.  Perhaps the on-chip temperature might be interesting later.

# CONFIG_USB_SUPPORT is not set

Let's not bother with USB for now.

CONFIG_NEW_LEDS=y

The discovery board has LEDs.

CONFIG_LEDS_CLASS=y

sysfs class in /sys/class/leds.  You'll need this to do anything useful with LEDs

CONFIG_LEDS_TRIGGERS=y

LED triggers allow kernel events to drive the LEDs and can be configured via sysfs, like blinking by a programmable timer, or one-shot blink, disk activity, or CPU activity, heartbeat, GPIO activity, or as a camera flash.

CONFIG_LEDS_TRIGGER_HEARTBEAT=y

LED flashes as a a hyperbolic function of the 1-minute load average.

# CONFIG_FILE_LOCKING is not set

Disable the standard file locking support (required for NFS and flock) to save 11 KB.

# CONFIG_DNOTIFY is not set

According to Kconfig file, what I am giving up is:
Dnotify is a directory-based per-fd file change notification system that uses signals to communicate events to user-space.  There exist superior alternatives.

# CONFIG_INOTIFY_USER is not set

According to Kconfig file, what I am giving up is:
Inotify allows monitoring of both files and directories via a single open fd.  Events are read from the file descriptor, which is also select()- and poll()-able.  Inotify fixes numerous shortcomings in dnotify and introduces several new features including multiple file events, one-shot support, and unmount notification. 

CONFIG_NLS=y

Native language support

CONFIG_PRINTK_TIME=y

Include time in printk.

CONFIG_DEBUG_INFO=y

"-g" compiler flag.  Essential for JTAG debugging, but might have to be sacrificed if I have only 1 MB flash.

# CONFIG_ENABLE_WARN_DEPRECATED is not set

We don't care to know, I guess:

#undef __deprecated
#undef __deprecated_for_modules
#define __deprecated
#define __deprecated_for_modules

# CONFIG_ENABLE_MUST_CHECK is not set

CPP just ignores "__must_check"

CONFIG_MAGIC_SYSRQ=y

Search for "Magic SysRq Key" in the Linux Kernel Development 3rd edition.

# CONFIG_SCHED_DEBUG is not set

I do NOT want to debug the scheduler.

# CONFIG_DEBUG_BUGVERBOSE is not set

Do NOT make the kernel's __BUG() macro verbose.

# CONFIG_FTRACE is not set

Ftrace is a kernel tracing feature.  Probably no room for this advanced feature?

CONFIG_CRC_ITU_T=y

Kconfig help:
This option is provided for the case where no in-kernel-tree modules require CRC ITU-T V.41 functions, but a module built outside the kernel tree does. Such modules that use library CRC ITU-T V.41 functions require M here.

CONFIG_CRC7=y

Same as above, except this is about the CRC7.

Building the stm32 defconfig kernel and rootfs in Buildroot

The 4.3.3 kernel with above defconfig builds nicely in latest mainline Buildroot (with the Cortex M4 addition to Buildroot I discussed in a previous blog entry).  My buildroot defconfig is:

BR2_arm=y
BR2_cortex_m4=y
BR2_ARM_FPU_VFPV4=y
BR2_ENABLE_DEBUG=y
BR2_TOOLCHAIN_BUILDROOT_LOCALE=y
# BR2_UCLIBC_INSTALL_UTILS is not set
BR2_BINUTILS_VERSION_2_25_X=y
BR2_GCC_VERSION_5_X=y
BR2_TOOLCHAIN_BUILDROOT_CXX=y
BR2_PACKAGE_HOST_ELF2FLT=y
BR2_PACKAGE_HOST_GDB=y
BR2_PACKAGE_HOST_GDB_TUI=y
BR2_PACKAGE_HOST_GDB_PYTHON=y
BR2_GDB_VERSION_7_10=y
BR2_ENABLE_LOCALE_PURGE=y
BR2_ENABLE_LOCALE_WHITELIST="en_US"
BR2_ECLIPSE_REGISTER=y
BR2_TARGET_GENERIC_HOSTNAME="uClinux"
BR2_TARGET_GENERIC_ISSUE="Welcome!"
BR2_ROOTFS_DEVICE_CREATION_STATIC=y
BR2_TARGET_GENERIC_ROOT_PASSWD="*******"
BR2_TARGET_GENERIC_GETTY_PORT="ttyPS0"
BR2_LINUX_KERNEL=y
BR2_LINUX_KERNEL_DEFCONFIG="stm32"
BR2_LINUX_KERNEL_IMAGE_TARGET_CUSTOM=y
BR2_LINUX_KERNEL_IMAGE_TARGET_NAME="xipImage"
BR2_LINUX_KERNEL_DTS_SUPPORT=y
BR2_LINUX_KERNEL_INTREE_DTS_NAME="stm32f429-disco"
BR2_TARGET_ROOTFS_INITRAMFS=y
# BR2_TARGET_ROOTFS_TAR is not set

Note that I left the thread library implementation at the default: NPTL, rather than "none" or "linuxthreads", as is common among other people working on "Linux on Cortex M", to preserve a sliver of hope of porting Qt5 (which current requires C++, NPTL toolchain, and shared lib) to a no MMU system.

[Update: this is a false hope.  uClibc currently requires MMU for NPTL, so the best I could do was Linux thread.  It's just that Buildroot did not expose this uClibc constraint properly, so I got the impression that I was building uClibc for NPTL, when in fact I was DISABLE threading altogether.]

I use the static device table because a wearable product will not have dynamic dev entries.  The devtempfs may still afford some convenience in device table population, so I may yet revert back to "dynamic with devtempfs".

The choice of root filesystem Buildroot offers is overwhelming, but this TI wiki page was of some help.
  • axfs: advance XIP file system allowsfiles to be executed directly from flash/ROM (rather than being copied into RAM)
  • cloop: compressed filesystem
  • cramfs: compressed read-only filesystem
  • ext2/3/4 root filesystem
    • ext2 requires a block device (which does NOT account for the erase block behavior of the flash)
  • initramfs linked into the kerne is the easiest option if:
    • There is extra room where the kernel is stored (increases the kernel image size)
    • There is extra RAM to hold the root filesystem that will be copied along with the kernel
    • You don't need to persist changes to the filesystem (corollary of the above)
  • j2ffs2 (journalling flash file system): mostly used for NOR flash.  TI wiki page quite helpful.
  • romfs: small read-only filesystem
  • squashfs
  • ubifs: j2ffs2 successor (better scalability and NAND support)
  • yaffs2: mostly for NAND flash
Since the only non-volatile storage on the stm32f429-discovery board is the on-chip flash the only viable option for a non-networked rootfs is the initramfs, which I've chosen in the Buildroot defconfig.

I deliberately did NOT build u-boot in Buildroot because the stm32f429-disco does NOT build with the eabihf toolchain, as I discussed in a previous blog entry.

After make <board>_defconfig and then "make", the resulting kernel (remember that debugging is turned on) is > 1 MB:

BRm4$ ls -glh output/images/
total 1.3M
-rw-r--r-- 1 henry  64K Jan  2 23:06 rootfs.cpio
-rw-r--r-- 1 henry 3.2K Jan  2 23:06 stm32f429-disco.dtb
-rwxr-xr-x 1 henry 1.1M Jan  2 23:06 xipImage

Since the xipImage is NOT an ELF file, I can't poke at it more.

BRm4$ file output/images/xipImage 
output/images/xipImage: data

But I CAN look at the vmlinux (uncompressed ELF file, which is used for source code debugging).

BRm4$ arm-linux-size output/build/linux-4.3.3/vmlinux
   text   data    bss    dec    hex filename
1082608  52900 100608 1236116 12dc94 output/build/linux-4.3.3/vmlinux

Even the text section is > 1 MB.  Running readelf on the same file shows that the text has been placed in the on-chip flash, and the kernel expects the .data and .bss sections at the external SDRAM (at 0x90000000):

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .head.text        PROGBITS        08008000 008000 000060 00  AX  0   0  4
  [ 2] .text             PROGBITS        08008060 008060 0b4ad0 00  AX  0   0 32
...
  [15] .data             PROGBITS        90008000 118000 00c9a0 00  WA  0   0 512
  [16] .bss              NOBITS          900149a0 1249a0 018900 00  WA  0   0 32

Note that the .data section is at DRAM_BASE + XIP kernel text offset.  You can also run "$(CROSS_COMPILE)readelf -s" to find where the kernel entry point stext was linked:

  Entry point address:               0x8008001

Describing the stm32f429-dicsovery board to the kernel

DTS is now the accepted way to describe the hardware to the kernel.  For this board, arch/arm/boot/dts/stm32f429-disco.dts is that file.  But since DTS inheritance is used extensively, this file inherits from increasingly non-board specific DTSI files: stm32f429.dtsi --> armv7-m.dtsi --> skeleton.dtsi.  The skeleton is literally just the skeleon, so let's start with the armv7-m.dtsi, which describes the interrupt controller and the systick hardware available on ALL ARM 7M architecture:

        nvic: nv-interrupt-controller  {
                compatible = "arm,armv7m-nvic";
                interrupt-controller;
                #interrupt-cells = <1>;
                reg = <0xe000e100 0xc00>;
        };

        systick: timer@e000e010 {
                compatible = "arm,armv7m-systick";
                reg = <0xe000e010 0x10>;
                status = "disabled";
        };
...

stm32f429.dtsi then supplies the ST specific hardware: clocks, timer, USART, random number generator, and RCC (reset and clock CONTROL), 

        clocks {
                clk_hse: clk-hse {
                        #clock-cells = <0>;
                        compatible = "fixed-clock";
                        clock-frequency = <0>;
                };
        };
        soc {
...
                usart1: serial@40011000 {
                        compatible = "st,stm32-usart", "st,stm32-uart";
                        reg = <0x40011000 0x400>;
                        interrupts = <37>;
                        clocks = <&rcc 0 164>;
                        status = "disabled";
                };
                rcc: rcc@40023810 {
                        #clock-cells = <2>;
                        compatible = "st,stm32f42xx-rcc", "st,stm32-rcc";
                        reg = <0x40023800 0x400>;
                        clocks = <&clk_hse>;
                };
                rng: rng@50060800 {
                        compatible = "st,stm32-rng";
                        reg = <0x50060800 0x400>;
                        interrupts = <80>;
                        clocks = <&rcc 0 38>;
                };
        };

The USART base address is found in the uC datasheet memory mapping table.  On the stm32f429-disco board, usart1 TX/RX are exposed on P1 header, as explained here.

Finally, the board specific DTS chooses the serial console, SDRAM address, external clock frequency, and the kernel boot arg:

/ {
        model = "STMicroelectronics STM32F429i-DISCO board";
        compatible = "st,stm32f429i-disco", "st,stm32f429";
        chosen {
                bootargs = "root=/dev/ram rdinit=/linuxrc";
                stdout-path = "serial0:115200n8";
        };
        memory {
                reg = <0x90000000 0x800000>;
        };
};
&clk_hse {
        clock-frequency = <8000000>;
};
&usart1 {
        status = "okay";
};

The DTB produced by Buildroot (see above) has to be written to the address the boot loader will tell the kernel: 0x08004000 in this case.  Since afboot-stm32 is essentially just a function call for me, I use openocd itself to write the DTB to the 2nd flash bank: 

BRm4$ sudo openocd -f /usr/local/share/openocd/scripts/board/stm32f429discovery.cfg \
-c "init" -c "reset init" \
-c "flash probe 0" -c "flash info 0" \
-c "flash write_image erase output/images/stm32f429-disco.dtb 0x08004000" \
-c "reset run" -c "shutdown"

Note that a NOR flash works by erasing a whole block (and therefore is NOT suitable for a random write usage).  Since the bootloader is at 0x08000000, writing to 0x08004000 would erase the boot loader if these 2 addresses were in the same flash block.  The blocks of the boot loader, the DTB, and the kernel was chosen to support independent update of these 3 binaries (that is why the kernel is at the 3rd flash block).

According to the openocd documentation, reading NOR flash works the same as reading memory, because NOR flash HW behaves like a RAM for reading.  So this command should print an endian flipped DTB header--one that matches the hexdump on the DTB. 

BRm4$ sudo openocd -f /usr/local/share/openocd/scripts/board/stm32f429discovery.cfg \
-c "init" -c "reset init" -c "mdw 0x08004000 4" -c "reset run" -c "shutdown"

JTAG debugging the stm32 defconfig kernel

I found a simple boot loader for the XIP kernel: afboot-stm32.  The code turns on all hardware on stm32f429, but since I turned off all peripheral init for the XIP kernel (was curious if the XIP kernel can initialize the external SDRAM by itself), the "bootloader" just jumps to the kernel entry location and consequently fits comfortably in the 16 KB NOR flash block at the uC boot address (0x08000000).  Since the kernel is on a different block, I can write the kernel image (in the vmlinux ELF produced right before the kernel makefile builds the xipImage) into the flash wit openocd (through the Eclipse CDT debugger for convenience), and debug the XIP kernel, with the following debug configuration for the ST-Link, where I turned off invoking the openocd within the debugger (because openocd requires sudo on Ubuntu--I suppose I can try to fix this with udev rules...):
Writing > 1 MB to the on-chip flash over 1800 kHz JTAG (requested 2000 kHz but got 1800 kHz instead) should only take 5 seconds but actually takes > 30 seconds (so the JTAG overhead to write to flash must be quite inefficient), but eventually, I hit the HW breakpoint set in arch/arm/mm/proc-v7m.S __v7m_setup (I set the break point there to debug why the kernel was jumping to 0x0 at some point).

I see that the vector_table (defined in entry-v7m.S) is expected to be somewhere on the SDRAM (0x9000e400).  That broke the initial SVCall, because the vector_table had not yet been copied to the SDRAM.

The vector table definition in the assembly file merely says to place it in the .data section:

.data
#if CONFIG_CPU_V7M_NUM_IRQ <= 112
.align 9
#else
.align 10
#endif

/*
 * Vector table (Natural alignment need to be ensured)
 */
ENTRY(vector_table)
.long 0 @ 0 - Reset stack pointer
.long __invalid_entry @ 1 - Reset
.long __invalid_entry @ 2 - NMI
.long __invalid_entry @ 3 - HardFault
.long __invalid_entry @ 4 - MemManage
.long __invalid_entry @ 5 - BusFault
.long __invalid_entry @ 6 - UsageFault
.long __invalid_entry @ 7 - Reserved
.long __invalid_entry @ 8 - Reserved
.long __invalid_entry @ 9 - Reserved
.long __invalid_entry @ 10 - Reserved
.long vector_swi @ 11 - SVCall
.long __invalid_entry @ 12 - Debug Monitor
.long __invalid_entry @ 13 - Reserved
.long __pendsv_entry @ 14 - PendSV
.long __invalid_entry @ 15 - SysTick
.rept CONFIG_CPU_V7M_NUM_IRQ
.long __irq_entry @ External Interrupts
.endr

ENTRY() macro merely aligns, and puts the symbol in global name space.

The .data section is placed ("linked") into the target memory in arch/arm/kernel/vmlinux.lds.S, specified by the PAGE_OFFSET

#ifdef CONFIG_XIP_KERNEL
__data_loc = ALIGN(4); /* location in binary */
. = PAGE_OFFSET + TEXT_OFFSET;
#else
...
#endif

.data : AT(__data_loc) {
_data = .; /* address in memory */
_sdata = .;
...

Remember the distinction between the data address within "binary" (kernel image) and "memory" where the kernel will run from (presumably the SDRAM).

The argument to AT() is the LMA (load address), not the virtual address.  Note that VMA (virtual address) is NOT specified here, and according to the LD manual:

If you do not provide address, the linker will set it based on region if present, or
otherwise based on the current value of the location counter.

The current location counter is in the SRAM territory in when .data is declared in vmlinux.lds.S.  So while __data_loc is in the flash, the .data section's VMA is in SRAM.  I think this is the bug: VMA should be in the flash (__data_loc), while the LMA should be the SRAM

Symbol table '.symtab' contains 21570 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
...
 18650: 08114914     0 NOTYPE  GLOBAL DEFAULT   14 __data_loc
...

We just saw that __data_loc was in the flash so why is .data in the SDRAM, according to "arm-linux-readelf -S vmlinux" as we saw above?

__data_loc used in arch/arm/head-common.S __mmap_switched_data:

.align 2
.type __mmap_switched_data, %object
__mmap_switched_data:
.long __data_loc @ r4
.long _sdata @ r5
.long __bss_start @ r6
.long _end @ r7
...

This seems to set R4 __load_loc.

I found CONFIG_PAGE_OFFSET in autoconf.h to be the same as the DRAM_BASE specified in the defconfig.  And TEXT_OFFSET was set as the -DTEXT_OFFSET=0x00008000 argument to the linker script generation, in turn specified by arch/arm/Makefile:

# Text offset. This list is sorted numerically by address in order to
# provide a means to avoid/resolve conflicts in multi-arch kernels.
textofs-y := 0x00008000
...
# The byte offset of the kernel image in RAM from the start of RAM.
TEXT_OFFSET := $(textofs-y)

Although reserving the 32 KB at the beginning of the SDRAM makes sense when the system runs on the SDRAM (like the laptop)--for the same reason we reserved the first 32 KB of the on-chip NOR flash as discussed above--this seems to be a waste of SDRAM.  [I'll try to make use of this 8 pages later.]

Putting all these together, I understand the ARM kernel boot code expects the memory that holds the data to have been started already.  Since I want to store the kernel image on the on-chip flash for now (i.e. the board does NOT have an external flash), it means I have to put the external SRAM boot-up code in the 16 KB, at 0x08000000.  But U-Boot text size is around 100 KB, even with debugging symbols removed (u-boot I built below has debugging symbols).

u-boot$ arm-linux-size u-boot
   text   data    bss    dec    hex filename
 126413   7384  40572 174369  2a921 u-boot

So I found a bare metal code called afboot-stm32 to bootstrap the SDRAM on the discovery board before starting the kernel.

Patch the binfmt_flag.c to add the thumb bit to the start address

In start_kernel() --> rest_init() --> kernel_thread --> do_fork, the init is forked off with a newly created kernel thread.  kernel_init runs the init, which in this case is in the initramfs.

if (ramdisk_execute_command) {
ret = run_init_process(ramdisk_execute_command);
if (!ret)
return 0;
pr_err("Failed to execute %s (error %d)\n",
      ramdisk_execute_command, ret);
}

But the flat binary produced by Buildroot has a bug on thumb: the start address of the executable is missing the "thumb bit".  A quick and dirty workaround is to force it in the kernel, in fs/binfmt_flat.c load_flat_binary(), like this:

start_addr = libinfo.lib_list[0].entry | 1;

But with the latest uClibc, this workaround might not be necessary any more.

Producing the STATIC_FLAT uClibc and busybox exe

At this point, the kernel was running the init program called "/linuxrc" (which happens to be just a softlink to busybox--as every executable on this build is).  More precisely arch/arm/kernel/vmlinux.lds.S places the COMPRESSED initramfs image at the end of the .init.data, as you can see here:

.init.data : {
#ifndef CONFIG_XIP_KERNEL
INIT_DATA
#endif
INIT_SETUP(16)
INIT_CALLS
CON_INITCALL
SECURITY_INITCALL
INIT_RAM_FS
}

This section is then copied to the SDRAM during in one of the kernel initcalls: do_initcalls() -->  populate_rootfs() --> unpack_to_rootfs(__initramfs_start, __initramfs_size).  INIT_RAM_FS macro above defines the __initramfs_start and initramfs_size used in the code (in my example 0x8118404 and 176319 respectively).  Next, the init executable ("/linuxrc" resolving to busybox) is located (AKA "loaded") in the rootfs, and relocated within the SDRAM (binfmt_flat.c), and "vforked" off.  The relocation behavior depends on the target file format.  Recall that the kernel is expecting BFLT format, but in the stm32 kernel config above "shared flat binary" feature was turned on.  I lost several days debugging the load failure because I turned on SHARED_FLAT target format in uClibc (vs. STATIC_FLAT or STATIC_FLAT_SEP_DATA).

The problem phenotype is that the busybox binary has a lot of GOT entries (which are 32 bits), and the MSB of those entries are 0x7F (which is certainly an invalid shared library ID allowed by the kernel--set to 3).  On a memory constrained system, sharing the executable's text section (cannot share data or bss) seems like a wonderful idea, and this feature seems to have been pioneered for blackfin.  But both the shared library and separate code/data features require gcc support (the blackfin specific gcc option is -mid-shared-library)--which has NOT yet been ported to the ARM.  While building the toolchain, using anything but the static flat binary format in the Buildroot config stops the gcc build (so you know).  But the Buildroot's toolchain exe format does not get propagated to the uClibc equivalent option, so for now, one must specify a custom uClibc defconfig.

Even after fixing this, /linuxrc did not run.  When I looked at the busybox executable produced by Buildroot in the output/build/busybox-<version> folder, I saw that it was only roughly 300 bytes: impossibly small.  After several days of after-work debugging, I learned that the final link step for busybox uses gcc rather than ld, but the LD_FLAGS were still being pulled in.  Normally, this is not a problem, but for the BFLT case, there are different flags for the "elf2flt" transformation:
  • The gcc (when used as a linker) option is -Wl,-elf2flt
  • The ld option is -elf2flt 
These CPPFLAGS and LDFLAGS workarounds (Buildroot patch for the nommu/flat) are necessary to skip the stripping step after the link (which is broken for the BFLT format).  Since gcc just passes the options starting with "-Wl," to the linker, the 2 options mean the same.  BUT, since the LD_FLAG included "-elf2flt", gcc interpreted this as the "entry point" ("-e" option) argument, and the link step failed silently, yielding an empty executable.  I worked around by replacing the "-elf2flt" to "-Wl,-elf2flt" in the makefiles, as shown in this example for the busybox/Makefile:

     "$(LDFLAGS) $(subst elf2flt,Wl$(comma)-elf2flt, $(EXTRA_LDFLAGS))" \

There were also a few patches I picked up from varcain repo.  With these changes, busybox builds and is > 500 KB, which makes more sense than the 300 B seen earlier.

BRm4/output/build/busybox-1.24.1$ arm-linux-size busybox_unstripped.gdb 
   text   data    bss    dec    hex filename
 381820 154816  19408 556044  87c0c busybox_unstripped.gdb

But I tried to pair this down a bit.  Dumping all network related commands and daemons saves nearly 100 KB.

BRm4/output/build/busybox-1.24.1$ arm-linux-size busybox_unstripped.gdb 
   text   data    bss    dec    hex filename
 318700 127232  18352 464284  7159c busybox_unstripped.gdb

Dumping commands for the storage media I will never use (like HDD, cdrom, floppy) knocks off another 50 KB.

BRm4/output/build/busybox-1.24.1$ arm-linux-size busybox_unstripped.gdb 
   text   data    bss    dec    hex filename
 282364 107256  18248 407868  6393c busybox_unstripped.gdb

I saved the resulting changed .config as BR2/package/busybox/stm32f429-disco.config, and made Buildroot use this as the busybox config.

As soon as I started running the right busybox exe ("/linuxrc"), I found that the init process crashed while running the very first shell script: /etc/init.d/rcS

Simplified init script for stm32

Debugging an init shell script was difficult (even with the shell script tracing option "set -x" added to the script), and I finally resorted to bisecting the script approach to hunt down the offending shell syntax:

for i in /etc/init.d/S??* ;do
     # Ignore dangling symlinks (if any).
     [ ! -f "$i" ] && continue

Firstly, the file pattern match did not even work: "S??*" did NOT get expanded out, but remained at "S??".  More immediately, the "test for file" syntax is what crashed busybox.  It looks like hush (the only shell available to the uClinux system for now) cannot parse legitimate sh scripts.  So I got rid of the fancy statements in the busybox provided rcS (which means I am not starting logging or urandom device driver).

Mount devtmpfs

mount: mounting devpts on /dev/pts failed: No such device

It's in varcain

[    0.880000] STM32 USART driver initialized
[    0.880000] 40011000.serial: ttyS0 at MMIO 0x40011000 (irq = 17, base_baud = 5625000) is a stm32-usart
[    1.130000] console [ttyS0] enabled
...
can't open /dev/ttyS0: No such file or directory

Being printed from getty open_tty() spawned from inittab (which launches init.d/rcS as well).

# Put a getty on the serial port
ttyS0::respawn:/sbin/getty -L  ttyS0 0 vt100 # GENERIC_SERIAL

These come from the Buildroot xconfig "getty" option.

Recall that we chose the devtmpfs in the kernel config.  It's supposed to populate the /dev, so why isn't it?  According to the kernel log, devtmpfs is getting initialized.

[    0.110000] devtmpfs: initialized

But I am not seeing "mounted".  I thought the kernel config specified devtmpfs mount, so I did not understand why it wasn't getting mounted, until I realized that the feature does NOT apply to initramfs rootfs.  So I mounted devtmpfs in rcS script:

mount -t devtmpfs devtmpfs /dev

As soon as I logged in as root, the system crashed after starting /bin/sh (with argument "-sh" for login) while running /etc/profile, which again has syntax that hush cannot handle (the bracket statement).  So I simplified it to this:

export PATH=/bin:/sbin:/usr/bin:/usr/sbin

export PS1='# '

export PAGER='/bin/more '
export EDITOR='/bin/vi'

# Source configuration files from /etc/profile.d
. /etc/profile.d/umask

And finally, I can see the shell!

/root # uname -a
[  266.740000] BINFMT_FLAT: Loading file: /bin/uname
[  266.740000] Mapping is 90200000, Entry point is dd1, data_start is 429a4
[  266.740000] Load /bin/uname: TEXT=90200040-902429a4 DATA=902429c0-902552c4 BSS=902552c4-902591bc
Linux (none) 4.4.0 #5 PREEMPT Sat Jan 30 19:25:21 PST 2016 armv7ml GNU/Linux
/root # 

According to top, it looks like I am using about half of the 8 MB SDRAM:

Mem: 3620K used, 4308K free, 0K shrd, 0K buff, 1556K cached
CPU:  88% usr  11% sys   0% nic   0% idle   0% io   0% irq   0% sirq
Load average: 0.00 0.00 0.00 1/19 29
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
   29    26 root     R      396   5% 100% top
   26     1 root     S      404   5%   0% -sh
    1     0 root     S      396   5%   0% {linuxrc} init
    2     0 root     SW       0   0%   0% [kthreadd]
    3     2 root     SW       0   0%   0% [ksoftirqd/0]
    4     2 root     SW       0   0%   0% [kworker/0:0]
    7     2 root     SW       0   0%   0% [rcu_preempt]
    8     2 root     SW       0   0%   0% [rcu_sched]
    9     2 root     SW       0   0%   0% [rcu_bh]
   10     2 root     SW       0   0%   0% [kdevtmpfs]
   11     2 root     SW<      0   0%   0% [writeback]
   12     2 root     SW<      0   0%   0% [bioset]
   13     2 root     SW<      0   0%   0% [kblockd]
   14     2 root     SW       0   0%   0% [kworker/0:1]
   15     2 root     SW       0   0%   0% [kswapd0]
   16     2 root     SW<      0   0%   0% [deferwq]
   17     2 root     SW       0   0%   0% [kworker/u2:1]
    5     2 root     SW<      0   0%   0% [kworker/0:0H]
    6     2 root     SW       0   0%   0% [kworker/u2:0]

Next step

Explore a sexy UI on a wearable band  

Appendix: kernel debugging techniques

I learned how to use the assembly view "instruction stepping mode" in Eclipse.
I also learned how to load busybox ELF file at the relocated text, data, and bss locations reported by the kernel.  To be pedantic, gdb cannot show me the symbols of the executable that has been relocated on the no-MMU system (the dynamic loader on MMU works auto-magically with the ELF file to pull this off), so I have to TELL gdb where busybox executable has been relocated.
Firstly, the BFLT loader in the kernel prints these value to the console only if a BLFT file has the kernel trace bit in the flag (the same flag that stores whether the file format is GOTPIC, RAM, etc).  This bit can be enabled by changing the "ktrace" counter in the elf2flt.c file, so I created a patch in Buildroot's package/elf2flt, to initialize the variable ktrace to 1 (instead of 0).

Then during kernel startup, I see flat file load addresses, like this example:

[  103.160000] BINFMT_FLAT: Loading file: /bin/sh
ogin[26]: root login on 'ttyS0'
[  103.170000] Mapping is 90600000, Entry point is dd1, data_start is 429a4
[  103.180000] Load /bin/sh: TEXT=90600040-906429a4 DATA=906429c0-906552c4 BSS=906552c4-906591bc

Then I fed those 3 addresses to gdb console, like this example:

add-symbol-file /mnt/work/band/uClinux/BRm4/output/build/busybox-1.24.1/busybox_unstripped.gdb 0x90600040 -s .data 0x906429c0 -s .bss 0x906552c4

The symbol can be removed as well:

remove-symbol-file /mnt/work/band/uClinux/BRm4/output/build/busybox-1.24.1/busybox_unstripped.gdb 
]