Dec 13, 2014

PXE booting x86 kernel from Ubuntu host

By itself, building my own Linux kernel and root file system only marginally boosted my understanding of the Linux kernel.  But since modifying the kernel features, and if necessary, kernel sources is a good way to understand the Linux kernel, rolling my own Linux distribution with Buildroot is critical.  For an ARM target, my recent blog entries (thisthis, and this) documented downloading the Linux kernel from a development host over TFTP, and then exporting the rootfs over NFS.  I was writing this as part of my blog entry on ways to study the Linux kernel, but it got too big, so I am breaking out just the part about building an x86 Buildroot distribution and booting that over PXE. For an x86 target, I went through the same exercise a few years ago while studying sample loop jitter on Linux, but I documented it in Google docs--which seems to be less accessible for people than blogger.  So I am going to repeat the exercise again using a Dell Optiplex 755 tower (model DCNE)  with a 16x PCIe slot.  So this section is about Buildroot for my Optiplex 755, which packs Core2 duo CPU, a Gbit Ethernet.  As PXE works the same for x86 vs. x64 targets, you might see parts of the steps illustrated for my older 32-bit laptop; don't be confused if you see the name change.

The overall interaction between the target and the development host follows the the x86 laptop example I went through in my Google docs blog entry from a few years ago.

Get Buildroot

The required packages (for xconfig--the QT4 GUI; GTK GUI option requires GTK2+ dev packages, which I don't want to bother with) on Ubuntu 14.04 LTS are:
$ sudo apt-get install git build-essential gawk bison flex gettext texinfo libqt4-dev

Get the buildroot source.
$ git clone git://git.buildroot.net/buildroot

And then switch over to the latest stable branch (which is 2014.08 right now)
$ cd buildroot
$ git checkout 2014.08
$ git pull . 2014.08

Configure Buildroot for the target CPU and glibc

Unlike a Zedboard (ARM based) which has a reference Buildroot config in (<BR2>/board/avnet/zedboard), there does not seem to be a checked in config for T40.  But this old blog entry seems like a good place to start.  Fire up xconfig to configure Buildroot.
$ make xconfig

These are the configs I changed from the Buildroot default:
  • Target option: x86_64/core2
  • Build options
    • Enable compiler cache
    • gcc optimization level: O3 (rather than for size)
  • Toolchain
    • glibc 2.19 (to mimic the Ubuntu 14.04 environment, which Cuda supports).
      • But stop and think: why didn't I just use the gcc I got with build-essential package?  Because Buildroot manual says: "We also do not support using the distribution toolchain (i.e. the gcc/binutils/C library installed by your distribution) as the toolchain to build software for the target. This is because your distribution toolchain is not a "pure" toolchain (i.e. only with the C/C++ library), so we cannot import it properly into the Buildroot build environment. So even if you are building a system for a x86 or x86_64 target, you have to generate a cross-compilation toolchain with Buildroot or crosstool-NG."
    • Enable C++ suport
    • Build cross gdb for the host
    • Purge unwanted locales
      • Locales to keep: en_US
    • Register toolchain without Eclipse Buildroot plugin, because I want the Buildroot Eclipse CDT integration.
  • System configuration
    • hostname: o755
    • /dev management: Dynamic using eudev, to more closely match Ubuntu
    • Root password: I changed from the empty (no password) to something else
    • Check "install timezone info"
      • default local time: America/Los_Angeles
    • Path to users tables: /mnt/work/o755/BR2/ROOTFS_USERS_TABLES, which looks like this:
      foo -1 wheel -1 =<password> /home/foo /bin/sh - Foo user
  • Kernel: Enable "Linux Kernel" to show kernel options
    • Kernel version: Local directory: /mnt/work/zed/kernel.  This is the kernel from ADI git tree, for Zedboard.  Why don't I just use the stock kernel that Buildroot points to?  Further down, I discuss the kernel configs necessary for rootfs over NFS.  The typical defconfig used for x86 is "i386" and ia64 is "generic".  My defconfig will have to be used on the "generic", but there is no Buildroot hook to copy a defconfig from elsewhere into the downloaded kernel.  So it's easier just to use a kernel that already has the defconfig I want.
    • Defconfig name: x86_64_nfs (which is in <kernel>/arch/x86/configs; see below for my defconfig)
  • Target packages
    • Debugging profiling and benchmarking
      • strace
      • trace-cmd
    • Networking applications
      • dropbear: do NOT optimize for size (I want speed instead)
      • ethtool: used extensively in Linux Kernel Networking by Rami Rosen.
    • Shell and utilities
      • file
      • logrotate
      • screen
      • sudo
Note that I did NOT configure the bootloader; the PC architecture specifies its own bootloader: it used to be the BIOS, but now the industry has transitioned to UEFI (Unified Enhanced Firmware Interface?).  From my 

Configure the kernel for NFS rootfs and kgdb

<kernel>/arch/x86/configs/x86_64_defconfig is already quite extensive.  I appended the following configs to create x86_64_nfs_defconfig for the NFS rootfs feature:

CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_ROOT_NFS=y
CONFIG_NFS_USE_KERNEL_DNS=y
# CONFIG_NFSD is not set
CONFIG_NFS_COMMON=y
CONFIG_E1000E=y

Note that I am overriding the CONFIG_NFSD=m set earlier in the file.  CONFIG_E1000E is technically NOT an NFS rootfs config, but the kernel needs a device driver statically compiled in, to get IP communication before the the rootfs is available.

I added the following configs for the ftrace feature, which I discussed in a past blog entry:

CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
# CONFIG_PSTORE_FTRACE is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set

I also added the following configs for the kgdb feature:

CONFIG_KGDB=y
CONFIG_DEBUG_INFO=y

Optional but recommended (makes sense for me):
  • CONFIG_FRAME_POINTER=y -- save frame information, to allow gdb to more accurately construct stack trace
  • CONFIG_KGDB_SERIAL_CONSOLE=y -- kgdb over Ethernet is not in mainline, so stick with the tried and true.  This also allow kgdbwait and early debugging
  • # CONFIG_DEBUG_RODATA is not set -- This option marks certain regions of the kernel's memory space as RO.  I did NOT have this option to begin with, so I will leave this alone.  If my processor did NOT have HW breakpoint, I should turn this off.  Zynq has 5 breakpoints and 1 watchpoint registers--which I found kgdb cannot use for some reason.
  • # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set -- I have plenty of disk space but the processor is relatively slow.  Besides, I find it confusing if the compiler optimizes some code away when debugging.

Buildroot (everything)

"make" at the BR2 folder builds everything (by I debug my configuration mistakes a few times).  After "make" goes through, the compressed kernel image and the tarred rootfs are in output/images:

henry@Zotac64:~/work/o755/buildroot$ ls -gh output/images/
total 25M
-rw-rw-r-- 1 henry 5.4M Nov 28 18:07 bzImage
-rw-rw-r-- 1 henry  19M Nov 28 18:19 rootfs.tar

We need to export the rootfs to /export/root later.

Setting up the PXE DHCP server on the development host

A development host needs a fixed IP address for the target to download the kernel.  How will the host and the target get such fixed IP address?  One way is to reserve fixed IP address on the DHCP server.

Dorking with the DHCP server: probably not an option for most hobbysts

I like to use DHCP because the target can access the Internet even without my having to setup a route on the host.  At home, I have the convenience of setting aside a static IP addresses through the admin console (I use Netgear), as you can see below, where my host IP address is 192.168.1.2:
I use this method in the past, in a setting: just tell the IT department, and they will take care of it (They are happy to do it!  They don't like people like me setting up my own router within the company, and it gives them a chance to give you a ticket, and "get paid"--one way or another).  Alas, when using PXE, you need an enhanced version of the DHCP server, which is NOT what the cheapo home router/DHCP server runs.  I am kicking myself because I used to have a Lynksys WRT router, on which you can run a Linux network server, but gave that away when I upgraded my house to wireless.

Alternatively, get a 2nd NIC and a cross Ethernet cable

With a PCI or USB Ethernet adapter like the Lynksys USB300M shown below (paid $1 PLUS $6 shipping!), and a cross cable, I can have a subnet (consisting of 2 nodes) even without a switch.
Cisco-Linksys USB Ethernet Adapter - Used See DescriptionThe only thing to watch out for is the Linux device driver support.  I bought USB300M because my previous company used it on Linux.  When I plug this into the host, it shows up in "ifconfig" as eth1.  The interface should be statically configured, like this:
This interface becomes active only when there is another machine on the other end.  Is there a way to change this behavior?

The DHCP server is installed on Ubuntu with this command:

$ sudo apt-get install isc-dhcp-server

isc-dhcp-server config

I can add this to the INTERFACES line in /etc/default/isc-dhcp-server:

INTERFACES="eth1"

Note that in the introduction picture, my target also included a virtualbox machine, which I am no longer using.  The latest version of the isc-dhcp-server does seems to have a problem starting if I specify "allow booting" and "allow bootp" in the isc-dhcp-server file.  I do not understand why, but when I move those 2 lines to the dhcpd.conf, the isc-dhcpd-server started.

Next, reserve the IP address for all targets I expect to serve over PXE in /etc/dhcp/dhcpd.conf:

subnet 192.168.2.0 netmask 255.255.255.0 {
  range 192.168.2.10 192.168.2.19; # dynamic will not be used
  option routers 192.168.2.1;
  next-server 192.168.2.1; # point targets at the TFTP server (me)
  filename "pxelinux.0";
  allow booting;
  allow bootp;
  option domain-name-servers 192.168.1.1;
  default-lease-time 600;
  max-lease-time 7200;
  ddns-update-style none;

  host o755 {
    hardware ethernet 00:1A:A0:CB:54:FF;
    fixed-address 192.168.2.3;
    option root-path "192.168.2.1:/export/root/o755/" ;
  }
}

When I start the daemon to check for any errors, I see the expected output:

henry@Zotac64:~/work$ sudo dhcpd
Internet Systems Consortium DHCP Server 4.2.4
Copyright 2004-2012 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
WARNING: Host declarations are global.  They are not limited to the scope you declared them in.
Wrote 0 deleted host decls to leases file.
Wrote 0 new dynamic host decls to leases file.
Wrote 0 leases to leases file.
Listening on LPF/eth1/48:f8:b3:45:48:b0/192.168.2.0/24
Sending on   LPF/eth1/48:f8:b3:45:48:b0/192.168.2.0/24
...
Sending on   Socket/fallback/fallback-net

When the target obtains its IP address, it is also told to download the SSBL (2nd stage bootloader) (pxelinux.0) from "next-server" (over TFTP, implicit in the PXE protocol).

Running the ISC DHCP server in Ubuntu service framework

You can always restart the server like this example:

henry@Zotac64:~$ sudo service isc-dhcp-server start
isc-dhcp-server start/running, process 3358

The server will start on bootup automatically, but as discussed here, it errors out because the network is not ready.  Rather than put in a 10 second delay, I will just manually start things for now.

Putting PXE DHCP to a test

Let take the DHCP server out for a spin.  The target exchanges the DHCP messages with the server, as appearing in /var/log/syslog:

Zotac64 NetworkManager[881]: <info> (eth1): carrier now ON (device state 100)
Zotac64 kernel: [ 1937.698818] asix 2-6:1.0 eth1: link up, 100Mbps, full-duplex, lpa 0xC1E1
Zotac64 dhcpd: DHCPDISCOVER from 00:1a:a0:cb:54:ff via eth1
Zotac64 dhcpd: DHCPOFFER on 192.168.2.3 to 00:1a:a0:cb:54:ff via eth1
Zotac64 dhcpd: DHCPREQUEST for 192.168.2.3 (192.168.2.1) from 00:1a:a0:cb:54:ff via eth1
Zotac64 dhcpd: DHCPACK on 192.168.2.3 to 00:1a:a0:cb:54:ff via eth1
Zotac64 in.tftpd[3446]: tftp: client does not accept options

As expected, the target cannot locate the PXE configuration file (because I did NOT create one yet!), and boot fails.

Installing (and configuring) the TFTP server on the development host

Install the tftp server with this command:

sudo apt-get install tftpd-hpa

My only addition to the default TFTP config (/etc/default/tftpd-hpa) is this line (if I  leave IPv6 running; otherwise, I have to change the address to bind to, and specify --ipv4 to the server option):

RUN_DAEMON="yes"

Then restart the TFTP daemon:

sudo service tftpd-hpa start

The SSBL (pxelinux.0) and associated menu (menu.c32) should be COPIED (rather than soft-linked) from /usr/lib/syslinux folder into the /var/lib/tftpd folder).   And then the create the pxelinux.cfg folder, containing a single file called default for now:

henry@Zotac64:/var/lib/tftpboot$ cat pxelinux.cfg/default
DEFAULT menu.c32
PROMPT 0
MENU TITLE PXE Boot Menu
TIMEOUT 50 #This means 5 seconds apparently
LABEL buildroot
 MENU LABEL buildroot kernel
 kernel o755Image
 append ip=dhcp

The last line appears to be the kernel arg.

Sanity test tftp

Download tftp

sudo apt-get install tftp-hpa

Then try to get the file, first on the server itself

tftp localhost -v -m binary -c get pxelinux.0

If you don't receive this file (perhaps due to some corruption while installing something else), removing, and then reinstalling the tftpd-hpa may get you out of the jam (as it did for me).

Kernel downloads and boots

With this, the kernel (~/work/o755/buildroot/out/images/bzImage) I copied into /var/lib/tftpboo downloads and boots!  But since the kernel does not know to use NFS rootfs, it panics:

...
---[ end Kernel panic - not synching: VHS: Unable to mount root fs on unknown-block(2,0)

As in the ARM case, the kernel needs to be told to run NFS rootfs through the bootarg.

Serial console to the target

But if the kernel does not boot (hint: expect problems every step of the way), having a serial console is necessary for debugging.  I connected the target and host PC with a 9-pin null-modem (and a USB-serial cable), so that the target's /dev/ttyS0 (on the target side) appears as /dev/ttyUSB0 on the development host.

NFS rootfs kernel arg

In the pxelinux menu above, I coded the kernel image and boot arg in the the "default" config.  Let's make things a bit more interesting: specify the kernel image and option tailored to a target.  As explained here, the default config is the LAST config the pxelinux will check.  The first is the PXE GUID (printed on the console), and the 2nd is the (LOWER case, HYPHENATED) MAC address, which for this target is 00-1a-a0-cb-54-ff.  Another way is to tie the configuration to the target's IP address, which is 192.168.2.3 in this case, translating to C0A80203 (uppercase hex notation) as the filename in pxelinux.cfg, which now looks like this:

DEFAULT menu.c32
PROMPT 0
MENU TITLE PXE Boot Menu
TIMEOUT 20 #This means 2 seconds apparently
LABEL buildroot
 MENU LABEL buildroot kernel
 kernel o755Image
 append ip=dhcp root=/dev/nfs nfsroot=192.168.2.1:/export/root/o755,nfsver=3 rw earlyprintk

During the 2 second wait, you can halt the boot and review the argument that PXE picked up.  Recall that Buildroot built rootfs, but we did not yet export it to NFS, so the kernel will still panic when trying to mount the rootfs.  So let's do this now.

$ sudo mkdir /export/root/o755
$ sudo tar -C /export/root/o755/ -xf ~/work/o755/buildroot/output/images/rootfs.tar

The NFS server export option must specify the "no_root_squash" option as explained in my earlier blog entry.  As a sanity check, I could mount this on the same machine with a mount command:

$ mkdir ~/o755
$ sudo mount 192.168.2.1:/export/root/o755 ~/o755

Don't use localhost, because the /etc/export only allows 192.168.2.0/22 mask.  Alas, The target was still not initiating NFS mount.  An ARM target (zedboard) can mount NFS just fine (a folder right next to o755 using the same set of kernel arguments), and shows these printk messages:

IP-Config: Got DHCP answer from 192.168.1.1, my address is 192.168.1.9
...
VFS: Mounted root (nfs filesystem) on device 0:13.

For NFS mount to succeed, the Kernel has to get exchange DHCP, so the question is: why doesn't o755 target attempt DHCP?  When the target successfully boots Ubuntu 14.04 LTS (and can respond over the network), it probes e1000e successfully, as you can see from this dmesg output:

[    2.184242] e1000e: Intel(R) PRO/1000 Network Driver - 2.3.2-k
...
[    2.558382] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[    2.570813] e1000e 0000:00:19.0 eth0: MAC: 7, PHY: 6, PBA No: 1041FF-0FF

I then checked the kernel config: CONFIG_E1000E was missing in x86_64, so I pulled that into my defconfig based on x86_64.  But that got me past DHCP, but I still did not see

VFS: Mounted root (nfs filesystem) on device 0:13.

Instead, I see:

VFS: Unable to mount root fs via NFS, trying floppy
VFS: Cannot open root device "nfs" or unknown-block(2,0): error -6

In /var/log/syslog, there is no log after a successful DHCP exchange.  And there is no error message between DHCP and the VFS failure above.  What is actually failing is sys_mount() being called in do_mount_root().  After debugging with kgdb (see appendix below), and realizing that I don't know how to specify the TCP transport to nfsroot kernel argument (yes, I tried appending ",tcp" to the nfsroot argument; the problem seems to be with the nfsvers=4 rather than the transport, because nfsvers=3 worked with either TCP or UDP) I changed the nfsroot kernel option's nfsvers argument back to 3 to work around, and voila!

Welcome to o755
o755 login: 

Appendix: Debug the NFS mounting problem with kgdbwait

Unfortunately, the nfs file handle is still not being found.  To debug, I thought I would give the kgdb a shot (because unlike on ARM, I don't have a JTAG debugger for this PC).  Earlier, I already added the Kernel configs necessary for kgdb, so I just have to tell the kernel in the kernel arg, by APPENDING the following to the "append" line in the PXE menu:

kgdboc=ttyS0,115200 kgdbwait

On the VGA screen, I see that the kernel has setup ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) and ttyS1 at I/O 0xec98 (irq = 17, base_baud = 115200), right before blocking on connection from remote gdb, so I dutifully oblige by firing up the cross gdb (the host gdb probably works fine, but since I went through all the trouble of building the cross gdb, why not use it?):

henry@Zotac64:~/work/o755/buildroot$ output/host/usr/bin/x86_64-linux-gdb output/build/linux-custom/vmlinux

I am not sure whether I got the serial port right (ttyS0) on the target side, but on the host, there is no doubt about the tty device because I am using USB-serial adapter, and I see only 1 /dev/ttyUSB0. On Ubuntu, this device is only writable by the group "dialout", so I have to add myself to that group.

$ sudo apt-get remove modemmanager
$ sudo adduser henry  dialout
$ sudo chmod a+rw /dev/ttyUSB0

Then I can connect to the targe within gdb:

(gdb) set serial baud 115200
(gdb) target remote /dev/ttyUSB0
Remote debugging using /dev/ttyUSB0
kgdb_breakpoint () at kernel/debug/debug_core.c:1051
1051 wmb(); /* Sync point after breakpoint */

(gdb) hb nfs_validate_text_mount_data
(gdb) c

I wonder whether I am really using the HW breakpoint on this target.  I traced the failure to fs/nfs/super.c:nfs_fs_mount(), due to error -22, or EINVAL, because of the transport protocol disagreement:

nfs_validate_transport_protocol(args);
if (args->nfs_server.protocol == XPRT_TRANSPORT_UDP)
goto out_invalid_transport_udp;

When I stared at the code for a while, I realized that Linux NFSv4 wants to use TCP as the transport.