Dec 23, 2017

Xenomai on Raspberry Pi

I am still trying to get enough free time to work on AMP (asymmetric multiprocessing) on RPi2 or RPi3, but until that works, I want to use Xenomai as a fallback.  I discussed the real-time capabilities of the Xenomai kernel patch several years ago.  I was on x86 at that time and besides Xenomai has gone to version 3, so I will see if the new Xenomai Cobalt architecture is better or worse on the ARM kernel than when I tested Xenomai five years ago on x86.

Building the Xenomai kernel

After ensuring that I could build the RPi kernel that still boots the Rasbian distribution that I downloaded (since I did not changing any kernel config significantly except CONFIG_DEBUG_INFO I expected this to work), I started changing the kernel config for Xenomai.  On x86, I build the kernel and a light weight rootfs built with Buildroot.  This time, I don't need the rootfs, but it's still easier to have Buildroot build Xenomai.

Cross-compiling the RPi kernel with Buildroot

Before building the Xenomai kernel, I practiced building a regular kernel that can still boot the Rasbian image downloadable from the steps outlined on the  One tip is to NOT use the toolchain recommended on the webpage; gcc version 4.8 has a bug that errors out during a kernel module build later.  I found that the other toolchain, also available from the RPi tools repo works:

export $HOME/rpi/tools/arm-bcm2708/arm-rpi-4.9.3-linux-gnueabihf/bin:$HOME
export CROSS_COMPILE=arm-linux-gnueabihf-

To build a kernel that is just like the current kernel, I copy the kernel config from the known good kernel.  This method has the advantage that you can just keep reusing the kernel modules that came with your install of Rasbian.
pi@raspberrypi$ sudo modprobe configs
pi@raspberrypi$ gunzip -c /proc/config.gz > ~/.config

Once I copy the .config file above to the Linux repo on the development host, I can bootstrap the linux repo (rpi-4.9.y branch to match the Rasbian image downloaded from the Raspberry site) by applying the .configs file saved above, like this (the make targets are as recommended by the above RPi document):

parallels@ubuntu:~/rpi/linux$ cp ../BAK/.config .
parallels@ubuntu:~/rpi/linux$ ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- make oldconfig

Save this config as a defconfig by running this command:

parallels@ubuntu:~/rpi/linux$ ARCH=arm make savedefconfig
parallels@ubuntu:~/rpi/linux$ mv defconfig ../BAK/rpi_linux_defconfig

To build just the kernel (no need to build the kernel modules if the config has not changed), just the zImage target is sufficient.

ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- make zImage

The resulting binary arch/arm/boot/zImage can be copied to the SD card, and the config.txt can point to it:

Patching the RPi kernel with Xenomai Cobalt co-kernel

Before patching the currently working Linux kernel, I copy the kernel to another folder : ~/rpi/linux.xeno.

Xenomai kernel patch is available as a i-pipe patch download  from the download area.  Since my RPi kernel version is 4.9.41, I found the patch that is as close to it as possible: ipipe-core-4.9.38-arm-3.patch

I then ran the script that Xenomai guys refer to in their install guide:

parallels@ubuntu:~/rpi/xenomai-3.0.6$ scripts/ --linux=../linux.xeno --ipipe ../ipipe-core-4.9.38-arm-3.patch --arch=arm

Unfortunately, this fail because of the difference between the mainline kernel version 4.9.38 and the RPi kernel 4.9.41--despite being only 3 versions apart.  I worked around by:
  1. deleting the patches that fail from the *.patch file
  2. applying the reduced patch
  3. manually applying the patch to the files that failed the automatic patch
  4. saving the resulting diff as a new patch file (git diff > ../ipipe-core-4.9.41-arm-3.patch)

Configuring the kernel for Cobalt

On recommendation from the Xenomai folks, I disabled the following kernel features:

Build the kernel

Given the massive changes to the kernel, I can no longer get away with just building the kernel proper: I have to also build the kernel modules.

parallels@ubuntu:~/rpi/linux.xeno$ ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- make zImage module

The built modules must be copied to the SD card's ext4 partition, pointing INSTALL_MOD_PATH at wherever the ext4 partition is mounted at on your dev host (when the SD card is inserted into the dev host).

sudo make ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- INSTALL_MOD_PATH=<mount point> modules_install

The new kernel is still named zImage, and copied to the boot partition of the SD card.  RPi bootloader's config.txt points at this zImage:


There is a tense moment when the modified SD card is inserted into the RPi and I turn on the RPi.   I seem to have been lucky: this kernel boots up with the Rasbian image, and the Xenomai kernel patch ran during the kernel bootup:

pi@raspberrypi:~ $ dmesg | grep -Ei 'xeno|pipe'
[    0.000000] I-pipe, 19.200 MHz clocksource, wrap in 960767920505705 ms
[    0.000000] clocksource: ipipe_tsc: mask: 0xffffffffffffffff max_cycles: 0x46d987e47, max_idle_ns: 440795202767 ns
[    0.001474] Interrupt pipeline (release #3)
[    0.172468] clocksource: Switched to clocksource ipipe_tsc
[    0.278993] [Xenomai] scheduling class idle registered.
[    0.279039] [Xenomai] scheduling class rt registered.
[    0.279289] I-pipe: head domain Xenomai registered.
[    0.284432] [Xenomai] Cobalt v3.0.6 (Stellar Parallax)

Since Xenomai is a way to run real-time code in userspace, it makes sense that I need to build userspace libs.

Building the Xenomai userspace

In the same scripts/ folder where I ran the above, there is "bootstrap" script, which generates the "configure" script.

parallels@ubuntu:~/rpi/xenomai-3.0.6$ scripts/bootstrap
parallels@ubuntu:~/rpi/xenomai-3.0.6$ ./configure --with-core=cobalt --enable-smp --host=arm-linux-gnueabihf CFLAGS="-mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard" LDFLAGS="-mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard"

I tried mcpu=cortex-a53, but that seemed to be incompatible with enable-smp option.  Maybe gcc-4.9 doesn't know about cortex-a53.   The --host argument should be the same as the CROSS_COMPILE environment variable, but without the trailing dash.  This is consistent with the gcc option you can see when run gcc on Raspberry natively (gcc -v).  I suspect that I can use the more modern version of the FPU, but Buildroot seems to use VFP4 as well (as of Buildroot version 2017.10), so I am not going to be aggressive about it.  Keep the build in a separate folder.

parallels@ubuntu:~/rpi/xenomai-3.0.6$ make DESTDIR=~/build install

The build output will then be neatly packaged in ~/build folder.

parallels@ubuntu:~/rpi/xenomai-3.0.6$ ls ~/build
dev  usr

Deploying the Xenomai userspace

I created 3 separate tarballs in this folder (are the headers and the lib tarballs actually used?):

$ cd ~/build
$ tar czf xeno-userspace.tar.gz *
$ cd usr/xenomai/include/
$ tar czf ../../../xeno-headers.tar.gz *
$ cd ../lib

$ tar czf ../../../xeno-libs.tar.gz *

These xenomai userspace lib has to be extracted to the root folder /:

pi@raspberrypi:~ $ cd /
pi@raspberrypi:/ $ sudo tar xzf ~/xeno-userspace.tar.gz

This will put the Xenomai device mount points in /dev folder, and the userspace headers and libs in /usr/xenomai folder.  The lib folder has to be appended to the /etc/

pi@raspberrypi:/ $ echo "/usr/xenomai/lib/" | sudo tee /etc/
pi@raspberrypi:/ $ sudo ldconfig -v
pi@raspberrypi:/ $ sync
pi@raspberrypi:/ $ sudo reboot

ldconfig modifies the dynamic linker's bindings to include Xenomai dylibs.  sync to flush to the disk is an overkill in my opinion, but I thought a harmles neat trick.

Testing the Xenomai userspace

The latency test exercises the Xenomai libs:

pi@raspberrypi:/ $ sudo /usr/xenomai/bin/latency

Note that running Xenomai requires sudo privilege because the Xenomai devices are accessible only to the root user by default (I ignored the build/config option to expose the Xenomai device to a designated group, because I think I have to call mlockall as sudo anyway). Here are my typical results (tabulated ever second) on RPi3:

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
RTD|      1.310|      2.229|      8.602|       0|     0|     -1.166|     32.842RTD|      1.252|      2.234|     10.471|       0|     0|     -1.166|     32.842

I stressed the system by scp'ing 9 GB file (loading the Ethernet--USB really--and the filesystem and the SD card driver), and installing beefy APT packages like emacs.  You can see that the worst case latency during the whole duration is greater than 2 orders of magnitude of the average latency.

RTT|  00:29:04  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
RTD|      0.793|      2.277|     14.126|       0|     0|     -1.961|    418.934

Therefore, it appears that while Xenomai is certainly much better than stock Linux, it still shouldn't be trusted to run hardware loops faster than 100~200 Hz (using my rule of thumb that the jitter should be < 1% of the sampling period).

Show stopper: random crash after a few hours

I noticed system freezes when I left the latency test running overnight.

Dec 20 17:40:49 raspberrypi kernel: [ 9091.384209] brcmfmac: brcmf_sdio_readframes: RXHEADER FAILED: -110
Dec 20 17:40:49 raspberrypi kernel: [ 9091.384232] brcmfmac: brcmf_sdio_rxfail: abort command, terminate frame, send NAK

Disabled wireless in general (bad idea for robust operation in real-world noisy environment) in /boot/config.txt.


These steps do not seem to solve the problem.  I still think it has to do with some device drivers expecting power management or CPU frequency adjustment, which I turned off at the kernel level above, but the root cause is unclear.

Nov 25, 2017

"Bare metal" control of the New Haven OLED

Many embedded devices have no display, and make do with status LEDs (think about your home router or switch).  One step up is an LCD display--like the New Haven OLED display I bought a few years ago, to study the Linux frame buffer device drivers.  Originally, I was going to experiment on my Zedboard (which packs Xilinx Zynq SoC) but since then I've embraced the Raspberry Pi project.  So in this blog entry, I create a status display GUI on my NHD-1.27-12896UGC3, which comes with its own displayer controller SSD1351.  When complete, the RPi Linux driven terminal looks like this if the soldering and connections were OK.
RPi console output on NHD-1.27-12896UGC3.  Note the crisp color and the deep black.

RPi Linux supports SSD1351 out of the box

The Raspberry Pi linux kernel (can be cloned from already has the matching kernel module for the display driver, which you can verify for yourself by running the following commands on your Raspberry Pi:

pi@raspberrypi$ sudo modprobe configs
pi@raspberrypi$ gunzip -c /proc/config.gz > ~/BAK/.config 

You can see SSD1351 support in the resulting file:


This define pulls in fb_ssd1351, which is one of the fbtft (TFT frame buffer) devices enumerated in fbtft_device.c: the 2 SSD1351 devices enumerated there are pioled and freetronicsoled128, neither of which are the NHD 1.27 128x96 device I have.  They are however driven similarly: 20 MHz SPI in Mode 0 (of the 4 SPI modes available; in mode 0, the slave samples SDA on the rising of SCL).  One puzzle is how pioled can drive a display only 32 pixels tall, when fb_ssd1351.c hard codes the initial height to 128, but let's see whether freetronicsoled128 can handle the "height=96" mod probe argument.  It looks like the probe code (fbtft_probe_dt) handles a whole bunch of options:

pdata->display.width = fbtft_of_value(node, "width");
pdata->display.height = fbtft_of_value(node, "height");
pdata->display.regwidth = fbtft_of_value(node, "regwidth");
pdata->display.buswidth = fbtft_of_value(node, "buswidth");
pdata->display.backlight = fbtft_of_value(node, "backlight");
pdata->display.bpp = fbtft_of_value(node, "bpp");
pdata->display.debug = fbtft_of_value(node, "debug");
pdata->rotate = fbtft_of_value(node, "rotate");
pdata->bgr = of_property_read_bool(node, "bgr");
pdata->fps = fbtft_of_value(node, "fps");
pdata->txbuflen = fbtft_of_value(node, "txbuflen");
pdata->startbyte = fbtft_of_value(node, "startbyte");
of_property_read_string(node, "gamma", (const char **)&pdata->gamma);

if (of_find_property(node, "led-gpios", NULL))
pdata->display.backlight = 1;

The default module properties in the kernel code for this device bears remembering:

.name = "freetronicsoled128",
.spi = &(struct spi_board_info) {
.modalias = "fb_ssd1351",
.max_speed_hz = 20000000,
.mode = SPI_MODE_0,
.platform_data = &(struct fbtft_platform_data) {
.display = {
.buswidth = 8,
.bgr = true,
.gpios = (const struct fbtft_gpio []) {
{ "reset", 24 },
{ "dc", 25 },

The reset and dc pins above are the D/C# and RES# pins defined in the SSD1351 controller interface table shown below:
Since it mentions the DC# pin explicitly (rather than being tied low as for the 3-wire interface), the device driver is expecting to use the 4-wire SPI interface above--through RPi GPIO pin 25.  The kernel config did not set CONFIG_FBTFT_ONBOARD_BACKLIGHT because the device doesn't need backlight (it's an OLED!).  NHD-1.27-12896UGC3 data sheet shows the recommended wiring for the 4-wire SPI mode as follows:
Including the 3.3 V power and ground, only 7 wires connect the display module to RPi GPIO header:
  • D/C: RPi P1.22, GPIO.25
  • SCLK: RPi P1.23 (AKA SCLK)
  • SDIN: RPi P1.19 (AKA MOSI)
  • /RES: RPi P1.18, GPIO.24
  • /CS: RPi P1.24 (AKA CE0), GPIO.8
When all soldered and wired, the connection looks like this (ignore the logic analyzer probes on the display pins).

To load the kernel module, I supply the display height (which is different than the default 128 pixels) to the module argument like this:

sudo modprobe fbtft_device name=freetronicsoled128 height=96

But according to the kernel log, the height argument was ignored:

Nov 25 17:22:38 hchoi2-RPi1B kernel: [ 1709.489760] graphics fb1: fb_ssd1351 frame buffer, 128x128, 32 KiB video memory, 4 KiB DMA buffer memory, fps=20, spi0.0 at 20 MHz

Anyhow, the module load succeeded and I now have another frame buffer device (in addition to the default HDMI out):

pi@hchoi2-RPi1B:~ $ ls /dev/fb
fb0  fb1  

I can then use the 2nd frame buffer as the console output:

pi@hchoi2-RPi1B:~ $ con2fbmap 1 1

The console can be redirected by changing the last 1 in the above command to 0.

Low level control of SSD1351 on Arduino Uno

Linux FB framework is powerful but requires a lot of code, which does not fit on most deeply embedded targets.  The vendor (New Haven Display) put out a "bare metal" example on GitHub for controlling the device from Arduino Uno.  This is an easier way to understand the low level control than wading through the many layers of the Linux FB driver code.  The following is my annotation of the example Arduino code.

Low level primitive

The supplied example shows 3 different methods of sending command to the device: 2 parallel interface and the 4-pin SPI.  I am only interested in the serial interface (cannot dedicate that many pins just for the display!) so I will ignore the parallel interface going forward.  The chip requires MSb (most-significant-bit-interface), and Arduino will bit-bang each bit on its GPIO pin while holding CS (chip select) and D/C# low for the whole duration of 8-bits.

Writing 1 B of data over the serial is exactly the same, except for holding D/C# high while writing.


  1. Chip reset: pull down the RES# pin for 500 usec, then pulling it up again, and then waiting for at least 500 usec.  
  2. Unlock command: write 0x12 and then 0xB1 to the command lock register (0xFD)
  3. Sleep mode on (display off): write (nothing ) to 0xAE register
  4. Set clock = divisor + 1, frequency = 0xF: write 0xF1 to 0xB3.  Writing to this register requires command unlocking (step #2).
  5. Set mux ratio
  6. Set display offset and start
  7. Set color depth to 18-bit (256k color), 16-bit format 2.
  8. GPIO input disabled
  9. Enable internal Vdd regulator
  10. Choose external VSL
  11. Set contrast current for the 3 collars (slightly different than the default: 0x8A, 0x70, 0x8A)
  12. Reset output currents for all colors
  13. Enhance display performance
  14. ...
  15. Sleep mode off (display on): write (nothing) to 0xAF register.

Blank out the entire screen to black

Blanking out the screen to any color just means writing the same (whatever) color to every pixel.  It consists of setup and data stage:
  1. Set column start and end to 0 and 127, respectively
  2. Set row start and end to 0 and 95, respectively
  3. Start write to RAM: write the destination register address (0x5C)
  4. For the next 128x96 pixels, write the given pixel value (RGB) as SPI data.  For the 262k color over 8-bit serial interface, the data format is given in Table 8-8 of the SSD1351 data sheet.  If I don't check for saturation, it's convenient to keep the colors as separate bytes, and output 8-bits for each color in rapid succession.

Print a fixed font letter

If I emit a different color for a pixel than the background color, I can show a dot at a given point.  If I arrange a group of neighboring pixels in a pre-arranged way, that is a symbol that can be shown at offset (x, y) on the screen.  If I then hold a read-only bitmap representing a letter, it is possible to print one letter at a time on the screen, by testing each bit of the bitmap as the pixel position moves to the right.  Here's an example of the letter 'E' in 10-point font:

const unsigned char A10pt [] = { // 'A' (11 pixels wide)
0x0E, 0x00, //     ###    
0x0F, 0x00, //     ####   
0x1B, 0x00, //    ## ##   
0x1B, 0x00, //    ## ##   
0x13, 0x80, //    #  ###  
0x31, 0x80, //   ##   ##  
0x3F, 0xC0, //   ######## 
0x7F, 0xC0, //  ######### 
0x60, 0xC0, //  ##     ## 
0x60, 0xE0, //  ##     ###
0xE0, 0xE0, // ###     ###

Note that this "10 point" font is actually 11 pixels tall and 13 pixels wide.  A for-loop to print this letter at position x and y on the screen is:

   index = 0;
   for(i=0;i<11;i++)     // display custom character A
        OLED_SetColumnAddress_12896RGB(x, 0x7F);
        OLED_SetRowAddress_12896RGB(y, 0x5F);
        for (count=0;count<8;count++)
            if((A10pt[index] & mask) == mask)
            mask = mask >> 1;
        mask = 0x80;
        for (count=0;count<8;count++)
            if((A10pt[index] & mask) == mask)
            mask = mask >> 1;
        mask = 0x80;
   x += 13;

This implementation is intimately tied to the font representation above (each row of the font consists of the 2 B and the pixel width and height are hard coded.  But note that a few of the hard coded parameters can be parametrized: the letter position (x, y), the letter itself, and the foreground color (and possibly the background color), and can be refactored into a common function that looks up the letter in a table--like the ASCII table:

void OLED_Text_12896RGB(unsigned char x_pos, unsigned char y_pos, unsigned char letter, unsigned long textColor, unsigned long backgroundColor);

This strategy is slow but functional.  Each byte write can be grouped together into a long sequence of bytes:
  • The nRS pin can be held low the whole time (i.e. avoid the repeated function calls)
  • The SPI write can be accelerated over DMA if the background is the same. That is, instead of a letter consisting of just 1 bitmap, it can just be a long sequence of colors for the entire rectangular region the letter takes up.  This will bloat the DATA segment dedicated to the letters.
Even more optimization techniques such as keeping a frame buffer and writing out a whole screen in one shot are just the beginning in graphics programming, and I won't write these myself because I don't want to reinvent the wheel.

Porting the Arduino example to RPi Linux user space

Driving out the SPI signal from RPi is an excellent way to prototype an embedded GUI platform even before the new board is brought up.  Even after the board is brought up, writing a user space program to try out an idea is a great convenience.  The key to porting the Arduino example to RPi is to leverage someone else's work on driving the RPi's SPI interface.  The BCM2835 library is mature and performant.  Using it, configuring the GPIO and SPI can be coded concisely:

#include <bcm2835.h>

#define    DC_PIN   25
#define   RES_PIN   24

int main() {
    if (!bcm2835_init())                                                                     
        return 1;                                                                            
    if (!bcm2835_spi_begin())    {                                                           
        fprintf(stderr, "bcm2835_spi_begin failed %d. Are you running as root??\n",          
        return 1;                                                                            
    bcm2835_spi_setBitOrder(BCM2835_SPI_BIT_ORDER_MSBFIRST);      // The default             
    bcm2835_spi_setDataMode(BCM2835_SPI_MODE0);                   // The default             
    bcm2835_spi_chipSelect(BCM2835_SPI_CS0);                      // The default             
    bcm2835_spi_setChipSelectPolarity(BCM2835_SPI_CS0, LOW);      // the default             
// the output pins: D/C (GPIO.3), RES (GPIO.5)                                               
    bcm2835_gpio_fsel(DC_PIN, BCM2835_GPIO_FSEL_OUTP);
    bcm2835_gpio_fsel(RES_PIN, BCM2835_GPIO_FSEL_OUTP);

Divider = 32 yields 8 MHz SPI speed.  I could try going faster, but even at 8 MHz, the signal integrity is marginal.  When the part is integrated on a PCB, I should be able to go faster.  Anyway, the smiley shows that once the image is set, the display can just refresh itself without a periodic refresh from the host, which means that a slow processor like a C2000 can just update the display asynchronously.

This is not quite bare metal in the true sense.  But still, this code should be readily transferrable to an embedded target such as C2000.

Bare metal control of the NHD panel from C2000


Nov 19, 2017

JTAG DAP parser

I have been trying to get a low level debug session going against my Raspberry Pi 3 using my J-Link debug probe.  If you Google for "J-Link Raspberry Pi", you will find success reported mostly for the original Raspberry Pi.  At first, I tried to use JLinkGDBServer and JLinkExe on my Ubuntu VM, but I haven't managed to write a working JLinkScript yet.  When even OpenOCD failed to connect to the target, I started digging into the root cause.  Following this page to enable JTAG on RPi's GPIO was relatively easy, as was exposing the copper for the JTAG's TRST, TDI, TDMS, TCLK, TDO lines and capturing the failed debug session on the logic analyzer.  Saleae already has the JTAG signal analyzer, so reading the raw bits going into and coming out of the JTAG scan is all relatively easy, as you can see below.
A small portion of a JTAG session between J-Link and Raspberry Pi 3 target, with JTAG enabled on RPi3's P8 header
But such low level exchange does not yield insight, so I started reading documents: ADI (ARM debug interface) 5.2, CoreSight specification 3.0, ARM Cortex A/R programmer's guide, and ARM Cortex-A7 TRM (technical reference manual), and I understood how the debug host controls the ARM CPU's debug subsystem by writing appropriate values to the DAP (debug access port) registers.  But to actually apply this understanding to my problem required rather painful mental bit-shifting, and repeatedly looking up the register definitions in the ADI and CoreSight specification.  So I saved the Saleae's JTAG capture to a CSV file, and wrote a Python script to do the low level heavy-lifting for me.


In the above trace, there are only 2 JTAG TAP (test access protocol) states that yield decodable value: Shift-IR (instruction register) and Shift-DR (data register).  All other transactions either lead up to this state, or move the state machine back to the starting state.  Roughly speaking, the target register is specified in Shift-IR state, and the data values for the specified registers are given in the Shift-DR state.  So unless the JTAG signals are bad (unlikely on a shipping HW like the Raspberry Pi), I can just focus on these 2 states and ignore most of the lines in the CSV emitted by Saleae Logic GUI.  I learned about the pandas Python library in a Udacity data science course on Supervised Learning: it will do much of the tabular data cleanup for me.  Although I can pip install pandas on my system, it was just easier to install Anaconda (version 2, to stay with Python 2.7) to a separate folder.  So my script begins by using that special python package that comes with anaconda2.

from enum import Enum
import sys
class TapState(Enum): Invalid, I, D = range(3)

tap_state = TapState.Invalid.value

Again, the reduction of the JTAG TAP states to I (instruction) and D (data) is a drastic simplification for this case, where I am only interested in the parsing the layer above the JTAG.

In the capture, there are 3 other registers that appear besides the DPACC (debug port access) and APACC (application port access), so I enumerate them.

class IR(Enum):
    ABORT = 8
    DPACC = 10
    APACC = 11
    IDCODE = 14
    BYPASS = 15

ir = IR.BYPASS.value

From my previous experience with SWD, I know about the trick that ARM plays with the SELECT register to map different registers to the limited size of register banks.  Here, 3 separate selections can happen independently, so I need 3 separate variables to maintain in a DAP transaction:

apsel = -1 # After PoR, APSEL unknown
apbank_sel = 0
dpbank_sel = 0

This simple script just takes 1 CSV file--which is separated with ';' rather than a comma.  pandas packages easily deals with it:

import pandas as pd
fn = sys.argv[1]
df = pd.read_csv(fn, sep=';', index_col=0)

Saleae emits the timestamp as the first column, and it is sometimes convenient to look up a packet by the timestamp, so I am specifying the 0th column as the index.

The very first exchange in a JTAG session is the JTAG scan: discovering how many JTAG devices are cascaded.  It's a complete waste for the normal case of just 1 device, but the long-ass sequence is still there; so I just drop it:

df = df[df['TDIBitCount'] < 100] # drop the JTAG scan

Next, I need to deal with pandas representation of CSV data: all numbers are floating point by default, and the rest are string.

df['TDOBitCount'] = df['TDOBitCount'].astype(int)
df['TDIBitCount'] = df['TDIBitCount'].astype(int)
df['TDI'] = df.TDI.apply(lambda x: int(x, 16))
df['TDO'] = df.TDO.apply(lambda x: int(x, 16))

Finally, I can iterate through the TAP Shift-IR and Shift-DR.  The first thing is to break out the items as separate variables, for legibility.  The last column is the number of bits output, which is the same as the number of bits input in all cases I've seen (JTAG seems to work like SPI), so it's safe to drop it.

for row in df.itertuples():
    timestamp, packet_type, TDI, TDO, nBit = row[:-1] # row[0] is timestamp

Since Shift-IR just sets the target register, handling that is straight-forward:

    if packet_type == 'Shift-IR':
        tap_state = TapState.I.value
        if (nBit == 4) and TDI in [IR.DPACC.value, IR.APACC.value, IR.IDCODE.value]:
            ir = TDI
        else: tap_state = TapState.Invalid.value # drop this packet

Shift-DR is far more complicated, but once again, I make a simplifying assumption that I am only interested in the DPACC or APACC.  In both cases, I am only interested in the standard TDI packet comprising of 32 bit data, 2 bit address, and 1 bit R/W indicator.

        if ir == IR.DPACC.value:
            if nBit == 35: 
                dout = TDO >> 3
                ack = TDO & 0x7
                din = TDI >> 3
                addr = (TDI & 0x6) << 1
                rnw = 'R' if (TDI & 0x1) else 'W'

                decoded = None
                # decode DAP reg
                if addr == 0: decoded = 'DPIDR'
                elif addr == 0x8:
                    apsel = din >> 24
                    apbank_sel = (din >> 4) & 0xF
                    dpbank_sel = din & 0xF
                    decoded = 'SELECT AP {:#x} APB {:#x} DPB {:#x}'.format(apsel, apbank_sel, dpbank_sel)
                elif addr == 0xC: decoded = 'RDBUFF'
                elif addr == 0x4: # act on dpbank_sel
                    if dpbank_sel == 0:
                        decoded = 'CTRL/STAT'
                    elif dpbank_sel == 1:
                        decoded = 'DLCR'
                    elif dpbank_sel == 2:
                        decoded = 'TARGETID'
                    elif dpbank_sel == 3:
                        decoded = 'DLPIDR'
                    elif dpbank_sel == 4:
                        decoded = 'EVENTSTAT'

                print('@{} {:#x} | {:x} {} | {} -> DPACC -> {:#x} | {}'. \
                    format(timestamp, din, addr, decoded, rnw, dout, ack))
            else: print('@{} Unhandled {:#x} -> DPACC -> {:#x}'.format(timestamp, TDI, TDO))

For APACC, my current level of understanding of the ARM MEM-AP registers are not solid enough to hard code the values I see in the TAR (target address register), so I keep things simple.

        elif ir == IR.APACC.value:
            if nBit == 35: 
                dout = TDO >> 3
                ack = TDO & 0x7
                din = TDI >> 3
                addr = (TDI & 0x6) << 1
                rnw = 'R' if (TDI & 0x1) else 'W'

                decoded = None
                # Assume this is a MEM-AP and decode
                if apbank_sel == 0:
                    if addr == 0:
                        decoded = 'CSW'
                    elif addr == 4:
                        decoded = 'TAR'
                    elif addr == 0xC:
                        decoded = 'DRW'
                elif apbank_sel == 0xf:
                    if addr == 4:
                        decoded = 'CFG'
                    elif addr == 8:
                        decoded = 'BASE'
                    elif addr == 0xC:
                        decoded = 'IDR'

                print('@{} {:#x} | {:x} {} | {} -> APACC -> {:#x} | {}'. \
                    format(timestamp, din, addr, decoded, rnw, dout, ack))
            else: print('@{} Unhandled {:#x} -> APACC -> {:#x}'.format(timestamp, TDI, TDO))

Finally, I can handle the IDCODE easily, so I just threw that in at the end:

        elif ir == IR.IDCODE.value:
            if nBit == 32: print('IDCODE -> {:#8x}'.format(TDO))
        else: # Hmm what is this?
            tap_state = TapState.Invalid.value

All in all, a simple parser!  Let's see if it's any useful.

Using the parser on openocd session

The very first line decoded with the parser on a session between my J-Link Ultra+ and RPi3 are:

@0.01815178 0x0 | 4 CTRL/STAT | R -> DPACC -> 0x0 | 2
@0.0181938 0x20 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2
@0.01823582 0x0 | 4 CTRL/STAT | R -> DPACC -> 0x0 | 2
@0.01827784 0x50000000 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2

According to my copy of ADI v5.2 section B.2.2 CTRL/STAT, Control/Status register, 0x20 is the STICKYERR bit; writing a 1'b1 to it clears that bit; makes sense except for the fact it was not set to begin with, so a complete waste of time.  Also, openocd is writing 0x0 to the upper 8 bits and then then writing 0x5 means it is requesting the system and debug subsystem reset.  This is actually not a good thing if I just want to halt a running system, so I don't know if I will run into a problem later.

So it seems that armed with tables of the various DP/AP registers, I can start to make sense of what openocd is requesting the target.  I was therefore surprised to discover--just a few ms later, that openocd goes through a "ping" of potential AP in the address space: all 256 of them, by trying to read the CIDR (component ID register; the 1st register in AP register bank 0xF) of each of the possible 4 KB mapping in the base register.  Using the same parser, I saw that J-Link discovers all available AP components by reading the ROM table (it fails to use the discovered ROM table in an intelligent way, but that's another topic altogether).  Going through 256 possible AP takes J-Link Ultra+ about 133 ms  at 100 kbps; it would take J-Link+ about 10x that duration (its inter-packet time is long for some reason).  It then queries the IDR of each possible AP component--only 8 of which are populated for the RPi3 (another ~130 ms wasted).  OpenOCD then goes through another round of unnecessary exchange with the target: it tries to unlock software access to the debug registers by writing the magic keys for each of the discovered components (the RPi's AP components are ROM table v9, which do not implement the software lock/unlock).  The waste is even worse, because after the discovery, openocd tries to reset the system and the debug subsystem (again), and then go through the same discovery and software unlock mechanism it went through last time.

@0.32220612 0x20 | 4 CTRL/STAT | W -> DPACC -> 0xf0000001 | 2
@0.32224814 0x0 | 4 CTRL/STAT | R -> DPACC -> 0xa0000000 | 2
@0.32229016 0x50000000 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2

Oct 11, 2017

Hacking the Raspberry Pi Boot

The downloadable Rasbian image does not come with U-Boot (the most popular embedded system boot loader), yet can clearly boot Linux.  But since the Pi's 2nd stage boot loader is closed source, booting Pi with U-Boot allows me to perform non-standard boot actions (like exposing the RPi's JTAG pins).

How does the RPi boot?

An ARM booting is very different than the Intel/AMD system (BIOS or UEFI).  RPi's 1st stage bootloader is on-chip, and looks for the file bootcode.bin in the boot partition:

pi@raspberrypi:~ $ mount
/dev/mmcblk0p1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro)

pi@raspberrypi:~ $ ls -gh /boot/bootcode.bin 
-rwxr-xr-x 1 root 50K Jul  3 10:07 /boot/bootcode.bin

The 2nd stage bootloader runs on RPi's GPU (also on-chip), and requires the start_<device>.elf executables, also in /boot partition:

-rwxr-xr-x 1 root 645K Jul  3 14:07 start_cd.elf
-rwxr-xr-x 1 root 4.8M Jul  3 14:07 start_db.elf
-rwxr-xr-x 1 root 2.8M Jul  3 14:07 start.elf
-rwxr-xr-x 1 root 3.8M Jul  3 14:07 start_x.elf

A headless system will only have start.elf.  The 3rd stage bootloader (also running on the GPU) is currently booting straight into Linux, using the kernel.img and the DTB (device tree blob) files (also in /boot):

-rwxr-xr-x 1 root  18K May 15 19:09 bcm2708-rpi-3-b.dtb
-rwxr-xr-x 1 root 4.4M Jul  3 10:07 kernel7.img # for >= RPi2
-rwxr-xr-x 1 root 4.2M Jul  3 10:07 kernel.img # All other RPi

The RPi boot loader running on the GPU reads config.txt (also in /boot) for boot options, including which Linux kernel to load.  There is no "kernel" line in the config.txt right after a fresh install, so the boot loader defaults to  it loads the highest numbered kernel<num>.img.  Since we have a working system but want a boot loader capable of booting multiple OS, we should point the RPi bootloader to U-Boot image instead of the kernel<num>.img.  [other people seem to prefer renaming u-boot.bin to kernel8.img, taking advantage of the boot loader's search for the highest numbered kernel<num>.img.]

While booting the kernel, the RPi bootloader feeds the content of the file /boot/cmdline.txt to the kernel, but now I need U-Boot to do that.  When you view this file, it can get overwhelming, but just note it down somewhere for now, so you can punch it into U-Boot down below.

Let's first back up this known working /boot partition to the development before messing with it.

Cross compiler toolchain for RPi

Remember throughout this article that the RPi is the target and my 64-bit Ubuntu 16.04 is the development host, which also implies that I am cross compiling the RPi's programs (starting with the bootloader) on the host, using a toolchain graciously provided by the Raspberry Pi folks on gihub.  To clone the git repo:

parallels@ubuntu:$ git clone

The tools are in the directory arm-bcm2708

$ ls arm-bcm2708/
arm-bcm2708hardfp-linux-gnueabi  gcc-linaro-arm-linux-gnueabihf-raspbian
arm-bcm2708-linux-gnueabi        gcc-linaro-arm-linux-gnueabihf-raspbian-x64

I decided to use the Linaro toolchain, because it is popular among embedded developers.  Note that the 32-bit version of the toolchain will not even run on a 64-bit host (that's pretty much all PCs these days).  I need to put the toolchain's bin/ folder in $PATH, by adding the following line at the end of my ~/.profile, like this:


[If using .profile, you have to logout and log back in for the change to take effect.]

In the toolchain's bin folder, there are a whole bunch of executables, like this:

~/rpi/tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin$ ls
arm-linux-gnueabihf-addr2line     arm-linux-gnueabihf-gfortran

To build a either the kernel or U-Boot, an environment variable called CROSS_COMPILE is necessary, as a prefix for the tool it should run.  In the above case, that prefix is arm-linux-gnueabihf-, which I hard code in my .profile again.

Building U-Boot for RPi3

I clone the latest U-Boot repo:

$ git clone git://

In the repo's configs/ folder, there are 4 RPi board configs:

~/rpi/u-boot$ ls configs/*rpi*
configs/rpi_2_defconfig      configs/rpi_3_defconfig
configs/rpi_3_32b_defconfig  configs/rpi_defconfig

These are the predefined U-Boot configs.  Without understanding the details of the config, I will try using the 32b version for now.  Remembering that I now have the environment variable CROSS_COMPILE, I seed this .config file from this defconfig:

~/rpi/u-boot$ make rpi_3_defconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/
  HOSTLD  scripts/kconfig/conf
# configuration written to .config

And then build the U-Boot binary itself with

~/rpi/u-boot$ make

If all goes well, you should see and the ELF binary (with symbols) and the stripped (without symbol) in the root folder, like this:

-rwxrwxr-x   1 parallels 2.2M Oct  1 12:00 u-boot
-rw-rw-r--   1 parallels 372K Oct  1 12:00 u-boot.bin

Copy u-boot.bin to RPi's /boot folder.

To cause RPi boot loader to choose U-Boot instead of the kernel, just add a new line in the config.txt:


Overriding what the GPU boots this way (as supposed to changing the file that the GPU looks for) offers the possibility of backing out easily using the recovery method (which lets you modify the config.txt)--IF you are running the NOOBS install.  Also enable UART in config.txt, for the debugging session to begin shortly below.


U-Boot boots kernel7.img

When I reboot RPi with the HDMI monitor and keyboard connected, I see the U-Boot screen on the monitor with this message:

U-Boot 2017.09-00351-g7eefa35-dirty (Oct 11 2017 - 21:25:18 -0700)

DRAM:  896 MiB
RPI 3 Model B (0xa22082)
MMC:   sdhci@7e300000: 0
Hit any key to stop autoboot:  0
Scanning mmc 0:1...

Note that the memory available to the CPU is 128 MB shy of 512 MB DRAM on board, because I gave 128 MB to the GPU (in /boot/config.txt).

Anyway it's clearly not booting to Linux, so I reboot and  hit the Enter key to halt the U-Boot, and I am left with the U-Boot console:


The next step is to configure U-Boot to load the kernel image already on /boot: kernel.img.  Load the image and the DTB file, and give those to the bootz command in the mmc_boot command, as you can see below.  [Note that "mmc 0:1" is pointing to the 2nd partition on the SD card--which is NOT the /boot partition for the NOOBS install (in which case it is the 0:6 partition).]

U-Boot> setenv fdtfile bcm2708-rpi-3-b.dtb
U-Boot> setenv bootcmd_mmc0 fatload mmc 0:1 ${kernel_addr_r} kernel7.img\;fatload mmc 0:1 ${fdt_addr} ${fdtfile}\;bootz ${kernel_addr_r} - ${fdt_addr}
This works because the default boot command--bootcmd--points to mmc_boot, as you can see in the boot command chain below:

U-Boot> printenv bootcmd
bootcmd=run distro_bootcmd

The distro_bootcmd, in turn, is a for loop, going through different boot hardware:

U-Boot> printenv distro_bootcmd
distro_bootcmd=for target in $boot_targets}; do run bootcmd_${target}; done

And the RPi3 config file set up multiple boot devices:

U-Boot> printenv boot_targets
boot_targets=mmc0 usb0 pxe dhcp

The kernel needs bootargs: the content of /boot/cmdline.txt that I saved away earlier.  In U-Boot, you do that through an environment called bootargs.  Because it is many lines long, it is easy to make a typo when creating this environment variable; so I do it in bite-sized chunks:

U-Boot> setenv rootargs "root=/dev/mmcblk0p2 rootfstype=ext4"
U-Boot> setenv fbargs "bcm2708_fb.fbwidth=1824 bcm2708_fb.fbheight=984 bcm2708_fb.fbswap=1"
U-Boot> setenv vcmemargs "vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000"
U-Boot> setenv miscargs "dwc_otg.lpm_enable=0 elevator=deadline console=serial0,115200"
U-Boot> setenv bootargs ${rootargs} ${fbargs} ${miscargs}

Note that I changed the rootfs device from a UUID to a more generic "2nd partition on the mmc device 0", to be more generic.  I found that for a NOOBS install (vs. a straight Rasbian imaging as I did), the ext4 partition is the 6th partition is the 6th partition, so you can adjust it to mmcblk0p6 above.

I save the configuration change to the U-Boot environment file (persisted in /boot) before rebooting.

U-Boot> saveenv
U-Boot> boot

After a few seconds, I see that RPi3 is starting to boot Linux, and then voilĂ : I am back to the Rasbian desktop!  I save the resulting /boot/uboot.env file for safe keeping.  Since I want to start the FW before booting Linux, I studied U-Boot's standalone examples.

From (, I learned that I could get the RPi bootloader to do all the heavy lifting of creating a merged DTB and putting it at the address specified in config.txt--for the device_tree_address line. The default U-Boot ${fdt_addr} is 0x2effb600, so I just added the following line to /boot/config.txt:


U-Boot Hello World example

According to U-Boot documentation, cache coherence problem can be worked around by having U-Boot build system package up the application binary into a U-Boot image (RPi's kernel.img I've been working with above is one such example).  U-Boot already builds a stand-alone hello_world example (examples/standalone), whose start address is defined in arch/arm/


I can verify this by reading the ELF produced by the build.

parallels@ubuntu:~/rpi/u-boot/examples/standalone$ ${CROSS_COMPILE}readelf hello_world -h

  Entry point address:               0xc100000

I can copy the build output (hello_world.bin) to the SD card's /boot partition, and run it in U-Boot shell (note that my /boot partition on the SD card is partition #1 on device #0):

U-Boot> fatload mmc 0:1 0xc100000 hello_world.bin
U-Boot>go 0xc100000 Hello world
## Starting application at 0x0C100000 ...
arg[1] = Hello
arg[2] = world
arg[3] = ""
## Application terminated, rc = 0x0

This shows that a stand-alone application can control low level peripheral.

Aside: cache problem

On the U-Boot main website's documentation page (as well as the README at the top of the U-Boot repo) there is some discussion about cache coherence problem because U-Boot runs all stand alone code with cache disabled.  Supposedly, the workaround is to use the "bootm" script , which requires coaxing the U-Boot build system to produce a U-Boot image for this, I can copy the resulting file to the SD card.

~/rpi/u-boot/examples/standalone$ ../../tools/mkimage -A arm -O u-boot -T standalone -C none -a 0xc100000 -d hello_world.bin -v hello_world.img
Adding Image hello_world.bin
Image Name:  
Created:      Sat Oct  7 19:33:34 2017
Image Type:   ARM U-Boot Standalone Program (uncompressed)
Data Size:    630 Bytes = 0.62 KiB = 0.00 MiB
Load Address: 0c100000
Entry Point:  0c100000

But this did not work, because U-Boot rejects it as an invalid kernel image.

Modifying the hello_world example to toggle a GPIO line

Printing a string to console is nice, but controlling GPIO lines or SPI would be more practical for embedded applications.  In an industrial application, an RPi system can signal that it is about to boot the Linux kernel by bouncing a GPIO line, or sending out a SPI/I2C packet.  Let's say I want to toggle the RPi3 pin J8.15 right before booting the kernel.  On the bcm2835~7 SoC, GPIO22 drives that J8.15 pin, as the command  "gpio readall" (available in out-of-the-box Rasbian) shows (I yellow-highlighted it for you):

pi@raspberrypi:~ $ gpio readall
 +-----+-----+---------+------+---+---Pi 3---+---+------+---------+-----+-----+
 | BCM | wPi |   Name  | Mode | V | Physical | V | Mode | Name    | wPi | BCM |
 |     |     |    3.3v |      |   |  1 || 2  |   |      | 5v      |     |     |
 |   2 |   8 |   SDA.1 |   IN | 1 |  3 || 4  |   |      | 5v      |     |     |
 |   3 |   9 |   SCL.1 |   IN | 1 |  5 || 6  |   |      | 0v      |     |     |
 |   4 |   7 |
JTAG_05 |   IN | 1 |  7 || 8  | 1 | ALT5 | TxD     | 15  | 14  |
 |     |     |      0v |      |   |  9 || 10 | 1 | ALT5 | RxD     | 16  | 15  |
 |  17 |   0 | GPIO. 0 |   IN | 0 | 11 || 12 | 0 | IN   | GPIO. 1 | 1   | 18  |
 |  27 |   2 | JTAG_07 |   IN | 0 | 13 || 14 |   |      | 0v      |     |     |
 |  22 |   3 | JTAG_03 |   IN | 0 | 15 || 16 | 0 | IN   | JTAG_11 | 4   | 23  |
 |     |     |    3.3v |      |   | 17 || 18 | 0 | IN   | JTAG_13 | 5   | 24  |
 |  10 |  12 |    MOSI |   IN | 0 | 19 || 20 |   |      | 0v      |     |     |
 |   9 |  13 |    MISO |   IN | 0 | 21 || 22 | 0 | IN   | JTAG_09 | 6   | 25  |
 |  11 |  14 |    SCLK |   IN | 0 | 23 || 24 | 1 | IN   | CE0     | 10  | 8   |
 |     |     |      0v |      |   | 25 || 26 | 1 | IN   | CE1     | 11  | 7   |
 |   0 |  30 |   SDA.0 |   IN | 1 | 27 || 28 | 1 | IN   | SCL.0   | 31  | 1   |
 |   5 |  21 | GPIO.21 |   IN | 1 | 29 || 30 |   |      | 0v      |     |     |
 |   6 |  22 | GPIO.22 |   IN | 1 | 31 || 32 | 0 | IN   | GPIO.26 | 26  | 12  |
 |  13 |  23 | GPIO.23 |   IN | 0 | 33 || 34 |   |      | 0v      |     |     |
 |  19 |  24 | GPIO.24 |   IN | 0 | 35 || 36 | 0 | IN   | GPIO.27 | 27  | 16  |
 |  26 |  25 | JTAG_05 |   IN | 0 | 37 || 38 | 0 | IN   | GPIO.28 | 28  | 20  |
 |     |     |      0v |      |   | 39 || 40 | 0 | IN   | GPIO.29 | 29  | 21  |

To bounce a specified GPIO line, I can modify the hello_world program (I can create another stand-alone project, but I am feeling lazy) like this:

/* see BMC2835 peripheral datasheet p.92, downloadable from RPi website */
#define GPIO_ALT_FUNCTION_0   0x4
#define GPIO_ALT_FUNCTION_1   0x5
#define GPIO_ALT_FUNCTION_2   0x6
#define GPIO_ALT_FUNCTION_3   0x7
#define GPIO_ALT_FUNCTION_4   0x3
#define GPIO_ALT_FUNCTION_5   0x2

/* from dwelch67's bare metal project */
#define BCM2708_PERI_BASE            0x3f000000
#define GPIO_BASE (BCM2708_PERI_BASE + 0x200000)

__attribute__((always_inline)) static inline void GPIOOutSet(int GPIO)
    volatile uint32_t* GPIO_SET = (volatile uint32_t*)(GPIO_BASE + 0x1C);
    GPIO_SET[GPIO >= 32] |= 1 << (GPIO & 0x1F);
__attribute__((always_inline)) static inline void GPIOOutClear(int GPIO)
    volatile uint32_t* GPIO_CLR = (volatile uint32_t*)(GPIO_BASE + 0x28);
    GPIO_CLR[GPIO >= 32] |= 1 << (GPIO & 0x1F);

/* Have to forward declare because whatever text that appears at the beginning
 * is put at the entry point of the stand alone app.
static void GPIOFunction(int GPIO, int functionCode);

int hello_world (int argc, char * const argv[])
    if (strcmp(argv[1], "gpio") == 0) {
        if (argc < 3) {
            printf("which pin? ");
            return 0;
        unsigned pin = 10 * (argv[2][0] - '0') /* simpler than sscanf */
            + (argv[2][1] - '0');
        printf("%d\n", pin);
        GPIOFunction(pin, GPIO_ALT_FUNCTION_OUT);
        GPIOOutSet(pin); GPIOOutClear(pin);
    return (0);

static unsigned _divide(unsigned n, unsigned d) {
    unsigned q = 0;
    while (n >= d) {
        n -= d;
    return q;

static void GPIOFunction(int GPIO, int functionCode)
    int registerIndex = _divide(GPIO, 10);
    int bit = (GPIO - 10 * registerIndex) * 3;
    volatile uint32_t* GPIO_FSEL = (volatile uint32_t*)(GPIO_BASE + 0);
    uint32_t oldValue = GPIO_FSEL[registerIndex];
    uint32_t mask = 0b111 << bit;
    GPIO_FSEL[registerIndex] = (oldValue & ~mask) /* don't touch others */
                            | ((functionCode << bit) & mask);

As you can see, it's a complete rewrite of the hello world stand alone application.  But U-Boot's make system will dutifully produce a hello_world.bin with correct entry point.

parallels@ubuntu:~/rpi/u-boot$ ${CROSS_COMPILE}readelf examples/standalone/hello_world -h
  Entry point address:               0xc100000
  Start of program headers:          52 (bytes into file)

I then copied hello_world.bin to the SD card's /boot and tried it out in U-Boot command line:

U-Boot> fatload mmc 0:1 0xc100000 hello_world.bin
U-Boot> go 0xc100000 gpio 22

This is what saw in logic analyzer connected to the UART TX (J8.8) and J8.15:
UART TX is busy printing the printf statements, but the ~150 ns wide impulse is the GPIO bounce I wanted.

Enabling JTAG debugging on RPi3

Now that I can control the GPIO peripheral before booting the kernel, another cool thing to do is turn on the JTAG.  On p. 102 of the CM2835 peripheral datasheet mentioned above, thre is a table of GPIO alternative functions.  On the ALT4 column, GPIO22~26 are the ARM JTAG pins.  I configure these pins for JTAG using a similar code as above.

} else if (strcmp(argv[1], "jtag") == 0) {
    printf ("Remapping GPIO to JTAG ...\n");
    GPIOFunction(22, GPIO_ALT_FUNCTION_4);//TRST: JTAG 3
    GPIOFunction(4, GPIO_ALT_FUNCTION_5);//TDI : JTAG 5
    GPIOFunction(24, GPIO_ALT_FUNCTION_4);//TMS : JTAG 7
    GPIOFunction(25, GPIO_ALT_FUNCTION_4);//TCK : JTAG 9

    //Don't need: RPi doesn't "return" CLK
    //GPIOFunction(26, GPIO_ALT_FUNCTION_4);//RTCK: JTAG 11
    GPIOFunction(27, GPIO_ALT_FUNCTION_4);//TDO : JTAG 13
    if (argc > 2 && strcmp(argv[2], "wait") == 0) {
        printf ("Waiting ...\n");
        while (!tstc());
        (void) getc();

Now I need to find these pins on the J8 header.  I marked them with a gray highlighter above.  I use SEGGER J-Link, and the JTAG pin definitions are listed in the 20-pin J-Link connector section of the J-Link manual. When I run JLinkGDBServer, I see the this error:

parallels@ubuntu:~$ JLinkGDBServer -device cortex-a7
------Target related settings------
Target device:                 cortex-a7
Target interface:              JTAG
Target interface speed:        1000kHz
Target endian:                 little
Target voltage: 3.31 V
Listening on TCP/IP port 2331
Connecting to target...ERROR: Cortex-A/R-JTAG (connect): Could not determine address of core debug registers. Incorrect CoreSight ROM table in device?
ERROR: Could not connect to target.
Target connection failed. GDBServer will be closed...Restoring target state and closing J-Link connection...
Shutting down...
Could not connect to target.
Please check power, connection and settings.

When I look at TDO (which is driven from from RPi), I see that the target is responding , as you  can see here.

JLinkExe fails to find core debug registers

JLinkExe can even get some information out of the target as well

Connecting to target via JTAG
TotalIRLen = 4, IRPrint = 0x01
JTAG chain detection found 1 devices:
 #0 Id: 0x4BA00477, IRLen: 04, CoreSight JTAG-DP
ARM AP[0]: 0x24770002, APB-AP
ROMTbl[0][0]: CompAddr: 80010000 CID: B105900D, PID:04-004BBD03
ROMTbl[0][1]: CompAddr: 80011000 CID: B105900D, PID:04-004BB9D3
ROMTbl[0][2]: CompAddr: 80012000 CID: B105900D, PID:04-004BBD03
ROMTbl[0][3]: CompAddr: 80013000 CID: B105900D, PID:04-004BB9D3
ROMTbl[0][4]: CompAddr: 80014000 CID: B105900D, PID:04-004BBD03
ROMTbl[0][5]: CompAddr: 80015000 CID: B105900D, PID:04-004BB9D3
ROMTbl[0][6]: CompAddr: 80016000 CID: B105900D, PID:04-004BBD03
ROMTbl[0][7]: CompAddr: 80017000 CID: B105900D, PID:04-004BB9D3
TotalIRLen = 4, IRPrint = 0x01
JTAG chain detection found 1 devices:
 #0 Id: 0x4BA00477, IRLen: 04, CoreSight JTAG-DP

****** Error: Cortex-A/R-JTAG (connect): Could not determine address of core debug registers. Incorrect CoreSight ROM table in device?
TotalIRLen = 4, IRPrint = 0x01
JTAG chain detection found 1 devices:
 #0 Id: 0x4BA00477, IRLen: 04, CoreSight JTAG-DP
TotalIRLen = 4, IRPrint = 0x01
JTAG chain detection found 1 devices:
 #0 Id: 0x4BA00477, IRLen: 04, CoreSight JTAG-DP
Cannot connect to target.

I then tried OpenOCD instead.

OpenOCD cannot connect either

OpenOCD does its own thing, but it can also get some response from RPi as well.

$ openocd -f /usr/share/openocd/scripts/interface/jlink.cfg -f ~/rpi/openocd/rpi3.cfg
Open On-Chip Debugger 0.9.0 (2015-09-02-10:42)
Licensed under GNU GPL v2
For bug reports, read
adapter speed: 1000 kHz
adapter_nsrst_delay: 400
none separate
Info : auto-selecting first available session transport "jtag". To override use 'transport select <transport>'.
Info : J-Link ARM V8 compiled Nov 28 2014 13:44:46
Info : J-Link caps 0xb9ff7bbf
Info : J-Link hw version 80000
Info : J-Link hw type J-Link
Info : J-Link max mem block 9224
Info : J-Link configuration
Info : USB-Address: 0x0
Info : Kickstart power on JTAG-pin 19: 0xffffffff
Info : Vref = 3.306 TCK = 1 TDI = 0 TDO = 1 TMS = 0 SRST = 1 TRST = 1
Info : J-Link JTAG Interface ready
Info : clock speed 1000 kHz
Info : JTAG tap: rspi.arm tap/device found: 0x4ba00477 (mfg: 0x23b, part: 0xba00, ver: 0x4)
Warn : JTAG tap: rspi.arm       UNEXPECTED: 0x4ba00477 (mfg: 0x23b, part: 0xba00, ver: 0x4)
Error: JTAG tap: rspi.arm  expected 1 of 1: 0x07b7617f (mfg: 0x0bf, part: 0x7b76, ver: 0x0)
Error: Trying to use configured scan chain anyway...
Warn : Bypassing JTAG setup events due to errors
Error: 'arm11 target' JTAG error SCREG OUT 0x00
Error: unexpected ARM11 ID code

It seems that JTAG pin mapping works, but the client side tools don't play well with RPi3.  I will come back after getting a response back from SEGGER.