Nov 25, 2017

"Bare metal" control of the New Haven OLED

Many embedded devices have no display, and make do with status LEDs (think about your home router or switch).  One step up is an LCD display--like the New Haven OLED display I bought a few years ago, to study the Linux frame buffer device drivers.  Originally, I was going to experiment on my Zedboard (which packs Xilinx Zynq SoC) but since then I've embraced the Raspberry Pi project.  So in this blog entry, I create a status display GUI on my NHD-1.27-12896UGC3, which comes with its own displayer controller SSD1351.  When complete, the RPi Linux driven terminal looks like this if the soldering and connections were OK.
RPi console output on NHD-1.27-12896UGC3.  Note the crisp color and the deep black.

RPi Linux supports SSD1351 out of the box

The Raspberry Pi linux kernel (can be cloned from https://github.com/raspberrypi/linux) already has the matching kernel module for the display driver, which you can verify for yourself by running the following commands on your Raspberry Pi:

pi@raspberrypi$ sudo modprobe configs
pi@raspberrypi$ gunzip -c /proc/config.gz > ~/BAK/.config 

You can see SSD1351 support in the resulting file:

CONFIG_FB_TFT_SSD1351=m

This define pulls in fb_ssd1351, which is one of the fbtft (TFT frame buffer) devices enumerated in fbtft_device.c: the 2 SSD1351 devices enumerated there are pioled and freetronicsoled128, neither of which are the NHD 1.27 128x96 device I have.  They are however driven similarly: 20 MHz SPI in Mode 0 (of the 4 SPI modes available; in mode 0, the slave samples SDA on the rising of SCL).  One puzzle is how pioled can drive a display only 32 pixels tall, when fb_ssd1351.c hard codes the initial height to 128, but let's see whether freetronicsoled128 can handle the "height=96" mod probe argument.  It looks like the probe code (fbtft_probe_dt) handles a whole bunch of options:

pdata->display.width = fbtft_of_value(node, "width");
pdata->display.height = fbtft_of_value(node, "height");
pdata->display.regwidth = fbtft_of_value(node, "regwidth");
pdata->display.buswidth = fbtft_of_value(node, "buswidth");
pdata->display.backlight = fbtft_of_value(node, "backlight");
pdata->display.bpp = fbtft_of_value(node, "bpp");
pdata->display.debug = fbtft_of_value(node, "debug");
pdata->rotate = fbtft_of_value(node, "rotate");
pdata->bgr = of_property_read_bool(node, "bgr");
pdata->fps = fbtft_of_value(node, "fps");
pdata->txbuflen = fbtft_of_value(node, "txbuflen");
pdata->startbyte = fbtft_of_value(node, "startbyte");
of_property_read_string(node, "gamma", (const char **)&pdata->gamma);

if (of_find_property(node, "led-gpios", NULL))
pdata->display.backlight = 1;

The default module properties in the kernel code for this device bears remembering:

.name = "freetronicsoled128",
.spi = &(struct spi_board_info) {
.modalias = "fb_ssd1351",
.max_speed_hz = 20000000,
.mode = SPI_MODE_0,
.platform_data = &(struct fbtft_platform_data) {
.display = {
.buswidth = 8,
.backlight = FBTFT_ONBOARD_BACKLIGHT,
},
.bgr = true,
.gpios = (const struct fbtft_gpio []) {
{ "reset", 24 },
{ "dc", 25 },
{},
},
}
}

The reset and dc pins above are the D/C# and RES# pins defined in the SSD1351 controller interface table shown below:
Since it mentions the DC# pin explicitly (rather than being tied low as for the 3-wire interface), the device driver is expecting to use the 4-wire SPI interface above--through RPi GPIO pin 25.  The kernel config did not set CONFIG_FBTFT_ONBOARD_BACKLIGHT because the device doesn't need backlight (it's an OLED!).  NHD-1.27-12896UGC3 data sheet shows the recommended wiring for the 4-wire SPI mode as follows:
Including the 3.3 V power and ground, only 7 wires connect the display module to RPi GPIO header:
  • D/C: RPi P1.22, GPIO.25
  • SCLK: RPi P1.23 (AKA SCLK)
  • SDIN: RPi P1.19 (AKA MOSI)
  • /RES: RPi P1.18, GPIO.24
  • /CS: RPi P1.24 (AKA CE0), GPIO.8
When all soldered and wired, the connection looks like this (ignore the logic analyzer probes on the display pins).

To load the kernel module, I supply the display height (which is different than the default 128 pixels) to the module argument like this:

sudo modprobe fbtft_device name=freetronicsoled128 height=96

But according to the kernel log, the height argument was ignored:

Nov 25 17:22:38 hchoi2-RPi1B kernel: [ 1709.489760] graphics fb1: fb_ssd1351 frame buffer, 128x128, 32 KiB video memory, 4 KiB DMA buffer memory, fps=20, spi0.0 at 20 MHz

Anyhow, the module load succeeded and I now have another frame buffer device (in addition to the default HDMI out):

pi@hchoi2-RPi1B:~ $ ls /dev/fb
fb0  fb1  

I can then use the 2nd frame buffer as the console output:

pi@hchoi2-RPi1B:~ $ con2fbmap 1 1

The console can be redirected by changing the last 1 in the above command to 0.

Low level control of SSD1351 on Arduino Uno

Linux FB framework is powerful but requires a lot of code, which does not fit on most deeply embedded targets.  The vendor (New Haven Display) put out a "bare metal" example on GitHub for controlling the device from Arduino Uno.  This is an easier way to understand the low level control than wading through the many layers of the Linux FB driver code.  The following is my annotation of the example Arduino code.

Low level primitive

The supplied example shows 3 different methods of sending command to the device: 2 parallel interface and the 4-pin SPI.  I am only interested in the serial interface (cannot dedicate that many pins just for the display!) so I will ignore the parallel interface going forward.  The chip requires MSb (most-significant-bit-interface), and Arduino will bit-bang each bit on its GPIO pin while holding CS (chip select) and D/C# low for the whole duration of 8-bits.

Writing 1 B of data over the serial is exactly the same, except for holding D/C# high while writing.

Initialization

  1. Chip reset: pull down the RES# pin for 500 usec, then pulling it up again, and then waiting for at least 500 usec.  
  2. Unlock command: write 0x12 and then 0xB1 to the command lock register (0xFD)
  3. Sleep mode on (display off): write (nothing ) to 0xAE register
  4. Set clock = divisor + 1, frequency = 0xF: write 0xF1 to 0xB3.  Writing to this register requires command unlocking (step #2).
  5. Set mux ratio
  6. Set display offset and start
  7. Set color depth to 18-bit (256k color), 16-bit format 2.
  8. GPIO input disabled
  9. Enable internal Vdd regulator
  10. Choose external VSL
  11. Set contrast current for the 3 collars (slightly different than the default: 0x8A, 0x70, 0x8A)
  12. Reset output currents for all colors
  13. Enhance display performance
  14. ...
  15. Sleep mode off (display on): write (nothing) to 0xAF register.

Blank out the entire screen to black

Blanking out the screen to any color just means writing the same (whatever) color to every pixel.  It consists of setup and data stage:
  1. Set column start and end to 0 and 127, respectively
  2. Set row start and end to 0 and 95, respectively
  3. Start write to RAM: write the destination register address (0x5C)
  4. For the next 128x96 pixels, write the given pixel value (RGB) as SPI data.  For the 262k color over 8-bit serial interface, the data format is given in Table 8-8 of the SSD1351 data sheet.  If I don't check for saturation, it's convenient to keep the colors as separate bytes, and output 8-bits for each color in rapid succession.

Print a fixed font letter

If I emit a different color for a pixel than the background color, I can show a dot at a given point.  If I arrange a group of neighboring pixels in a pre-arranged way, that is a symbol that can be shown at offset (x, y) on the screen.  If I then hold a read-only bitmap representing a letter, it is possible to print one letter at a time on the screen, by testing each bit of the bitmap as the pixel position moves to the right.  Here's an example of the letter 'E' in 10-point font:

const unsigned char A10pt [] = { // 'A' (11 pixels wide)
0x0E, 0x00, //     ###    
0x0F, 0x00, //     ####   
0x1B, 0x00, //    ## ##   
0x1B, 0x00, //    ## ##   
0x13, 0x80, //    #  ###  
0x31, 0x80, //   ##   ##  
0x3F, 0xC0, //   ######## 
0x7F, 0xC0, //  ######### 
0x60, 0xC0, //  ##     ## 
0x60, 0xE0, //  ##     ###
0xE0, 0xE0, // ###     ###
};

Note that this "10 point" font is actually 11 pixels tall and 13 pixels wide.  A for-loop to print this letter at position x and y on the screen is:

   index = 0;
   for(i=0;i<11;i++)     // display custom character A
   {
        OLED_SetColumnAddress_12896RGB(x, 0x7F);
        OLED_SetRowAddress_12896RGB(y, 0x5F);
        OLED_WriteMemoryStart_12896RGB();
        for (count=0;count<8;count++)
        {
            if((A10pt[index] & mask) == mask)
                OLED_Pixel_12896RGB(textColor);
            else
                OLED_Pixel_12896RGB(backgroundColor);
            mask = mask >> 1;
        }
        index++;
        mask = 0x80;
        for (count=0;count<8;count++)
        {
            if((A10pt[index] & mask) == mask)
                OLED_Pixel_12896RGB(textColor);
            else
                OLED_Pixel_12896RGB(backgroundColor);
            mask = mask >> 1;
        }
        index++;
        mask = 0x80;
        y_pos--;
   }
   x += 13;

This implementation is intimately tied to the font representation above (each row of the font consists of the 2 B and the pixel width and height are hard coded.  But note that a few of the hard coded parameters can be parametrized: the letter position (x, y), the letter itself, and the foreground color (and possibly the background color), and can be refactored into a common function that looks up the letter in a table--like the ASCII table:

void OLED_Text_12896RGB(unsigned char x_pos, unsigned char y_pos, unsigned char letter, unsigned long textColor, unsigned long backgroundColor);

This strategy is slow but functional.  Each byte write can be grouped together into a long sequence of bytes:
  • The nRS pin can be held low the whole time (i.e. avoid the repeated function calls)
  • The SPI write can be accelerated over DMA if the background is the same. That is, instead of a letter consisting of just 1 bitmap, it can just be a long sequence of colors for the entire rectangular region the letter takes up.  This will bloat the DATA segment dedicated to the letters.
Even more optimization techniques such as keeping a frame buffer and writing out a whole screen in one shot are just the beginning in graphics programming, and I won't write these myself because I don't want to reinvent the wheel.

Porting the Arduino example to RPi Linux user space

Driving out the SPI signal from RPi is an excellent way to prototype an embedded GUI platform even before the new board is brought up.  Even after the board is brought up, writing a user space program to try out an idea is a great convenience.  The key to porting the Arduino example to RPi is to leverage someone else's work on driving the RPi's SPI interface.  The BCM2835 library is mature and performant.  Using it, configuring the GPIO and SPI can be coded concisely:

#include <bcm2835.h>

#define    DC_PIN   25
#define   RES_PIN   24

int main() {
    if (!bcm2835_init())                                                                     
        return 1;                                                                            
    if (!bcm2835_spi_begin())    {                                                           
        fprintf(stderr, "bcm2835_spi_begin failed %d. Are you running as root??\n",          
                errno);                                                                      
        return 1;                                                                            
    }                                                                                        
    bcm2835_spi_setBitOrder(BCM2835_SPI_BIT_ORDER_MSBFIRST);      // The default             
    bcm2835_spi_setDataMode(BCM2835_SPI_MODE0);                   // The default             
    bcm2835_spi_setClockDivider(BCM2835_SPI_CLOCK_DIVIDER_32);                               
    bcm2835_spi_chipSelect(BCM2835_SPI_CS0);                      // The default             
    bcm2835_spi_setChipSelectPolarity(BCM2835_SPI_CS0, LOW);      // the default             
                                                                                             
// the output pins: D/C (GPIO.3), RES (GPIO.5)                                               
    bcm2835_gpio_fsel(DC_PIN, BCM2835_GPIO_FSEL_OUTP);
    bcm2835_gpio_fsel(RES_PIN, BCM2835_GPIO_FSEL_OUTP);

Divider = 32 yields 8 MHz SPI speed.  I could try going faster, but even at 8 MHz, the signal integrity is marginal.  When the part is integrated on a PCB, I should be able to go faster.  Anyway, the smiley shows that once the image is set, the display can just refresh itself without a periodic refresh from the host, which means that a slow processor like a C2000 can just update the display asynchronously.

This is not quite bare metal in the true sense.  But still, this code should be readily transferrable to an embedded target such as C2000.

Bare metal control of the NHD panel from C2000

TODO

Nov 19, 2017

JTAG DAP parser

I have been trying to get a low level debug session going against my Raspberry Pi 3 using my J-Link debug probe.  If you Google for "J-Link Raspberry Pi", you will find success reported mostly for the original Raspberry Pi.  At first, I tried to use JLinkGDBServer and JLinkExe on my Ubuntu VM, but I haven't managed to write a working JLinkScript yet.  When even OpenOCD failed to connect to the target, I started digging into the root cause.  Following this page to enable JTAG on RPi's GPIO was relatively easy, as was exposing the copper for the JTAG's TRST, TDI, TDMS, TCLK, TDO lines and capturing the failed debug session on the logic analyzer.  Saleae already has the JTAG signal analyzer, so reading the raw bits going into and coming out of the JTAG scan is all relatively easy, as you can see below.
A small portion of a JTAG session between J-Link and Raspberry Pi 3 target, with JTAG enabled on RPi3's P8 header
But such low level exchange does not yield insight, so I started reading documents: ADI (ARM debug interface) 5.2, CoreSight specification 3.0, ARM Cortex A/R programmer's guide, and ARM Cortex-A7 TRM (technical reference manual), and I understood how the debug host controls the ARM CPU's debug subsystem by writing appropriate values to the DAP (debug access port) registers.  But to actually apply this understanding to my problem required rather painful mental bit-shifting, and repeatedly looking up the register definitions in the ADI and CoreSight specification.  So I saved the Saleae's JTAG capture to a CSV file, and wrote a Python script to do the low level heavy-lifting for me.

Parser

In the above trace, there are only 2 JTAG TAP (test access protocol) states that yield decodable value: Shift-IR (instruction register) and Shift-DR (data register).  All other transactions either lead up to this state, or move the state machine back to the starting state.  Roughly speaking, the target register is specified in Shift-IR state, and the data values for the specified registers are given in the Shift-DR state.  So unless the JTAG signals are bad (unlikely on a shipping HW like the Raspberry Pi), I can just focus on these 2 states and ignore most of the lines in the CSV emitted by Saleae Logic GUI.  I learned about the pandas Python library in a Udacity data science course on Supervised Learning: it will do much of the tabular data cleanup for me.  Although I can pip install pandas on my system, it was just easier to install Anaconda (version 2, to stay with Python 2.7) to a separate folder.  So my script begins by using that special python package that comes with anaconda2.

#!/anaconda2/bin/python
from enum import Enum
import sys
class TapState(Enum): Invalid, I, D = range(3)

tap_state = TapState.Invalid.value

Again, the reduction of the JTAG TAP states to I (instruction) and D (data) is a drastic simplification for this case, where I am only interested in the parsing the layer above the JTAG.

In the capture, there are 3 other registers that appear besides the DPACC (debug port access) and APACC (application port access), so I enumerate them.

class IR(Enum):
    ABORT = 8
    DPACC = 10
    APACC = 11
    IDCODE = 14
    BYPASS = 15

ir = IR.BYPASS.value

From my previous experience with SWD, I know about the trick that ARM plays with the SELECT register to map different registers to the limited size of register banks.  Here, 3 separate selections can happen independently, so I need 3 separate variables to maintain in a DAP transaction:

apsel = -1 # After PoR, APSEL unknown
apbank_sel = 0
dpbank_sel = 0

This simple script just takes 1 CSV file--which is separated with ';' rather than a comma.  pandas packages easily deals with it:

import pandas as pd
fn = sys.argv[1]
df = pd.read_csv(fn, sep=';', index_col=0)

Saleae emits the timestamp as the first column, and it is sometimes convenient to look up a packet by the timestamp, so I am specifying the 0th column as the index.

The very first exchange in a JTAG session is the JTAG scan: discovering how many JTAG devices are cascaded.  It's a complete waste for the normal case of just 1 device, but the long-ass sequence is still there; so I just drop it:

df = df[df['TDIBitCount'] < 100] # drop the JTAG scan

Next, I need to deal with pandas representation of CSV data: all numbers are floating point by default, and the rest are string.

df['TDOBitCount'] = df['TDOBitCount'].astype(int)
df['TDIBitCount'] = df['TDIBitCount'].astype(int)
df['TDI'] = df.TDI.apply(lambda x: int(x, 16))
df['TDO'] = df.TDO.apply(lambda x: int(x, 16))

Finally, I can iterate through the TAP Shift-IR and Shift-DR.  The first thing is to break out the items as separate variables, for legibility.  The last column is the number of bits output, which is the same as the number of bits input in all cases I've seen (JTAG seems to work like SPI), so it's safe to drop it.

for row in df.itertuples():
    timestamp, packet_type, TDI, TDO, nBit = row[:-1] # row[0] is timestamp

Since Shift-IR just sets the target register, handling that is straight-forward:

    if packet_type == 'Shift-IR':
        tap_state = TapState.I.value
        if (nBit == 4) and TDI in [IR.DPACC.value, IR.APACC.value, IR.IDCODE.value]:
            ir = TDI
        else: tap_state = TapState.Invalid.value # drop this packet

Shift-DR is far more complicated, but once again, I make a simplifying assumption that I am only interested in the DPACC or APACC.  In both cases, I am only interested in the standard TDI packet comprising of 32 bit data, 2 bit address, and 1 bit R/W indicator.

        if ir == IR.DPACC.value:
            if nBit == 35: 
                dout = TDO >> 3
                ack = TDO & 0x7
                din = TDI >> 3
                addr = (TDI & 0x6) << 1
                rnw = 'R' if (TDI & 0x1) else 'W'

                decoded = None
                # decode DAP reg
                if addr == 0: decoded = 'DPIDR'
                elif addr == 0x8:
                    apsel = din >> 24
                    apbank_sel = (din >> 4) & 0xF
                    dpbank_sel = din & 0xF
                    decoded = 'SELECT AP {:#x} APB {:#x} DPB {:#x}'.format(apsel, apbank_sel, dpbank_sel)
                elif addr == 0xC: decoded = 'RDBUFF'
                elif addr == 0x4: # act on dpbank_sel
                    if dpbank_sel == 0:
                        decoded = 'CTRL/STAT'
                    elif dpbank_sel == 1:
                        decoded = 'DLCR'
                    elif dpbank_sel == 2:
                        decoded = 'TARGETID'
                    elif dpbank_sel == 3:
                        decoded = 'DLPIDR'
                    elif dpbank_sel == 4:
                        decoded = 'EVENTSTAT'

                print('@{} {:#x} | {:x} {} | {} -> DPACC -> {:#x} | {}'. \
                    format(timestamp, din, addr, decoded, rnw, dout, ack))
            else: print('@{} Unhandled {:#x} -> DPACC -> {:#x}'.format(timestamp, TDI, TDO))

For APACC, my current level of understanding of the ARM MEM-AP registers are not solid enough to hard code the values I see in the TAR (target address register), so I keep things simple.

        elif ir == IR.APACC.value:
            if nBit == 35: 
                dout = TDO >> 3
                ack = TDO & 0x7
                din = TDI >> 3
                addr = (TDI & 0x6) << 1
                rnw = 'R' if (TDI & 0x1) else 'W'

                decoded = None
                # Assume this is a MEM-AP and decode
                if apbank_sel == 0:
                    if addr == 0:
                        decoded = 'CSW'
                    elif addr == 4:
                        decoded = 'TAR'
                    elif addr == 0xC:
                        decoded = 'DRW'
                elif apbank_sel == 0xf:
                    if addr == 4:
                        decoded = 'CFG'
                    elif addr == 8:
                        decoded = 'BASE'
                    elif addr == 0xC:
                        decoded = 'IDR'

                print('@{} {:#x} | {:x} {} | {} -> APACC -> {:#x} | {}'. \
                    format(timestamp, din, addr, decoded, rnw, dout, ack))
            else: print('@{} Unhandled {:#x} -> APACC -> {:#x}'.format(timestamp, TDI, TDO))

Finally, I can handle the IDCODE easily, so I just threw that in at the end:

        elif ir == IR.IDCODE.value:
            if nBit == 32: print('IDCODE -> {:#8x}'.format(TDO))
        else: # Hmm what is this?
            tap_state = TapState.Invalid.value

All in all, a simple parser!  Let's see if it's any useful.

Using the parser on openocd session

The very first line decoded with the parser on a session between my J-Link Ultra+ and RPi3 are:

@0.01815178 0x0 | 4 CTRL/STAT | R -> DPACC -> 0x0 | 2
@0.0181938 0x20 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2
@0.01823582 0x0 | 4 CTRL/STAT | R -> DPACC -> 0x0 | 2
@0.01827784 0x50000000 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2

According to my copy of ADI v5.2 section B.2.2 CTRL/STAT, Control/Status register, 0x20 is the STICKYERR bit; writing a 1'b1 to it clears that bit; makes sense except for the fact it was not set to begin with, so a complete waste of time.  Also, openocd is writing 0x0 to the upper 8 bits and then then writing 0x5 means it is requesting the system and debug subsystem reset.  This is actually not a good thing if I just want to halt a running system, so I don't know if I will run into a problem later.

So it seems that armed with tables of the various DP/AP registers, I can start to make sense of what openocd is requesting the target.  I was therefore surprised to discover--just a few ms later, that openocd goes through a "ping" of potential AP in the address space: all 256 of them, by trying to read the CIDR (component ID register; the 1st register in AP register bank 0xF) of each of the possible 4 KB mapping in the base register.  Using the same parser, I saw that J-Link discovers all available AP components by reading the ROM table (it fails to use the discovered ROM table in an intelligent way, but that's another topic altogether).  Going through 256 possible AP takes J-Link Ultra+ about 133 ms  at 100 kbps; it would take J-Link+ about 10x that duration (its inter-packet time is long for some reason).  It then queries the IDR of each possible AP component--only 8 of which are populated for the RPi3 (another ~130 ms wasted).  OpenOCD then goes through another round of unnecessary exchange with the target: it tries to unlock software access to the debug registers by writing the magic keys for each of the discovered components (the RPi's AP components are ROM table v9, which do not implement the software lock/unlock).  The waste is even worse, because after the discovery, openocd tries to reset the system and the debug subsystem (again), and then go through the same discovery and software unlock mechanism it went through last time.

@0.32220612 0x20 | 4 CTRL/STAT | W -> DPACC -> 0xf0000001 | 2
@0.32224814 0x0 | 4 CTRL/STAT | R -> DPACC -> 0xa0000000 | 2
@0.32229016 0x50000000 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2