Nov 19, 2017

JTAG DAP parser

I have been trying to get a low level debug session going against my Raspberry Pi 3 using my J-Link debug probe.  If you Google for "J-Link Raspberry Pi", you will find success reported mostly for the original Raspberry Pi.  At first, I tried to use JLinkGDBServer and JLinkExe on my Ubuntu VM, but I haven't managed to write a working JLinkScript yet.  When even OpenOCD failed to connect to the target, I started digging into the root cause.  Following this page to enable JTAG on RPi's GPIO was relatively easy, as was exposing the copper for the JTAG's TRST, TDI, TDMS, TCLK, TDO lines and capturing the failed debug session on the logic analyzer.  Saleae already has the JTAG signal analyzer, so reading the raw bits going into and coming out of the JTAG scan is all relatively easy, as you can see below.
A small portion of a JTAG session between J-Link and Raspberry Pi 3 target, with JTAG enabled on RPi3's P8 header
But such low level exchange does not yield insight, so I started reading documents: ADI (ARM debug interface) 5.2, CoreSight specification 3.0, ARM Cortex A/R programmer's guide, and ARM Cortex-A7 TRM (technical reference manual), and I understood how the debug host controls the ARM CPU's debug subsystem by writing appropriate values to the DAP (debug access port) registers.  But to actually apply this understanding to my problem required rather painful mental bit-shifting, and repeatedly looking up the register definitions in the ADI and CoreSight specification.  So I saved the Saleae's JTAG capture to a CSV file, and wrote a Python script to do the low level heavy-lifting for me.

Parser

In the above trace, there are only 2 JTAG TAP (test access protocol) states that yield decodable value: Shift-IR (instruction register) and Shift-DR (data register).  All other transactions either lead up to this state, or move the state machine back to the starting state.  Roughly speaking, the target register is specified in Shift-IR state, and the data values for the specified registers are given in the Shift-DR state.  So unless the JTAG signals are bad (unlikely on a shipping HW like the Raspberry Pi), I can just focus on these 2 states and ignore most of the lines in the CSV emitted by Saleae Logic GUI.  I learned about the pandas Python library in a Udacity data science course on Supervised Learning: it will do much of the tabular data cleanup for me.  Although I can pip install pandas on my system, it was just easier to install Anaconda (version 2, to stay with Python 2.7) to a separate folder.  So my script begins by using that special python package that comes with anaconda2.

#!/anaconda2/bin/python
from enum import Enum
import sys
class TapState(Enum): Invalid, I, D = range(3)

tap_state = TapState.Invalid.value

Again, the reduction of the JTAG TAP states to I (instruction) and D (data) is a drastic simplification for this case, where I am only interested in the parsing the layer above the JTAG.

In the capture, there are 3 other registers that appear besides the DPACC (debug port access) and APACC (application port access), so I enumerate them.

class IR(Enum):
    ABORT = 8
    DPACC = 10
    APACC = 11
    IDCODE = 14
    BYPASS = 15

ir = IR.BYPASS.value

From my previous experience with SWD, I know about the trick that ARM plays with the SELECT register to map different registers to the limited size of register banks.  Here, 3 separate selections can happen independently, so I need 3 separate variables to maintain in a DAP transaction:

apsel = -1 # After PoR, APSEL unknown
apbank_sel = 0
dpbank_sel = 0

This simple script just takes 1 CSV file--which is separated with ';' rather than a comma.  pandas packages easily deals with it:

import pandas as pd
fn = sys.argv[1]
df = pd.read_csv(fn, sep=';', index_col=0)

Saleae emits the timestamp as the first column, and it is sometimes convenient to look up a packet by the timestamp, so I am specifying the 0th column as the index.

The very first exchange in a JTAG session is the JTAG scan: discovering how many JTAG devices are cascaded.  It's a complete waste for the normal case of just 1 device, but the long-ass sequence is still there; so I just drop it:

df = df[df['TDIBitCount'] < 100] # drop the JTAG scan

Next, I need to deal with pandas representation of CSV data: all numbers are floating point by default, and the rest are string.

df['TDOBitCount'] = df['TDOBitCount'].astype(int)
df['TDIBitCount'] = df['TDIBitCount'].astype(int)
df['TDI'] = df.TDI.apply(lambda x: int(x, 16))
df['TDO'] = df.TDO.apply(lambda x: int(x, 16))

Finally, I can iterate through the TAP Shift-IR and Shift-DR.  The first thing is to break out the items as separate variables, for legibility.  The last column is the number of bits output, which is the same as the number of bits input in all cases I've seen (JTAG seems to work like SPI), so it's safe to drop it.

for row in df.itertuples():
    timestamp, packet_type, TDI, TDO, nBit = row[:-1] # row[0] is timestamp

Since Shift-IR just sets the target register, handling that is straight-forward:

    if packet_type == 'Shift-IR':
        tap_state = TapState.I.value
        if (nBit == 4) and TDI in [IR.DPACC.value, IR.APACC.value, IR.IDCODE.value]:
            ir = TDI
        else: tap_state = TapState.Invalid.value # drop this packet

Shift-DR is far more complicated, but once again, I make a simplifying assumption that I am only interested in the DPACC or APACC.  In both cases, I am only interested in the standard TDI packet comprising of 32 bit data, 2 bit address, and 1 bit R/W indicator.

        if ir == IR.DPACC.value:
            if nBit == 35: 
                dout = TDO >> 3
                ack = TDO & 0x7
                din = TDI >> 3
                addr = (TDI & 0x6) << 1
                rnw = 'R' if (TDI & 0x1) else 'W'

                decoded = None
                # decode DAP reg
                if addr == 0: decoded = 'DPIDR'
                elif addr == 0x8:
                    apsel = din >> 24
                    apbank_sel = (din >> 4) & 0xF
                    dpbank_sel = din & 0xF
                    decoded = 'SELECT AP {:#x} APB {:#x} DPB {:#x}'.format(apsel, apbank_sel, dpbank_sel)
                elif addr == 0xC: decoded = 'RDBUFF'
                elif addr == 0x4: # act on dpbank_sel
                    if dpbank_sel == 0:
                        decoded = 'CTRL/STAT'
                    elif dpbank_sel == 1:
                        decoded = 'DLCR'
                    elif dpbank_sel == 2:
                        decoded = 'TARGETID'
                    elif dpbank_sel == 3:
                        decoded = 'DLPIDR'
                    elif dpbank_sel == 4:
                        decoded = 'EVENTSTAT'

                print('@{} {:#x} | {:x} {} | {} -> DPACC -> {:#x} | {}'. \
                    format(timestamp, din, addr, decoded, rnw, dout, ack))
            else: print('@{} Unhandled {:#x} -> DPACC -> {:#x}'.format(timestamp, TDI, TDO))

For APACC, my current level of understanding of the ARM MEM-AP registers are not solid enough to hard code the values I see in the TAR (target address register), so I keep things simple.

        elif ir == IR.APACC.value:
            if nBit == 35: 
                dout = TDO >> 3
                ack = TDO & 0x7
                din = TDI >> 3
                addr = (TDI & 0x6) << 1
                rnw = 'R' if (TDI & 0x1) else 'W'

                decoded = None
                # Assume this is a MEM-AP and decode
                if apbank_sel == 0:
                    if addr == 0:
                        decoded = 'CSW'
                    elif addr == 4:
                        decoded = 'TAR'
                    elif addr == 0xC:
                        decoded = 'DRW'
                elif apbank_sel == 0xf:
                    if addr == 4:
                        decoded = 'CFG'
                    elif addr == 8:
                        decoded = 'BASE'
                    elif addr == 0xC:
                        decoded = 'IDR'

                print('@{} {:#x} | {:x} {} | {} -> APACC -> {:#x} | {}'. \
                    format(timestamp, din, addr, decoded, rnw, dout, ack))
            else: print('@{} Unhandled {:#x} -> APACC -> {:#x}'.format(timestamp, TDI, TDO))

Finally, I can handle the IDCODE easily, so I just threw that in at the end:

        elif ir == IR.IDCODE.value:
            if nBit == 32: print('IDCODE -> {:#8x}'.format(TDO))
        else: # Hmm what is this?
            tap_state = TapState.Invalid.value

All in all, a simple parser!  Let's see if it's any useful.

Using the parser on openocd session

The very first line decoded with the parser on a session between my J-Link Ultra+ and RPi3 are:

@0.01815178 0x0 | 4 CTRL/STAT | R -> DPACC -> 0x0 | 2
@0.0181938 0x20 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2
@0.01823582 0x0 | 4 CTRL/STAT | R -> DPACC -> 0x0 | 2
@0.01827784 0x50000000 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2

According to my copy of ADI v5.2 section B.2.2 CTRL/STAT, Control/Status register, 0x20 is the STICKYERR bit; writing a 1'b1 to it clears that bit; makes sense except for the fact it was not set to begin with, so a complete waste of time.  Also, openocd is writing 0x0 to the upper 8 bits and then then writing 0x5 means it is requesting the system and debug subsystem reset.  This is actually not a good thing if I just want to halt a running system, so I don't know if I will run into a problem later.

So it seems that armed with tables of the various DP/AP registers, I can start to make sense of what openocd is requesting the target.  I was therefore surprised to discover--just a few ms later, that openocd goes through a "ping" of potential AP in the address space: all 256 of them, by trying to read the CIDR (component ID register; the 1st register in AP register bank 0xF) of each of the possible 4 KB mapping in the base register.  Using the same parser, I saw that J-Link discovers all available AP components by reading the ROM table (it fails to use the discovered ROM table in an intelligent way, but that's another topic altogether).  Going through 256 possible AP takes J-Link Ultra+ about 133 ms  at 100 kbps; it would take J-Link+ about 10x that duration (its inter-packet time is long for some reason).  It then queries the IDR of each possible AP component--only 8 of which are populated for the RPi3 (another ~130 ms wasted).  OpenOCD then goes through another round of unnecessary exchange with the target: it tries to unlock software access to the debug registers by writing the magic keys for each of the discovered components (the RPi's AP components are ROM table v9, which do not implement the software lock/unlock).  The waste is even worse, because after the discovery, openocd tries to reset the system and the debug subsystem (again), and then go through the same discovery and software unlock mechanism it went through last time.

@0.32220612 0x20 | 4 CTRL/STAT | W -> DPACC -> 0xf0000001 | 2
@0.32224814 0x0 | 4 CTRL/STAT | R -> DPACC -> 0xa0000000 | 2
@0.32229016 0x50000000 | 4 CTRL/STAT | W -> DPACC -> 0x0 | 2