Mar 31, 2015

State machine based Qt5 GUI on Zedboard

In a previous blog entry, I explored creating a minimal embedded Linux distribution containing the Qt5 framework, and writing and debugging a "Hello world" Qt GUI application.  Whenever possible, I write all my SW within an event-driven, hierarchical state machine framework called QP.  But since Qt is also an event-driven framwork in its own right, meshing the 2 together is not straight-forward.  When creating a WPF MVVM (model-view-view model) GUI application with state machines, I could update the WPF view model from a special active object (I called it the GuiStateMachine) in response to any update events (of interest to the GUI) from ALL other active objects.  Apparently, you cannot do that in Qt, because in the official Qt-QP integration example, the singleton GUI state machine runs in Qt context.  So unlike in my WPF-QP integration, the events delivered to the GUI state machine (active object, really) are transformed into a Qt event and shoved into the Qt's event delivery mechanism.  The Qt-QP reference application is available for mingw, but I cross-compile for the Zynq (ARM Cortex A9), so I am going to modify the reference application for my situation.

Create the DPP Qt Widgets project

The reference application creates the QP Qt library first.  But on my system, one Qt GUI is the only application (I am an embedded SW engineer, not a desktop SW engineer), so I will not bother with a separate library, and just put all code in 1 Qt widgets application, in the qpcpp/example/qt/arm/buildroot folder.

~/work/Dorking/QP/qpcpp/examples/qt$ mkdir -p arm/buildroot

Then in Qt Creator (the previous blog entry discussed how to get and install the Qt Creator FROM qt.io rather than as a Debian package)
  1. Click "New Project" button, and then choose the "Qt Widgets Application" template.
  2. Following the reference application example, I create a project called "dpp-gui" in the /mnt/work/Dorking/QP/qpcpp/examples/qt/arm/buildroot folder just created.
  3. Next, I choose the zedbr2 kit I created in the  previous blog entry.
  4. In a departure from the example, I create my GUI as a QMainWindow (vs. QDialog).  Also unlike the example, I WILL use the form.  But I will still call the main class "Gui", to follow the example.
Qt Creator can ready build this empty main class, which is always a good first step.

Preprocessor include path and defines in qmake project file

At minimum, the project must include the QP include/, qep/source/, qf/source/, and  the QP port folders.  Unlike other IDEs, the build variables like include paths are NOT a project property; I write these are directly into the project (.pro) file in a text editor, using a qmake variable, like this:

QP_ROOT = ../../../../..

INCLUDEPATH += $$QP_ROOT/include $$QP_ROOT/qep/source \$$QP_ROOT/qep/source \ $$QP_ROOT/qf/source \$$QP_ROOT/qf/source \ $$QP_ROOT/ports/qt




Qt itself has a state machine infrastructure, which is redundant for a QP state machine application, so I turn off the Qt's state machine feature in the qmake .pro file:

DEFINES += QT_NO_STATEMACHINE

Add sources to the project and tailor to my needs

QP platform independent sources

In Qt Creator, right click on Sources --> Add Existing Directory --> Browse to the qpcpp/qep/source/ folder --> Start Parsing, to expand the folder and unselect the unnecessary files, as shown below (I do not use FSM, only HSM):
I later learned that you can also include the header files, and Qt Creator will correctly pull them into the HEADERS variable, so qep_pkg.h should have been checked in the above screenshot.

I add qpcpp/qf/source folder similarly, without leaving out any files this time.

Note on updating to the QP 5 API

When copying examples written for QP API 4.5 or earlier, the following changes are required:
  • Delete the deprecated call to QS_RESET()
  • QTimeEvt ctor now takes the owning active object as the 1st argument.  In C++, that would show up as the "this" pointer if the timer belongs to an active object.  In exchange, the armX method of the QTimerEvt--which should be used instead of postIn() method--now does NOT take an active object.
  • Q_NEW now takes ctor arguments, to call the PLACEMENT new operator (i.e. unlike the new does NOT hit the heap) of the type being created.  While this is great for a single process usage of the memory pool, the virtual table you get with the new operator is dangerous when the memory pool spans multiple processes (through shared memory)--as will be the case for me.  The danger lies in the possibility for different compiler versions laying out the virtual table differently (C++ compilers are notorious for this, even among different versions).  I decide to play it safe here, turn off QEvent's CTOR and VIRTUAL features in qep_port.h, as shown below (and pay the price of having to initialize the memory pool objects myself):
// don't define QEvent to avoid conflict with Qt
#define Q_NQEVENT    1

// provide QEvt constructors
#undef Q_EVT_CTOR

// provide QEvt virtual destructor
#undef Q_EVT_VIRTUAL

QP Qt port sources

Because Qt is a multi-platform code, the example QP port to mingw Qt still works for embedded ARM.  I just have to include the qpcpp/ports/qt/ folder, like I have done for the qep/ and qf/ folders above.  But since the PixelLabel is only necessary for the fly-and-shoot example, I excluded them.

SOURCES += \...
$$QP_ROOT/ports/qt/guiapp.cpp \
$$QP_ROOT/ports/qt/qf_port.cpp


HEADERS += gui.h \
$$QP_ROOT/ports/qt/qep_port.h \
$$QP_ROOT/ports/qt/qf_port.h \
$$QP_ROOT/ports/qt/tickerthread.h \
$$QP_ROOT/ports/qt/aothread.h \
$$QP_ROOT/ports/qt/guiapp.h \
$$QP_ROOT/ports/qt/guiactive.h

Unlike the example Qt integration on mingw, setting a stack size to 4 KB is preventing QThread start, so I commented them out and let QThread use the default thread stack size for now.

   //thread->setStackSize(stkSize);

Application support files

The final step in mating QP to an application is to specify functions that QP calls for certain events (startup, onClockTick, onAssert, etc) and the application state machine calls (like updating the philosopher stats from the Table state machine).  Unlike the port files, which can theoretically be shared between different QP-Qt projects (again, I will only have 1), the application specific files are coupled to the application logic.  For the DPP application, dpp.h and the bsp header/source files are such files, so I add them to the first lines of SOURCES and HEADERS in the qmake pro file:

SOURCES += main.cpp gui.cpp bsp.cpp philo.cpp table.cpp \
...


HEADERS += gui.h bsp.h dpp.h \
...



dpp.h contains the application specific event class TableEvt.  To turn off the event polymorphism feature, I take in only the signal number in the TableEvt constructor.

When I examine bsp.cpp, I see that the philosopher states (THINKING/HUNGRY/EATING) are displayed with QPixmaps showing 3 different PNG files, and the table state (PAUSED/SERVING) is displayed with a text on a button.  The images for the philosopher states are in res folder,  pointed to by the gui.qrc (Qt resource) file.  So I add this file to the project (Add Existing File).  I also copied the entire res/ folder from the mingw example folder, so that when I click on one of the PNG files in the resource, I see the image in the Qt Creator, like this:

In the qmake pro file, the resource shows up like this:

RESOURCES += gui.qrc

To update the files to the latest QP API, I make the changes discussed above, in "Note on updating to the QP 5 API" section.

UI

Instead of just blindly copying the QDialog based UI from the example, I went through the trouble of copying the buttons and labels from the example UI to the QMainWindow based UI, all to preserve the possibility of using the top menu and the bottom status bars in the future.  In Qt Creator's Designer View, the UI looks like this:
Note that all widgets I copied are in the central widget; that is, the north, south, east, west widget areas do not exist.

I wire the signals emitted from the widgets to the 3 slots defined in gui.cpp constructor:

...
    QObject::connect(m_quitButton, SIGNAL(clicked()), this, SLOT(onQuit()));
    QObject::connect(m_pauseButton, SIGNAL(pressed()), this, SLOT(onPausePressed()));
    QObject::connect(m_pauseButton, SIGNAL(released()), this, SLOT(onPauseReleased()));
    QObject::connect(this, SIGNAL(finished(int)), this, SLOT(onQuit()));
    } // setupUi

The UI designer just lays out the widgets (and possibly statically connects signals to slots).  The code behind the UI is in gui.cpp, which I copied from the example.  After this step, my gui.cpp code is the same as the example, except for Gui parent being QMainWindow instead of QDialog.

State machines

The philosopher and the table state machines drive the application logic.  The Qt integration example has the 2 state machine implementations generated by the QM state charting tool, but I do NOT want to generate my code, so I copy philo.cpp and table.cpp from another example (examples/arm/vanilla/gnu/dpp-at91sam7s-ek) that does not yet use the new style of coding the state transition.  I also added these 2 files to the project.  But I later found out that weird crash can occur if I update the GUI in a non-GUI thread.  Examples of the crash:

QObject::startTimer: Timers cannot be started from another thread
QBasicTimer::stop: Failed. Possibly trying to stop from a different thread
QObject::connect: Cannot queue arguments of type 'QTextBlock'
(Make sure 'QTextBlock' is registered using qRegisterMetaType().)

valgrind  --undef-value-errors=no --leak-check=yes dpp-gui > dpp_valgrind.txt 2>&1

I added the Desktop kit to the project, in the Projects toolbar icon, and reproduced the problem even on Ubuntu.  More errors:



QApplication: Object event filter cannot be in a different thread.
QWidget::repaint: Recursive repaint detected

This is why in the Qt integration example, the table active object it the ONLY active object that derives from GuiQActive class, which is supplied in the port.

class Table : public QP::GuiQActive {
...

Application main

I copied main.cpp verbatim from the example, which gives the table GuiQActive object NO event queue (because events to the GUI go through the Qt event delivery mechanism).  So the following code snippet is correct:

    DPP::AO_Table->start((uint_fast8_t)(N_PHILO + 1),
                         //GuiQActive does not need event queue
                         //&l_tableQueueSto[0], Q_DIM(l_tableQueueSto),
                         (QP::QEvt const **)0, (uint32_t)0,
                         (void *)0, (uint_fast16_t)0);

Build and debug on the target

  1. Leveraging the hard work of setting up the cross-compile in the previous blog entry, I build the target ELF file easily by clicking on the build icon (the hammer).  The debug target is still only 2.3 MB on the disk.
  2. Following the workaround for the cross-debug not working, I copy the ELF file to the target's /root folder.
  3. I start the gdbserver on the copied app, specifying the mouse device (note that this application does NOT use the keyboard, but the keyboard device is event1)

    gdbserver localhost:1234 /root/dpp-gui -plugin evdevmouse:/dev/input/event0
  4. In Qt Creator, attach to the remote gdbserver (menu --> Debug --> Start Debugging --> Attach to Remote Debug Server), specifying the port and the ELF file, as you can see in this example:

I see 5 Homer icons happily taking turns eating, thinking, being hungry!

Mar 22, 2015

Porting qpcpp to Zynq AMP CPU1


QP is my favorite RTOS.  It's actually NOT an OS per se, but instead a state-machine framework.  But with a small amount of scheduling code (a very particular scheduler called a single threaded RTC scheduler), a QP application with multiple "threads" (active objects actually) run in a clean and high performance manner.  I have been using this framework for over 12 years now, and have not yet seen a better SW framework--especially the QK framework for hard real-time code.

I started my Zynq study about 6 months ago, and wanted to run bare metal C/C++ on Zynq.  Its dual-core architecture is ideal for running GPOS like Linux on CPU0, and a bare metal C/C++ code on CPU1.  In previous blog entries, I figured out how to kick off a bare metal (more accurately, no OS) C application on CPU1 from Linux on CPU0, so the next logical step is to write a QK application for Zynq CPU1,

Although there is an application note for porting QP to bare metal ARM, it targets small ARM CPU running by itself.  What I need is a QK port that starts from the code already in the DRAM (Linux on CPU0 puts the code there and resets CPU1), so many code in the qpcpp's qk port to ARM gnu is unnecessary.

xsdk static library project for QK-C++

You can get QP through git:

git clone git://git.code.sf.net/p/qpc/qpcpp

HW independent code

At the root of the qpcpp folder, the include/, qf/, qep/, and qk/ have platform independent code.  A new port should go to ports/ folder.  I create this Zynq (Xilinx ARM) remoteproc (CPU1 kicked off from Linux on CPU0) port in ports/arm/qk/zynq_rproc, as shown in an example of Xilinx xsdk new static library project creation window.
xsdk (which is based on Eclipse CDT) creates the .cproject and .project files in the above folder.  To add QP files to the project, I create a LINKED folder (right click on the project --> New --> Folder --> Advanced) to the qf, qep, qk source folders, like this example:
There is 1 problem with this method of adding all QP folders through linked folder feature: I had to delete the file qvanilla.cpp, which contains the same implementations that are in qk.cpp--which is what I want.

The only HW independent code I have to change is the QP_API_VERSION in qp_port.h, which controls the degree of backward compatibility.  0 (the default) means maximum compatibility.  For a new project that does NOT have to care about backward compatibility.  For a new project (such as this) without a backward compatibility requirement, set QP_API_VERSION to 999.

To avoid pulling in standard libraries, I had to supply -fno-rtti -fno-exceptions compiler option, both for the qk library and the executable below.

Above screenshot also shows compiler option "-ffunction-sections", which forces each function to its own section.  This allows the linker to place any section to different memory.  Together with the linker option "--gc-sections", this compiler option (discussed when building the application ELF file below) also leads to a smaller executable footprint.  BUT this option strips out the resource table static struct read by the Linux remoteproc module--the one that copies the ELF file into CPU1, reboots CPU1, and kicks off the executable.  So do NOT include this option!

I also copy the code recommended by the "Bare metal ARM code development", to avoid dynamic memory allocation.

#include "qp_port.h"
#include <stdlib.h>                   // for prototypes of malloc() and free()

Q_DEFINE_THIS_FILE

extern "C" {
int __aeabi_atexit(void *object, void (*destructor)(void *), void *dso_handle)
{ return 0; }
//............................................................................
void __cxa_atexit(void (*arg1)(void *), void *arg2, void *arg3) {
}
//............................................................................
void __cxa_guard_acquire() {
}
//............................................................................
void __cxa_guard_release() {
}
//............................................................................
void *malloc(size_t) {
    Q_ERROR();
    return (void *)0;
}
//............................................................................
void free(void *) {
    Q_ERROR();
}
//............................................................................
void *calloc(size_t, size_t) {
    Q_ERROR();
    return (void *)0;
}
}

//............................................................................
void *operator new(size_t size) throw() { return malloc(size); }
//............................................................................
void operator delete(void *p) throw() { free(p); }

HW specific linker script and startup assembly code

To repeat, this port of QK is for the 2nd Zynq CPU placed into its own RAM area by Linux on CPU0.  I started from the bare metal ARM 7/9 port available from state-machine.com, but removed the code I do not need.  The linker script specifies the entry point into the application is the reset vector of the vector table.

Vector table

ENTRY(_vector_table)

This port will supply the standard ARM interrupt handlers, so the vector table is hard coded to the functions in the port in startup.s.  Technically speaking, a linker script is NOT part of the QK library; it is application specific.  But the vector table is tightly coupled to the QK port.

    .text
    .code 32

.section .vectors
_vector_table:
B _boot
B QF_undef
B QF_swi
B QF_pAbort
B QF_dAbort
B QF_reserved/* Placeholder for address exception vector*/
B QK_irq
B QF_fiq_dummy

.set vector_base, _vector_table

The ARM 7/9 port actually starts with an empty vector table until the RAM is remapped, and even then, the 1st level vector table merely indirectly to a 2nd level that is later filled in by QF::onStartup.  But as explained above, my code is placed straight into already initialized RAM, and the entire CPU1 is dedicated for this application.  So I decided to keep things simple by hard coding the vector table (no indirection recommended in the bare metal ARM application note).  The qk_port.s section below will detail all the above interrupt handlers--except the reset vector (boot), which I elaborate right now.

Every vectors except for the _boot and QK_irq vectors turn off interrupts and call Q_onAssert(char* source, int linnum).  Here is an example for the FIQ vector, which this port of QP do NOT handle:

Csting_fiq:         .string  "FIQ dummy"

    .global QF_fiq_dummy
    .func   QF_fiq_dummy
    .align  3
QF_fiq_dummy:
    LDR     r0,=Csting_fiq
    B       QF_except
    .size   QF_fiq_dummy, . - QF_fiq_dummy
    .endfunc

C calling convention uses R0 and R1 as the 1st and the 2nd parameters:

    .global QF_except
    .func   QF_except
    .align  3
QF_except:
    /* r0 is set to the string with the exception name */
    SUB     r1,lr,#4            /* set line number to the exception address */
    MSR     cpsr_c,#(SYS_MODE | NO_IRQ | NO_FIQ) /* SYSTEM,IRQ/FIQ disabled */
    LDR     r12,=Q_onAssert
    MOV     lr,pc               /* store the return address */
    BX      r12                 /* call the assertion-handler (ARM/THUMB) */
    /* the assertion handler should not return, but in case it does
    * hang up the machine in this endless loop
    */
    B       .

_boot vector

The very first code out of reset is to allow only CPU1 to run.  Maybe I don't need this code, but I inherited this code from Xilinx BSP's boot.S.  Depending on the time of the day, I feel like this is a good thing to do:

.section .boot,"ax"
_prestart:
_boot: /* only allow cpu1 through */
mrc p15,0,r1,c0,c0,5
and r1, r1, #0xf
cmp r1, #1
beq OKToRun
EndlessLoop1:
wfe
b EndlessLoop1

There are CPU version specific workarounds to be applied.  The errata to pick up is defined in xil_errata.h, which I COPIED from the auto-generated BSP code.  For Zynq only ARM errata 742230 and 743622 apply.

Recall from my earlier blog on the zynq_remoteproc module (which controls the lifecycle of the CPU1 app) that CPU1 gets the 2 MB starting at 0x1fe0000.  So the vectors are laid down at that address.  But when an interrupt comes (for example an IRQ), how does the CPU1 know to call QK_irq()?  The answer is the VBAR (vector base address register):

ldr r0, =vector_base
mcr p15, 0, r0, c12, c0, 0

MMU initialization

Next, the SCU (snoop control unit) is enabled and invalidated.  It must be referring to the CPU1's interface to the SCU, because SCU is shared between the 2 cores.  According to ARM bloggers, this, and writing the ACTLR.SMB bit, seem to be related to cache coherency setting.

Next, I invalidate the I and D caches, and MMU TLB, and then disable the MMU--all through the MMU co-processor on the ARM CP15, before writing the MMU table.  This port being specialized for the 2nd CPU with its reserved DRAM range (510~512 MB), the MMU table is configured this way:
  • DRAM alloted to CPU0 is marked unassigned/reserved.  If CPU1 accessed this range, the HW will raise pAbort or dAbort.
  • The rest of the memory is marked non-shared, L2 turned off (TEX='b100), L1 enabled (b'01 write-back, write allocate)
The MMU and cache are enabled, again through the cp15 registers.  The syntax for the CP15 access is:

mcr p15, <op1>, r0, <target register 0; CRn>, <target register 1; CRm>, <op2>

CPU feature initialization

Next, various features in the CPU are enabled, starting with the  FPU.  Perhaps one can be more optimal about this sort of thing, but if the application runs any control algorithm, floating point is probably inevitable.

.set FPEXC_EN, 0x40000000 /* FPU enable bit, (1 << 30) */
fmrx r1, FPEXC /* read the exception register */
orr r1,r1, #FPEXC_EN /* set VFP enable bit, leave the others in orig state */
fmxr FPEXC, r1 /* write back the exception register */

Branch (Xilnix BSP called this the flow) prediction, and data/instruction prefetchs are turned on again through the CP15.  I am not sure why the Xilinx BSP turns on the asynchronous abort exception...  And then, more CPU initialization, followed by performance counter reset.

mov r2, #0x80000007 /* clear CCNT and Px overflow */
mcr p15, 0, r2, c9, c12, 3

MOV R1, #0 //select counter 0
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MOV R2, #0x10 //monitor branch miss
MCR p15, 0, R2, c9, c13, 1 //Write EVTSELx Register

MOV R1, #1 //select counter 1
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MOV R2, #0x01 //monitor instruction cache miss
MCR p15, 0, R2, c9, c13, 1 //Write EVTSELx Register

MOV R1, #2 //select counter 2
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MOV R2, #0x03 //monitor data cache miss
MCR p15, 0, R2, c9, c13, 1 //Write EVTSELx Register

//Countrol register
//D: 1/64 off!
//C: reset counter
//P: reset event counter
//E: enable all counters including CCNT
mov r2, #0xF //DCPE
mcr p15, 0, r2, c9, c12, 0
mov r2, #0x80000007 /* enable CCNT and Px */
mcr p15, 0, r2, c9, c12, 1

Using the performance counter, you can see all sorts of details!  This may be necessary to debug why CPU1 is running slower than CPU0.  I chose to look at 

Supporting C/C++

BSS section should be zeroed out before entering main().  This is an efficient memset(0):

    LDR     r1,=__bss_start__
    LDR     r2,=__bss_end__
    MOV     r3,#0
1:
    CMP     r1,r2
    STMLTIA r1!,{r3}
    BLT     1b

STM means "store multiple", and its syntax is STM{cond}{address mode0}.  So in STMLTIA, LT means (the result of the comparison above) "less than", and IA means "increment after": the register indicated in front of the "!". In the debugger, the instruction actually shows up as:

1fe001a0:   ldr     r1, [pc, #+252]
1fe001a4:   ldr     r2, [pc, #+252]
1fe001a8:   mov     r3, #0
1fe001ac:   cmp     r1, r2
1fe001b0:   stmlt   r1!, {r3}
1fe001b4:   blt     -1

__bss_strart is stored at [pc, #252] = 0x1fe001a0 + 252 (260 actually, if I account for the 2 instruction pipeline).

QP uses just 1 stack (and no heap).  In the linker, the stack is defined in the RAM this way:

_STACK_SIZE = DEFINED(_STACK_SIZE) ? _STACK_SIZE : 0x2000;
_HEAP_SIZE = 0;
.stack (NOLOAD) : { /* Single stack application */
__stack_start__ = . ;
__irq_stack_top__ = .;
__fiq_stack_top__ = .;
__svc_stack_top__ = .;
__abt_stack_top__ = .;
__und_stack_top__ = .;

. += _STACK_SIZE;
. = ALIGN (4);
__c_stack_top__ = . ;
__stack_end__ = .;
} > ps7_ddr_0_S_AXI_BASEADDR

Note that there is actually NO stack for any mode except the user/system mode.

Similar to zeroing out the BSS, filling the stack with special patterns (to detect stack overrun, when debugging weird problems) uses the STMLTIA pneudo instruction as well:

    LDR     r1,=__stack_start__
    LDR     r2,=__stack_end__
.equ    STACK_FILL,     0xAAAAAAAA
    LDR     r3,=STACK_FILL
1:
    CMP     r1,r2
    STMLTIA r1!,{r3}
    BLT     1b

In the disassembly window, this code shows up as

1fe001b8:   ldr     r1, [pc, #+236]
1fe001bc:   ldr     r2, [pc, #+236]
1fe001c0:   ldr     r3, [pc, #+236]
1fe001c4:   cmp     r1, r2
1fe001c8:   stmlt   r1!, {r3}
1fe001cc:   blt     -16

Wouldn't you feel reassured if you saw 0xAAAAAAAA at the PC (0x1fe001c0) + 236 + 8 (for pipelining)?  In the memory view, 0x1fe002b4 indeed has the STACK_FILL pattern!

0x1fe002ac : 0x1FE002AC <Hex Integer>
  Address   0 - 3     4 - 7     8 - B     C - F               
  1FE002A0  00001005  1FE0C05C  1FE0C3FC  1FE0C400          
  1FE002B0  1FE0E400  AAAAAAAA  1FE0E400  1FE03FE0          
  1FE002C0  1FE00424  000003FF  00007FFF  E92D4800          
  1FE002D0  E28DB004  E24DD008  E30E39FF  E50B3008          
  1FE002E0  E30004D2  EB00002D  E24BD004  E8BD8800          
  1FE002F0  E52DB004  E28DB000  E24DD00C  E1A03000          

The 2 wors before the fill pattern (0x1fe0c400 and 0x1fe0e400) are the stack start and end addresses, which confirms that the stack is 0x2000 bytes big.

Since the ARM stack register (alias: sp) is banked, I have to be in the desired CPU mode (ARM has total of 7 modes, but the user and system mode use the same banked registers) to set the stack pointer for each mode.  But QK ARM port uses only the SYSTEM mode, so only need the following code

    MSR     CPSR_c,#(SYS_MODE | I_BIT | F_BIT)
    LDR     sp,=__c_stack_top__        /* set the C stack pointer */

Using the counters I initialized way above, I can measure the speed of the BSS and stack init:

MSR     CPSR_c,#(SYS_MODE | I_BIT | F_BIT)
LDR     sp,=__c_stack_top__                  /* set the C stack pointer */

//How long did BSS and stack init take?
MRC p15, 0, R0, c9, c13, 0//read the CCNT

MOV R1, #0 //select counter
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MRC p15, 0, R2, c9, c13, 2//read event count

MOV R1, #1 //select counter
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MRC p15, 0, R3, c9, c13, 2//read event count

MOV R1, #2 //select counter
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MRC p15, 0, R4, c9, c13, 2//read event count

Performance of initializing the stack

And the result (when I turn off the divide by 64 mode is): R0: 5124, R2: 4, R3: 3, R4: 13; this means that it took 5124 clock cycles to zero the 928 bytes/232 words of BSS (_bss_end - _bss_start = 0x1fe0c3fc - 0x1fe0c05c),write the stack fill pattern to 0x2000 (8K) bytes or 2K words, and set the stack pointer.  BSS zeroing and stack filling loop both consisted of 3 assembly statements.  So 3 times (232 + 2048) loop = 6840 instructions, which is LARGER than the CCNT count, showing the ARM7 3-stage pipeline at work!  The instruction and data cache misses were 3 and 13, respectively, and there were 4 branch prediction misses.

Calling the C/C++ static initializers

__libc_init_array(), normally supplied in libc, calls the constructors of the classes declared globally.

    LDR     r12,=__libc_init_array //in libc
    MOV     lr,pc
    BX      r12

Since I did not want to link against Xilinx BSP libc, I wrote my own function shown way above.

Jump to main()

Since my embedded application will never shutdown, I did not even bother calling the destructors for the global instances, to undo what I did in __libc_init_array.

    LDR     r12,=main //call int main(void)
    MOV     lr,pc           /* set the return address */
    BX      r12             /* the target code can be ARM or THUMB */
    //bl __libc_fini_array /* Cleanup global constructors */
_forever_: b _forever_ //End of the world!

BSP_init(): HW specific initialization

BSP_init() is responsible for initializing all HW the application will use, so it is of course application dependent.  BUT, there are some HW that ALL real-time embedded applications use: timer interrupt and GPIO LED.  So I set up these 2 HW here, and then perhaps come back to more application specific HW--such as watchdog, PWM timer, and I2C in another blog.

Setup HB GPIO LED

Linux kernel source to enable GPIO clock

In case the GPIO HW clock is not yet enabled, I turn it on here.  The GPIO clock is defined in the DTS:

                gpio: gpio@e000a000 {
                        compatible = "xlnx,ps7-gpio-1.00.a", "xlnx,zynq-gpio-1.00.a", "xlnx,zynq-gpio-1.0";
                        reg = <0xe000a000 0x1000>;
                        interrupts = <0 20 4>;
                        interrupt-parent = <&gic>;
                        clocks = <&clkc 42>;
                        gpio-controller;
                        #gpio-cells = <2>;
                        interrupt-controller;
                        #interrupt-cells = <2>;
                };

clkc_42 is pointing to "gpio_aper" clock:

                slcr: slcr@f8000000 {
                        #address-cells = <1>;
                        #size-cells = <1>;
                        compatible = "xlnx,zynq-slcr", "syscon";
                        reg = <0xf8000000 0x1000>;
                        ranges ;
                        clkc: clkc {
                                #clock-cells = <1>;
                                clock-output-names = "armpll", "ddrpll", "iopll", "cpu_6or4x", "cpu_3or2x",
                                        "cpu_2x", "cpu_1x", "ddr2x", "ddr3x", "dci",
                                        "lqspi", "smc", "pcap", "gem0", "gem1",
                                        "fclk0", "fclk1", "fclk2", "fclk3", "can0",
                                        "can1", "sdio0", "sdio1", "uart0", "uart1",
                                        "spi0", "spi1", "dma", "usb0_aper", "usb1_aper",
                                        "gem0_aper", "gem1_aper", "sdio0_aper", "sdio1_aper", "spi0_aper",
                                        "spi1_aper", "can0_aper", "can1_aper", "i2c0_aper", "i2c1_aper",
                                        "uart0_aper", "uart1_aper", "gpio_aper", "lqspi_aper", "smc_aper",
                                        "swdt", "dbg_trc", "dbg_apb";
                                compatible = "xlnx,ps7-clkc";
                                ps-clk-frequency = <33333333>;
                                fclk-enable = <0xf>;
                                reg = <0x100 0x100>;
                        };
                };

Kernel code for Zynq clock is <>/drivers/clk/zynq/clk.c.  gpio_aper clock's bit index is 22, meaningful among other aper clocks (these are the peripheral clock driven at CPU_1x rate offset 0x2C from the zync clock base 0xF8000100 = 0xF8000000 + 0x100 defined in the DTS above, and described in the Zynq TRM Table 25-2).  Enabling the clock means simply setting the bit (22) in this case for that clock control register, like this:

    //Turn on the GPIO clock (in case it is off)
#define ZYNQ_APER_CLK_CTRL 0xF800012C
    *(volatile uint32_t*)ZYNQ_APER_CLK_CTRL |= 1 << 22;

Configure GPIO

On Zynq, all GPIO is configured for interrupt enable out of reset.  The 128 possible such interrupt sources can be quenched by asserting the GPIO interrupt disable register, which are not contiguous, but spread in blocks for the 4 banks of GPIOs.

#define ZYNQ_GPIO_BASE_ADDR 0xe000a000
#define ZYNQ_GPIO_BANK_CTRL_OFFSET 0x40

    for(i=0; i < 4; ++i)
    *(volatile uint32_t*)
(ZYNQ_GPIO_INT_DIS_ADDR + i * ZYNQ_GPIO_BANK_CTRL_OFFSET) = ~0UL;

A GPIO pin is an input on PoR.  To write to that pin, I have to set the direction and the output enable registers.  It is slightly complicated by the non-linear mapping of a logical GPIO number to the GPIO banks, as in this example for the GPIO07--the only LED addressable from the CPU on the Zedboard:

uint8_t const BSP_HB_LED_GPIO = static_cast<uint32_t>(7);

    uint8_t gpio_bank, gpio_pin;
    XGpioPs_GetBankPin(BSP_HB_LED_GPIO, &gpio_bank, &gpio_pin);//map
*(volatile uint32_t*)(gpio_bank * ZYNQ_GPIO_BANK_CTRL_OFFSET +
ZYNQ_GPIO_DIRM_ADDR) = 1 << gpio_pin;
*(volatile uint32_t*)(gpio_bank * ZYNQ_GPIO_BANK_CTRL_OFFSET +
ZYNQ_GPIO_OUTEN_ADDR) = 1 << gpio_pin;

The non-linear mapping is handled with a helper function:

static inline void XGpioPs_GetBankPin(uint8_t PinNumber,
uint8_t *BankNumber, uint8_t *PinNumberInBank) {
for (*BankNumber = 0; *BankNumber < 4; (*BankNumber)++)
if (PinNumber <= XGpioPsPinTable[*BankNumber])
break;

if (*BankNumber == 0) {
*PinNumberInBank = PinNumber;
} else {
*PinNumberInBank = PinNumber %
(XGpioPsPinTable[*BankNumber - 1] + 1);
}
}

During bringup, BSP_init may hang (e.g. when performing a self-test of a serial peripheral).  In my experience, I found it convenient to light up the heartbeat LED as soon as possible, until the BSP_init() is done.  Writing to the GPIO is just turning on the correct bit in either the high/low registers, each of which handle only 16 GPIO pins:

void BSP_writeGPIO(uint8_t pin, bool on) {
    uint8_t gpio_bank, gpio_pin;
    volatile uint32_t* data_reg;
    XGpioPs_GetBankPin(pin, &gpio_bank, &gpio_pin);//map

    if(gpio_pin > 15) {
    gpio_pin -= 16;
    data_reg = (volatile uint32_t*)
    (gpio_bank * ZYNQ_GPIO_BANK_DATA_OFFSET +
    ZYNQ_GPIO_DATA_HI16_ADDR);
    } else {
    data_reg = (volatile uint32_t*)
    (gpio_bank * ZYNQ_GPIO_BANK_DATA_OFFSET +
    ZYNQ_GPIO_DATA_LO16_ADDR);
    }

    *data_reg = ~(1 << (gpio_pin+16)) //mask shields other pins from
& (((on & 1) << gpio_pin) | 0xFFFF0000); //0 in data
}

I turn on the LED before starting the rest of HW initialization--which in the simplest case will be just the timer interrupt.

BSP_writeGPIO(BSP_HB_LED_GPIO, true); //light up the LED ASAP

Setup timer interrupt and the GIC (the interrupt controller)

GIC has complicated rules about the secure/non-secure interrupts.  I do NOT use the security feature of the HW, so I just copied the XSDK BSP generated example:

#define XPAR_PS7_SCUGIC_0_BASEADDR      0xF8F00100 //CPU base addr
#define XSCUGIC_CPU_CONTROL_OFFSET             0x0
#define XSCUGIC_CPU_PRIOR_OFFSET               0x4

*(volatile uint32_t*)//See ICCICR
(XPAR_PS7_SCUGIC_0_BASEADDR + XSCUGIC_CPU_CONTROL_OFFSET) =
1 << 2 | 1 << 1 | 1;

The GIC identifies each of up to 96 different interrupts it services with an interrupt ID.  The private timer interrupt ID is 29.

enum InterruptId { //These are all interrupts I care to handle
INT_ID_PRIVATE_TIMER = 29 //priority? ICDIPR and ICDIPTR
};

I have to tell the GIC distributor that I want this interrupt (and any others I care to receive in the future):

#define XPAR_PS7_SCUGIC_0_DIST_BASEADDR 0xF8F01000
#define XSCUGIC_ENABLE_SET_OFFSET            0x100//ICDISER0,ICDISER1,ICDISER2
*(volatile uint32_t*)//See ICDISER0: for intID 0~31
(XPAR_PS7_SCUGIC_0_DIST_BASEADDR + XSCUGIC_ENABLE_SET_OFFSET + 0) =
1 << (INT_ID_PRIVATE_TIMER % 32)
;

Then I configure the private timer HW.  If I want to receive a 1 second period timer interrupt, I have to set the timer load value to be 1 less than the private timer clock frequency, which (according to the TRM) is HALF of the CPU frequency.  The definitive source for the CPU clock frequency is in the Vivado PS7 config wiward, as you can see here:
I requested of the actual ARM PLL rate to the SCU timer register like this:

#define CPU3x2xHZ (666666687 / 2)
*(volatile uint32_t*)//See Private_Timer_Load_Register in Zynq TRM
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_LOAD_OFFSET) =
CPU3x2xHZ - 1;

Finally, I started the timer through the control register:

*(volatile uint32_t*)//See Private_Timer_Control_Register Zynq TRM
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_CONTROL_OFFSET) =
1 << 2 | //interrupt enable
1 << 1 | //auto-reload
1;       //Enable

With HW initialization complete, I turn off the heartbeat LED on my way out of BSP_init().

    BSP_writeGPIO(BSP_HB_LED_GPIO, false); //turn off the LED

HW specific interrupt locking/unlocking in qf_port.h

Maximum number of of active objects and event pool just sizes static variables adequately for most applications

#define QF_MAX_ACTIVE               32//Should be enough for most
#define QF_MAX_EPOOL 6 // The maximum number of event pools in the application

The most important decision when porting QP is the interrupt locking policy.  Since Zynq has a prioritized interrupt controller (GIC), I can use the simply policy of unconditional locking and unlocking interrupts while still retaining the nested interrupt feature (because the GIC takes the responsibility for holding back the interrupts with same or lower priority than the currently asserted interrupt.  Understanding the interrupt enabling/disabling code is easier with section 9.2.3.1 of the ARM System Developer's Guide and the CSPR register:
CSPR_C pseudo register name allows me to write only bits [7:0] (control field) of CSPR.  Bit 7 and 6 of the CPSR register is the IRQ and FIQ status.  The interrupt controller is connected to the IRQ line but NOT the FIQ line, so I will try hard to avoid FIQ in this port.

#define QF_INT_DISABLE()   \
    __asm volatile ("MSR cpsr_c,#(0x1F | 0x80)" ::: "cc")

#define QF_INT_ENABLE() \
    __asm volatile ("MSR cpsr_c,#(0x1F)" ::: "cc")

BUT to allow for the possibility of nesting critical section, QP saves interrupt status when entering critical section.  Showing just the ARM case--selected with __arm__ preprocessor define (I did NOT bother with THUMB code):

#define QF_CRIT_STAT_TYPE       unsigned int
#define QF_CRIT_ENTRY(stat_)    do { \
__asm volatile ("MRS %0,cpsr" : "=r" (stat_) :: "cc"); \
QF_INT_DISABLE(); \
} while (0)
#define QF_CRIT_EXIT(stat_) \
__asm volatile ("MSR cpsr_c,%0" :: "r" (stat_) : "cc")

Note that it is apparently OK to write a 32 bit into CPSR_C; assembler must be ignoring the top 24 bits.

#define QF_LOG2(n_) ((uint8_t)(32U - __builtin_clz(n_)))

QF or QK port has to provide the various ARM exception handlers, which are forward declared here, so that the vector table can point to the assembly implemention (in qk_port.s):

extern "C" {
    void QF_reset(void);
    void QF_undef(void);
    void QF_swi(void);
    void QF_pAbort(void);
    void QF_dAbort(void);
    void QF_reserved(void);
    void QF_fiq_dummy(void);
}

HW specific interrupt handling in qk_port.s

Earlier, I explained the _boot vector when explaining the startup code, and all other interrupt handlers--which are expected never to be used.  ALL legitimate interrupt handling is done in QK_irq, which I copied straight from the qpcpp qk ARM 7/9 port.

QK IRQ interrupt handler wrapper

QK_irq is just a thin wrapper to book keep the nested interrupt count and call the QK scheduler; the actual interrupt handling is done witin BSP_irq.  qk_port.h forward declares them:

extern "C" {
    void QK_irq(void);
    void BSP_irq(void);
}

The wrapper works like this (when reading below, remember that R13: stack pointer, R14: link register (return address), R15: PC):
  1. Save the SYSTEM context ({R0-R3, R12, R13, PC, SPSR}, complying to the ARM v7-M interrupt stack frame) onto the SYSTEM stack, and change back to the SYSTEM mode.
    1. Save R0 and R1 from the system context, and save SPSR (the stack pointer) and the return address to R0 and R1.
    2. Disable IRQ and change back to the SYSTEM mode.
    3. Push R0 and R1 to the SYSTEM stack (because we are in the SYSTEM mode now).
    4. Push general purpose registers allowed to by modified by the AAPCS (ARM architecture procedure call standard) to the stack.
    5. Remember the new stack pointer
    6. Change back to IRQ mode and save the SYSTEM R0, R1 (which has been saved in step #1 above) into the stack.
  2. Increment QK_intNest_, which keeps track of the nested interrupt level
  3. Run the C interrupt handler (BSP_irq).  IRQ should be disabled at this point, but QK
  4. Decrement QK_intNest_.  If it comes down to 0, run the event checker (QK_schedPrio_) function, which will return the priority of the active object with an event pending.  If that priority is NOT zero, run the scheduler (QK_sched_).
  5. Restore context and change back to IRQ mode

Minimal BSP_irq() for QP: handle the timer tick

Please recall from above that QK_irq() wrapper does NOT acknowledge the interrupt to the HW.  So the BSP_irq reads the currently pending highest priority interrupt (as decided by the GIC), and acknowledges both the interrupt to both the GIC and the HW that generated the interrupt.  If I am only interested in the private timer interrupt, this codes does exactly that:

void BSP_irq(void) {
/*
* Read the int_ack register to identify the highest priority interrupt ID
* and make sure it is valid. Reading Int_Ack will clear the interrupt
* in the GIC.
*/
#define XSCUGIC_INT_ACK_OFFSET 0xC
    uint32_t intAck = *(volatile uint32_t*)//See ICCIAR register in Zynq TRM
(XPAR_PS7_SCUGIC_0_BASEADDR + XSCUGIC_INT_ACK_OFFSET);
#define XSCUGIC_ACK_INTID_MASK 0x3FF
    uint32_t intID = intAck  & XSCUGIC_ACK_INTID_MASK;

QF_INT_ENABLE(); // allow nesting interrupts
switch(intID) {
case INT_ID_PRIVATE_TIMER:
if(*(volatile uint32_t*)//See Private_Timer_Interrupt_Status_Register
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_ISR_OFFSET)
& 0x1) {
*(volatile uint32_t*)// clear interrupt source
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_ISR_OFFSET) = 1;
DPP::BSP_writeGPIO(BSP_HB_LED_GPIO, BSP_HB_LED_on = !BSP_HB_LED_on);
QP::QF::TICK(&l_ISR_tick);
}
break;
default: break;
}
QF_INT_DISABLE();// disable IRQ/FIQ before return

#define XSCUGIC_EOI_OFFSET 0x10
    *(volatile uint32_t*)//See ICCEOIR register in Zynq TRM
(XPAR_PS7_SCUGIC_0_BASEADDR + XSCUGIC_EOI_OFFSET) = intAck;
}

DPP example application on Zynq CPU1

Since the DPP application just needs a timer interrupt and an LED to indicate the tick event firing, I can just use the main.cpp, philo.cpp, and table.cpp from the reference ARM 7/9 port I downloaded.

I created a new Xilinx C++ project (right click in xsdk Project Explorer --> New --> Project --> Xilinx --> Application Project), and specify the project location, Processor, Language, and the board support package, as shown below:
I chose to create a Xilinx standalone application and THEN remove the dependence on the BSP, but perhaps it would have been more straight forward just to create a vanilla C++ application; the only difference is between a Xilinx application and a C++ application is that the linker command is given an extra option (-T) to use the linker script in the project.  Removing the dependence on the Xilinx standalone BSP requires a few steps explained below.

A QP application project must include the QP include files and the QP port include files.  Since I use the xsdk generated BSP, the project generator also put in the standalone BSP include folder, as you see below:
Remove the BSP include folder (above the /qk/include in the above screenshot) from the include folder.

Similarly, the project will link against the qk library just built above.  Since the library location is build config dependent, I use an Eclipse variable ${ConfigName}, as shown below:
Remove the standalone BSP lib folder (highlighted above) from "Library Paths".

For the library name, I only added "qk", even though the full library file name is libqk.a (UNIX naming convention) as shown below.
I removed ALL libraries from the Libraries in the above screenshot except for qk, to avoid picking up unnecessary code.

This application will NOT depend on the standalone BSP, so navigate to "Project References" and uncheck the dependence on standalone_bsp_1.


To generate the map file, specify -Wl,-Map,<name of the map file>,--cref,--gc-sections as a linker option, as you can see here:

--cref option causes a cross-reference table to be emitted to the map file, and --gc-sections strips out unused input sections.  With this option, the DPP ELF debug (-O0) code size went down from ~49 KB down to 45 KB.  BUT, do NOT include this option if the application will be started on CPU1 by Linux remoteproc module, because the resource table--which is NOT used by anything in the application itself--must be left intact in the ELF file.  I suppose if I can write a code in the application to refer to the resource table to work around this problem.

I see the LED blink at 0.5 Hz (because I toggle the LED at 1 Hz rate).

Idling: QK::onIdle()

For the most part, a hard-real-time application has nothing to do.  One can choose to just keep running the background idle() function over and over, or put the CPU to sleep, and save some power.   Since Zynq is ARM Cortex A9 based, which has the "wait for event" instruction that does exactly that, I can just call that instruction, and expect to be woken up if there is an interrupt/event for the CPU.

void QK::onIdle(void) {
    asm("WFE" : : : );//NOTE: an interrupt starts the CPU clock again
}

Debugging QP application on CPU1

Q_ASSERT is used extensively in QP--both within the infrastructure and the application code.  I found that for some reason, XSDK generated code optimized out the file name and line number arguments into Q_assert, and would not show the stack trace when in the infinite loop in Q_onAssert().  So I worked around the first problem by creating global variables oops_file and oops_line and then saving to those, so that I can display it in Expression window, as shown below:


Next step: inter-AMP DPP application

The usual DPP application has either a GUI showing the states of the philosophers, or an LED for each philosopher, or a QSpy text output.  I will make my AMP system more interesting by sending messages between the Linux application and the real-time bare-metal application.