QP is my favorite RTOS. It's actually NOT an OS per se, but instead a state-machine framework. But with a small amount of scheduling code (a very particular scheduler called a single threaded RTC scheduler), a QP application with multiple "threads" (active objects actually) run in a clean and high performance manner. I have been using this framework for over 12 years now, and have not yet seen a better SW framework--especially the QK framework for hard real-time code.
I started my Zynq study about 6 months ago, and wanted to run bare metal C/C++ on Zynq. Its dual-core architecture is ideal for running GPOS like Linux on CPU0, and a bare metal C/C++ code on CPU1. In previous blog entries, I figured out how to kick off a bare metal (more accurately, no OS) C application on CPU1 from Linux on CPU0, so the next logical step is to write a QK application for Zynq CPU1,
Although there is an application note for porting QP to bare metal ARM, it targets small ARM CPU running by itself. What I need is a QK port that starts from the code already in the DRAM (Linux on CPU0 puts the code there and resets CPU1), so many code in the qpcpp's qk port to ARM gnu is unnecessary.
xsdk static library project for QK-C++
You can get QP through git:
git clone git://git.code.sf.net/p/qpc/qpcpp
HW independent code
At the root of the qpcpp folder, the include/, qf/, qep/, and qk/ have platform independent code. A new port should go to ports/ folder. I create this Zynq (Xilinx ARM) remoteproc (CPU1 kicked off from Linux on CPU0) port in ports/arm/qk/zynq_rproc, as shown in an example of Xilinx xsdk new static library project creation window.
xsdk (which is based on Eclipse CDT) creates the .cproject and .project files in the above folder. To add QP files to the project, I create a LINKED folder (right click on the project --> New --> Folder --> Advanced) to the qf, qep, qk source folders, like this example:
There is 1 problem with this method of adding all QP folders through linked folder feature: I had to delete the file qvanilla.cpp, which contains the same implementations that are in qk.cpp--which is what I want.
The only HW independent code I have to change is the QP_API_VERSION in qp_port.h, which controls the degree of backward compatibility. 0 (the default) means maximum compatibility. For a new project that does NOT have to care about backward compatibility. For a new project (such as this) without a backward compatibility requirement, set QP_API_VERSION to 999.
To avoid pulling in standard libraries, I had to supply -fno-rtti -fno-exceptions compiler option, both for the qk library and the executable below.
Above screenshot also shows compiler option "-ffunction-sections", which forces each function to its own section. This allows the linker to place any section to different memory. Together with the linker option "--gc-sections", this compiler option (discussed when building the application ELF file below) also leads to a smaller executable footprint. BUT this option strips out the resource table static struct read by the Linux remoteproc module--the one that copies the ELF file into CPU1, reboots CPU1, and kicks off the executable. So do NOT include this option!
I also copy the code recommended by the "Bare metal ARM code development", to avoid dynamic memory allocation.
#include "qp_port.h"
#include <stdlib.h> // for prototypes of malloc() and free()
Q_DEFINE_THIS_FILE
extern "C" {
int __aeabi_atexit(void *object, void (*destructor)(void *), void *dso_handle)
{ return 0; }
//............................................................................
void __cxa_atexit(void (*arg1)(void *), void *arg2, void *arg3) {
}
//............................................................................
void __cxa_guard_acquire() {
}
//............................................................................
void __cxa_guard_release() {
}
//............................................................................
void *malloc(size_t) {
Q_ERROR();
return (void *)0;
}
//............................................................................
void free(void *) {
Q_ERROR();
}
//............................................................................
void *calloc(size_t, size_t) {
Q_ERROR();
return (void *)0;
}
}
//............................................................................
void *operator new(size_t size) throw() { return malloc(size); }
//............................................................................
void operator delete(void *p) throw() { free(p); }
HW specific linker script and startup assembly code
To repeat, this port of QK is for the 2nd Zynq CPU placed into its own RAM area by Linux on CPU0. I started from the bare metal ARM 7/9 port available from state-machine.com, but removed the code I do not need. The linker script specifies the entry point into the application is the reset vector of the vector table.
Vector table
ENTRY(_vector_table)
This port will supply the standard ARM interrupt handlers, so the vector table is hard coded to the functions in the port in startup.s. Technically speaking, a linker script is NOT part of the QK library; it is application specific. But the vector table is tightly coupled to the QK port.
.text
.code 32
.section .vectors
_vector_table:
B _boot
B QF_undef
B QF_swi
B QF_pAbort
B QF_dAbort
B QF_reserved/* Placeholder for address exception vector*/
B QK_irq
B QF_fiq_dummy
.set vector_base, _vector_table
The ARM 7/9 port actually starts with an empty vector table until the RAM is remapped, and even then, the 1st level vector table merely indirectly to a 2nd level that is later filled in by QF::onStartup. But as explained above, my code is placed straight into already initialized RAM, and the entire CPU1 is dedicated for this application. So I decided to keep things simple by hard coding the vector table (no indirection recommended in the bare metal ARM application note). The qk_port.s section below will detail all the above interrupt handlers--except the reset vector (boot), which I elaborate right now.
Every vectors except for the _boot and QK_irq vectors turn off interrupts and call Q_onAssert(char* source, int linnum). Here is an example for the FIQ vector, which this port of QP do NOT handle:
Csting_fiq: .string "FIQ dummy"
.global QF_fiq_dummy
.func QF_fiq_dummy
.align 3
QF_fiq_dummy:
LDR r0,=Csting_fiq
B QF_except
.size QF_fiq_dummy, . - QF_fiq_dummy
.endfunc
C calling convention uses R0 and R1 as the 1st and the 2nd parameters:
.global QF_except
.func QF_except
.align 3
QF_except:
/* r0 is set to the string with the exception name */
SUB r1,lr,#4 /* set line number to the exception address */
MSR cpsr_c,#(SYS_MODE | NO_IRQ | NO_FIQ) /* SYSTEM,IRQ/FIQ disabled */
LDR r12,=Q_onAssert
MOV lr,pc /* store the return address */
BX r12 /* call the assertion-handler (ARM/THUMB) */
/* the assertion handler should not return, but in case it does
* hang up the machine in this endless loop
*/
B .
_boot vector
The very first code out of reset is to allow only CPU1 to run. Maybe I don't need this code, but I inherited this code from Xilinx BSP's boot.S. Depending on the time of the day, I feel like this is a good thing to do:
.section .boot,"ax"
_prestart:
_boot: /* only allow cpu1 through */
mrc p15,0,r1,c0,c0,5
and r1, r1, #0xf
cmp r1, #1
beq OKToRun
EndlessLoop1:
wfe
b EndlessLoop1
There are CPU version specific workarounds to be applied. The errata to pick up is defined in xil_errata.h, which I COPIED from the auto-generated BSP code. For Zynq only ARM errata 742230 and 743622 apply.
Recall from
my earlier blog on the zynq_remoteproc module (which controls the lifecycle of the CPU1 app) that CPU1 gets the 2 MB starting at 0x1fe0000. So the vectors are laid down at that address. But when an interrupt comes (for example an IRQ), how does the CPU1 know to call QK_irq()? The answer is the VBAR (vector base address register):
ldr r0, =vector_base
mcr p15, 0, r0, c12, c0, 0
MMU initialization
Next, the SCU (snoop control unit) is enabled and invalidated. It must be referring to the CPU1's interface to the SCU, because SCU is shared between the 2 cores. According to ARM bloggers, this, and writing the ACTLR.SMB bit, seem to be related to cache coherency setting.
Next, I invalidate the I and D caches, and MMU TLB, and then disable the MMU--all through the MMU co-processor on the ARM CP15, before writing the MMU table. This port being specialized for the 2nd CPU with its reserved DRAM range (510~512 MB), the MMU table is configured this way:
- DRAM alloted to CPU0 is marked unassigned/reserved. If CPU1 accessed this range, the HW will raise pAbort or dAbort.
- The rest of the memory is marked non-shared, L2 turned off (TEX='b100), L1 enabled (b'01 write-back, write allocate)
The MMU and cache are enabled, again through the cp15 registers. The syntax for the CP15 access is:
mcr p15, <op1>, r0, <target register 0; CRn>, <target register 1; CRm>, <op2>
CPU feature initialization
Next, various features in the CPU are enabled, starting with the FPU. Perhaps one can be more optimal about this sort of thing, but if the application runs any control algorithm, floating point is probably inevitable.
.set FPEXC_EN, 0x40000000 /* FPU enable bit, (1 << 30) */
fmrx r1, FPEXC /* read the exception register */
orr r1,r1, #FPEXC_EN /* set VFP enable bit, leave the others in orig state */
fmxr FPEXC, r1 /* write back the exception register */
Branch (Xilnix BSP called this the flow) prediction, and data/instruction prefetchs are turned on again through the CP15. I am not sure why the Xilinx BSP turns on the asynchronous abort exception... And then, more CPU initialization, followed by
performance counter reset.
mov r2, #0x80000007 /* clear CCNT and Px overflow */
mcr p15, 0, r2, c9, c12, 3
MOV R1, #0 //select counter 0
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MOV R2, #0x10 //monitor branch miss
MCR p15, 0, R2, c9, c13, 1 //Write EVTSELx Register
MOV R1, #1 //select counter 1
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MOV R2, #0x01 //monitor instruction cache miss
MCR p15, 0, R2, c9, c13, 1 //Write EVTSELx Register
MOV R1, #2 //select counter 2
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MOV R2, #0x03 //monitor data cache miss
MCR p15, 0, R2, c9, c13, 1 //Write EVTSELx Register
//Countrol register
//D: 1/64 off!
//C: reset counter
//P: reset event counter
//E: enable all counters including CCNT
mov r2, #0xF //DCPE
mcr p15, 0, r2, c9, c12, 0
mov r2, #0x80000007 /* enable CCNT and Px */
mcr p15, 0, r2, c9, c12, 1
Using the performance counter, you can see
all sorts of details! This may be necessary to debug why CPU1 is running slower than CPU0. I chose to look at
Supporting C/C++
BSS section should be zeroed out before entering main(). This is an efficient memset(0):
LDR r1,=__bss_start__
LDR r2,=__bss_end__
MOV r3,#0
1:
CMP r1,r2
STMLTIA r1!,{r3}
BLT 1b
STM means "store multiple", and its syntax is STM{cond}{address mode0}. So in STMLTIA, LT means (the result of the comparison above) "less than", and IA means "increment after": the register indicated in front of the "!". In the debugger, the instruction actually shows up as:
1fe001a0: ldr r1, [pc, #+252]
1fe001a4: ldr r2, [pc, #+252]
1fe001a8: mov r3, #0
1fe001ac: cmp r1, r2
1fe001b0: stmlt r1!, {r3}
1fe001b4: blt -1
__bss_strart is stored at [pc, #252] = 0x1fe001a0 + 252 (260 actually, if I account for the 2 instruction pipeline).
QP uses just 1 stack (and no heap). In the linker, the stack is defined in the RAM this way:
_STACK_SIZE = DEFINED(_STACK_SIZE) ? _STACK_SIZE : 0x2000;
_HEAP_SIZE = 0;
.stack (NOLOAD) : { /* Single stack application */
__stack_start__ = . ;
__irq_stack_top__ = .;
__fiq_stack_top__ = .;
__svc_stack_top__ = .;
__abt_stack_top__ = .;
__und_stack_top__ = .;
. += _STACK_SIZE;
. = ALIGN (4);
__c_stack_top__ = . ;
__stack_end__ = .;
} > ps7_ddr_0_S_AXI_BASEADDR
Note that there is actually NO stack for any mode except the user/system mode.
Similar to zeroing out the BSS, filling the stack with special patterns (to detect stack overrun, when debugging weird problems) uses the STMLTIA pneudo instruction as well:
LDR r1,=__stack_start__
LDR r2,=__stack_end__
.equ STACK_FILL, 0xAAAAAAAA
LDR r3,=STACK_FILL
1:
CMP r1,r2
STMLTIA r1!,{r3}
BLT 1b
In the disassembly window, this code shows up as
1fe001b8: ldr r1, [pc, #+236]
1fe001bc: ldr r2, [pc, #+236]
1fe001c0: ldr r3, [pc, #+236]
1fe001c4: cmp r1, r2
1fe001c8: stmlt r1!, {r3}
1fe001cc: blt -16
Wouldn't you feel reassured if you saw 0xAAAAAAAA at the PC (0x1fe001c0) + 236 + 8 (for pipelining)? In the memory view, 0x1fe002b4 indeed has the STACK_FILL pattern!
0x1fe002ac : 0x1FE002AC <Hex Integer>
Address 0 - 3 4 - 7 8 - B C - F
1FE002A0 00001005 1FE0C05C 1FE0C3FC 1FE0C400
1FE002B0 1FE0E400 AAAAAAAA 1FE0E400 1FE03FE0
1FE002C0 1FE00424 000003FF 00007FFF E92D4800
1FE002D0 E28DB004 E24DD008 E30E39FF E50B3008
1FE002E0 E30004D2 EB00002D E24BD004 E8BD8800
1FE002F0 E52DB004 E28DB000 E24DD00C E1A03000
The 2 wors before the fill pattern (0x1fe0c400 and 0x1fe0e400) are the stack start and end addresses, which confirms that the stack is 0x2000 bytes big.
Since the ARM stack register (alias: sp) is banked, I have to be in the desired CPU mode (ARM has total of 7 modes, but the user and system mode use the same banked registers) to set the stack pointer for each mode. But QK ARM port uses only the SYSTEM mode, so only need the following code
MSR CPSR_c,#(SYS_MODE | I_BIT | F_BIT)
LDR sp,=__c_stack_top__ /* set the C stack pointer */
Using the counters I initialized way above, I can measure the speed of the BSS and stack init:
MSR CPSR_c,#(SYS_MODE | I_BIT | F_BIT)
LDR sp,=__c_stack_top__ /* set the C stack pointer */
//How long did BSS and stack init take?
MRC p15, 0, R0, c9, c13, 0//read the CCNT
MOV R1, #0 //select counter
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MRC p15, 0, R2, c9, c13, 2//read event count
MOV R1, #1 //select counter
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MRC p15, 0, R3, c9, c13, 2//read event count
MOV R1, #2 //select counter
MCR p15, 0, R1, c9, c12, 5 //Write PMNXSEL Register
MRC p15, 0, R4, c9, c13, 2//read event count
Performance of initializing the stack
And the result (when I turn off the divide by 64 mode is): R0: 5124, R2: 4, R3: 3, R4: 13; this means that it took 5124 clock cycles to zero the 928 bytes/232 words of BSS (_bss_end - _bss_start = 0x1fe0c3fc - 0x1fe0c05c),write the stack fill pattern to 0x2000 (8K) bytes or 2K words, and set the stack pointer. BSS zeroing and stack filling loop both consisted of 3 assembly statements. So 3 times (232 + 2048) loop = 6840 instructions, which is LARGER than the CCNT count, showing the ARM7 3-stage pipeline at work! The instruction and data cache misses were 3 and 13, respectively, and there were 4 branch prediction misses.
Calling the C/C++ static initializers
__libc_init_array(), normally supplied in libc, calls the constructors of the classes declared globally.
LDR r12,=__libc_init_array //in libc
MOV lr,pc
BX r12
Since I did not want to link against Xilinx BSP libc, I wrote my own function shown way above.
Jump to main()
Since my embedded application will never shutdown, I did not even bother calling the destructors for the global instances, to undo what I did in __libc_init_array.
LDR r12,=main //call int main(void)
MOV lr,pc /* set the return address */
BX r12 /* the target code can be ARM or THUMB */
//bl __libc_fini_array /* Cleanup global constructors */
_forever_: b _forever_ //End of the world!
BSP_init(): HW specific initialization
BSP_init() is responsible for initializing all HW the application will use, so it is of course application dependent. BUT, there are some HW that ALL real-time embedded applications use: timer interrupt and GPIO LED. So I set up these 2 HW here, and then perhaps come back to more application specific HW--such as watchdog, PWM timer, and I2C in another blog.
Setup HB GPIO LED
Linux kernel source to enable GPIO clock
In case the GPIO HW clock is not yet enabled, I turn it on here. The GPIO clock is defined in the DTS:
gpio: gpio@e000a000 {
compatible = "xlnx,ps7-gpio-1.00.a", "xlnx,zynq-gpio-1.00.a", "xlnx,zynq-gpio-1.0";
reg = <0xe000a000 0x1000>;
interrupts = <0 20 4>;
interrupt-parent = <&gic>;
clocks = <&clkc 42>;
gpio-controller;
#gpio-cells = <2>;
interrupt-controller;
#interrupt-cells = <2>;
};
clkc_42 is pointing to "gpio_aper" clock:
slcr: slcr@f8000000 {
#address-cells = <1>;
#size-cells = <1>;
compatible = "xlnx,zynq-slcr", "syscon";
reg = <0xf8000000 0x1000>;
ranges ;
clkc: clkc {
#clock-cells = <1>;
clock-output-names = "armpll", "ddrpll", "iopll", "cpu_6or4x", "cpu_3or2x",
"cpu_2x", "cpu_1x", "ddr2x", "ddr3x", "dci",
"lqspi", "smc", "pcap", "gem0", "gem1",
"fclk0", "fclk1", "fclk2", "fclk3", "can0",
"can1", "sdio0", "sdio1", "uart0", "uart1",
"spi0", "spi1", "dma", "usb0_aper", "usb1_aper",
"gem0_aper", "gem1_aper", "sdio0_aper", "sdio1_aper", "spi0_aper",
"spi1_aper", "can0_aper", "can1_aper", "i2c0_aper", "i2c1_aper",
"uart0_aper", "uart1_aper", "gpio_aper", "lqspi_aper", "smc_aper",
"swdt", "dbg_trc", "dbg_apb";
compatible = "xlnx,ps7-clkc";
ps-clk-frequency = <33333333>;
fclk-enable = <0xf>;
reg = <0x100 0x100>;
};
};
Kernel code for Zynq clock is <>/drivers/clk/zynq/clk.c. gpio_aper clock's bit index is 22, meaningful among other aper clocks (these are the peripheral clock driven at CPU_1x rate offset 0x2C from the zync clock base 0xF8000100 = 0xF8000000 + 0x100 defined in the DTS above, and described in the Zynq TRM Table 25-2). Enabling the clock means simply setting the bit (22) in this case for that clock control register, like this:
//Turn on the GPIO clock (in case it is off)
#define ZYNQ_APER_CLK_CTRL 0xF800012C
*(volatile uint32_t*)ZYNQ_APER_CLK_CTRL |= 1 << 22;
Configure GPIO
On Zynq, all GPIO is configured for interrupt enable out of reset. The 128 possible such interrupt sources can be quenched by asserting the GPIO interrupt disable register, which are not contiguous, but spread in blocks for the 4 banks of GPIOs.
#define ZYNQ_GPIO_BASE_ADDR 0xe000a000
#define ZYNQ_GPIO_BANK_CTRL_OFFSET 0x40
for(i=0; i < 4; ++i)
*(volatile uint32_t*)
(ZYNQ_GPIO_INT_DIS_ADDR + i * ZYNQ_GPIO_BANK_CTRL_OFFSET) = ~0UL;
A GPIO pin is an input on PoR. To write to that pin, I have to set the direction and the output enable registers. It is slightly complicated by the non-linear mapping of a logical GPIO number to the GPIO banks, as in this example for the GPIO07--the only LED addressable from the CPU on the Zedboard:
uint8_t const BSP_HB_LED_GPIO = static_cast<uint32_t>(7);
uint8_t gpio_bank, gpio_pin;
XGpioPs_GetBankPin(BSP_HB_LED_GPIO, &gpio_bank, &gpio_pin);//map
*(volatile uint32_t*)(gpio_bank * ZYNQ_GPIO_BANK_CTRL_OFFSET +
ZYNQ_GPIO_DIRM_ADDR) = 1 << gpio_pin;
*(volatile uint32_t*)(gpio_bank * ZYNQ_GPIO_BANK_CTRL_OFFSET +
ZYNQ_GPIO_OUTEN_ADDR) = 1 << gpio_pin;
The non-linear mapping is handled with a helper function:
static inline void XGpioPs_GetBankPin(uint8_t PinNumber,
uint8_t *BankNumber, uint8_t *PinNumberInBank) {
for (*BankNumber = 0; *BankNumber < 4; (*BankNumber)++)
if (PinNumber <= XGpioPsPinTable[*BankNumber])
break;
if (*BankNumber == 0) {
*PinNumberInBank = PinNumber;
} else {
*PinNumberInBank = PinNumber %
(XGpioPsPinTable[*BankNumber - 1] + 1);
}
}
During bringup, BSP_init may hang (e.g. when performing a self-test of a serial peripheral). In my experience, I found it convenient to light up the heartbeat LED as soon as possible, until the BSP_init() is done. Writing to the GPIO is just turning on the correct bit in either the high/low registers, each of which handle only 16 GPIO pins:
void BSP_writeGPIO(uint8_t pin, bool on) {
uint8_t gpio_bank, gpio_pin;
volatile uint32_t* data_reg;
XGpioPs_GetBankPin(pin, &gpio_bank, &gpio_pin);//map
if(gpio_pin > 15) {
gpio_pin -= 16;
data_reg = (volatile uint32_t*)
(gpio_bank * ZYNQ_GPIO_BANK_DATA_OFFSET +
ZYNQ_GPIO_DATA_HI16_ADDR);
} else {
data_reg = (volatile uint32_t*)
(gpio_bank * ZYNQ_GPIO_BANK_DATA_OFFSET +
ZYNQ_GPIO_DATA_LO16_ADDR);
}
*data_reg = ~(1 << (gpio_pin+16)) //mask shields other pins from
& (((on & 1) << gpio_pin) | 0xFFFF0000); //0 in data
}
I turn on the LED before starting the rest of HW initialization--which in the simplest case will be just the timer interrupt.
BSP_writeGPIO(BSP_HB_LED_GPIO, true); //light up the LED ASAP
Setup timer interrupt and the GIC (the interrupt controller)
GIC has complicated rules about the secure/non-secure interrupts. I do NOT use the security feature of the HW, so I just copied the XSDK BSP generated example:
#define XPAR_PS7_SCUGIC_0_BASEADDR 0xF8F00100 //CPU base addr
#define XSCUGIC_CPU_CONTROL_OFFSET 0x0
#define XSCUGIC_CPU_PRIOR_OFFSET 0x4
*(volatile uint32_t*)//See ICCICR
(XPAR_PS7_SCUGIC_0_BASEADDR + XSCUGIC_CPU_CONTROL_OFFSET) =
1 << 2 | 1 << 1 | 1;
The GIC identifies each of up to 96 different interrupts it services with an interrupt ID. The private timer interrupt ID is 29.
enum InterruptId { //These are all interrupts I care to handle
INT_ID_PRIVATE_TIMER = 29 //priority? ICDIPR and ICDIPTR
};
I have to tell the GIC distributor that I want this interrupt (and any others I care to receive in the future):
#define XPAR_PS7_SCUGIC_0_DIST_BASEADDR 0xF8F01000
#define XSCUGIC_ENABLE_SET_OFFSET 0x100//ICDISER0,ICDISER1,ICDISER2
*(volatile uint32_t*)//See ICDISER0: for intID 0~31
(XPAR_PS7_SCUGIC_0_DIST_BASEADDR + XSCUGIC_ENABLE_SET_OFFSET + 0) =
1 << (INT_ID_PRIVATE_TIMER % 32)
;
Then I configure the private timer HW. If I want to receive a 1 second period timer interrupt, I have to set the timer load value to be 1 less than the private timer clock frequency, which (according to the TRM) is HALF of the CPU frequency. The definitive source for the CPU clock frequency is in the Vivado PS7 config wiward, as you can see here:
I requested of the actual ARM PLL rate to the SCU timer register like this:
#define CPU3x2xHZ (666666687 / 2)
*(volatile uint32_t*)//See Private_Timer_Load_Register in Zynq TRM
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_LOAD_OFFSET) =
CPU3x2xHZ - 1;
Finally, I started the timer through the control register:
*(volatile uint32_t*)//See Private_Timer_Control_Register Zynq TRM
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_CONTROL_OFFSET) =
1 << 2 | //interrupt enable
1 << 1 | //auto-reload
1; //Enable
With HW initialization complete, I turn off the heartbeat LED on my way out of BSP_init().
BSP_writeGPIO(BSP_HB_LED_GPIO, false); //turn off the LED
HW specific interrupt locking/unlocking in qf_port.h
Maximum number of of active objects and event pool just sizes static variables adequately for most applications
#define QF_MAX_ACTIVE 32//Should be enough for most
#define QF_MAX_EPOOL 6 // The maximum number of event pools in the application
The most important decision when porting QP is the interrupt locking policy. Since Zynq has a prioritized interrupt controller (GIC), I can use the simply policy of unconditional locking and unlocking interrupts while still retaining the nested interrupt feature (because the GIC takes the responsibility for holding back the interrupts with same or lower priority than the currently asserted interrupt. Understanding the interrupt enabling/disabling code is easier with section 9.2.3.1 of the ARM System Developer's Guide and the CSPR register:
CSPR_C pseudo register name allows me to write only bits [7:0] (control field) of CSPR. Bit 7 and 6 of the CPSR register is the IRQ and FIQ status. The interrupt controller is connected to the IRQ line but NOT the FIQ line, so I will try hard to avoid FIQ in this port.
#define QF_INT_DISABLE() \
__asm volatile ("MSR cpsr_c,#(0x1F | 0x80)" ::: "cc")
#define QF_INT_ENABLE() \
__asm volatile ("MSR cpsr_c,#(0x1F)" ::: "cc")
BUT to allow for the possibility of nesting critical section, QP saves interrupt status when entering critical section. Showing just the ARM case--selected with __arm__ preprocessor define (I did NOT bother with THUMB code):
#define QF_CRIT_STAT_TYPE unsigned int
#define QF_CRIT_ENTRY(stat_) do { \
__asm volatile ("MRS %0,cpsr" : "=r" (stat_) :: "cc"); \
QF_INT_DISABLE(); \
} while (0)
#define QF_CRIT_EXIT(stat_) \
__asm volatile ("MSR cpsr_c,%0" :: "r" (stat_) : "cc")
Note that it is apparently OK to write a 32 bit into CPSR_C; assembler must be ignoring the top 24 bits.
#define QF_LOG2(n_) ((uint8_t)(32U - __builtin_clz(n_)))
QF or QK port has to provide the various ARM exception handlers, which are forward declared here, so that the vector table can point to the assembly implemention (in qk_port.s):
extern "C" {
void QF_reset(void);
void QF_undef(void);
void QF_swi(void);
void QF_pAbort(void);
void QF_dAbort(void);
void QF_reserved(void);
void QF_fiq_dummy(void);
}
HW specific interrupt handling in qk_port.s
Earlier, I explained the _boot vector when explaining the startup code, and all other interrupt handlers--which are expected never to be used. ALL legitimate interrupt handling is done in QK_irq, which I copied straight from the qpcpp qk ARM 7/9 port.
QK IRQ interrupt handler wrapper
QK_irq is just a thin wrapper to book keep the nested interrupt count and call the QK scheduler; the actual interrupt handling is done witin BSP_irq. qk_port.h forward declares them:
extern "C" {
void QK_irq(void);
void BSP_irq(void);
}
The wrapper works like this (when reading below, remember that R13: stack pointer, R14: link register (return address), R15: PC):
- Save the SYSTEM context ({R0-R3, R12, R13, PC, SPSR}, complying to the ARM v7-M interrupt stack frame) onto the SYSTEM stack, and change back to the SYSTEM mode.
- Save R0 and R1 from the system context, and save SPSR (the stack pointer) and the return address to R0 and R1.
- Disable IRQ and change back to the SYSTEM mode.
- Push R0 and R1 to the SYSTEM stack (because we are in the SYSTEM mode now).
- Push general purpose registers allowed to by modified by the AAPCS (ARM architecture procedure call standard) to the stack.
- Remember the new stack pointer
- Change back to IRQ mode and save the SYSTEM R0, R1 (which has been saved in step #1 above) into the stack.
- Increment QK_intNest_, which keeps track of the nested interrupt level
- Run the C interrupt handler (BSP_irq). IRQ should be disabled at this point, but QK
- Decrement QK_intNest_. If it comes down to 0, run the event checker (QK_schedPrio_) function, which will return the priority of the active object with an event pending. If that priority is NOT zero, run the scheduler (QK_sched_).
- Restore context and change back to IRQ mode
Minimal BSP_irq() for QP: handle the timer tick
Please recall from above that QK_irq() wrapper does NOT acknowledge the interrupt to the HW. So the BSP_irq reads the currently pending highest priority interrupt (as decided by the GIC), and acknowledges both the interrupt to both the GIC and the HW that generated the interrupt. If I am only interested in the private timer interrupt, this codes does exactly that:
void BSP_irq(void) {
/*
* Read the int_ack register to identify the highest priority interrupt ID
* and make sure it is valid. Reading Int_Ack will clear the interrupt
* in the GIC.
*/
#define XSCUGIC_INT_ACK_OFFSET 0xC
uint32_t intAck = *(volatile uint32_t*)//See ICCIAR register in Zynq TRM
(XPAR_PS7_SCUGIC_0_BASEADDR + XSCUGIC_INT_ACK_OFFSET);
#define XSCUGIC_ACK_INTID_MASK 0x3FF
uint32_t intID = intAck & XSCUGIC_ACK_INTID_MASK;
QF_INT_ENABLE(); // allow nesting interrupts
switch(intID) {
case INT_ID_PRIVATE_TIMER:
if(*(volatile uint32_t*)//See Private_Timer_Interrupt_Status_Register
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_ISR_OFFSET)
& 0x1) {
*(volatile uint32_t*)// clear interrupt source
(XPAR_PS7_SCUTIMER_0_BASEADDR + XSCUTIMER_ISR_OFFSET) = 1;
DPP::BSP_writeGPIO(BSP_HB_LED_GPIO, BSP_HB_LED_on = !BSP_HB_LED_on);
QP::QF::TICK(&l_ISR_tick);
}
break;
default: break;
}
QF_INT_DISABLE();// disable IRQ/FIQ before return
#define XSCUGIC_EOI_OFFSET 0x10
*(volatile uint32_t*)//See ICCEOIR register in Zynq TRM
(XPAR_PS7_SCUGIC_0_BASEADDR + XSCUGIC_EOI_OFFSET) = intAck;
}
DPP example application on Zynq CPU1
Since the DPP application just needs a timer interrupt and an LED to indicate the tick event firing, I can just use the main.cpp, philo.cpp, and table.cpp from the reference ARM 7/9 port I downloaded.
I created a new Xilinx C++ project (right click in xsdk Project Explorer --> New --> Project --> Xilinx --> Application Project), and specify the project location, Processor, Language, and the board support package, as shown below:
I chose to create a Xilinx standalone application and THEN remove the dependence on the BSP, but perhaps it would have been more straight forward just to create a vanilla C++ application; the only difference is between a Xilinx application and a C++ application is that the linker command is given an extra option (-T) to use the linker script in the project. Removing the dependence on the Xilinx standalone BSP requires a few steps explained below.
A QP application project must include the QP include files and the QP port include files. Since I use the xsdk generated BSP, the project generator also put in the standalone BSP include folder, as you see below:
Remove the BSP include folder (above the /qk/include in the above screenshot) from the include folder.
Similarly, the project will link against the qk library just built above. Since the library location is build config dependent, I use an Eclipse variable ${ConfigName}, as shown below:
Remove the standalone BSP lib folder (highlighted above) from "Library Paths".
For the library name, I only added "qk", even though the full library file name is libqk.a (UNIX naming convention) as shown below.
I removed ALL libraries from the Libraries in the above screenshot except for qk, to avoid picking up unnecessary code.
This application will NOT depend on the standalone BSP, so navigate to "Project References" and uncheck the dependence on standalone_bsp_1.
To generate the map file, specify -Wl,-Map,<name of the map file>,--cref,--gc-sections as a linker option, as you can see here:
--cref option causes a cross-reference table to be emitted to the map file, and --gc-sections strips out unused input sections. With this option, the DPP ELF debug (-O0) code size went down from ~49 KB down to 45 KB. BUT,
do NOT include this option if the application will be started on CPU1 by Linux remoteproc module, because the resource table--which is NOT used by anything in the application itself--must be left intact in the ELF file. I suppose if I can write a code in the application to refer to the resource table to work around this problem.
I see the LED blink at 0.5 Hz (because I toggle the LED at 1 Hz rate).
Idling: QK::onIdle()
For the most part, a hard-real-time application has nothing to do. One can choose to just keep running the background idle() function over and over, or put the CPU to sleep, and save some power. Since Zynq is ARM Cortex A9 based, which has the "wait for event" instruction that does exactly that, I can just call that instruction, and expect to be woken up if there is an interrupt/event for the CPU.
asm("WFE" : : : );//NOTE: an interrupt starts the CPU clock again
}
Debugging QP application on CPU1
Q_ASSERT is used extensively in QP--both within the infrastructure and the application code. I found that for some reason, XSDK generated code optimized out the file name and line number arguments into Q_assert, and would not show the stack trace when in the infinite loop in Q_onAssert(). So I worked around the first problem by creating global variables oops_file and oops_line and then saving to those, so that I can display it in Expression window, as shown below:
Next step: inter-AMP DPP application
The usual DPP application has either a GUI showing the states of the philosophers, or an LED for each philosopher, or a QSpy text output. I will make my AMP system more interesting by sending messages between the Linux application and the real-time bare-metal application.