Apr 5, 2015

AMP state machine applications on Zynq

In a previous blog entry, I demonstrated how to read/write OCM from a Linux userspace application.   And in another, I ran an LED blink bare metal application on Zynq CPU1, launched from Linux running on CPU0.  In this article, I take the next step and demonstrate a shared memory message queue in OCM.  This is an indispensable part of a practical AMP computing architecture, because a hard real-time software needs high level control (some UI) running on Linux (or Windows if Xilinx ever supports Windows CE).

Dining Homer demo

The DPP problem is the demo of choice when porting the QP framework.  I showed a working DPP demo when porting QP to the bare metal, and on Linux, integrated with Qt.  Here is the same DPP GUI containing only 1 active object (the table object) running on desktop-less, Buildroot Linux distribution I built, cooperating with 5 instances of the philosopher (Homer) active objects.

Making CPU1 hot-pluggable: 3-step startup/shutdown

In a previous blog, I started/stopped bare metal application from Linux shell command.  This makes the bare metal app on CPU1 hot pluggable for any userspace Linux application.   Conversely, a putative Linux userspace application may start and stop at any time, so the CPU1 application cannot count on its CPU0 counterpart being always there.  So a question naturally arises: how to remove the assumption of the other side still running?  Here are the techniques I used in the past:
  • Master slave: communication is initiated by the master.  If the slave does not respond, master considers the slave dead.  The slave may also consider the master to be dead if it does not hear from the master within a watchdog timeout.  In practice, the slave runs mostly independently of the master, so whether the master is alive is usually irrelevant.
  • Fault tolerant networking: there are many networking options available to a Linux application, while the choice is more limited for the bare metal application.  The key for a networked application is to be tolerant of the connection going down silently.
The OCM is essentially an error-free communication channel between the 2 CPUs.  A circular queue on top of OCM would yield a reliable single-direction communication channel.

Without using a mutex, there is no way to protect against the reader from reading garbage head index.  So one side has to initialize the queue first; I will require the bare metal to start and initialize the queue, BEFORE the Linux app starts.

Startup sequence

  1. zynq_remoteproc module is loaded on Linux boot, but if it has been unloaded (see the shutdown sequence below), the user can load it as root, with a modprobe command. The zynq_remoteproc module I modified (originally TI/Xilinx code) does NOT kick-off the CPU1 application on module load; it is in a ready state.
  2. The Linux startup script can then start the CPU1 application by writing "1" to the zynq_remoteproc module's "up" attribute.  The CPU1 application then initializes the inter-AMP message queue, and starts its state machines.  It CAN potentially start writing to the CPU1 --> CPU0 queue.
  3. The same Linux startup script can then start a privileged (because it needs to perform mmap to the OCM memory) userspace application that OVERLAYS the message queue on the OCM, without initializing the queue.  It can write to the CPU0 --> CPU1 queue.

Shutdown sequence

  1. Quit the userspace app, WITHOUT touching the inter-AMP message queue.  This is sufficient for the usual (and more frequent) case of developing the userspace application.
  2. If the real-time application has to be restarted for some reason, writing "0" to the zynq_remoteproc module's "up" attribute will merely STOP (but it does not call the destructors) the CPU1 application.
  3. Optionally, rmmod the zynq_remoteproc module, to release the CPU1 firmware code (ELF file), to write a new FW.

Learning about zero-copy event queue from the QP framework

QP framework offers lots of idea on how to implement a zero-copy event queue.  For the zero copy to work safely, memory pool for the events are created by the application, but registered to the framework for management.  When an event is needed, it can be loaned from the pool and then cast to QEvt, which is the parent of all QP events.

#define QF_EPOOL_GET_(p_, e_, m_) \
        ((e_) = static_cast<QEvt *>((p_).get((m_))))

When loaning an element from its pool, QMPool searches for the next free block within a critical section, which has to be effective across multiple CPUs that are part of the QP framework.  This event has to be returned back to the framework when the event is not used by any active objects.  ONE of the places this garbage collection takes place is right after the event has been handled by all possible receivers ("act" in the POSIX port below):

// loop until m_thread is cleared in QActive::stop()
do {
    QEvt const *e = act->get_(); // wait for event
    act->dispatch(e); // dispatch to the active object's state machine
    gc(e); // check if the event is garbage, and collect it if so
} while (act->m_thread != static_cast<uint8_t>(0));

An event has a reference count that must be protected by both the incrementers (those situations when it is USED) and decrementers (situations when it is no longer used).  In QP, there is no assumption about which thread may use an event; it may be chained (receive an event and send the event to some other place right away) for example.

An event queue just POINTS at these events (whether static or dynamic--and therefore garbage collected as shown above).  But the framework still needs an array of QEvt children POINTERS, like this example:

   static QP::QEvt const *tableQueueSto[N_PHILO];

Inside, an event queue is initialized to manage this raw pointer arrays.

    m_eQueue.init(qSto, qLen);

An event queue belongs to only 1 thread.  The owner of the event queue waits for an event to be inserted into the queue if the event queue is empty

    QACTIVE_EQUEUE_WAIT_(this); // wait for event to arrive directly

which on POSIX is just a condition variable wait

while ((me_)->m_eQueue.m_frontEvt == static_cast<QEvt const *>(0)) \
    pthread_cond_wait(&(me_)->m_osObject, &QF_pThreadMutex_)

Note that IF there is already a pending event (such as in QK_sched_ called in QK_irq interrupt handler), the owner does NOT wait, in order to suck up the events as quickly as possible.  Conversely, an event sender signals the queue owner when it inserts an event into an empty queue

if (m_eQueue.m_frontEvt == null_evt) {
    m_eQueue.m_frontEvt = e;      // deliver event directly
    QACTIVE_EQUEUE_SIGNAL_(this); // signal the event queue
}

Since there are usually multiple event senders (but 1 receiver), above code must be protected with a mutex.

Adaptation of the QP event queue to AMP

In a previous blog, I worked out how to wake up the bare metal code running on CPU1 from Linux userspace application on CPU0.  That code can be used in EQUEUE_SIGNAL pseudo code above.  I also know how to wake up a Linux kernel module, but I have not yet written a code to chain that to a userspace application.  At any rate, waking up a Linux userspace application with minimal latency is not all that important.  There are 2 bigger problems: protecting access to the queue, and protecting the reference count in an event.  In both cases, a mutex is necessary.  But there is no general purpose mutex that can work in both Linux and bare metal code.  I could copy Linux mutex code--which uses atomic_t under the hood, but that seems like a long row to hoe.  QP's event queue locks the queue on both insert and removal, so cannot be used across inter-AMP.  But if I constrain the AMP inter-process messaging problem, cross OS mutex becomes unnecessary: Bare-metal code runs on within RTC (run-to-completion) single-threaded framework (such as QK or QVanilla).  

A lockless queue is a special case of a circular queue for a single-writer and single-reader constraint; the writer only increments the tail, and the reader only increments the head.  If all writers are on the Linux side, a POSIX mutex can protect the tail.  I can use this queue to send events from Linux userspace app to a bare metal software ISR, which will then turn around and forward the events to an RTC application.  When the RTC framework handles a message, I know it is completely done, so I can garbage collect the event right there.  This allows an important simplification: an interrupt lock can sufficiently protect access to a bare metal side resource, compared to the required use of a full-blown mutex on the Linux side.  I drew up the Linux to the bare metal scenario and the opposite in 2 diagrams below.

I had to make 1 modification to QP (in qep.h), to put QEvt (its children, actually) in OCM.  Zero out the poolId_ and refCtr_ in the constructor.  This is necessary because the bare metal side uses placement new, instead of the more commonly used version that calls calloc:

        QEvt(QSignal const s) // poolId_/refCtr_ intentionally uninitialized
          : sig(s) {
        poolId_ = refCtr_ = 0;//good hygene for overlaying on memory
        }


Turn off cacheing in OCM

As discussed in a previous blog, the Linux userspace application accesses the OCM through the Xilinx proprietary device driver which exposes the 256 KB of OCM to /proc/iomem.  As shown in that blog entry, the userspace application can mmap the address 0xFFFC0000.  Man page on mmap says:
A mapping created using /dev/mem will be uncached if it's above the top of RAM.
So the Linux side already turns off cache on OCM.  But to ensure the same for the bare metal, I need to manually change the master MMU table (AKA L1 lookup table).  When porting QP to Zynq CPU1, I left the MMU table as default.  After debugging the inter-CPU OCM messaging for a couple of days, I discovered that it's because cacheing was turned on for the OCM range.  I was forced modify the translation_table.s:

//.word SECT + 0x4c0e /* S=b0 TEX=b100 AP=b11, Domain=b0, C=b1, B=b1 */
.word SECT + 0x4c02 /* S=b0 TEX=b100 AP=b11, Domain=b0, C=b0, B=b0 */
.set SECT, SECT+0x100000

The only change is turning off the C and B bits of the L1 policy.

Circular event queue between Linux and bare metal

Like the QP event queue, my first event queue implementation stored pointer (to the events) in the array.  But unlike the bare metal side, where the pointer is pointing to the physical memory, the Linux side points to VIRTUAL address.  Since the 2 sides use a different memory offset even for the SAME physical location, I wrote 2 different versions of the queue.  The bare metal side stores OFFSET from the beginning of the physical OCM in the event queue, while the Linux side uses the address returned from mmap.  Examine the bare metal side first:

Bare metal circular event queue

class AMP_CEQ {
    uint8_t m_head;
    volatile uint8_t m_tail;
    uint8_t m_mask, dummy;//to align to word
    //16 bit for [word] offsets are enough for 256 KB OCM
    volatile uint16_t m_offset_array[1<<8];//Keep everything real simple

public:
inline void push(QP::QEvt* e) {
uint16_t e_offset = (uint16_t)(((uint32_t)e - OCM_LOC) >> 2);
m_offset_array[m_tail++] = e_offset;
}
inline const QP::QEvt* pop() {
uint16_t e_offset = m_offset_array[m_head++];
const QP::QEvt* e = (const QP::QEvt*)
(((uint32_t)e_offset << 2) + OCM_LOC);
return e;
}
inline bool empty() { return m_tail == m_head; }
inline void init() {
m_tail = m_head = 0;
m_mask = 0xFF;
}
private://don't allow ctor
    AMP_CEQ() {}
};

Since QK is a single-task scheduling system, I initially thought that I can let the state machines send messages directly to Linux without any concurrency protection, like this:

            BSP_send2Linux(&BSP_staticHungryEvts[PHILO_ID(me)]);

which just shoves the event to the m2lQ; the function is helpful in hiding the circular queue from the rest of the SW--they just deal with a BSP service to push events to Linux.

void BSP_send2Linux(QP::QEvt* evt) {
QF_INT_DISABLE();
m2lQ->push(evt);//push atomically into L1 cache (NOT to the memory yet)
QF_INT_ENABLE();
}

But when I thought more, I realized that QK's PREEMPTIVE nature breaks this assumption that would otherwise hold in the QVanilla (no premption; run to completion) scheduling.  So the quick interrupt locking and unlocking was required for concurrency protection.

You might ask: why don't I raise a SW interrupt to the Linux, so that it can start servicing the message as soon as possible?  Please remember that the whole point of the AMP architecture is that the non-real-time Linux side and the hard-real-time bare metal side are now loosely coupled.  There is no real-time guarantees about when the messages will be communicated.  So it's perfectly fine for the Linux Qt GUI to check for pending messages every system tick (as slow as 2 Hz, but can easily run at 100 Hz or faster):

void QP::QF_onClockTick(void) {
    QP::QF::TICK_X(0U, &l_time_tick);

    while(!m2lQ->empty()) {
        //In this application, there is only 1 active object: DPP:AO_Table
        const DPP::TableEvt* e = static_cast<const DPP::TableEvt*>
                        (m2lQ->pop());
        DPP::AO_Table->POST(e, &l_dummyOnClockTick);
    }
}

In this simple AMP DPP application, table is the only active object, so I can post directly to it.  But in general, I will have to PUBLISH the received event to the QP framework.

Linux side circular event queue

This code is the same as above, except for using the globally stored l_ocm pointer, returned from mmap() during BSP_init():

static char* l_ocm = NULL;//(virtual) IO memory mapped OCM section
static int l_memf = -1;//and the Linux /dev/mem file backing it up
class AMP_CEQ {
    uint8_t m_head;
    volatile uint8_t m_tail;
    uint8_t m_mask, dummy;//to align to word
    //16 bit for [word] offsets are enough for 256 KB OCM
    volatile uint16_t m_offset_array[1<<8];//Keep everything real simple

public:
    inline void push(const QP::QEvt* e) {
        uint16_t e_offset = (uint16_t)(((uint32_t)e - (uint32_t)l_ocm) >> 2);
        m_offset_array[m_tail++] = e_offset;
        //m_tail &= m_mask;
    }
    inline const QP::QEvt* pop() {
        uint16_t e_offset = m_offset_array[m_head++];
        const QP::QEvt* e = (const QP::QEvt*)
                (((uint32_t)e_offset << 2) + (uint32_t)l_ocm);
        //m_head &= m_mask;
        return e;
    }
    inline bool empty() { return m_tail == m_head; }
private:
    AMP_CEQ() {}
};

static AMP_CEQ* l2mQ = NULL;
static AMP_CEQ* m2lQ = NULL;

Note that the 2 queues are owned by the bare metal side, so the Linux side just gets the pointer to the agreed-upon location in BSP_init():

    //mmap OCM.  man mem says: "Byte addresses in /dev/mem are
    //interpreted as physical memory addresses."
    l_memf = open("/dev/mem"
            , O_RDWR /*| O_SYNC*/); //do I want cacheing?
    Q_ASSERT(l_memf > 0);

    //A mapping created using /dev/mem will be uncached if it's above
    //the top of RAM.  Also, Zynq OCM driver mapped this memory as non-cached
#define OCM_LOC 0xFFFC0000
#define OCM_SIZE (4*64*1024)
    l_ocm = (char*)mmap(NULL,//Tried specifying OCM_LOC, no luck
                        OCM_SIZE, PROT_READ | PROT_WRITE,
                        MAP_SHARED /*| MAP_LOCKED*/,
                        l_memf, OCM_LOC);//0xe0080000);//
    char* pocm = l_ocm;
    l2mQ = (AMP_CEQ*)pocm; pocm += sizeof(AMP_CEQ);
    m2lQ = (AMP_CEQ*)pocm; pocm += sizeof(AMP_CEQ);

On the Linux side, there are MT protection is required, so I use the QP's mutex:

void BSP_send2Metal(const QP::QEvt* evt) {
    //Multiple AO can send to Linux, so the write must be serialized
    QF_CRIT_ENTRY(dummy);
    l2mQ->push(evt);//push atomically
    QF_CRIT_EXIT(dummy);
    //Explicit DMB not required because mutex unlock should already do that
    //asm("DMB": : : "memory");//flush L1 for CPU0
}

Like the Linux reader, the bare metal BSP pops the messages from its timer tick handler.

...
QP::QF::TICK(&l_ISR_tick);
//This interrupt will NOT nest, so NO need to lock interrupt
while(!l2mQ->empty()) {
const DPP::TableEvt* e = static_cast<const DPP::TableEvt*> (l2mQ->pop());
AO_Philo[e->philoNum]->POST(e, &l_ISR_sw2);
}

And like this Linux side, this POST() will have to change to PUBLISH() soon.

Finding static events preallocated in OCM

Let's do the easy case first: static event.  A static event lives for the duration of the application--without changing its content (events are read-only to the event receivers).  Such event can be overlaid on OCM, along with the event queues, like this:

#define BARE_METAL
#define OCM_SIZE (4*64*1024) //256KB too big?
#define OCM_LOC 0xFFFC0000


//Don't try be too fancy in the global initializers; just pin the pointers to
//the right place; place new will be called later on the instances in BSP_init
#define L2M_Q_LOC OCM_LOC //(M2L_Q_ARRAY_LOC + sizeof(QP::QEvt*) * (1<<M2L_Q_SIZE_EXP))
#define M2L_Q_LOC (L2M_Q_LOC      + sizeof(AMP_CEQ))
#define EAT_EVT_LOC (M2L_Q_LOC      + sizeof(AMP_CEQ))
#define HUNGRY_EVT_LOC (EAT_EVT_LOC    + sizeof(TableEvt) * N_PHILO)
#define DONE_EVT_LOC (HUNGRY_EVT_LOC + sizeof(TableEvt) * N_PHILO)

static AMP_CEQ *const l2mQ = (AMP_CEQ*)L2M_Q_LOC;
static AMP_CEQ *const m2lQ = (AMP_CEQ*)M2L_Q_LOC;
DPP::TableEvt* const BSP_staticEatEvts    = (DPP::TableEvt*)EAT_EVT_LOC;
DPP::TableEvt* const BSP_staticHungryEvts = (DPP::TableEvt*)HUNGRY_EVT_LOC;
DPP::TableEvt* const BSP_staticDoneEvts   = (DPP::TableEvt*)DONE_EVT_LOC;

Note that this just establishes the pointers.  The queue initialization and the static event placement new ctors are called in BSP_init():

    l2mQ->init();//new(l2mQ) AMP_CEQ();
    m2lQ->init();//new(m2lQ) AMP_CEQ();

    for(i=0; i < 128; ++i) {//For sanity test, shove fake addresses
    l2mQ->push((QP::QEvt*)(OCM_LOC + 4*i));
    m2lQ->push((QP::QEvt*)(OCM_LOC + 4*i));
    }

    for(i=0; i < N_PHILO; ++i) { //call the constructors for each event
     new(&BSP_staticEatEvts[i])    TableEvt(EAT_SIG, i);
     new(&BSP_staticHungryEvts[i]) TableEvt(HUNGRY_SIG, i);
     new(&BSP_staticDoneEvts[i])   TableEvt(DONE_SIG, i);
    }

Note the placement new semantics, to avoid calling malloc().

Memory pool in OCM

Since AMP events will NOT be contained within 1 QP system, I cannot blindly entrust some OCM block to the QP's memory pool--not unless I come up with an AMP mutex.  An alternative is to make the memory pool circular, just like the event queue.  As long as the event is not (re)used for longer than it takes for the memory pool free pointer to wrap around, the event can be used as if it is a static event.  In fact, to ensure that QP framework does NOT try to reclaim the memory on its own, the AMP event will be declared as static to QP (by leaving the poolId to 0, just like the static event).

Without the QP framework managing the reference count, I cannot think of a good way to indicate that an event is completely done being used.  But let me come back to this later.

Bare metal side memory pool

If I enforce that the memory to be loaned is word aligned, the circular memory pool is quite simple on the bare metal side, because can hard code where to place the memory pool:

//The pool should be word aligned
static uint32_t* BSP_m2lPool = (uint32_t*)(OCM_LOC + 1*64*1024);
static uint32_t* BSP_l2mPool = (uint32_t*)(OCM_LOC + 2*64*1024);
uint32_t* BSP_loanMemory(uint8_t wordSize) {
static uint32_t* p = (uint32_t*)(OCM_LOC + 64*1024);
QF_INT_DISABLE();//vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
uint32_t* next = p + wordSize;
if(next >= BSP_l2mPool) {//wrap
p = BSP_m2lPool;
next = p + wordSize;
}
uint32_t* e = p;
p = next;//move the pointer
QF_INT_ENABLE();//^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
return e;
}

A convenience macro can then just call the placement new constructor on any type that has a constructor.  Note that I divide the size of any type by 4 to get the number of words for a type:

#define BSP_loanEvent(evtT_, sig_, ...) (new(BSP_loanMemory(sizeof(evtT_)>>2)) \
evtT_((sig_), ##__VA_ARGS__))

An active object can then loan an event from the event pool and just toss the event to the Linux side, like this example:

            TableEvt *pe = BSP_loanEvent(TableEvt, HUNGRY_SIG, PHILO_ID(me));
            BSP_send2Linux(pe);//&BSP_staticHungryEvts[PHILO_ID(me)]);

Linux side memory pool

The Linux side closely mirrors the bare metal side, except for once again not knowing priori, the base address of the OCM in virtual memory.  So the global memory are initially NULL, but then given proper address in BSP_init().  Once initialized properly, the load function works entirely within a critical section:

static uint32_t* BSP_l2mPool = NULL, *BSP_l2mPoolEnd = NULL, *BSP_l2mPoolPtr = NULL;

uint32_t* BSP_loanMemory(uint8_t wordSize) {
    QF_CRIT_ENTRY(dummy);//vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    uint32_t* next = BSP_l2mPoolPtr + wordSize;
    if(next >= BSP_l2mPoolEnd) {//wrap
        BSP_l2mPoolPtr = BSP_l2mPool;
        next = BSP_l2mPoolPtr + wordSize;
    }
    uint32_t* e = BSP_l2mPoolPtr;
    //qDebug("BSP_l2mPoolPtr = %p -> %p", BSP_l2mPoolPtr, next);
    BSP_l2mPoolPtr = next;//move the pointer
    QF_CRIT_EXIT(dummy);//^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    return e;
}

The helper macro and its usage is the same as the bare metal side.