Feb 26, 2015

Zynq AMP: Linux on CPU0 and bare metal on CPU1

When I first started playing around with Zedboard, I set a goal to investigate ways to integrate all computing that I've ever done in an expensive (I've never worked on something that sold for less than $100K--actually more like $500K) hardware into an SoC.  Studying how to run 2 bare metal C applications on each Zynq ARM CPU (xapp 1079) was the first step, and I learned about some of the Linux kernel and device drivers after that.  When I studied xapp 1079, I had trouble thoroughly understanding its companion reference app xapp 1078, in which the app on CPU1 is kicked off from Linux running on CPU0.  But my half-year long detour through the various Linux subsystems just paid off serendipitously, because I found a Linux kernel module that may obviate the need for xapp 1078 altogether (actually will make xapp 1078 seem like a giant head-fake; maybe not as bad as the James Clark's WebTV venture during the height of the dot-com boom, but still right up there).

remoteproc kernel module

There are 2 reasons to keep zynq_remoteproc as a module rather than compiling into the kernel:
  1. Since I am hosting the root file system on NFS, this module should NOT start until the NFS rootfs is mounted.  Modules seem to start AFTER NFS mounting.
  2. To start/stop CPU1, this module should be probed and removed
NOTE TO SELF: after compiling modifying the kernel module and doing a module_install, the modules still need to be copied to the NFS export!

When Xilinx made a marketing push to AMP (asymmetric multi-processing) a couple of years ago, they put out (rather quietly) an application note ug978 that launched FreeRTOS on CPU1 from Linux running on CPU0.  I will try to use zynq_remoteproc module--the specialization of the generic Linux remoteproc module--as verbatim as possible (<kernel>/drivers/remoteproc/zynq_remoteproc.c), to launch my own bare metal C++ application on CPU1.

Firstly, the module has to be built.  I added the following lines to my kernel defconfig:


Next, the kernel has to be told about my desire to use the zynq_remoteproc driver, through DTS.  I added the following entry in zynq-zed-adv7511.dts:

remoteproc@1 {
     compatible = "xlnx,zynq_remoteproc";
     reg = < 0x1FE00000 0x200000 >;
     interrupt-parent = <&gic>;
     interrupts = < 0 37 0 0 38 0 >;
     firmware = "cpu1app.elf";
     ipino = <0>; //The only free ipino
     vring0 = <2>;
     vring1 = <3>;

Here, I am telling the kernel that I want to use the last 2 MB (out of 512 MB available on Zedboard) of the RAM for the bare metal app running on CPU1.  Please recall that the memory was declared in zynq-zed.dtsi, which is included by zynq-zed-adv7511.dts:

memory {
device_type = "memory";
reg = <0x000000000 0x20000000>;

To constrain the Linux kernel to only 510 MB without having to change the above DTS entry, I add "mem=510M" in the U-Boot kernel bootargs.  Without it, the module cannot allocate coherent DMA mapping for the last 2 MB because the following code in zynq_remoteproc probe will fail (I tried it already):

ret = dma_declare_coherent_memory(&pdev->dev, local->mem_start,
local->mem_start, local->mem_end - local->mem_start + 1,

In Xilinx document ug978, the CPU1 application was placed in the boot partition, right next to BOOT.bin--which lives on my SD card.  For convenience during development, I want to put the application ELF file on the NFS export.  Many Linux distributions seem to put firmware in /lib/firmware, but according to the hard coded paths in fw_path string array (<>/drivers/base/firmware_class.c), /lib/firmware/updates/ is also a possibility, as well as a custom path specified in the "path" module parameter.  This folder is conveniently accessible on my NFS host, making development iteration easier.

I can just compile this DTS in bash and move the DTB into the TFTP download folder, because I am downloading the kernel over TFTP:

~/work/zed/kernel/arch/arm/boot/dts$ ~/work/zed/kernel/scripts/dtc/dtc -I dts -O dtb -o zynq-zed-adv7511.dtb  zynq-zed-adv7511.dts
~/work/zed/kernel/arch/arm/boot/dts$ sudo mv zynq-zed-adv7511.dtb  /var/lib/tftpboot/

Of course, there is no cpu1app ELF file in /lib/firmware, BUT the modprobe fails for a different reason if I ipino in DTS is anything other than 0:

CPU0: IPI handler 0x5 already registered to ipi_cpu_stop
zynq_remoteproc 1fe00000.remoteproc: IPI handler already registered
zynq_remoteproc 1fe00000.remoteproc: Deleting the irq_list
CPU1: Booted secondary processor
CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
zynq_remoteproc 1fe00000.remoteproc: Can't power on cpu1 -1
zynq_remoteproc: probe of 1fe00000.remoteproc failed with error -1

This code is stopping the probe():

ret = set_ipi_handler(local->ipino, ipi_kick, "Firmware kick");
if (ret) {
dev_err(&pdev->dev, "IPI handler already registered\n");
goto irq_fault;

Reading set_ipi_handler(), I realized that 0 (IPI_WAKEUP) is the only available IPI handler number, so I changed DTS.  I do NOT plan to use virtio, so I simply commented out anything related to vring in zynq_remoteproc with CONFIG_ZYNQ_IPC #ifdef.

Simplest bare metal (actually uses the Xilinx stand-alone BSP) CPU1 application: blinks

Since bare metal AMP was demonstrated in xapp 1079, it may be easiest to pick up from there.  But briefly, building a stand-alone (no OS) for CPU1 involves the following high-level steps:
  1. Create a standalone BSP specialized for AMP CPU1 (when creating the Xilinx BSP project in xsdk, select ps_cortexa9_1 as the CPU).  Since I did not install the FreeRTOS template, the only OS choice I get is standalone--hence the project name "standalone_bsp_1".
  2. Compile a ELF executable that targets CPU1 and depends on the BSP just created above, and hard coded to some load address
Since the CPU1 BSP will NOT be used for FSBL, there is an opportunity to reduce the code size (compared to the CPU0 BSP) by NOT selecting any libraries--such as xilffs or xilrsa, as I've done below:
Since I am NOT interested in debugging the BSP, I have an opportunity to increase the optimization level and remove the debug (-g) flag in the BSP setting.  But this is important: USE_AMP=1 preprocessor define in the BSP setting (right click on the BSP project in Eclipse --> Board Support Package settings) changes some BSP code from the default BSP):
  • GIC (generalized interrupt controller?) distributor is disabled
  • L2 cache invalidation is disabled in boot.S, and instead, virtual address 0x20000000 is mapped to 0x0 and marked as non-cacheable (while MMU is disabled of course).  xapp 1079 comments this out, so I did too.
  • Recently, John McDougall added more AMP code in boot.S to:
    • Mark the Linux DDR region as unassigned/reserved to the MMU, which is a private resource of CPU1
    • Mark the CPU1 DDR as inner (L1) cached only
  • L2 cache is NOT turned back on (because it was not invalidated in the first place!)
Marking certain sections of the DDR as reserved and the last part of the DDR as inner cached only is done in boot.S, when USE_AMP=1:

#if USE_AMP==1
// /* In case of AMP, map virtual address 0x20000000 to 0x00000000  and mark it as non-cacheable */
// ldr r3, =0x1ff /* 512 entries to cover 512MB DDR */
// ldr r0, =TblBase /* MMU Table address in memory */
// add r0, r0, #0x800 /* Address of entry in MMU table, for 0x20000000 */
// ldr r2, =0x0c02 /* S=b0 TEX=b000 AP=b11, Domain=b0, C=b0, B=b0 */
// str r2, [r0] /* write the entry to MMU table */
// add r0, r0, #0x4 /* next entry in the table */
// add r2, r2, #0x100000 /* next section */
// subs r3, r3, #1
// bge mmu_loop /* loop till 512MB is covered */

/* Mark Linux DDR [0x00000000, 0x1FE00000) as unassigned/reserved */
ldr r3, =0x1fd  /* counter=509 to cover 510MB DDR */
ldr r0, =TblBase /* MMU Table address in memory */
ldr r2, =0x0000  /* S=b0 TEX=b000 AP=b00, Domain=b0, C=b0, B=b0 */
str r2, [r0]    /* write the entry to MMU table */
add r0, r0, #0x4 /* next entry in the table */
add r2, r2, #0x100000 /* next section */
subs r3, r3, #1     //counter--
bge mmu_loop    /* loop till Linux DDR MB covered */

/* Mark CPU1 DDR [0x1FE00000, 0x20000000) as inner cached only */
ldr r3, =0x1  /* counter=1 to cover 2MB DDR */
movw r2, #0x4de6  /* S=b0 TEX=b100 AP=b11, Domain=b1111, C=b0, B=b1 */
movt r2, #0x1FE0      /* S=b0, Section start for address 0x1FE00000 */
str r2, [r0]    /* write the entry to MMU table */
add r0, r0, #0x4 /* next entry in the table */
add r2, r2, #0x100000 /* next section */
subs r3, r3, #1     //counter--
bge mmu_loop1    /* loop till CPU1 DDR MB is covered */

For the application, I copy the xapp 1079 CPU1 application as a new project "cpu1app" and start modifying.  Besides the application logic itself, the linker script (lscript.ld) specifies where the code/data sections will be placed in memory (DDR, to be specific, by CPU0--but that is not the concern of the linker script).  xapp1079 reserved 0x02000000 through 0x02ffffff (16 MB) for CPU1, but as shown in the DTS above, I want to allocate CPU1 memory at 0x1FE00000.  So I change the ps7_ddr_0_S_AXI_BASEADDR location and size to in the linker script editor, like this:

   ps7_ddr_0_S_AXI_BASEADDR : ORIGIN = 0x1fe00000, LENGTH = 0x200000

Since the linker places all sections into the DDR, there is no reason to even mention other on-chip memory (BRAM at 0x0 and OCM at 0xFFFC0000).  I don't know the correct stack and heap size yet, so I'll just leave them alone (8 KB each).


The simplest app I can think of is a blinker.  Recently, John McDougall introduced a sleep method using CPU1's private timer (which seems to be called SCU timer--I don't yet see the connection to the snoop control unit).  John McDougall's code for initializing the SCU timer and calling a sleep on it is in this download (in design/src/apps/app_cpu1/scu_sleep.[ch]).  My main() simply calls the SCU timer init and then sleep for 1 second over and over.

#define LED_DELAY 10000000
#define OUTPUT_PIN 7 /* Pin connected to LED/Output */
XGpioPs Gpio; /* The driver instance for GPIO Device. */

static int GpioOutputExample(void)
volatile int Delay;

XGpioPs_SetDirectionPin(&Gpio, OUTPUT_PIN, 1);
XGpioPs_SetOutputEnablePin(&Gpio, OUTPUT_PIN, 1);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x0);

while(1) {
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x1);
for (Delay = 0; Delay < LED_DELAY; Delay++);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x0);
for (Delay = 0; Delay < LED_DELAY; Delay++);

int main(void)
int Status;
XGpioPs_Config *ConfigPtr;

ConfigPtr = XGpioPs_LookupConfig(GPIO_DEVICE_ID);
Status = XGpioPs_CfgInitialize(&Gpio, ConfigPtr,
if (Status != XST_SUCCESS) {
Status = GpioOutputExample();
if (Status != XST_SUCCESS) {


WITHOUT the USE_AMP=1 modifications I made to boot.S above, I can launch this program from xsdk (Xilinx SW development IDE), and I can see the blinking LED.

xsdk builds the ELF file with ease, and I moved that file into a new folder /lib/firmware within the NFS exported root for the target.  When I rebooted Zedboard, I was greeted with what seems like a minor success in dmesg output:

CPU1: shutdown
 remoteproc0: 1fe00000.remoteproc is available
 remoteproc0: Note: remoteproc is still under development and considered experimental.
 remoteproc0: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.

As dmesg suggests, Linux first shut down CPU1.  Silently, it tries to load the firmware through this chain: zynq_remoteproc_probe() --> rproc_add() --> rproc_add_virtio_devices() --> request_firmware_nowait() --> INIT_WORK(&fw_work->work, request_firmware_work_func) --> request_firmware_work_func() --> _request_firmware() --> fw_get_filesystem_firmware() --> fw_read_file_contents().  request_firmware_work_func() should also do post-FW load work (like booting the remote proc) through the fw_work->cont function pointer to rproc_fw_config_virtio(), but that is bombing out because there is no rproc_find_rsc_table <-- rproc_elf_find_rsc_table()

The debugger does NOT respond when CPU1 is halted (as in this case), so I had to rely on printk.  I came to appreciate the value of out-of-tree module compilation:

~/work/zed/kernel/drivers/remoteproc$ make -C /mnt/work/zed/buildroot/output/build/linux-custom ARCH=arm M=`pwd` modules

Having the target's modules folder on NFS export (/export/root/zedbr2/lib/modules/3.15/kernel/drivers/remoteproc in this case) made the otherwise printk based debugging much faster (still took a few days to navigate through all the source and try different hypothesis).  Finally, I realized that my executable does not have the .resource_table section the ELF loader is looking for.  I put an empty resource table (note that num=1 below) as its own section (which is what the remoteproc module looks for after the ELF loader parses the ELF file) in lscript.ld:

.resource_table : {
   __rtable_start = .;
   __rtable_end = .;
} > ps7_ddr_0_S_AXI_BASEADDR

The C program can have the global data as the resource table content:

#define RAM_ADDR 0x1fe00000
struct resource_table {//Just copied from linux/remoteproc.h
u32 ver;//Must be 1 for remoteproc module!
u32 num;
u32 reserved[2];
u32 offset[1];
} __packed;
enum fw_resource_type {
RSC_MMU = 4,
struct fw_rsc_carveout {
u32 type;//from struct fw_rsc_hdr
u32 da;
u32 pa;
u32 len;
u32 flags;
u32 reserved;
u8 name[32];
} __packed;

__attribute__ ((section (".rtable")))
const struct rproc_resource {
    struct resource_table base;
    //u32 offset[4];
    struct fw_rsc_carveout code_cout;
} ti_ipc_remoteproc_ResourceTable = {
.base = { .ver = 1, .num = 1, .reserved = { 0, 0 },
.offset = { offsetof(struct rproc_resource, code_cout) },
.code_cout = {
   .type = RSC_CARVEOUT, .da = RAM_ADDR, .pa = RAM_ADDR, .len = 1<<19,
   .flags=0, .reserved=0, .name="CPU1CODE",

With this change, my program is copied to the correct location in the DRAM, and I can dynamically start/stop Linux on CPU1 by probing and removig the module, like this:

# rmmod zynq_remoteproc
# modprobe kernel/drivers/remoteproc/zynq_remoteproc.ko

This driver shows up in sys/module/zynq_remoteproc/  and /sys/devices/1fe00000.remoteproc.  But  zynq_remoteproc probe does NOT call rproc; it merely loads the firmware.  Indeed, it cannot because the firmware loading completes asynchronously from module probing. Supposedly, the rpmsg module probe should call rproc_boot(), so I tried the following

# modprobe kernel/drivers/rpmsg/virtio_rpmsg_bus.ko

But the module's probe does still NOT get called (note that I crossed CONFIG_RPMSG=y from my defconfig above)!  I could not figure out how to get the virtio device probed, and for that matter, another determined engineer could not either, so I just added in a single-threaded work queue to call rproc_boot after the firmware is loaded.

struct zynq_rproc_pdata {
struct irq_list mylist;
struct rproc *rproc;
u32 ipino;
u32 vring0;
u32 vring1;
u32 mem_start;
u32 mem_end;

//Need my own workqueue rather than a shared work queue because I will block for completion
struct workqueue_struct* wq;
struct work_struct boot_work;

static void boot_cpu1(struct work_struct *work) {
struct zynq_rproc_pdata* local =
container_of(work, struct zynq_rproc_pdata, boot_work);
struct rproc* rproc = local->rproc;
int err;

dev_info(&rproc->dev, "firmware_loading_complete\n");
err = rproc_boot(rproc);
dev_err(&rproc->dev, "rproc_boot %d\n", err);

static int zynq_remoteproc_probe(struct platform_device *pdev)
ret = rproc_add(local->rproc);
if (ret) {
dev_err(&pdev->dev, "rproc registration failed\n");
goto rproc_fault;

INIT_WORK(&local->boot_work, boot_cpu1);
local->wq = create_singlethread_workqueue("znq_remoteproc boot");
if(IS_ERR(local->wq)) {
dev_err(&pdev->dev, "create_singlethread_workqueue %ld\n",
goto rproc_fault;
queue_work(local->wq, &local->boot_work);

static int zynq_remoteproc_remove(struct platform_device *pdev)
struct zynq_rproc_pdata *local = platform_get_drvdata(pdev);
u32 ret;

dev_info(&pdev->dev, "%s\n", __func__);

With this change, the my cpu1app runs on boot:

 remoteproc0: firmware_loading_complete
 remoteproc0: powering up 1fe00000.remoteproc
 remoteproc0: Read /lib/firmware/cpu1app.elf 0
 remoteproc0: firmware: direct-loading firmware cpu1app.elf
 remoteproc0: assign_firmware_buf, flag 5 state 0
 remoteproc0: Booting fw image cpu1app.elf, size 150445
zynq_remoteproc 1fe00000.remoteproc: iommu not found
 remoteproc0: rsc: type 0
 remoteproc0: phdr: type 1 da 0x1fe00000 memsz 0xd890 filesz 0x8058
 remoteproc0: rproc_da_to_va 1fe00000 -->   (null) remoteproc0: rproc_da_to_va 1fe0800c -->   (null)
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_start
 remoteproc0: remote processor 1fe00000.remoteproc is now up

I can also debug my app in xsdk JTAG debugger.  This debugger stack trace is a proof that I can running Linux on CPU0 and my bare metal application on CPU1:

ARM Cortex-A9 MPCore #0 (Suspended)
0xc0020428 cpu_v7_do_idle(): arch/arm/mm/proc-v7.S, line 74
0xc0013d1c arm_cpuidle_simple_enter(): arch/arm/kernel/cpuidle.c, line 18
0xc03d08b8 cpuidle_enter_state(): drivers/cpuidle/cpuidle.c, line 104
0xc03d09ac cpuidle_enter(): drivers/cpuidle/cpuidle.c, line 159
0xc0060ad0 cpu_startup_entry(): kernel/sched/idle.c, line 154
0xc0573fac rest_init(): init/main.c, line 397
0xc07ebba4 start_kernel(): init/main.c, line 652
ARM Cortex-A9 MPCore #1 (Suspended)
0x1fe00594 GpioOutputExample(): ../src/xgpiops_polled_example.c, line 93
0x1fe005f4 main(): ../src/xgpiops_polled_example.c, line 113
0x1fe02264 _start()

rmmod zynq_remoteproc does not work; remove() method is not even getting called.  As a result, I cannot stop cpu1app; it just starts at the system bootup, and keeps running--which is OK for an embedded application.  Another approach would be to create another module that boots and stops zynq_remoteproc, but I don't know how to get a handle to the existing zynq_remoteproc instance...

Better alternative: provide "up" device attribute to read/write

If I provide a sysfs file for the userspace to write to, the firmware will probably have been loaded already by the time the user writes '1' to the attribute file.  So I created the store/show methods of "up" attribute as shown here:

ssize_t up_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count) {
struct rproc *rproc = container_of(dev, struct rproc, dev);
//struct platform_device *pdev = to_platform_device(dev);
//struct zynq_rproc_pdata *local = platform_get_drvdata(pdev);
if(buf[0] == '0') { //want to shut down
} else { // bring up
return count;
static ssize_t up_show(struct device *dev,
    struct device_attribute *attr, char *buf) {
struct rproc *rproc = container_of(dev, struct rproc, dev);
return sprintf(buf, "%d\n", rproc->state);
static DEVICE_ATTR_RW(up);

And in probe, I can register this file:

... ret = rproc_add(local->rproc);
if (ret) {
dev_err(&pdev->dev, "rproc registration failed\n");
goto rproc_fault;

ret = device_create_file(&local->rproc->dev, &dev_attr_up);
return ret;

When I probe this module, I can read the "up" file

# cat  /sys/devices/1fe00000.remoteproc/remoteproc0/up

I then start the cpu1app by writing 1 to the file:

# echo 1 > /sys/devices/1fe00000.remoteproc/remoteproc0/up
 remoteproc0: powering up 1fe00000.remoteproc
 remoteproc0: Read /lib/firmware/cpu1app.elf 0
 remoteproc0: firmware: direct-loading firmware cpu1app.elf
 remoteproc0: assign_firmware_buf, flag 5 state 0
 remoteproc0: Booting fw image cpu1app.elf, size 150445
zynq_remoteproc 1fe00000.remoteproc: iommu not found
 remoteproc0: rsc: type 0
 remoteproc0: phdr: type 1 da 0x1fe00000 memsz 0xd890 filesz 0x8058
 remoteproc0: rproc_da_to_va 1fe00000 -->   (null) remoteproc0: rproc_da_to_va 1fe0800c -->   (null)
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_start
 remoteproc0: remote processor 1fe00000.remoteproc is now up

And the up file now reads 0, which means RPROC_RUNNING (and the LED is bliking!).

# cat  /sys/devices/1fe00000.remoteproc/remoteproc0/up

To stop CPU1, I have to do 2 things in succession: write 0 to the "up" file, and then remove the module:

# echo 0 > /sys/devices/1fe00000.remoteproc/remoteproc0/up
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_stop
 remoteproc0: stopped remote processor 1fe00000.remoteproc

# rmmod zynq_remoteproc
zynq_remoteproc 1fe00000.remoteproc: zynq_remoteproc_remove
zynq_remoteproc 1fe00000.remoteproc: Deleting the irq_list
 remoteproc0: releasing 1fe00000.remoteproc
CPU1: Booted secondary processor

At this point, Linux has been restarted on the 2nd processor; if I do things in this way, I can restart the app again by modprobing and then writing 1 to the "up" file again.


  1. Hi Henry,

    Just wanted to say thank you: your posts have been of invaluable help for my project, and especially this one.
    I have been able to follow your explanation and I now have a working modified remoteproc.

    Did you understand why zynq_remoterproc doesn't work out of the box? I mean: how is the original code supposed to operate in order to actually launch the code on CPU1? Without your clever modifications, the original code seems just useless...

    Also: do you know why we do have to explicitly add the resource table? With the lscript.ld I had, my ELF already contained the desired address in a segment. It seems to be redundant to have to add the rtable by hand in the C code, I am puzzled.

    Note: on my system the "up" file end up in /sys/devices/soc0/1fe00000.remoteproc/... I don't know why "soc0" is added to the path.

    Small question: did you figure out how to use an arbitrary path to the ELF firmware without modifying code?

    Thanks again!

  2. Thanks for the kind words. I think that it did work initially, but the underlying remoteproc and virtio changed at some point. So I don't fault the zynq_remoteproc author. I agree that the whole mechanism seems like the rube-goldberg exercise, but I haven't heard any discussion in the Linux community about simplifying the arrangement. The rtable mechanism can in theory do a lot for you--although I don't use any of it.

    The "soc0" is because your DTS is newer than mine. When I upgrade the ADI kernel to the latest (3.18), I got that as well.

    I think the path can be specified in either DTS or modprobe argument, but haven't looked into it.

    Do you mind sharing your usage of zynq_remoteproc? I haven't heard of anyone else besides me who used AMP...

  3. Hi Henry,
    Thanks for your reply. Yes, rube-goldberg… that’s my feeling too ;-)
    Before trying with remoteproc I tried a more “direct” approach, from user code, similar to this:
    While it basically works, my issue was that the 3 word “jump code” written at 0x00000000 was sometimes still in a cache somewhere at the instant where CPU1 was unstopped. Therefore CPU1 was basically executing whatever he saw at 0x00000000 instead of my jump code, about 1 time out of 5. So while the mechanism is correct, I didn’t figure out how to flush the cache. I tried to tweak cache in zynq settings, sync/flush/etc. all Linux caches I can think of, but without success. Other peoples seem to have the exact same issue, and I eventually gave up.
    Remoteproc is now working for me, but what I dislike in this architecture is that some piece of information has to be repeated in several places, leading to potential issues and maintenance overhead. For instance the address of my CPU1 code (your 1fe000000) is declared:
    1) in lscript.ld (as DDR baseaddr)
    2) in the “.rtable” declaration in my main.c
    3) in boot.S of the BSP
    4) in the remoteproc declaration inside DTS
    5) implicitly by the “mem=” of kernel bootargs
    6) in my user-space program in order to be able to find the “up” file…
    The fact that the firmware name is hardcoded inside the DTS is also not very convenient in my opinion… I would be happy to be able to launch an arbitrary ELF for testing purpose. And despite my efforts I didn’t figure out how to specify an alternate path to it. I am using symlink now.
    I feel that it could be possible to make things simpler and more compact… Maybe I’ll figure it out how when I’ll become more familiar with all this.
    Well, I’ll stop complaining, sorry, I guess you know all this ;-)
    I am not the owner of the project I am working on, so I cannot disclose a lot of details. But basically we have Software Defined Radio in the FPGA, real-time C++ bare-metal code on CPU1 (using a home-made state-machine framework) for operating it, and a .Net mono app under Linux on CPU0 to manage and give orders to CPU1 (via am hom-made message queue). I am thus calling “modprobe remoteproc” from a C# application ;-)
    Thanks again a lot for your hard work and for sharing it! Your solution is so far the only one that works perfectly!

  4. Hello Henry, this blog post has helped me tremendously in getting AMP to run on the zynq. With your help, I have Linux running on CPU0 and got it to load an elf on CPU1. But my cpu1 app is not blinking the LED. Do you know why I'm getting this error "BUG: using smp_processor_id() in preemptible [00000000] code"? http://pastebin.com/hj286ivk Does this error have something to do with cpu1app not fully working? On my custom zynq board, the LED I'm toggling is connected to AXI GPIO, not PS GPIO, so my cpu1app is talking to an AXI GPIO device that's not listed in Linux's device tree.

    I've done pretty much everything in your blog post except modifying boot.S to modify the DDR cache states. I'm running Linux in the upper 256MB of DDR RAM and my cpu1 app is using 32MB of the lower 256MB DDR RAM. I'm not sure how to modify boot.S to fit my usage (or if it's necessary to do so). I would appreciate any advise you can give! Thanks.

    1. Hi Eric, I never got that KERN_ERR (obviously). I recommend that you first match my settings (510 MB for Linux and 2 MB for bare metal), and my boot.S. I suspect this is NOT memory related, so the other thing you could try (perhaps first) is setting a JTAG breakpoint on that KERN_ERR line (smp_processor_id.c line 42) and looking at the stack trace. I described JTAG debugging in Xilinx XSDK in another blog (http://henryomd.blogspot.com/2014/10/ways-to-study-linux-kernel-and-driver.html).

      Finally,if your application is really working OK (despite the KERN_ERR), I would personally just move on and come back to this some other day, because life is short.

    2. Also, what would be involved in using a higher ipino like 15? I see my Linux uses IPI 1-7. Was there a particular reason you were limited to 0 instead of 8 or higher? Thanks.

    3. Hello Henry, I did not see your reply before sending my 2nd reply. But yeah, if my cpu1 app were working, I also would move on to more important things. Good suggestion on duplicating your exact configuration first. I will do so tomorrow and see. How did you find out the file and line number from the stack trace? Thanks for your help.

    4. Arg, I am completely sidetracked with trying to give all 512MB of RAM back to Linux. I thought changing the ddr entry in the device tree from

      ps7_ddr_0: memory@0 {
      reg = <0x10000000 0x10000000>;

      back to

      ps7_ddr_0: memory@0 {
      reg = <0x0 0x20000000>;
      would do the track, but Linux still only sees 256MB. It is "Ignoring memory below PHYS_OFFSET: 0x00000000-0x10000000". I can't remember where else I might have changed settings that affected memory.

    5. OK, I loaded the kernel to 0x100000 and it now sees 512MB (I forgot I was loading it at 0x1000_0000 before. I guess the kernel only sees memory above it?). And with mem=510M, it only sees 510MB.

      I now have AMP configured as you do here (510M/2M), with the boot.S changes as well. The KERN_ERR still occurs and cpu1app loads, but doesn't boot now:

      [ 3.323518] remoteproc0: firmware_loading_complete
      [ 3.328488] remoteproc0: powering up 1fe00000.remoteproc-test
      [ 3.334433] remoteproc0: Booting fw image cpu1app.elf, size 66576
      [ 3.345650] remoteproc0: bad phdr da 0x1fe00000 mem 0x8040
      [ 3.351142] remoteproc0: Failed to load program segments: -22
      [ 3.364925] remoteproc0: rproc_boot -22

      It seems like the KERN_ERR occurs because these checks here fail:

      And so it goes on to print the KERN_ERR. But I suspect it may not be entirely fatal because it goes on to execute code that would have been executed had the KERN_ERR not occur, plus in my first AMP configuration (256M/32M), cpul1app got loaded and apparently booted despite the KERN_ERR (though it didn't toggle the LED as expected).

      I also noticed my elf file size is 170599, but remoteproc reports a fw image size of 66576. Does this size difference happen for you too? I am using Yocto to generate the kernel, device tree and rootfs images. The cpu1app elf is in the rootfs at /lib/firmware and the file size there is also the smaller 66576. I wonder if there's compression going on during the rootfs image creation. Are you using PetaLinux or something else?

      Anyway, sorry for polluting your blog with my progress. If you have the time and inclination, I'd appreciate any help you can offer.

    6. I was away on a training seminar the whole day, where it occurred to me that I *might* have seen the KERN_ERR way before I got the zynq_remoteproc working--possibly when I was playing around with interprocess interrupts. From your output, it appears that you arranged to start the cpu1app on startup. Are you also somehow interrupting Linux on CPU0 from CPU1--possibly using IPI? If so, I think I have seen that error before--I could be spacing out but I recall that things seemed to work for me even with the KERN_ERR.

      So if your application is working, I recommend you just move on--unless interrupting Linux from CPU1 is important to you. I know you probably spent a couple of frustrating days trying to debug this, but if you still have the gumption, I would really appreciate learning about your use case, because I decided not to use AMP interrupts--after getting it to work. I just thought about the extra code I would have to write on the Linux to take advantage of it, and decided it was not worth it.

      Please note that right now, I am always manually starting my cpu1app. I know I have to run an init.rc script at some point, so I would be curious in learning what you have done. I am sorry that for now, I am not going to roll up my sleeves on this, because I am trying to make progress on my other problems (trying to figure out video DMA).

      As for figuring out the kernel code line number, I use Eclipse, as I described in this article: http://henryomd.blogspot.com/2014/10/ways-to-study-linux-kernel-and-driver.html

    7. Hi Eric, I think sonatique's suggestion will probably do the trick. I vaguely remember that I saw that error message if my memory size is inconsistent somewhere; sonatique already remarked on the MANY places where the 510 MB appears in his earlier exchange with me. I know it's horrible, but I have bigger fish to fry--which is bringing up my system. To wit: when you say your application does NOT work, are you trying the blinker bare metal app first? If so, does your blinker app work if you run it stand-alone on CPU1 (i.e. do NOT run Linux on CPU1)? Did you try starting your app AFTER Linux completely boots?

    8. I meant Blogspot ate this comment...

    9. Blogspot keeps disappearing my comment. I submitted 4 or 5 times already. It will appear published but then goes away after a while. I put in pastebin here: http://pastebin.com/vqVC5Pnj Hope you can read it that way.

    10. Eric, if your LED was blinking running standalone but does NOT when started from Linux, it could be because the Linux is stopping the GPIO clock after it boots. I scratched my head for a few days on this problem, and then put the following code in my bare metal app:

      //Turn on the clocks I depend on (in case it is off)
      #define ZYNQ_APER_CLK_CTRL (*(volatile uint32_t*)0xF800012C)
      ZYNQ_APER_CLK_CTRL = ZYNQ_APER_CLK_CTRL | //leave existing clocks alone!
      1 << 18 |//I2C0
      1 << 22 ;//GPIO

    11. Hi Henry, I am using AXI GPIO device to toggle LED, not the PS GPIO. But I did extend my cpu1app to write a character to UART0 as well, so I enabled the UART0 clock (and the GPIO clock anyway) with your code. I was very hopeful, but that didn't help. It's so frustrating because I feel so close, all indications are that cpu1app is loaded and running (rproc state is 2 - RPROC_RUNNING), and yet nothing. I'm going to go bang my head on this some more and try to set up the debugger in order to see what's going on in CPU1. Maybe I need to enable some AXI clock. I am very appreciative of all your help and suggestions!

      Just to get it on record, I found some answers to my previous questions:
      The cpu1app ELF size is ~160K. Remoteproc reports loading a firmware image size of 66576. This is because all the debug info and symbols were stripped from the ELF during the rootfs image creation process.

      The remoteproc load failure (bad phdr) I was getting in reply dated May 20, 2015 at 3:41 PM was due to me forgetting to change the RAM address in one of the many places it needs to be specified. In this case, in the resource table.

    12. Hi Eric, so the code does execute, but you just don't see the AXI GPIO LED? If that is the only thing that seems to be broken, can you possibly drive this (or perhaps another) LED from PS GPIO--as a sanity check?

    13. As far as I can tell, the code is executing. From the Linux CPU0 side, remoteproc is reporting success in loading firmware and booting CPU1, and the rproc state is RUNNING. I don't have any direct insights into the CPU1 side though so far. I was thinking of getting the debugger attached to the running CPUs and see what CPU1 is doing, which is probably a whole exercise in of itself.

      On the custom zynq board I'm using, there is one lone PS GPIO not connected to anything that I might be able to get at though and I am considering doing exactly as you said at this point. All of the other PS MIO pins are used for USB, ethernet, etc. But my cpu1app is also driving PS UART0 now, and no activity there either (I set ps7_uart_0 as compatible="invalid" in the device tree. Does that make Linux ignore it so it's free for CPU1 to use?).

    14. I'm still pounding away at this. I was able to get the Xilinx debugger attached. Found out as soon as I do modprobe zynq_remoteproc, CPU1 gets suspended, with a stack trace at addresses 0xffff0010 and 0xffff000c. I also found when I try to view memory at 0x1fe00000 or 0xffff000c, I get "MMU section translation fault". Sometimes I get "Data read abort, Fault status 0x5, domain 0xE" for 0x1fe00000. Assuming the fault status is from the Data Fault Status Register, 0x5 means First Level Translation MMU Fault. I take this to mean something is wrong with my MMU configuration. Maybe Linux is still protecting memory at 0x1fe00000? I do have your changes in boot.S. And I don't know why cpu1app goes to OCM at those 0xffffxxxx addresses, but maybe that's moot because of the MMU config.

      Anyway, I think CPU1 isn't able to read the cpu1app elf because of some MMU misconfiguration or protection still in place. Would appreciate any thoughts you might have on where to go with this.

    15. Hi Eric, sorry I was pounding away on my own project, trying to beat Vivado and QtCreator into submission.

      I think got the same error you did, but I think it's expected: if you followed my (originally John McDougall's code) CPU1 bootup code turns off L2 cache for the CPU1's portion of the DRAM, and the OCM cacheing is turned off altogether. I guess I don't understand what you are trying to debug here. The JTAG debugger is finnicky (but even that is better than the gdb software debugger); when I was bringing up my system, I found that I had to HALT CPU1 before I could set HW breakpoint that actually break--most of the time. Often, xsdk on Ubuntu would just hang, and I have to restart it. As a last resort, I will move around an infinite loop in an startup.S, or my C program, and try to bisect the problematic code. I ASSUME you've seen my other blog entry on low level JTAG debugging: http://henryomd.blogspot.com/2014/10/ways-to-study-linux-kernel-and-driver.html

      Restrictlying debugger scope to only CPU1 may help, as well as deleting the symbol file and attaching it to a HALTED CPU1. I don't know how else to help you other than sitting down together over a GoToMeeting. But have hope; I found the low level difficult at first, but after I settled on a repetitive process, it has been pretty stable for me.

    16. Please don't apologize, I appreciate your patience with me and having someone to bounce ideas off of. Your GoToMeeting offer is incredibly kind, but I'm not at the end of my rope yet, though I hope I can reserve it for when I am! =) To help your understandable confusion, I am not really debugging anything with the debugger, just looking at the CPU states and the general vicinity of the PC when I halt it. That's how I discovered CPU1 gets suspended whenever I did a modprobe zynq_remoteproc.

      I tend to think the MMU translation errors are not the cause of my problem anymore because I studied XAPP1079 and XAPP1078 a lot more in-depth and found some discrepancies in my boot.S. The boot.S in those app notes modified not only the MMU settings, but also the very start of boot.S, adding some cpu0_catch/cpu1_catch mechanism to determine where to jump to based on which CPU is running. I didn't have that in my boot.S and that led me to find some issues related to that. I don't know how this can happen because it is using a BSP for ps_1, but it looks like my cpu1app was somehow compiled with XPAR_CPU_ID = 0, so when it runs the boot code, it will only let CPU0 continue, but since it's running on CPU1, it gets branched to an endless WFE loop. I think this may be the cause of the CPU1 suspension I'm seeing. I figured this out debugging the XAPP1079 bare-metal/bare-metal configuration though and fixed it (by getting rid of that CPU check code). I was so sure this would also fix my Linux/bare-metal AMP problems, but CPU1 is still getting suspended once I do modprobe zynq_remoteproc on the Linux/bare-metal AMP configuration. I ran out of time before the weekend to verify the fixed boot code was used, but I'm pretty sure it was. I hope to figure this out or find out more on Monday.

      Do you know exactly how zynq_remoteproc gets CPU1 to start execution at the cpu1app elf start address? When Linux releases CPU1, what does CPU1 then do? I imagine Linux has to somehow tell it an address at which to start execution. I know on entire system reset, CPU1 looks for non-zero address at 0xFFFF_FFF0 to jump to. I wonder if the same happens if only CPU1 is reset and zynq_remoteproc just resets CPU1 and writes the cpu1app elf address to 0xFFFF_FFF0, but then again, I don't think it can because that whole process on entire system reset puts a bit of boot code in the upper reaches of OCM for CPU1 to use (put there by the boot ROM), so there's no guarantee OCM is not in use by the time Linux is up and running.

    17. It uses a trampoline mechanism too. I didn't look into all that much detail once I got my CPU1 application booted up. It may be helpful for you to take a look at my CPU1 startup.S and ldscript, but I hesitate because I only enable 1 interrupt (IRQ) besides the boot. Have a look at this if you just want to do what I did and move on: http://henryomd.blogspot.com/2015/03/porting-qpcpp-to-zynq-amp-cpu1.html

    18. So I finally got my cpu1app to work! It turns out in the zynq_remoteproc module, zynq_rproc_start() calls zynq_cpun_start(0,1). I had to change that to zynq_cpun_start(0x1fe00000,1) so that CPU1 would start at the correct address. I suppose THAT'S why the memory remapping MMU code was in boot.S. But thank you Henry for this blog post and all your help in the comments.

    19. Henry, why do you suppose it is that the zynq_remoteproc module doesn't work as provided by Xilinx? To actually start running the firmware, you had to add the work queue stuff and/or the "up" attribute. Is it just alpha quality software? It's been around for at least 3 years now, so it surely must be working for others as is.

    20. Hi Eric, I think only someone like Linus Torvalds can make credible appraisal about the quality of kernel modules. I learned from the freeelectron slides that the Linux community has invisible walls, and it takes time and effort to get into the community--which may be worth the effort if one will continue to engage with the Linux kernel/driver codes. A few people at Xilinx, ADI, and TI are probably already in that select group. To application people like us, all the bells and whistles that these talented people throw into the kernel and any module/driver may seem excessive at times. I think of the situation similar to how the AI community was dominated (hijacked?) by mathematicians for decades until fairly recently: it is unfortunate, but individuals like me can only hope to selectively derive whatever benefit I can, after (heavy) filtering of their work. Applied to the current zynq_remoteproc module, I don't think it will become much simpler unless TI and ADI wants to give up the idea of setting up their DSPs completely through this module. I tend to get into arguments with (high level) SW because I keep pushing them to abandon (what I consider) unnecessary features, but starting a few years ago, my new strategy is avoid the unproductive arguments and just to do the whole thing myself--the way I think it can be done (meaning: minimally). That is why I am not going to spend more time on this kernel module, because I am busy working on the lower level (FPGA and sensors) and the high level (image processing and UI) by myself.

      To answer your direct question, when you Google for the remoteproc, there isn't much hit. I actually corresponded with another engineer (I think the link is in this blog somewhere) who wrote YET another kernel module, to use this remoteproc module with his TI DSP. I would be interested in finding out how many people out there actually use this kernel module the way TI/ADI/Xilinx thought it would be used; I suspect not many.

  5. Kernel "mem=" bootargs, maybe?

    1. Thanks for the suggestion. It turns out I was still loading the kernel at 0x1000_0000 from back when I was splitting the RAM 256M/256M. To do like in this blog post, with a 510M/2M RAM split, I needed to load the kernel at 0x100000. After that, Linux saw all 512MB of RAM, and with "mem=510M", sees 510MB.

  6. Hi Henry, I also had a question about your boot.S mods. The original code maps virtual address 0x2000_0000 to 0x0000_0000. Am I correct that your modified code doesn't do this? Your code just alters the cacheability settings for [0x0000_00000, 0x2000_0000)? Do you know what is the original purpose of mapping the virtual address 0x2000_0000 to 0x0?

    1. I think it's to allow programming cpu1app (by which I mean whatever code you run on CPU1 in a bare metal mode) to use address map that starts at 0x0. I actually just inherited John McDougall's approach in xapp1079, where he did NOT do the address remapping. I actually cannot think of a good reason to do the address translation; I just find dealing with the actual physical address to be much easier.

  7. Hello Henry,

    I would like to use Ubuntu Linux on Core0 and Bare Metal on Core1. Do you have your project files uploaded to github? I also would like to use vivado 2015.1 and build the SW without the SDK, just via make (+ Makefile). Is that possible?

  8. Probably inadequate answer to your questions:

    1. No github. I'd rather push my zynq_remoteproc mods to the ADI kernel branch, but they actually want to mainline their work, so I don't know how things will shake out. I don't think I have enough visibility with the ADI/Xilinx kernel developers (Lars, Michal, etc) to even send a pull request. In the past, when they convinced themselves about the argument I made, they would just make the change on their end. I don't see the benefit of making a github project for a blinking light bare metal code running at DRAM 0x1fd00000 (510 MB)--started by the zynq_remoteproc. My "up" file addition is such an easy change for anyone. I COULD put out my diff file, but then you would have to know how to apply the patch to YOUR kernel, OR know how to use something like Buildroot/Yocto. In short, I expect my project will be applicable to myself only, as everyone else will have his own setup, and can read my blog entries to quickly figure out what he needs to do.

    2. Vivado 2015.1: good luck; I am stuck on 2014.4 because my EE and SW colleagues are on 2014.4.

    3. No SDK route: probably possible, but not interested in figuring it out right now. I want to know if you get that working.

  9. thanks for your quick reply!

    why to the adi kernel branch?
    i refered to your own github repo... so that people can directly try out the code samples, what do you think?

    how much effort is it to adapt your vivado 2014 to 2015.1 changes...any experience? and later on remove the sdk dependency?

  10. Hi Henry,

    Thanks a lot for all this extremely valuable information!

    I've adapted XAPP1078 and your instructions to Vivado 2015.1 and the Red Pitaya board:




  11. Hello Mr.Henry
    Very nice post. Thank you very much. I am working on similar solution but I want to start cpu1 from uboot. So before Linux boots up my secondary core is already running. I want to use onlr rpmsg for communication between the cores. Do I still need remoteproc?

    1. If you will NEVER restart CPU1 app, then you don't need remoteproc--if you can avoid the dependency of rpmsg on remoteproc. I should think that you can design your own message format and just stay away from the whole remoteproc infrastructure.

      I am curious: why do you HAVE to start bare metal first?

    2. Hi Henry,

      Many thanks for your reply..
      I am working on a solution where I am using Automotive RTOS that should communicate with the vehicle CAN network within few milliseconds. That is the reason I am starting RTOS fort on core1. Once linux boots up, it has to inform RTOS that it is up and based on the request from linux, RTOS can share vehicle data, then linux sends it to the reast of the world or even display it to the driver.

      I am new to this remoteproc framework. Based on my scenarion as explained above, if you can give some inputs on how and where do I need to update the rpmsg modules to remove the dependency on remoteproc. Is it in Zynq_remoteproc module where we setup interrupts? Or can I remove the remoteproc completely from loading it.?

      Best regards,
      Diwakar Reddy

    3. In that case, you should NOT use the remote proc framework. You should instead figure out booting your own RTOS FIRST from U-Boot, and THEN proceeding to boot Linux. It might just be easier to run your RTOS on CPU0. RTOS <--> Linux comm over OCM doesn't have to be complicated; see my other post on that topic: http://henryomd.blogspot.com/2015/04/amp-state-machine-applications-on-zynq.html

      I haven't looked deeply into U-Boot, but figuring that out will be an entirely another challenge for you. I suppose you are really sure about the requirement of the RTOS having to control your car within a few ms, because if you could wait a few seconds, you could potentially just use my solution. Consider that a full blown Chromebook cold boots in < 10. Waking back up from sleep is like a second or two. Would a customer really want to drive a car within 2 second of pressing the power button?

    4. Hello Henry,
      Now I am able to load RTOS from u-boot and able to send first CAN message within 370ms before linux bootsup. As you suggested I have modified remoteproc/rpmsg framework and able to exchange the CAN data between from RTOS to Linux. Thanks a lot for your help that I got from all your posts about zynq hardware. Without that it would have taken more time than expected. Now I will try message queues and other solutions as suggested by you to make it simple.
      Best regards,
      Diwakar Reddy

    5. Thanks for the update Diwakar,
      I am trying to learn u-boot these days. Did you write up anything about how you loaded your own RTOS (and then Linux) in u-boot? It would be great if you can share it.

  12. Hi Henry,
    Thanks a lot.
    In fact, if the ECU(zynq controller) does not respond within 500 ms, vehicle network assumes that it is not available because it has to send some diag messages. I would like to try the solution suggested by you..booting rtos on cpu0 and then Linux on cpu1 and then using OCM for communication. I went through the uboot code and I felt it is possible to do it that way. I will post it here once it is done.
    Best regards,
    Diwakar reddy

  13. This comment has been removed by the author.

  14. If you want to stick with linux on cpu0, you can do some processing in Zynq First stage bootloader, which loads uboot.

    1. I assume you are talking aThat's brilliant! Perhaps the bare metal app running in the FSBL can send some signal to the rest of the system that it is healthy.

      But when I thought about it again, I wondered if Diwakar's system could tolerate not hearing back from the ARM CPU for the at least few seconds it takes to boot Linux and then start the CPU1 bare metal app through the zynq_remoteproc.

  15. I have followed above instructions and I have the same environment (petalinux 2014.4 + zed-board). I changed zynq_cpun_start(0x1fe00000,1). When I modprobe, zynq_remoteproc, I get message saying CPU1 is now up. I don't know if to trust that, as I do not get LED blinked. I also added a bit of code from xapp1078, to increment a variable in OCM, but it's not changing either. In short, I don't think processor 2 is running.

    1. Should I assume that your blinker app works in a stand-alone configuration? If you read the comments of some of the people who had the same problem, getting the CPU1 load address consistently everywhere can be challenging.

  16. I am a new user of this site so here i saw multiple articles and posts posted by this site,I curious more interest in some of them hope you will give more information on this topics in your next articles. power presses