Henry Choi: Zynq AMP: Linux on CPU0 and bare metal on CPU1

When I first started playing around with Zedboard, I set a goal to investigate ways to integrate all computing that I've ever done in an expensive (I've never worked on something that sold for less than $100K--actually more like $500K) hardware into an SoC. Studying how to run 2 bare metal C applications on each Zynq ARM CPU (xapp 1079) was the first step, and I learned about some of the Linux kernel and device drivers after that. When I studied xapp 1079, I had trouble thoroughly understanding its companion reference app xapp 1078, in which the app on CPU1 is kicked off from Linux running on CPU0. But my half-year long detour through the various Linux subsystems just paid off serendipitously, because I found a Linux kernel module that may obviate the need for xapp 1078 altogether (actually will make xapp 1078 seem like a giant head-fake; maybe not as bad as the James Clark's WebTV venture during the height of the dot-com boom, but still right up there).

remoteproc kernel module

There are 2 reasons to keep zynq_remoteproc as a module rather than compiling into the kernel:

Since I am hosting the root file system on NFS, this module should NOT start until the NFS rootfs is mounted. Modules seem to start AFTER NFS mounting.
To start/stop CPU1, this module should be probed and removed

NOTE TO SELF: after compiling modifying the kernel module and doing a module_install, the modules still need to be copied to the NFS export!

When Xilinx made a marketing push to AMP (asymmetric multi-processing) a couple of years ago, they put out (rather quietly) an application note ug978 that launched FreeRTOS on CPU1 from Linux running on CPU0. I will try to use zynq_remoteproc module--the specialization of the generic Linux remoteproc module--as verbatim as possible (<kernel>/drivers/remoteproc/zynq_remoteproc.c), to launch my own bare metal C++ application on CPU1.

Firstly, the module has to be built. I added the following lines to my kernel defconfig:

~~CONFIG_RPMSG=y~~
CONFIG_REMOTEPROC=y
CONFIG_ZYNQ_REMOTEPROC=m

Next, the kernel has to be told about my desire to use the zynq_remoteproc driver, through DTS. I added the following entry in zynq-zed-adv7511.dts:

remoteproc@1 {
compatible = "xlnx,zynq_remoteproc";
reg = < 0x1FE00000 0x200000 >;
interrupt-parent = <&gic>;
interrupts = < 0 37 0 0 38 0 >;
firmware = "cpu1app.elf";
ipino = <0>; //The only free ipino
vring0 = <2>;
vring1 = <3>;
};

Here, I am telling the kernel that I want to use the last 2 MB (out of 512 MB available on Zedboard) of the RAM for the bare metal app running on CPU1. Please recall that the memory was declared in zynq-zed.dtsi, which is included by zynq-zed-adv7511.dts:

memory {
device_type = "memory";
reg = <0x000000000 0x20000000>;
};

To constrain the Linux kernel to only 510 MB without having to change the above DTS entry, I add "mem=510M" in the U-Boot kernel bootargs. Without it, the module cannot allocate coherent DMA mapping for the last 2 MB because the following code in zynq_remoteproc probe will fail (I tried it already):

ret = dma_declare_coherent_memory(&pdev->dev, local->mem_start,
local->mem_start, local->mem_end - local->mem_start + 1,
DMA_MEMORY_IO);

In Xilinx document ug978, the CPU1 application was placed in the boot partition, right next to BOOT.bin--which lives on my SD card. For convenience during development, I want to put the application ELF file on the NFS export. Many Linux distributions seem to put firmware in /lib/firmware, but according to the hard coded paths in fw_path string array (<>/drivers/base/firmware_class.c), /lib/firmware/updates/ is also a possibility, as well as a custom path specified in the "path" module parameter. This folder is conveniently accessible on my NFS host, making development iteration easier.

I can just compile this DTS in bash and move the DTB into the TFTP download folder, because I am downloading the kernel over TFTP:

~/work/zed/kernel/arch/arm/boot/dts$ ~/work/zed/kernel/scripts/dtc/dtc -I dts -O dtb -o zynq-zed-adv7511.dtb zynq-zed-adv7511.dts

~/work/zed/kernel/arch/arm/boot/dts$ sudo mv zynq-zed-adv7511.dtb /var/lib/tftpboot/

Of course, there is no cpu1app ELF file in /lib/firmware, BUT the modprobe fails for a different reason if I ipino in DTS is anything other than 0:

CPU0: IPI handler 0x5 already registered to ipi_cpu_stop
zynq_remoteproc 1fe00000.remoteproc: IPI handler already registered
zynq_remoteproc 1fe00000.remoteproc: Deleting the irq_list
CPU1: Booted secondary processor
CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
zynq_remoteproc 1fe00000.remoteproc: Can't power on cpu1 -1
zynq_remoteproc: probe of 1fe00000.remoteproc failed with error -1

This code is stopping the probe():

ret = set_ipi_handler(local->ipino, ipi_kick, "Firmware kick");
if (ret) {
dev_err(&pdev->dev, "IPI handler already registered\n");
goto irq_fault;
}

Reading set_ipi_handler(), I realized that 0 (IPI_WAKEUP) is the only available IPI handler number, so I changed DTS. I do NOT plan to use virtio, so I simply commented out anything related to vring in zynq_remoteproc with CONFIG_ZYNQ_IPC #ifdef.

Simplest bare metal (actually uses the Xilinx stand-alone BSP) CPU1 application: blinks

Since bare metal AMP was demonstrated in xapp 1079, it may be easiest to pick up from there. But briefly, building a stand-alone (no OS) for CPU1 involves the following high-level steps:

Create a standalone BSP specialized for AMP CPU1 (when creating the Xilinx BSP project in xsdk, select ps_cortexa9_1 as the CPU). Since I did not install the FreeRTOS template, the only OS choice I get is standalone--hence the project name "standalone_bsp_1".
Compile a ELF executable that targets CPU1 and depends on the BSP just created above, and hard coded to some load address

Since the CPU1 BSP will NOT be used for FSBL, there is an opportunity to reduce the code size (compared to the CPU0 BSP) by NOT selecting any libraries--such as xilffs or xilrsa, as I've done below:

Since I am NOT interested in debugging the BSP, I have an opportunity to increase the optimization level and remove the debug (-g) flag in the BSP setting. But this is important: USE_AMP=1 preprocessor define in the BSP setting (right click on the BSP project in Eclipse --> Board Support Package settings) changes some BSP code from the default BSP):

GIC (generalized interrupt controller?) distributor is disabled
L2 cache invalidation is disabled in boot.S, and instead, virtual address 0x20000000 is mapped to 0x0 and marked as non-cacheable (while MMU is disabled of course). xapp 1079 comments this out, so I did too.
Recently, John McDougall added more AMP code in boot.S to:

Mark the Linux DDR region as unassigned/reserved to the MMU, which is a private resource of CPU1
Mark the CPU1 DDR as inner (L1) cached only

L2 cache is NOT turned back on (because it was not invalidated in the first place!)

Marking certain sections of the DDR as reserved and the last part of the DDR as inner cached only is done in boot.S, when USE_AMP=1:

#if USE_AMP==1
// /* In case of AMP, map virtual address 0x20000000 to 0x00000000 and mark it as non-cacheable */
// ldr r3, =0x1ff /* 512 entries to cover 512MB DDR */
// ldr r0, =TblBase /* MMU Table address in memory */
// add r0, r0, #0x800 /* Address of entry in MMU table, for 0x20000000 */
// ldr r2, =0x0c02 /* S=b0 TEX=b000 AP=b11, Domain=b0, C=b0, B=b0 */
//mmu_loop:
// str r2, [r0] /* write the entry to MMU table */
// add r0, r0, #0x4 /* next entry in the table */
// add r2, r2, #0x100000 /* next section */
// subs r3, r3, #1
// bge mmu_loop /* loop till 512MB is covered */

/* Mark Linux DDR [0x00000000, 0x1FE00000) as unassigned/reserved */
ldr r3, =0x1fd /* counter=509 to cover 510MB DDR */
ldr r0, =TblBase /* MMU Table address in memory */
ldr r2, =0x0000 /* S=b0 TEX=b000 AP=b00, Domain=b0, C=b0, B=b0 */
mmu_loop:
str r2, [r0] /* write the entry to MMU table */
add r0, r0, #0x4 /* next entry in the table */
add r2, r2, #0x100000 /* next section */
subs r3, r3, #1 //counter--
bge mmu_loop /* loop till Linux DDR MB covered */

/* Mark CPU1 DDR [0x1FE00000, 0x20000000) as inner cached only */
ldr r3, =0x1 /* counter=1 to cover 2MB DDR */
movw r2, #0x4de6 /* S=b0 TEX=b100 AP=b11, Domain=b1111, C=b0, B=b1 */
movt r2, #0x1FE0 /* S=b0, Section start for address 0x1FE00000 */
mmu_loop1:
str r2, [r0] /* write the entry to MMU table */
add r0, r0, #0x4 /* next entry in the table */
add r2, r2, #0x100000 /* next section */
subs r3, r3, #1 //counter--
bge mmu_loop1 /* loop till CPU1 DDR MB is covered */
#endif

For the application, I copy the xapp 1079 CPU1 application as a new project "cpu1app" and start modifying. Besides the application logic itself, the linker script (lscript.ld) specifies where the code/data sections will be placed in memory (DDR, to be specific, by CPU0--but that is not the concern of the linker script). xapp1079 reserved 0x02000000 through 0x02ffffff (16 MB) for CPU1, but as shown in the DTS above, I want to allocate CPU1 memory at 0x1FE00000. So I change the ps7_ddr_0_S_AXI_BASEADDR location and size to in the linker script editor, like this:

MEMORY
{
ps7_ddr_0_S_AXI_BASEADDR : ORIGIN = 0x1fe00000, LENGTH = 0x200000
}

Since the linker places all sections into the DDR, there is no reason to even mention other on-chip memory (BRAM at 0x0 and OCM at 0xFFFC0000). I don't know the correct stack and heap size yet, so I'll just leave them alone (8 KB each).

_STACK_SIZE = DEFINED(_STACK_SIZE) ? _STACK_SIZE : 0x2000;
_HEAP_SIZE = DEFINED(_HEAP_SIZE) ? _HEAP_SIZE : 0x2000;

The simplest app I can think of is a blinker. Recently, John McDougall introduced a sleep method using CPU1's private timer (which seems to be called SCU timer--I don't yet see the connection to the snoop control unit). John McDougall's code for initializing the SCU timer and calling a sleep on it is in this download (in design/src/apps/app_cpu1/scu_sleep.[ch]). My main() simply calls the SCU timer init and then sleep for 1 second over and over.

#define GPIO_DEVICE_ID XPAR_XGPIOPS_0_DEVICE_ID
#define LED_DELAY 10000000
#define OUTPUT_PIN 7 /* Pin connected to LED/Output */
XGpioPs Gpio; /* The driver instance for GPIO Device. */

static int GpioOutputExample(void)
{
volatile int Delay;

XGpioPs_SetDirectionPin(&Gpio, OUTPUT_PIN, 1);
XGpioPs_SetOutputEnablePin(&Gpio, OUTPUT_PIN, 1);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x0);

while(1) {
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x1);
for (Delay = 0; Delay < LED_DELAY; Delay++);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x0);
for (Delay = 0; Delay < LED_DELAY; Delay++);
}
return XST_SUCCESS;
}

int main(void)
{
int Status;
XGpioPs_Config *ConfigPtr;

ConfigPtr = XGpioPs_LookupConfig(GPIO_DEVICE_ID);
Status = XGpioPs_CfgInitialize(&Gpio, ConfigPtr,
ConfigPtr->BaseAddr);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
Status = GpioOutputExample();
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}

return XST_SUCCESS;
}

WITHOUT the USE_AMP=1 modifications I made to boot.S above, I can launch this program from xsdk (Xilinx SW development IDE), and I can see the blinking LED.

xsdk builds the ELF file with ease, and I moved that file into a new folder /lib/firmware within the NFS exported root for the target. When I rebooted Zedboard, I was greeted with what seems like a minor success in dmesg output:

CPU1: shutdown
remoteproc0: 1fe00000.remoteproc is available
remoteproc0: Note: remoteproc is still under development and considered experimental.
remoteproc0: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.

As dmesg suggests, Linux first shut down CPU1. Silently, it tries to load the firmware through this chain: zynq_remoteproc_probe() --> rproc_add() --> rproc_add_virtio_devices() --> request_firmware_nowait() --> INIT_WORK(&fw_work->work, request_firmware_work_func) --> request_firmware_work_func() --> _request_firmware() --> fw_get_filesystem_firmware() --> fw_read_file_contents(). request_firmware_work_func() should also do post-FW load work (like booting the remote proc) through the fw_work->cont function pointer to rproc_fw_config_virtio(), but that is bombing out because there is no rproc_find_rsc_table <-- rproc_elf_find_rsc_table()

The debugger does NOT respond when CPU1 is halted (as in this case), so I had to rely on printk. I came to appreciate the value of out-of-tree module compilation:

~/work/zed/kernel/drivers/remoteproc$ make -C /mnt/work/zed/buildroot/output/build/linux-custom ARCH=arm M=`pwd` modules

Having the target's modules folder on NFS export (/export/root/zedbr2/lib/modules/3.15/kernel/drivers/remoteproc in this case) made the otherwise printk based debugging much faster (still took a few days to navigate through all the source and try different hypothesis). Finally, I realized that my executable does not have the .resource_table section the ELF loader is looking for. I put an empty resource table (note that num=1 below) as its own section (which is what the remoteproc module looks for after the ELF loader parses the ELF file) in lscript.ld:

.resource_table : {
__rtable_start = .;
*(.rtable)
__rtable_end = .;
} > ps7_ddr_0_S_AXI_BASEADDR

The C program can have the global data as the resource table content:

#define RAM_ADDR 0x1fe00000
struct resource_table {//Just copied from linux/remoteproc.h
u32 ver;//Must be 1 for remoteproc module!
u32 num;
u32 reserved[2];
u32 offset[1];
} __packed;
enum fw_resource_type {
RSC_CARVEOUT = 0,
RSC_DEVMEM = 1,
RSC_TRACE = 2,
RSC_VDEV = 3,
RSC_MMU = 4,
RSC_LAST = 5,
};
struct fw_rsc_carveout {
u32 type;//from struct fw_rsc_hdr
u32 da;
u32 pa;
u32 len;
u32 flags;
u32 reserved;
u8 name[32];
} __packed;

__attribute__ ((section (".rtable")))
const struct rproc_resource {
struct resource_table base;
//u32 offset[4];
struct fw_rsc_carveout code_cout;
} ti_ipc_remoteproc_ResourceTable = {
.base = { .ver = 1, .num = 1, .reserved = { 0, 0 },
.offset = { offsetof(struct rproc_resource, code_cout) },
},
.code_cout = {
.type = RSC_CARVEOUT, .da = RAM_ADDR, .pa = RAM_ADDR, .len = 1<<19,
.flags=0, .reserved=0, .name="CPU1CODE",
},
};

With this change, my program is copied to the correct location in the DRAM, and I can dynamically start/stop Linux on CPU1 by probing and removig the module, like this:

# rmmod zynq_remoteproc

# modprobe kernel/drivers/remoteproc/zynq_remoteproc.ko

This driver shows up in sys/module/zynq_remoteproc/ and /sys/devices/1fe00000.remoteproc. But zynq_remoteproc probe does NOT call rproc; it merely loads the firmware. Indeed, it cannot because the firmware loading completes asynchronously from module probing. Supposedly, the rpmsg module probe should call rproc_boot(), so I tried the following

# modprobe kernel/drivers/rpmsg/virtio_rpmsg_bus.ko

But the module's probe does still NOT get called (note that I crossed CONFIG_RPMSG=y from my defconfig above)! I could not figure out how to get the virtio device probed, and for that matter, another determined engineer could not either, so I just added in a single-threaded work queue to call rproc_boot after the firmware is loaded.

struct zynq_rproc_pdata {
struct irq_list mylist;
struct rproc *rproc;
u32 ipino;
#ifdef CONFIG_ZYNQ_IPC
u32 vring0;
u32 vring1;
#endif
u32 mem_start;
u32 mem_end;

//Need my own workqueue rather than a shared work queue because I will block for completion
struct workqueue_struct* wq;
struct work_struct boot_work;
};

static void boot_cpu1(struct work_struct *work) {
struct zynq_rproc_pdata* local =
container_of(work, struct zynq_rproc_pdata, boot_work);
struct rproc* rproc = local->rproc;
int err;

wait_for_completion(&rproc->firmware_loading_complete);
dev_info(&rproc->dev, "firmware_loading_complete\n");
err = rproc_boot(rproc);
if(err)
dev_err(&rproc->dev, "rproc_boot %d\n", err);
}

static int zynq_remoteproc_probe(struct platform_device *pdev)

{

...

ret = rproc_add(local->rproc);

if (ret) {

dev_err(&pdev->dev, "rproc registration failed\n");

goto rproc_fault;

}

INIT_WORK(&local->boot_work, boot_cpu1);

local->wq = create_singlethread_workqueue("znq_remoteproc boot");

if(IS_ERR(local->wq)) {

dev_err(&pdev->dev, "create_singlethread_workqueue %ld\n",

PTR_ERR(local->wq));

goto rproc_fault;

}

queue_work(local->wq, &local->boot_work);

...

}

static int zynq_remoteproc_remove(struct platform_device *pdev)

{

struct zynq_rproc_pdata *local = platform_get_drvdata(pdev);

u32 ret;

dev_info(&pdev->dev, "%s\n", __func__);

rproc_shutdown(local->rproc);

destroy_workqueue(local->wq);

...

With this change, the my cpu1app runs on boot:

remoteproc0: firmware_loading_complete
remoteproc0: powering up 1fe00000.remoteproc
remoteproc0: Read /lib/firmware/cpu1app.elf 0
remoteproc0: firmware: direct-loading firmware cpu1app.elf
remoteproc0: assign_firmware_buf, flag 5 state 0
remoteproc0: Booting fw image cpu1app.elf, size 150445
zynq_remoteproc 1fe00000.remoteproc: iommu not found
remoteproc0: rsc: type 0
remoteproc0: phdr: type 1 da 0x1fe00000 memsz 0xd890 filesz 0x8058
remoteproc0: rproc_da_to_va 1fe00000 --> (null) remoteproc0: rproc_da_to_va 1fe0800c --> (null)
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_start
remoteproc0: remote processor 1fe00000.remoteproc is now up

I can also debug my app in xsdk JTAG debugger. This debugger stack trace is a proof that I can running Linux on CPU0 and my bare metal application on CPU1:

ARM Cortex-A9 MPCore #0 (Suspended)
0xc0020428 cpu_v7_do_idle(): arch/arm/mm/proc-v7.S, line 74
0xc0013d1c arm_cpuidle_simple_enter(): arch/arm/kernel/cpuidle.c, line 18
0xc03d08b8 cpuidle_enter_state(): drivers/cpuidle/cpuidle.c, line 104
0xc03d09ac cpuidle_enter(): drivers/cpuidle/cpuidle.c, line 159
0xc0060ad0 cpu_startup_entry(): kernel/sched/idle.c, line 154
0xc0573fac rest_init(): init/main.c, line 397
0xc07ebba4 start_kernel(): init/main.c, line 652
0x00008074
0x00008074
ARM Cortex-A9 MPCore #1 (Suspended)
0x1fe00594 GpioOutputExample(): ../src/xgpiops_polled_example.c, line 93
0x1fe005f4 main(): ../src/xgpiops_polled_example.c, line 113
0x1fe02264 _start()

rmmod zynq_remoteproc does not work; remove() method is not even getting called. As a result, I cannot stop cpu1app; it just starts at the system bootup, and keeps running--which is OK for an embedded application. Another approach would be to create another module that boots and stops zynq_remoteproc, but I don't know how to get a handle to the existing zynq_remoteproc instance...

Better alternative: provide "up" device attribute to read/write

If I provide a sysfs file for the userspace to write to, the firmware will probably have been loaded already by the time the user writes '1' to the attribute file. So I created the store/show methods of "up" attribute as shown here:

ssize_t up_store(struct device *dev, struct device_attribute *attr,

const char *buf, size_t count) {

struct rproc *rproc = container_of(dev, struct rproc, dev);

//struct platform_device *pdev = to_platform_device(dev);

//struct zynq_rproc_pdata *local = platform_get_drvdata(pdev);

if(buf[0] == '0') { //want to shut down

rproc_shutdown(rproc);

} else { // bring up

rproc_boot(rproc);

}

return count;

}

static ssize_t up_show(struct device *dev,

struct device_attribute *attr, char *buf) {

struct rproc *rproc = container_of(dev, struct rproc, dev);

return sprintf(buf, "%d\n", rproc->state);

}

static DEVICE_ATTR_RW(up);

And in probe, I can register this file:

... ret = rproc_add(local->rproc);

if (ret) {

dev_err(&pdev->dev, "rproc registration failed\n");

goto rproc_fault;

}

ret = device_create_file(&local->rproc->dev, &dev_attr_up);

return ret;

When I probe this module, I can read the "up" file

# cat /sys/devices/1fe00000.remoteproc/remoteproc0/up

I then start the cpu1app by writing 1 to the file:

# echo 1 > /sys/devices/1fe00000.remoteproc/remoteproc0/up

remoteproc0: powering up 1fe00000.remoteproc

remoteproc0: Read /lib/firmware/cpu1app.elf 0

remoteproc0: firmware: direct-loading firmware cpu1app.elf

remoteproc0: assign_firmware_buf, flag 5 state 0

remoteproc0: Booting fw image cpu1app.elf, size 150445

zynq_remoteproc 1fe00000.remoteproc: iommu not found

remoteproc0: rsc: type 0

remoteproc0: phdr: type 1 da 0x1fe00000 memsz 0xd890 filesz 0x8058

remoteproc0: rproc_da_to_va 1fe00000 --> (null) remoteproc0: rproc_da_to_va 1fe0800c --> (null)

zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_start

remoteproc0: remote processor 1fe00000.remoteproc is now up

And the up file now reads 0, which means RPROC_RUNNING (and the LED is bliking!).

# cat /sys/devices/1fe00000.remoteproc/remoteproc0/up

To stop CPU1, I have to do 2 things in succession: write 0 to the "up" file, and then remove the module:

# echo 0 > /sys/devices/1fe00000.remoteproc/remoteproc0/up

zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_stop

remoteproc0: stopped remote processor 1fe00000.remoteproc

# rmmod zynq_remoteproc

zynq_remoteproc 1fe00000.remoteproc: zynq_remoteproc_remove

zynq_remoteproc 1fe00000.remoteproc: Deleting the irq_list

remoteproc0: releasing 1fe00000.remoteproc

CPU1: Booted secondary processor

At this point, Linux has been restarted on the 2nd processor; if I do things in this way, I can restart the app again by modprobing and then writing 1 to the "up" file again.

Henry Choi

Feb 26, 2015

Zynq AMP: Linux on CPU0 and bare metal on CPU1

remoteproc kernel module

Simplest bare metal (actually uses the Xilinx stand-alone BSP) CPU1 application: blinks

Better alternative: provide "up" device attribute to read/write

Followers

Blog Archive