Feb 28, 2015

Zynq inter-process interrupts

I started thinking about AMP (asymmetric multi-processing) communicating via OCM (on-chip-memory) when I first started playing around with Linux on Zynq.  Although I made sure that the Zynq OCM already had a device driver, tt took me all this time to get comfortable with Linux kernel and device drivers to get to this point, where I can start a bare metal application on CPU1 from Linux on CPU0.  In this blog, I study the logical next step: inter-process interrupts.

Learning from existing code

<kernel>/drivers/irqchip/irq-gic.c: interrupt related functions in Linux kernel

What can I learn from existing kernel functions?  Firstly, all IRQ register change is done in a spinlock (irq_controller_lock).

To disable interrupt (git_mask_irq), '1' bit is written to appropriate bit in ICDICER0 (0xF8F01180 ; 0x180 relative to the ICD base 0xF8F01000) ~ ICDICER2 (0xF8F01188 ; 0x188 relative to the ICD base 0xF8F01000).  Writing 0 enables forwarding the interrupt again, as shown in this example:

static void gic_unmask_irq(struct irq_data *d)
{
u32 mask = 1 << (gic_irq(d) % 32);

raw_spin_lock(&irq_controller_lock);
if (gic_arch_extn.irq_unmask)
gic_arch_extn.irq_unmask(d);
writel_relaxed(mask, gic_dist_base(d) + GIC_DIST_ENABLE_SET + (gic_irq(d) / 32) * 4);
raw_spin_unlock(&irq_controller_lock);
}

On Zynq (arch/arm/mach-zynq), irq_mask/irq_unmask methods are mask_msi_irq()/unmask_msi_irq() in <>/drivers/pci/msi.c, which handles the plain MSI (message signalled interrupt) case and MISX case.  Zynq does NOT seem to use these extensions.

ICCIAR/GIC_INT_ACK (0xF8F0010C): interrupt acknowledge register; reading the ID acknowledges the pending interrupt

ICCEOIR/GIC_EOI (0xF8F00110): end of interrupt register; write the interrupt ID from GIC_INT_ACK.

Interrupt handling in XSDK standalone BSP interrupt

Being more level, the Xilinx BSP may give a better example of IRQ handling.  I started with an example interrupt driven program auto-generated from the BSP summary page: interrupt driven GPIO example.  The 1st interrupt related function is SetupInterruptSystem(), which is specific to GPIO (i.e. not generic for all interrupts).  But most of the lower level calls inside it are generic.
  1. Fill all XSCUGIC_MAX_NUM_INTR_INPUTS (95) number of interrupt handlers to the stub handler (just increments the interrupt controller's UnhandledInterrupts counter)
  2.  DistInit (initialize distributor): do nothing if USE_AMP (Linux is the interrupt distributor master)
  3. Write 0xF0 to ICCPMR (CPU interrupt priority mas register; 0x4 relative to the CPU interface base address 0x00000100).  Why would we set the interrupt priority threshold to 0xF0?
  4. Write 0x7 to ICCIC (CPU interface control register), having to do with secure interrupts (don't care to learn about this for now).
  5. Set the interrupt handler as the ISR for HW's IRQ vector (vs other HW defined interrupts, such as FIQ, RESET, ABORT, SWI).
    1. Therefore, we know that on Zynq, interrupt handling is a 2 step process: ALL GIC interrupts (95 of them) are handled by this ISR, which then multiplexs into the handlers that will be defined for different types of interrupts.
    2. In xparameters, shared interrupt IDs start at 32 (saw this before, where the interrupt number defined in HW design shows up with 32 added to it).
  6. GPIO interrupt handler is 52 (XPAR_XGPIOPS_0_INTR defined in xparameters.h), registered with XScuGic_Connect()
    1. Q: is there a number set aside for the 16 software generated interrupts?
  7. Some peripherals like GPIO can configure the interrupt type (edge/level) through peripheral specific register(s).
  8. The 3rd leg of chained interrupt handler is the peripheral specific ISR, written by the application, which does NOT seem to have to acknowledge the interrupt (done by the 1st and 2nd ISRs).
  9. Peripheral specific interrupt is enabled to the 2nd multiplexer
  10. HW IRQ interrupt is enabled by Xil_ExceptionEnableMask(XIL_EXCEPTION_IRQ); 

Interrupting CPU1 from CPU0

Exposing a interrupt write attribute to the userspace on CPU1

With the knowledge gained from studying Linux and Xilinx BSP code, let's send the interrupt from Linux.  The Linux kernel already provides a function to raise software interrupt to any CPU.  For example, to raise IRQ number "irqnum" to CPU1:

gic_raise_softirq(cpumask_of(1), irqnum)

Under the hood, this writes "gic_cpu_map[1] | irqnum" to address "GIC0 data base address + 0xF00".  Linux kernel code is valid for Zynq GIC because it is based on the ARM GIC architecture.  It is NOT vectored in HW, so therefore there is an interrupt distributor that implements (configurable) priority (and serializes interrupts targeting multiples CPUs).  The SGI (software generated inteerupt) being raised above is explained in Zynq TRM section 7.2.1: SGI range from 0 to 15, and is raised by writing to ICDSGIR (Software Generated Interrupt) register at 0xF8F01F00, or 0xF00 relative to the ICD (interrupt control distributor at 0xF8F01000) .  gic_cpu_map[1] above corresponds to the target filter being 0 (specify the target) and the target being CPU1.

[BTW, section 7.4.2 seems VERY important; in particular, I need to better understand this sentence: "If the interrupt is active in the GIC (because the CPU interface has acknowledged the interrupt), then the software ISR determines the cause by checking the GIC registers first and then polling the I/O Peripheral interrupt status registers."]

Assuming that writing to this register does raise the software interrupt to CPU1, there is currently no way for a USERSPACE application to raise this interrupt.  The zynq_remoteproc device with which I flexibly booted a bare metal application on CPU1 now has an attribute file that the userspace can get to, as demonstrated in the last blog.  I can create another attribute for the userspace app to write to, with this code:

#include <linux/irqchip/arm-gic.h>
ssize_t irq_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count) {
u8 irqnum = buf[0] - '0';
if(irqnum >= 16)
dev_err(dev, "Invalid soft IRQ num %u\n", irqnum);
else
gic_raise_softirq(cpumask_of(1), irqnum);
return count;
}
static DEVICE_ATTR_WO(irq);

static int zynq_remoteproc_probe(struct platform_device *pdev)
{
...
ret = device_create_file(&local->rproc->dev, &dev_attr_irq);
if (ret) {
dev_err(&pdev->dev, "device_create_file %s %d\n",
dev_attr_irq.attr.name, ret);
goto attr_up_err;
}
return ret;
attr_up_err:
device_remove_file(&local->rproc->dev, &dev_attr_up);
...

The device sysfs folder now has "irq" file (next to the "up" file created in the last blog entry):

# ls /sys/devices/1fe00000.remoteproc/remoteproc0/
irq     power   uevent  up

Catching the interrupt on CPU1 bare metal application

The bare metal cpu1app will have to install interrupt handler and enable HW interrupt, with this code copied mostly from the BSP auto-generated GPIO example:

#include "xil_exception.h"
#include "xscugic.h"

int ledon = 1;
static void on_SGI(void*CallBackRef) {
//reading interrupt status acknowledges pending interrupt
#define ICCIAR (XPAR_PS7_SCUGIC_0_BASEADDR | 0x10C)
u32 status = Xil_In32(ICCIAR);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, ledon ^= 1);//toggle LED
}

//Mostly copied from BSP auto-generated xgpiops_int_example
#define INTC_DEVICE_ID XPAR_PS7_SCUGIC_0_DEVICE_ID
static int SetupInterruptSystem() {
int Status;
XScuGic_Config *IntcConfig; //GIC config

Xil_ExceptionInit();

IntcConfig = XScuGic_LookupConfig(INTC_DEVICE_ID);
XScuGic_CfgInitialize(&Intc, IntcConfig, IntcConfig->CpuBaseAddress);

//connect to the HW
Xil_ExceptionRegisterHandler(XIL_EXCEPTION_ID_INT,//== XIL_EXCEPTION_IRQ
(Xil_ExceptionHandler)XScuGic_InterruptHandler, &Intc);
#define SGI_NUM 2
Status = XScuGic_Connect(&Intc, SGI_NUM,
(Xil_ExceptionHandler)on_SGI,(void *)&Intc);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
XScuGic_Enable(&Intc, SGI_NUM);

// Enable interrupts in the Processor.
Xil_ExceptionEnableMask(XIL_EXCEPTION_IRQ);
return XST_SUCCESS;
}
int main(void)
{
...
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, ledon);

SetupInterruptSystem();

while(1) {
volatile int Delay;
for (Delay = 0; Delay < 10000000; Delay++);
}
return XST_SUCCESS;
}

The idea is to initially turn on the LED, and toggle it only in the ISR.

Userspace test

To see if the interrupt can be delivered to CPU1, I first boot the bare metal application as I did in the last blog entry

# echo 1 > /sys/devices/1fe00000.remoteproc/remoteproc0/up

The MIO LED is lit when cpu1app starts.  To get ready to examine the interrupt status registers, I bring up the Xilinx JTAG debugger (see this previous blog entry for how), and then write to the irq file

# echo 2 > /sys/devices/1fe00000.remoteproc/remoteproc0/irq

The LED turns off!  And then on, and off, every time I run the above command!

Bonus: putting the CPU1 into WFE while waiting for the interrupt

My typical real-time SW is completely event driven, so that the main loop does not need to do any work.  In this case, putting CPU1 into sleep waiting for an interrupt will save power.  Changing the main()'s infinite while loop to sleep is trivial, thanks to the WFE instruction available on ARMv7 and on:

while(1) {
asm("WFE" : : : );
}

The LEDs still toggle in response to my writing 2 into the irq attribute file, so WFE works as expected.

In fact, sending an event itself can be a poor man's way of interrupting the bare metal application on CPU1 (poor because SEV instruction wakes up ALL processors; but in a 2 processor situation, 1 is already awake, so not much of a hit except for possibly 1 unnecessary context switch) if CPU1 is normally waiting for a command from CPU1.

Interrupting Linux from bare metal

Sending the interrupt from CPU1

As shown in the Linux code, SENDING the interrupt is much easier than receiving the interrupt.  Xilinx BSP makes it almost trivial:

static void on_SGI(void*CallBackRef) {
//reading interrupt status acknowledges pending interrupt
#define ICCIAR (XPAR_PS7_SCUGIC_0_BASEADDR | 0x10C)
u32 status = Xil_In32(ICCIAR);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, ledon ^= 1);//toggle LED

status = XScuGic_SoftwareIntr(&Intc, 0, XSCUGIC_SPI_CPU0_MASK);
//TODO check error
}

This test code raises the SW interrupt that (0) zynq_remoteproc is already listening for.

Catching the interrupt in the Linux kernel

In the last blog, I found out (rather painfully) that zynq_remoteproc module already installs a Linux IPI (inter-process-interrupt) handler that doesn't do any work, and that 0 (IPI_WAKEUP) was the only remaining unassigned IPI number (because Linux SMP IPI table only goes up to 7) even though Zynq has a whopping 16 possible software interrupt numbers:

static void ipi_kick(void)
{
dev_info(&remoteprocdev->dev, "KICK Linux because of pending message\n");
//schedule_work(&workqueue);
}

Leaving aside the utility of current kernel module, I wanted to see if the interrupt is caught at all.  So I rebuilt cpu1app and ran it again.  This time, when I sent soft IRQ 2 to CPU1, CPU1 raised an IRQ 0 back to CPU0, and I saw this in the command prompt:

KICK Linux because of pending message

So it does work!

Propagating the interrupt to the userspace: unnecessary?

The ways to alert the userspace application that is waiting for an event from CPU1 might be application specific:
  • If only 1 application were waiting for some kind of data, sending a signal may be the easiest.
  • If the kernel module does not know how many userspace application wants the data, a netlink socket broadcast may be more appropriate.
Perhaps independent of how to wake up a userspace application, if a high data rate, maybe the message should be sent over DMA, and the DMA controller may raise a DMA done interrupt, which the CPU0 can catch and handle.

A caution: Zynq OCM is already used by kernel

Linux kernel suspend (part of pm subsystem) runs the last stage of suspend from OCM (after powering off the DDR?).  In ADI kernel's arch/arm/mach-zynq/pm.c zynq_pm_suspend_init(), zynq_sys_suspend_sz number of bytes are copied into the OCM base.  zynq_sys_suspend_sz is calculated in <kernel>/arch/arm/mach-zynq/suspend.S:

ENTRY(zynq_sys_suspend_sz)
.word . - zynq_sys_suspend

which means: zynq_sys_suspend_sz is the size of the assembly function that starts at ENTRY(zynq_sys_suspend) in the same file (line 50).  Just counting the lines from that point to the .word label above (line 182), and subtracting empty and comment lines, I'd say it's about 100 lines of assembly, so I'd ballpark the suspend code to be ~400 bytes (assuming this code is ARM--I don't see anything that indicates the code is THUMB).
I would guess it'a good practice to avoid the 1st page of the OCM.  Therefore, I will try to constrain my usage of OCM to start at 0xFFFC1000.

Feb 26, 2015

Zynq 2nd CPU as a Linux device

Current status

This blog is orphaned in favor or the superior method described in another blog.  I just keep this blog as a note to myself.

Introduction

Xilinx engineer John McDougall has been improving the reference design xapp 1079 in the last couple of years, from the initial implementation where the 2nd CPU just waits in WFE (wait for event) loop until the 1st CPU changes the magic address content: the jump address, to the CPU0 now resetting the CPU1 using a Zynq proprietary system control register.  The companion reference application xapp does not seem to have been similarly updated.  At a high level xapp 1078 does the following, to demonstrate cooperation between Linux running on CPU0 and a bare metal C application running on CPU1:
  1. Linux is procured, and the DTS is modified to restrict Linux running on CPU0 to use only half of the RAM.
  2. A Linux user space application called rwmem is compiled.
  3. The bare metal C application ELF is compiled in XSDK, and is copied to the SD card's BOOT partition (along with the bitstream, FSBL, U-Boot, and the Linux image).  The bare metal app's load address is hard coded at this time (this is NOT a position independent code!).
  4. When the board starts, the FSBL programs the HW with the bitstream, then copies all ELF to the designated load addresses, and then starts the 1st ELF program, which happens to be U-Boot.  CPU1 is be parked at the WFE loop by the bootROM code.
  5. Once U-Boot starts Linux, the rwmem program can then write the start address of of the bare metal application for CPU1 at a special address.
  6. The bare metal application initializes itself (does NOT initialize the level 2 cache but remaps the address--remember that the RAM has been segmented into those dedicated for CPU0 and the rest for CPU1), and can then begin communicating with Linux application through OCM.
I realize now how sensible it is for CPU0 to reset CPU1, as John McDougall has done in recent changes to xapp 1079 (in app_cpu0.c he provides for CPU0): the master SW running on Linux can not only start/stop CPU1, but to some extent choose which application to run.  Before I realized that the CPU0 SW does NOT write the CPU1 SW instructions into the RAM (it is the FSBL, as mentioned above), I thought it would be very cool for the Linux master to arbitrarily run different CPU1 SW applications--if it can somehow access area of RAM that has been reserved for CPU1 (I was thinking about trying to let Linux access the whole RAM, and seeing whether CPU1 can still use the last few MB of the RAM).  It is in fact possible to limit Linux from using the last 2 MB of the Zedboard RAM with the 'mem=510M" argument, and later use this code to access that memory (copied from LDD 3rd ed, p. 443):

ioremap(0x1FE00000 /* 510M */, 0x200000 /* 2M */);

But even in this limited form, I can still have multiple SW to run on CPU1--as long as I ensure for the FSBL that no CPU1 SW overlaps in RAM, and the Linux application remembers the different start address for each desired CPU1 application.

If I can start/stop CPU1 from Linux and read/write from/to it through OCM, then CPU1 is can be abstracted as a file: fopen it to start an application on CPU1, fclose it to stop CPU1 (accomplished by stopping its clock), and read/write to the file handle.  The driver should expose as many devices as there are different applications: e.g. /sys/amp/cpu1-10000000 and /sys/amp/cpu1-18000000.  Maybe more CPUs will run bare metal application in the future.  The device driver can build these devices from a module parameter (or DTS).  In turn, the device file name will then indicate to the device driver the load address.

Iteration 0: build a stub in-tree module

I will start with a character device model; the command/update pattern is well suited for many things I want to do on CPU1.  Plus, since I will communicate with CPU1 over OCM, this device is best characterized as a character mem device (device major ID 1).  I create a new folder <kernel>/drivers/char/amp, to keep my work as cleanly separated from existing char drivers sources:

~/work/zed/kernel/drivers/char$ mkdir zynq_amp

I want to build this module only if CONFIG_ZYNQ_AMP is defined, so I added the following line to <kernel>/drivers/char/Makefile:

obj-$(CONFIG_ZYNQ_AMP) += zynq_amp

The kernel configs will be pulled in <kernel>/drivers/char/Kconfig:

source "drivers/char/zynq_amp/Kconfig"

I see Makefile comment "# When adding new entries keep the list in alphabetical order", but that guidance has clearly gone unheeded in the existing Makefiles.  Then my own <>/drivers/char/zynq_amp/Makefile can be a one-liner:

obj-$(CONFIG_ZYNQ_AMP) += zynq_amp.o

<kernel>/drivers/char/zynq_amp/Kconfig needs to explain the config entry:

menu "Zynq AMP"

config ZYNQ_AMP
tristate "Run bare-metal programs on other CPUs"

depends on ARCH_ZYNQ
help
 Say yes here to build support for AMP as explained in Xilinx xapp 1078
 but now through the device driver formalism. To compile this driver as a
    module, choose M here: the module will be called amp.
endmenu

I will actually depend on the Zynq OCM platform driver, but that driver comes with the Zynq machine architecture; that's why ZYNQ_AMP depends on ARCH_ZYNQ.  CONFIG_ZYNQ_AMP (whether as y or m) should be in the .config file, as in this example at the end of my zynq_xcomm_adv7511_nfs_defconfig:

CONFIG_ZYNQ_AMP=m

There should be a way to opt out of loading the device driver even if the driver is compiled into the kernel (Linux is--too--flexible).  The DTS compatibility table seems to be the way.  If I want this driver, then I can declare it at the top of the device tree:

/dts-v1/;
/include/ "zynq-zed.dtsi"

/ {
 ...
    amp@1 {
        compatible = "zynq-amp-1.0";
        reg = <0x18000000 0x1000>;
    };
};

Originally, I wanted to hang a pseudo device off the OCM (which is required for this driver as explained above) in DTS (zynq-zed-adv7511.dts), similar to how I tacked on the spidev device driver off the raw Xilnx SPI device driver in an earlier blog like this:

/ {
 ...
axi: amba@0 {
ps7_ocmc_0: ps7-ocmc@f800c000 {
   amp@1 {
       compatible = "zynq-amp-1.0";
        reg = <0x18000000 0x1000>;
   };
};
} ;
};
};

But the driver would NOT probe (using the probing method below)!

Until the device driver is fully baked, it's easier to bypass Buildroot for the kernel and module build, and just use the Xilinx toolchain (which is actually just an old version of the Codesourcery toolchain):

~/work/zed/kernel$ source /opt/Xilinx/SDK/2014.4/settings64.sh
~/work/zed/kernel$ export CROSS_COMPILE=arm-xilinx-linux-gnueabi-
~/work/zed/kernel$ make ARCH=arm zynq_xcomm_adv7511_nfs_defconfig
~/work/zed/kernel$ make ARCH=arm uImage LOADADDR=0x8000

The last dash in CROSS_COMPILE is important because the cross-compile binaries are invoked simply by prepending ${CROSS_COMPILE} to the generic tool names, such as gcc.  If you have ccache in your path, adding CC="ccache gcc" right before ARCH=arm in the previous command will help reduce the subsequent build time, lie this example:

~/work/zed/kernel$ make CC="/mnt/work/zed/buildroot/output/host/usr/bin/ccache gcc" ARCH=arm uImage LOADADDR=0x8000

If I already have a compiled kernel, I can build out-of-tree like this:

~/work/zed/kernel/drivers/char/zynq_amp$ make -C /mnt/work/zed/buildroot/output/build/linux-custom ARCH=arm M=`pwd` modules

And of course changing the build target to clean will clean the object files and the ko file.  An "installed" module will appear on the target root file system's /lib/modules/<kernel version>/kernel.  For example, this driver would up in this directory:

# ls /lib/modules/3.15.0/kernel/drivers/char/zynq_amp/
zynq_amp.ko

I can then probe this module:

# modprobe kernel/drivers/char/zynq_amp/zynq_amp.ko
# lsmod
Module                  Size  Used by    Not tainted
zynq_amp                1237  0 
ipv6                  272881 12 [permanent]
# rmmod kernel/drivers/char/zynq_amp/zynq_amp.ko

To iterate code development, the simplest way is to copy the kernel object into the NFS root's /lib/modules/<kernel version>/kernel folder.

Iteration 0: skeleton only

Linux already has many driver framework (e.g. for ADC, TTY, HID) because very few devices are truly new.  So I know that this device should not register as a bare character device--but what kind of character device?  I fell back to the misc character device: <linux/miscdevice.h>.  The driver should match itself against the device tree entry (above) like this:

static struct of_device_id zynq_amp_dt_compatible[] = {
{ .compatible = "zynq-amp-1.0" },
{ /* end of table */ }
};
static struct platform_driver amp_driver = {
.probe = amp_probe,
.remove = amp_remove,
.driver = {
.name = "amp",
.of_match_table = zynq_amp_dt_compatible,
},
};
module_platform_driver(amp_driver);/* because the init/exit are no-op */

Probe

At minimum, probe should read the DTS entry to configure itself correctly, and register itself to the chosen driver framework (which is the miscdevice in this case).  A no-op implementation is given below.

struct ampdevice {
struct miscdevice parent;
/* embed within singleton instead of DEFINE_MUTEX(dev_mutex);*/
struct mutex mutex;
u32 addr;
struct file* file;/* file that has this device opened */
};

static const struct file_operations amp_chrdev_ops = {
.owner = THIS_MODULE,
.open = amp_open,
.release    = amp_release,
.read = amp_read,
.llseek  = no_llseek,
};

/* Request open slot; see misc_register() implementation */
#define AMP_MISCDEV_MINOR MISC_DYNAMIC_MINOR

struct ampdevice amp = {/* the singleton amp device */
.parent = {
.minor = AMP_MISCDEV_MINOR,
.name = MODNAME,
.nodename = MODNAME,
.fops = &amp_chrdev_ops,
}
};

static int amp_remove(struct platform_device *pdev)
{
int err = 0;
struct device *dev = &pdev->dev;
struct ampdevice* amp = platform_get_drvdata(pdev);
dev_info(dev, "remove");/* should be dev_dbg? */

mutex_lock(&amp->mutex);/* ---------------------------------------- */
err = misc_deregister(&amp->parent);
if(err) {
dev_err(dev, "misc_deregister() %d", err);
}
amp->addr = 0;
mutex_unlock(&amp->mutex);/* ++++++++++++++++++++++++++++++++++++++ */
return err;
}

static int amp_probe(struct platform_device *pdev) {
...
id = of_match_node(zynq_amp_dt_compatible, dev->of_node);
if (!id) {
err = -EINVAL;
goto err_;
}
addr = platform_get_resource(pdev, IORESOURCE_MEM, 0);/* the reg field */
if(!addr) {
err = -EINVAL;/* Invalid reg */
dev_err(dev, "NULL reg");
goto err_;
}

mutex_init(&amp.mutex);/* ---------------------------------------- */
platform_set_drvdata(pdev, &amp);
amp.addr = (u32)addr;

err = misc_register(&amp.parent);
if(err) {
dev_err(dev, "misc_register %d", err);
goto err_remove;
}
mutex_unlock(&amp.mutex);/* ++++++++++++++++++++++++++++++++++++++ */

return 0;/* success! */

err_remove:
amp_remove(pdev);
err_:
return err;
}

Open and release

A typical cdev registers itself to the kernel with cdev_add(), so that the kernel will link cdev struct with inode that is used to present the device to the userspace--in the inode.i_cdev union (i.e. there is i_bdev and i_pipe).  open() can then use this pointer to get back to the device specific structure for itself.  An example is the iio_device_register.  But miscdevice does NOT cdev_register(), so the miscdevices found in the kernel work around in 3 ways: 1) find itself in misc_dev list using the dev_t i_rdev in inode as the key, 2) do the same, using a hard coded (hopefully assigned) minor device number, 3) just keep a global private structure.   While studying misc_open(), I saw that the miscdevice pointer is recorded in the file's private_data structure before open() is called, so I can indeed get back my struct this way.

Since this device abstracts a bare metal application running on CPU1, there is only 1 such device.  So open() enforces that only 1 file uses this device.  As explained in LDD 3rd edition, fork() and dup() do NOT open new file handle, so the code below should be valid.

int amp_open(struct inode *n, struct file *f) {
int err;
/* See ? for the explanation of this code */
struct ampdevice *amp = container_of(f->private_data, struct ampdevice
, parent);
struct device *dev;
if(!amp) { /* cannot log since no handle to device */
err = -ENODEV;
goto err_;
}
dev = amp->parent.this_device;
dev_printk(KERN_DEBUG, dev, "open");
mutex_lock(&amp->mutex);/* ---------------------------------------- */
if(amp->file) {
dev_err(dev, "already open");
err = -EMFILE;/* Too many open files */
goto err_unlock;
}
amp->file = f;/* Now this device is owned */
err = 0;

err_unlock:
mutex_unlock(&amp->mutex);/* ++++++++++++++++++++++++++++++++++++++ */
err_:
return err;
}
int amp_release(struct inode* n, struct file* f) {
int err;
struct ampdevice *amp = container_of(f->private_data, struct ampdevice
, parent);
struct device *dev;
if(!amp) { /* cannot log since no handle to device */
err = -ENODEV;
goto err_;
}
dev = amp->parent.this_device;
dev_printk(KERN_DEBUG, dev, "release");
mutex_lock(&amp->mutex);/* ---------------------------------------- */
if(f != amp->file) { /* precondition violation */
dev_err(dev, "not owner");
err = -EINVAL;
goto err_unlock;
}
amp->file = 0;/* Now this device is free */
err = 0;

err_unlock:
mutex_unlock(&amp->mutex);/* ++++++++++++++++++++++++++++++++++++++ */
err_:
return err;
}

Read

Returning no data is a perfectly valid when prototyping the driver stub.  One annoyance of the printk based debugging is that I did not yet beat dev_dbg into printing the message even when DEBUG is defined.

ssize_t amp_read(struct file* f, char __user *buf, size_t req, loff_t *f_pos) {
int err;
struct ampdevice *amp = container_of(f->private_data, struct ampdevice
, parent);
struct device *dev;
if(!amp) { /* cannot log since no handle to device */
err = -ENODEV;
goto err_;
}
dev = amp->parent.this_device;
dev_printk(KERN_DEBUG, dev, "read");

err = req;
err_:
return err;
}

With these stub functions, I can see the open, read, and release life cycle in the kernel log when running the following command (after modprobe shown above of course):

# echo "8 4 1 8" > /proc/sys/kernel/printk
# od -vAn -N4 -tx4 /dev/amp
misc amp: open
misc amp: read
 00000000
misc amp: release

Iteration 1: probe device gets a handle to sysctl register

Iteration 2: open device resets CPU1

In JTAG debugger, CPU1 is running through this loop:

00000010:   andeq   r0, r0, r0
00000014:   andeq   r0, r0, r0
00000018:   andeq   r0, r0, r0
0000001c:   .word 0x01411c47
00000020:   andvc   r0, r7, r0, lsr #8
00000024:   ldrbhi  r0, [r8], #-2187
00000028:   ldrbne  r5, [r4], #-3216



Zynq AMP: Linux on CPU0 and bare metal on CPU1

When I first started playing around with Zedboard, I set a goal to investigate ways to integrate all computing that I've ever done in an expensive (I've never worked on something that sold for less than $100K--actually more like $500K) hardware into an SoC.  Studying how to run 2 bare metal C applications on each Zynq ARM CPU (xapp 1079) was the first step, and I learned about some of the Linux kernel and device drivers after that.  When I studied xapp 1079, I had trouble thoroughly understanding its companion reference app xapp 1078, in which the app on CPU1 is kicked off from Linux running on CPU0.  But my half-year long detour through the various Linux subsystems just paid off serendipitously, because I found a Linux kernel module that may obviate the need for xapp 1078 altogether (actually will make xapp 1078 seem like a giant head-fake; maybe not as bad as the James Clark's WebTV venture during the height of the dot-com boom, but still right up there).

remoteproc kernel module

There are 2 reasons to keep zynq_remoteproc as a module rather than compiling into the kernel:
  1. Since I am hosting the root file system on NFS, this module should NOT start until the NFS rootfs is mounted.  Modules seem to start AFTER NFS mounting.
  2. To start/stop CPU1, this module should be probed and removed
NOTE TO SELF: after compiling modifying the kernel module and doing a module_install, the modules still need to be copied to the NFS export!

When Xilinx made a marketing push to AMP (asymmetric multi-processing) a couple of years ago, they put out (rather quietly) an application note ug978 that launched FreeRTOS on CPU1 from Linux running on CPU0.  I will try to use zynq_remoteproc module--the specialization of the generic Linux remoteproc module--as verbatim as possible (<kernel>/drivers/remoteproc/zynq_remoteproc.c), to launch my own bare metal C++ application on CPU1.

Firstly, the module has to be built.  I added the following lines to my kernel defconfig:

CONFIG_RPMSG=y
CONFIG_REMOTEPROC=y
CONFIG_ZYNQ_REMOTEPROC=m

Next, the kernel has to be told about my desire to use the zynq_remoteproc driver, through DTS.  I added the following entry in zynq-zed-adv7511.dts:

remoteproc@1 {
     compatible = "xlnx,zynq_remoteproc";
     reg = < 0x1FE00000 0x200000 >;
     interrupt-parent = <&gic>;
     interrupts = < 0 37 0 0 38 0 >;
     firmware = "cpu1app.elf";
     ipino = <0>; //The only free ipino
     vring0 = <2>;
     vring1 = <3>;
};

Here, I am telling the kernel that I want to use the last 2 MB (out of 512 MB available on Zedboard) of the RAM for the bare metal app running on CPU1.  Please recall that the memory was declared in zynq-zed.dtsi, which is included by zynq-zed-adv7511.dts:

memory {
device_type = "memory";
reg = <0x000000000 0x20000000>;
};

To constrain the Linux kernel to only 510 MB without having to change the above DTS entry, I add "mem=510M" in the U-Boot kernel bootargs.  Without it, the module cannot allocate coherent DMA mapping for the last 2 MB because the following code in zynq_remoteproc probe will fail (I tried it already):

ret = dma_declare_coherent_memory(&pdev->dev, local->mem_start,
local->mem_start, local->mem_end - local->mem_start + 1,
DMA_MEMORY_IO);

In Xilinx document ug978, the CPU1 application was placed in the boot partition, right next to BOOT.bin--which lives on my SD card.  For convenience during development, I want to put the application ELF file on the NFS export.  Many Linux distributions seem to put firmware in /lib/firmware, but according to the hard coded paths in fw_path string array (<>/drivers/base/firmware_class.c), /lib/firmware/updates/ is also a possibility, as well as a custom path specified in the "path" module parameter.  This folder is conveniently accessible on my NFS host, making development iteration easier.

I can just compile this DTS in bash and move the DTB into the TFTP download folder, because I am downloading the kernel over TFTP:

~/work/zed/kernel/arch/arm/boot/dts$ ~/work/zed/kernel/scripts/dtc/dtc -I dts -O dtb -o zynq-zed-adv7511.dtb  zynq-zed-adv7511.dts
~/work/zed/kernel/arch/arm/boot/dts$ sudo mv zynq-zed-adv7511.dtb  /var/lib/tftpboot/

Of course, there is no cpu1app ELF file in /lib/firmware, BUT the modprobe fails for a different reason if I ipino in DTS is anything other than 0:

CPU0: IPI handler 0x5 already registered to ipi_cpu_stop
zynq_remoteproc 1fe00000.remoteproc: IPI handler already registered
zynq_remoteproc 1fe00000.remoteproc: Deleting the irq_list
CPU1: Booted secondary processor
CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
zynq_remoteproc 1fe00000.remoteproc: Can't power on cpu1 -1
zynq_remoteproc: probe of 1fe00000.remoteproc failed with error -1

This code is stopping the probe():

ret = set_ipi_handler(local->ipino, ipi_kick, "Firmware kick");
if (ret) {
dev_err(&pdev->dev, "IPI handler already registered\n");
goto irq_fault;
}

Reading set_ipi_handler(), I realized that 0 (IPI_WAKEUP) is the only available IPI handler number, so I changed DTS.  I do NOT plan to use virtio, so I simply commented out anything related to vring in zynq_remoteproc with CONFIG_ZYNQ_IPC #ifdef.

Simplest bare metal (actually uses the Xilinx stand-alone BSP) CPU1 application: blinks

Since bare metal AMP was demonstrated in xapp 1079, it may be easiest to pick up from there.  But briefly, building a stand-alone (no OS) for CPU1 involves the following high-level steps:
  1. Create a standalone BSP specialized for AMP CPU1 (when creating the Xilinx BSP project in xsdk, select ps_cortexa9_1 as the CPU).  Since I did not install the FreeRTOS template, the only OS choice I get is standalone--hence the project name "standalone_bsp_1".
  2. Compile a ELF executable that targets CPU1 and depends on the BSP just created above, and hard coded to some load address
Since the CPU1 BSP will NOT be used for FSBL, there is an opportunity to reduce the code size (compared to the CPU0 BSP) by NOT selecting any libraries--such as xilffs or xilrsa, as I've done below:
Since I am NOT interested in debugging the BSP, I have an opportunity to increase the optimization level and remove the debug (-g) flag in the BSP setting.  But this is important: USE_AMP=1 preprocessor define in the BSP setting (right click on the BSP project in Eclipse --> Board Support Package settings) changes some BSP code from the default BSP):
  • GIC (generalized interrupt controller?) distributor is disabled
  • L2 cache invalidation is disabled in boot.S, and instead, virtual address 0x20000000 is mapped to 0x0 and marked as non-cacheable (while MMU is disabled of course).  xapp 1079 comments this out, so I did too.
  • Recently, John McDougall added more AMP code in boot.S to:
    • Mark the Linux DDR region as unassigned/reserved to the MMU, which is a private resource of CPU1
    • Mark the CPU1 DDR as inner (L1) cached only
  • L2 cache is NOT turned back on (because it was not invalidated in the first place!)
Marking certain sections of the DDR as reserved and the last part of the DDR as inner cached only is done in boot.S, when USE_AMP=1:

#if USE_AMP==1
// /* In case of AMP, map virtual address 0x20000000 to 0x00000000  and mark it as non-cacheable */
// ldr r3, =0x1ff /* 512 entries to cover 512MB DDR */
// ldr r0, =TblBase /* MMU Table address in memory */
// add r0, r0, #0x800 /* Address of entry in MMU table, for 0x20000000 */
// ldr r2, =0x0c02 /* S=b0 TEX=b000 AP=b11, Domain=b0, C=b0, B=b0 */
//mmu_loop:
// str r2, [r0] /* write the entry to MMU table */
// add r0, r0, #0x4 /* next entry in the table */
// add r2, r2, #0x100000 /* next section */
// subs r3, r3, #1
// bge mmu_loop /* loop till 512MB is covered */

/* Mark Linux DDR [0x00000000, 0x1FE00000) as unassigned/reserved */
ldr r3, =0x1fd  /* counter=509 to cover 510MB DDR */
ldr r0, =TblBase /* MMU Table address in memory */
ldr r2, =0x0000  /* S=b0 TEX=b000 AP=b00, Domain=b0, C=b0, B=b0 */
mmu_loop:
str r2, [r0]    /* write the entry to MMU table */
add r0, r0, #0x4 /* next entry in the table */
add r2, r2, #0x100000 /* next section */
subs r3, r3, #1     //counter--
bge mmu_loop    /* loop till Linux DDR MB covered */

/* Mark CPU1 DDR [0x1FE00000, 0x20000000) as inner cached only */
ldr r3, =0x1  /* counter=1 to cover 2MB DDR */
movw r2, #0x4de6  /* S=b0 TEX=b100 AP=b11, Domain=b1111, C=b0, B=b1 */
movt r2, #0x1FE0      /* S=b0, Section start for address 0x1FE00000 */
mmu_loop1:
str r2, [r0]    /* write the entry to MMU table */
add r0, r0, #0x4 /* next entry in the table */
add r2, r2, #0x100000 /* next section */
subs r3, r3, #1     //counter--
bge mmu_loop1    /* loop till CPU1 DDR MB is covered */
#endif

For the application, I copy the xapp 1079 CPU1 application as a new project "cpu1app" and start modifying.  Besides the application logic itself, the linker script (lscript.ld) specifies where the code/data sections will be placed in memory (DDR, to be specific, by CPU0--but that is not the concern of the linker script).  xapp1079 reserved 0x02000000 through 0x02ffffff (16 MB) for CPU1, but as shown in the DTS above, I want to allocate CPU1 memory at 0x1FE00000.  So I change the ps7_ddr_0_S_AXI_BASEADDR location and size to in the linker script editor, like this:

MEMORY
{
   ps7_ddr_0_S_AXI_BASEADDR : ORIGIN = 0x1fe00000, LENGTH = 0x200000
}

Since the linker places all sections into the DDR, there is no reason to even mention other on-chip memory (BRAM at 0x0 and OCM at 0xFFFC0000).  I don't know the correct stack and heap size yet, so I'll just leave them alone (8 KB each).

_STACK_SIZE = DEFINED(_STACK_SIZE) ? _STACK_SIZE : 0x2000;
_HEAP_SIZE = DEFINED(_HEAP_SIZE) ? _HEAP_SIZE : 0x2000;

The simplest app I can think of is a blinker.  Recently, John McDougall introduced a sleep method using CPU1's private timer (which seems to be called SCU timer--I don't yet see the connection to the snoop control unit).  John McDougall's code for initializing the SCU timer and calling a sleep on it is in this download (in design/src/apps/app_cpu1/scu_sleep.[ch]).  My main() simply calls the SCU timer init and then sleep for 1 second over and over.

#define GPIO_DEVICE_ID   XPAR_XGPIOPS_0_DEVICE_ID
#define LED_DELAY 10000000
#define OUTPUT_PIN 7 /* Pin connected to LED/Output */
XGpioPs Gpio; /* The driver instance for GPIO Device. */

static int GpioOutputExample(void)
{
volatile int Delay;

XGpioPs_SetDirectionPin(&Gpio, OUTPUT_PIN, 1);
XGpioPs_SetOutputEnablePin(&Gpio, OUTPUT_PIN, 1);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x0);

while(1) {
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x1);
for (Delay = 0; Delay < LED_DELAY; Delay++);
XGpioPs_WritePin(&Gpio, OUTPUT_PIN, 0x0);
for (Delay = 0; Delay < LED_DELAY; Delay++);
}
return XST_SUCCESS;
}

int main(void)
{
int Status;
XGpioPs_Config *ConfigPtr;

ConfigPtr = XGpioPs_LookupConfig(GPIO_DEVICE_ID);
Status = XGpioPs_CfgInitialize(&Gpio, ConfigPtr,
ConfigPtr->BaseAddr);
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}
Status = GpioOutputExample();
if (Status != XST_SUCCESS) {
return XST_FAILURE;
}

return XST_SUCCESS;
}

WITHOUT the USE_AMP=1 modifications I made to boot.S above, I can launch this program from xsdk (Xilinx SW development IDE), and I can see the blinking LED.

xsdk builds the ELF file with ease, and I moved that file into a new folder /lib/firmware within the NFS exported root for the target.  When I rebooted Zedboard, I was greeted with what seems like a minor success in dmesg output:

CPU1: shutdown
 remoteproc0: 1fe00000.remoteproc is available
 remoteproc0: Note: remoteproc is still under development and considered experimental.
 remoteproc0: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.

As dmesg suggests, Linux first shut down CPU1.  Silently, it tries to load the firmware through this chain: zynq_remoteproc_probe() --> rproc_add() --> rproc_add_virtio_devices() --> request_firmware_nowait() --> INIT_WORK(&fw_work->work, request_firmware_work_func) --> request_firmware_work_func() --> _request_firmware() --> fw_get_filesystem_firmware() --> fw_read_file_contents().  request_firmware_work_func() should also do post-FW load work (like booting the remote proc) through the fw_work->cont function pointer to rproc_fw_config_virtio(), but that is bombing out because there is no rproc_find_rsc_table <-- rproc_elf_find_rsc_table()

The debugger does NOT respond when CPU1 is halted (as in this case), so I had to rely on printk.  I came to appreciate the value of out-of-tree module compilation:

~/work/zed/kernel/drivers/remoteproc$ make -C /mnt/work/zed/buildroot/output/build/linux-custom ARCH=arm M=`pwd` modules

Having the target's modules folder on NFS export (/export/root/zedbr2/lib/modules/3.15/kernel/drivers/remoteproc in this case) made the otherwise printk based debugging much faster (still took a few days to navigate through all the source and try different hypothesis).  Finally, I realized that my executable does not have the .resource_table section the ELF loader is looking for.  I put an empty resource table (note that num=1 below) as its own section (which is what the remoteproc module looks for after the ELF loader parses the ELF file) in lscript.ld:

.resource_table : {
   __rtable_start = .;
   *(.rtable)
   __rtable_end = .;
} > ps7_ddr_0_S_AXI_BASEADDR

The C program can have the global data as the resource table content:

#define RAM_ADDR 0x1fe00000
struct resource_table {//Just copied from linux/remoteproc.h
u32 ver;//Must be 1 for remoteproc module!
u32 num;
u32 reserved[2];
u32 offset[1];
} __packed;
enum fw_resource_type {
RSC_CARVEOUT = 0,
RSC_DEVMEM = 1,
RSC_TRACE = 2,
RSC_VDEV = 3,
RSC_MMU = 4,
RSC_LAST = 5,
};
struct fw_rsc_carveout {
u32 type;//from struct fw_rsc_hdr
u32 da;
u32 pa;
u32 len;
u32 flags;
u32 reserved;
u8 name[32];
} __packed;

__attribute__ ((section (".rtable")))
const struct rproc_resource {
    struct resource_table base;
    //u32 offset[4];
    struct fw_rsc_carveout code_cout;
} ti_ipc_remoteproc_ResourceTable = {
.base = { .ver = 1, .num = 1, .reserved = { 0, 0 },
.offset = { offsetof(struct rproc_resource, code_cout) },
},
.code_cout = {
   .type = RSC_CARVEOUT, .da = RAM_ADDR, .pa = RAM_ADDR, .len = 1<<19,
   .flags=0, .reserved=0, .name="CPU1CODE",
},
};

With this change, my program is copied to the correct location in the DRAM, and I can dynamically start/stop Linux on CPU1 by probing and removig the module, like this:

# rmmod zynq_remoteproc
# modprobe kernel/drivers/remoteproc/zynq_remoteproc.ko

This driver shows up in sys/module/zynq_remoteproc/  and /sys/devices/1fe00000.remoteproc.  But  zynq_remoteproc probe does NOT call rproc; it merely loads the firmware.  Indeed, it cannot because the firmware loading completes asynchronously from module probing. Supposedly, the rpmsg module probe should call rproc_boot(), so I tried the following

# modprobe kernel/drivers/rpmsg/virtio_rpmsg_bus.ko

But the module's probe does still NOT get called (note that I crossed CONFIG_RPMSG=y from my defconfig above)!  I could not figure out how to get the virtio device probed, and for that matter, another determined engineer could not either, so I just added in a single-threaded work queue to call rproc_boot after the firmware is loaded.

struct zynq_rproc_pdata {
struct irq_list mylist;
struct rproc *rproc;
u32 ipino;
#ifdef CONFIG_ZYNQ_IPC
u32 vring0;
u32 vring1;
#endif
u32 mem_start;
u32 mem_end;

//Need my own workqueue rather than a shared work queue because I will block for completion
struct workqueue_struct* wq;
struct work_struct boot_work;
};

static void boot_cpu1(struct work_struct *work) {
struct zynq_rproc_pdata* local =
container_of(work, struct zynq_rproc_pdata, boot_work);
struct rproc* rproc = local->rproc;
int err;

wait_for_completion(&rproc->firmware_loading_complete);
dev_info(&rproc->dev, "firmware_loading_complete\n");
err = rproc_boot(rproc);
if(err)
dev_err(&rproc->dev, "rproc_boot %d\n", err);
}

static int zynq_remoteproc_probe(struct platform_device *pdev)
{
...
ret = rproc_add(local->rproc);
if (ret) {
dev_err(&pdev->dev, "rproc registration failed\n");
goto rproc_fault;
}

INIT_WORK(&local->boot_work, boot_cpu1);
local->wq = create_singlethread_workqueue("znq_remoteproc boot");
if(IS_ERR(local->wq)) {
dev_err(&pdev->dev, "create_singlethread_workqueue %ld\n",
PTR_ERR(local->wq));
goto rproc_fault;
}
queue_work(local->wq, &local->boot_work);
...
}


static int zynq_remoteproc_remove(struct platform_device *pdev)
{
struct zynq_rproc_pdata *local = platform_get_drvdata(pdev);
u32 ret;

dev_info(&pdev->dev, "%s\n", __func__);
rproc_shutdown(local->rproc);
destroy_workqueue(local->wq);
...

With this change, the my cpu1app runs on boot:

 remoteproc0: firmware_loading_complete
 remoteproc0: powering up 1fe00000.remoteproc
 remoteproc0: Read /lib/firmware/cpu1app.elf 0
 remoteproc0: firmware: direct-loading firmware cpu1app.elf
 remoteproc0: assign_firmware_buf, flag 5 state 0
 remoteproc0: Booting fw image cpu1app.elf, size 150445
zynq_remoteproc 1fe00000.remoteproc: iommu not found
 remoteproc0: rsc: type 0
 remoteproc0: phdr: type 1 da 0x1fe00000 memsz 0xd890 filesz 0x8058
 remoteproc0: rproc_da_to_va 1fe00000 -->   (null) remoteproc0: rproc_da_to_va 1fe0800c -->   (null)
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_start
 remoteproc0: remote processor 1fe00000.remoteproc is now up

I can also debug my app in xsdk JTAG debugger.  This debugger stack trace is a proof that I can running Linux on CPU0 and my bare metal application on CPU1:

ARM Cortex-A9 MPCore #0 (Suspended)
0xc0020428 cpu_v7_do_idle(): arch/arm/mm/proc-v7.S, line 74
0xc0013d1c arm_cpuidle_simple_enter(): arch/arm/kernel/cpuidle.c, line 18
0xc03d08b8 cpuidle_enter_state(): drivers/cpuidle/cpuidle.c, line 104
0xc03d09ac cpuidle_enter(): drivers/cpuidle/cpuidle.c, line 159
0xc0060ad0 cpu_startup_entry(): kernel/sched/idle.c, line 154
0xc0573fac rest_init(): init/main.c, line 397
0xc07ebba4 start_kernel(): init/main.c, line 652
0x00008074
0x00008074
ARM Cortex-A9 MPCore #1 (Suspended)
0x1fe00594 GpioOutputExample(): ../src/xgpiops_polled_example.c, line 93
0x1fe005f4 main(): ../src/xgpiops_polled_example.c, line 113
0x1fe02264 _start()

rmmod zynq_remoteproc does not work; remove() method is not even getting called.  As a result, I cannot stop cpu1app; it just starts at the system bootup, and keeps running--which is OK for an embedded application.  Another approach would be to create another module that boots and stops zynq_remoteproc, but I don't know how to get a handle to the existing zynq_remoteproc instance...

Better alternative: provide "up" device attribute to read/write

If I provide a sysfs file for the userspace to write to, the firmware will probably have been loaded already by the time the user writes '1' to the attribute file.  So I created the store/show methods of "up" attribute as shown here:

ssize_t up_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count) {
struct rproc *rproc = container_of(dev, struct rproc, dev);
//struct platform_device *pdev = to_platform_device(dev);
//struct zynq_rproc_pdata *local = platform_get_drvdata(pdev);
if(buf[0] == '0') { //want to shut down
rproc_shutdown(rproc);
} else { // bring up
rproc_boot(rproc);
}
return count;
}
static ssize_t up_show(struct device *dev,
    struct device_attribute *attr, char *buf) {
struct rproc *rproc = container_of(dev, struct rproc, dev);
return sprintf(buf, "%d\n", rproc->state);
}
static DEVICE_ATTR_RW(up);

And in probe, I can register this file:

... ret = rproc_add(local->rproc);
if (ret) {
dev_err(&pdev->dev, "rproc registration failed\n");
goto rproc_fault;
}

ret = device_create_file(&local->rproc->dev, &dev_attr_up);
return ret;

When I probe this module, I can read the "up" file

# cat  /sys/devices/1fe00000.remoteproc/remoteproc0/up
 0

I then start the cpu1app by writing 1 to the file:

# echo 1 > /sys/devices/1fe00000.remoteproc/remoteproc0/up
 remoteproc0: powering up 1fe00000.remoteproc
 remoteproc0: Read /lib/firmware/cpu1app.elf 0
 remoteproc0: firmware: direct-loading firmware cpu1app.elf
 remoteproc0: assign_firmware_buf, flag 5 state 0
 remoteproc0: Booting fw image cpu1app.elf, size 150445
zynq_remoteproc 1fe00000.remoteproc: iommu not found
 remoteproc0: rsc: type 0
 remoteproc0: phdr: type 1 da 0x1fe00000 memsz 0xd890 filesz 0x8058
 remoteproc0: rproc_da_to_va 1fe00000 -->   (null) remoteproc0: rproc_da_to_va 1fe0800c -->   (null)
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_start
 remoteproc0: remote processor 1fe00000.remoteproc is now up

And the up file now reads 0, which means RPROC_RUNNING (and the LED is bliking!).

# cat  /sys/devices/1fe00000.remoteproc/remoteproc0/up
 2

To stop CPU1, I have to do 2 things in succession: write 0 to the "up" file, and then remove the module:

# echo 0 > /sys/devices/1fe00000.remoteproc/remoteproc0/up
zynq_remoteproc 1fe00000.remoteproc: zynq_rproc_stop
 remoteproc0: stopped remote processor 1fe00000.remoteproc

# rmmod zynq_remoteproc
zynq_remoteproc 1fe00000.remoteproc: zynq_remoteproc_remove
zynq_remoteproc 1fe00000.remoteproc: Deleting the irq_list
 remoteproc0: releasing 1fe00000.remoteproc
CPU1: Booted secondary processor

At this point, Linux has been restarted on the 2nd processor; if I do things in this way, I can restart the app again by modprobing and then writing 1 to the "up" file again.