Feb 26, 2015

Zynq 2nd CPU as a Linux device

Current status

This blog is orphaned in favor or the superior method described in another blog.  I just keep this blog as a note to myself.

Introduction

Xilinx engineer John McDougall has been improving the reference design xapp 1079 in the last couple of years, from the initial implementation where the 2nd CPU just waits in WFE (wait for event) loop until the 1st CPU changes the magic address content: the jump address, to the CPU0 now resetting the CPU1 using a Zynq proprietary system control register.  The companion reference application xapp does not seem to have been similarly updated.  At a high level xapp 1078 does the following, to demonstrate cooperation between Linux running on CPU0 and a bare metal C application running on CPU1:
  1. Linux is procured, and the DTS is modified to restrict Linux running on CPU0 to use only half of the RAM.
  2. A Linux user space application called rwmem is compiled.
  3. The bare metal C application ELF is compiled in XSDK, and is copied to the SD card's BOOT partition (along with the bitstream, FSBL, U-Boot, and the Linux image).  The bare metal app's load address is hard coded at this time (this is NOT a position independent code!).
  4. When the board starts, the FSBL programs the HW with the bitstream, then copies all ELF to the designated load addresses, and then starts the 1st ELF program, which happens to be U-Boot.  CPU1 is be parked at the WFE loop by the bootROM code.
  5. Once U-Boot starts Linux, the rwmem program can then write the start address of of the bare metal application for CPU1 at a special address.
  6. The bare metal application initializes itself (does NOT initialize the level 2 cache but remaps the address--remember that the RAM has been segmented into those dedicated for CPU0 and the rest for CPU1), and can then begin communicating with Linux application through OCM.
I realize now how sensible it is for CPU0 to reset CPU1, as John McDougall has done in recent changes to xapp 1079 (in app_cpu0.c he provides for CPU0): the master SW running on Linux can not only start/stop CPU1, but to some extent choose which application to run.  Before I realized that the CPU0 SW does NOT write the CPU1 SW instructions into the RAM (it is the FSBL, as mentioned above), I thought it would be very cool for the Linux master to arbitrarily run different CPU1 SW applications--if it can somehow access area of RAM that has been reserved for CPU1 (I was thinking about trying to let Linux access the whole RAM, and seeing whether CPU1 can still use the last few MB of the RAM).  It is in fact possible to limit Linux from using the last 2 MB of the Zedboard RAM with the 'mem=510M" argument, and later use this code to access that memory (copied from LDD 3rd ed, p. 443):

ioremap(0x1FE00000 /* 510M */, 0x200000 /* 2M */);

But even in this limited form, I can still have multiple SW to run on CPU1--as long as I ensure for the FSBL that no CPU1 SW overlaps in RAM, and the Linux application remembers the different start address for each desired CPU1 application.

If I can start/stop CPU1 from Linux and read/write from/to it through OCM, then CPU1 is can be abstracted as a file: fopen it to start an application on CPU1, fclose it to stop CPU1 (accomplished by stopping its clock), and read/write to the file handle.  The driver should expose as many devices as there are different applications: e.g. /sys/amp/cpu1-10000000 and /sys/amp/cpu1-18000000.  Maybe more CPUs will run bare metal application in the future.  The device driver can build these devices from a module parameter (or DTS).  In turn, the device file name will then indicate to the device driver the load address.

Iteration 0: build a stub in-tree module

I will start with a character device model; the command/update pattern is well suited for many things I want to do on CPU1.  Plus, since I will communicate with CPU1 over OCM, this device is best characterized as a character mem device (device major ID 1).  I create a new folder <kernel>/drivers/char/amp, to keep my work as cleanly separated from existing char drivers sources:

~/work/zed/kernel/drivers/char$ mkdir zynq_amp

I want to build this module only if CONFIG_ZYNQ_AMP is defined, so I added the following line to <kernel>/drivers/char/Makefile:

obj-$(CONFIG_ZYNQ_AMP) += zynq_amp

The kernel configs will be pulled in <kernel>/drivers/char/Kconfig:

source "drivers/char/zynq_amp/Kconfig"

I see Makefile comment "# When adding new entries keep the list in alphabetical order", but that guidance has clearly gone unheeded in the existing Makefiles.  Then my own <>/drivers/char/zynq_amp/Makefile can be a one-liner:

obj-$(CONFIG_ZYNQ_AMP) += zynq_amp.o

<kernel>/drivers/char/zynq_amp/Kconfig needs to explain the config entry:

menu "Zynq AMP"

config ZYNQ_AMP
tristate "Run bare-metal programs on other CPUs"

depends on ARCH_ZYNQ
help
 Say yes here to build support for AMP as explained in Xilinx xapp 1078
 but now through the device driver formalism. To compile this driver as a
    module, choose M here: the module will be called amp.
endmenu

I will actually depend on the Zynq OCM platform driver, but that driver comes with the Zynq machine architecture; that's why ZYNQ_AMP depends on ARCH_ZYNQ.  CONFIG_ZYNQ_AMP (whether as y or m) should be in the .config file, as in this example at the end of my zynq_xcomm_adv7511_nfs_defconfig:

CONFIG_ZYNQ_AMP=m

There should be a way to opt out of loading the device driver even if the driver is compiled into the kernel (Linux is--too--flexible).  The DTS compatibility table seems to be the way.  If I want this driver, then I can declare it at the top of the device tree:

/dts-v1/;
/include/ "zynq-zed.dtsi"

/ {
 ...
    amp@1 {
        compatible = "zynq-amp-1.0";
        reg = <0x18000000 0x1000>;
    };
};

Originally, I wanted to hang a pseudo device off the OCM (which is required for this driver as explained above) in DTS (zynq-zed-adv7511.dts), similar to how I tacked on the spidev device driver off the raw Xilnx SPI device driver in an earlier blog like this:

/ {
 ...
axi: amba@0 {
ps7_ocmc_0: ps7-ocmc@f800c000 {
   amp@1 {
       compatible = "zynq-amp-1.0";
        reg = <0x18000000 0x1000>;
   };
};
} ;
};
};

But the driver would NOT probe (using the probing method below)!

Until the device driver is fully baked, it's easier to bypass Buildroot for the kernel and module build, and just use the Xilinx toolchain (which is actually just an old version of the Codesourcery toolchain):

~/work/zed/kernel$ source /opt/Xilinx/SDK/2014.4/settings64.sh
~/work/zed/kernel$ export CROSS_COMPILE=arm-xilinx-linux-gnueabi-
~/work/zed/kernel$ make ARCH=arm zynq_xcomm_adv7511_nfs_defconfig
~/work/zed/kernel$ make ARCH=arm uImage LOADADDR=0x8000

The last dash in CROSS_COMPILE is important because the cross-compile binaries are invoked simply by prepending ${CROSS_COMPILE} to the generic tool names, such as gcc.  If you have ccache in your path, adding CC="ccache gcc" right before ARCH=arm in the previous command will help reduce the subsequent build time, lie this example:

~/work/zed/kernel$ make CC="/mnt/work/zed/buildroot/output/host/usr/bin/ccache gcc" ARCH=arm uImage LOADADDR=0x8000

If I already have a compiled kernel, I can build out-of-tree like this:

~/work/zed/kernel/drivers/char/zynq_amp$ make -C /mnt/work/zed/buildroot/output/build/linux-custom ARCH=arm M=`pwd` modules

And of course changing the build target to clean will clean the object files and the ko file.  An "installed" module will appear on the target root file system's /lib/modules/<kernel version>/kernel.  For example, this driver would up in this directory:

# ls /lib/modules/3.15.0/kernel/drivers/char/zynq_amp/
zynq_amp.ko

I can then probe this module:

# modprobe kernel/drivers/char/zynq_amp/zynq_amp.ko
# lsmod
Module                  Size  Used by    Not tainted
zynq_amp                1237  0 
ipv6                  272881 12 [permanent]
# rmmod kernel/drivers/char/zynq_amp/zynq_amp.ko

To iterate code development, the simplest way is to copy the kernel object into the NFS root's /lib/modules/<kernel version>/kernel folder.

Iteration 0: skeleton only

Linux already has many driver framework (e.g. for ADC, TTY, HID) because very few devices are truly new.  So I know that this device should not register as a bare character device--but what kind of character device?  I fell back to the misc character device: <linux/miscdevice.h>.  The driver should match itself against the device tree entry (above) like this:

static struct of_device_id zynq_amp_dt_compatible[] = {
{ .compatible = "zynq-amp-1.0" },
{ /* end of table */ }
};
static struct platform_driver amp_driver = {
.probe = amp_probe,
.remove = amp_remove,
.driver = {
.name = "amp",
.of_match_table = zynq_amp_dt_compatible,
},
};
module_platform_driver(amp_driver);/* because the init/exit are no-op */

Probe

At minimum, probe should read the DTS entry to configure itself correctly, and register itself to the chosen driver framework (which is the miscdevice in this case).  A no-op implementation is given below.

struct ampdevice {
struct miscdevice parent;
/* embed within singleton instead of DEFINE_MUTEX(dev_mutex);*/
struct mutex mutex;
u32 addr;
struct file* file;/* file that has this device opened */
};

static const struct file_operations amp_chrdev_ops = {
.owner = THIS_MODULE,
.open = amp_open,
.release    = amp_release,
.read = amp_read,
.llseek  = no_llseek,
};

/* Request open slot; see misc_register() implementation */
#define AMP_MISCDEV_MINOR MISC_DYNAMIC_MINOR

struct ampdevice amp = {/* the singleton amp device */
.parent = {
.minor = AMP_MISCDEV_MINOR,
.name = MODNAME,
.nodename = MODNAME,
.fops = &amp_chrdev_ops,
}
};

static int amp_remove(struct platform_device *pdev)
{
int err = 0;
struct device *dev = &pdev->dev;
struct ampdevice* amp = platform_get_drvdata(pdev);
dev_info(dev, "remove");/* should be dev_dbg? */

mutex_lock(&amp->mutex);/* ---------------------------------------- */
err = misc_deregister(&amp->parent);
if(err) {
dev_err(dev, "misc_deregister() %d", err);
}
amp->addr = 0;
mutex_unlock(&amp->mutex);/* ++++++++++++++++++++++++++++++++++++++ */
return err;
}

static int amp_probe(struct platform_device *pdev) {
...
id = of_match_node(zynq_amp_dt_compatible, dev->of_node);
if (!id) {
err = -EINVAL;
goto err_;
}
addr = platform_get_resource(pdev, IORESOURCE_MEM, 0);/* the reg field */
if(!addr) {
err = -EINVAL;/* Invalid reg */
dev_err(dev, "NULL reg");
goto err_;
}

mutex_init(&amp.mutex);/* ---------------------------------------- */
platform_set_drvdata(pdev, &amp);
amp.addr = (u32)addr;

err = misc_register(&amp.parent);
if(err) {
dev_err(dev, "misc_register %d", err);
goto err_remove;
}
mutex_unlock(&amp.mutex);/* ++++++++++++++++++++++++++++++++++++++ */

return 0;/* success! */

err_remove:
amp_remove(pdev);
err_:
return err;
}

Open and release

A typical cdev registers itself to the kernel with cdev_add(), so that the kernel will link cdev struct with inode that is used to present the device to the userspace--in the inode.i_cdev union (i.e. there is i_bdev and i_pipe).  open() can then use this pointer to get back to the device specific structure for itself.  An example is the iio_device_register.  But miscdevice does NOT cdev_register(), so the miscdevices found in the kernel work around in 3 ways: 1) find itself in misc_dev list using the dev_t i_rdev in inode as the key, 2) do the same, using a hard coded (hopefully assigned) minor device number, 3) just keep a global private structure.   While studying misc_open(), I saw that the miscdevice pointer is recorded in the file's private_data structure before open() is called, so I can indeed get back my struct this way.

Since this device abstracts a bare metal application running on CPU1, there is only 1 such device.  So open() enforces that only 1 file uses this device.  As explained in LDD 3rd edition, fork() and dup() do NOT open new file handle, so the code below should be valid.

int amp_open(struct inode *n, struct file *f) {
int err;
/* See ? for the explanation of this code */
struct ampdevice *amp = container_of(f->private_data, struct ampdevice
, parent);
struct device *dev;
if(!amp) { /* cannot log since no handle to device */
err = -ENODEV;
goto err_;
}
dev = amp->parent.this_device;
dev_printk(KERN_DEBUG, dev, "open");
mutex_lock(&amp->mutex);/* ---------------------------------------- */
if(amp->file) {
dev_err(dev, "already open");
err = -EMFILE;/* Too many open files */
goto err_unlock;
}
amp->file = f;/* Now this device is owned */
err = 0;

err_unlock:
mutex_unlock(&amp->mutex);/* ++++++++++++++++++++++++++++++++++++++ */
err_:
return err;
}
int amp_release(struct inode* n, struct file* f) {
int err;
struct ampdevice *amp = container_of(f->private_data, struct ampdevice
, parent);
struct device *dev;
if(!amp) { /* cannot log since no handle to device */
err = -ENODEV;
goto err_;
}
dev = amp->parent.this_device;
dev_printk(KERN_DEBUG, dev, "release");
mutex_lock(&amp->mutex);/* ---------------------------------------- */
if(f != amp->file) { /* precondition violation */
dev_err(dev, "not owner");
err = -EINVAL;
goto err_unlock;
}
amp->file = 0;/* Now this device is free */
err = 0;

err_unlock:
mutex_unlock(&amp->mutex);/* ++++++++++++++++++++++++++++++++++++++ */
err_:
return err;
}

Read

Returning no data is a perfectly valid when prototyping the driver stub.  One annoyance of the printk based debugging is that I did not yet beat dev_dbg into printing the message even when DEBUG is defined.

ssize_t amp_read(struct file* f, char __user *buf, size_t req, loff_t *f_pos) {
int err;
struct ampdevice *amp = container_of(f->private_data, struct ampdevice
, parent);
struct device *dev;
if(!amp) { /* cannot log since no handle to device */
err = -ENODEV;
goto err_;
}
dev = amp->parent.this_device;
dev_printk(KERN_DEBUG, dev, "read");

err = req;
err_:
return err;
}

With these stub functions, I can see the open, read, and release life cycle in the kernel log when running the following command (after modprobe shown above of course):

# echo "8 4 1 8" > /proc/sys/kernel/printk
# od -vAn -N4 -tx4 /dev/amp
misc amp: open
misc amp: read
 00000000
misc amp: release

Iteration 1: probe device gets a handle to sysctl register

Iteration 2: open device resets CPU1

In JTAG debugger, CPU1 is running through this loop:

00000010:   andeq   r0, r0, r0
00000014:   andeq   r0, r0, r0
00000018:   andeq   r0, r0, r0
0000001c:   .word 0x01411c47
00000020:   andvc   r0, r7, r0, lsr #8
00000024:   ldrbhi  r0, [r8], #-2187
00000028:   ldrbne  r5, [r4], #-3216