Oct 30, 2014

Understanding /dev/zero

The Linux /dev/zero pseudo file generates "0" bytes, as you can see in this example, where 4 zero bytes is strung together to give an impression of a ulong = 0:

henry@Zotac64:~$ od -vAn -N4 -tx4 /dev/zero
 00000000

The od command dumps file in octal and other formats "-N4" gets 4 bytes, and "-tx4" formats it as an 4 byte hex number.  strace shows that the system call used is read(), to grab 4 bytes:
...
open("/dev/zero", O_RDONLY)          = 3
read(3, "zv_\22", 4)                 = 4
...

How does it actually work under the hood?  Firstly, it is a mem device, as you can see under /sys folder:

henry@Zotac64:~$ ls -o /sys/class/mem/zero
... /sys/class/mem/zero -> ../../devices/virtual/mem/zero

It is a char mem device (major number 1), as you can see below:

henry@Zotac64:~$ ls -lh --time-style=+ /dev/zero
crw-rw-rw- 1 root root 1, 5  /dev/zero

Browsing the kernel source

The memory devices are defined <kernel>/drivers/char/mem.c:

static const struct memdev {
const char *name;
umode_t mode;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
} devlist[] = {
... [3] = { "null", 0666, &null_fops, NULL },
... [5] = { "zero", 0666, &zero_fops, &zero_bdi },
[7] = { "full", 0666, &full_fops, NULL },
[8] = { "random", 0666, &random_fops, NULL },
[9] = { "urandom", 0666, &urandom_fops, NULL },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, NULL },
#endif
};

Note that the array indices ARE the minor device number appearing in the /dev/ folder.  The actual device creation is done en-mass in chr_dev_init():

err = bdi_init(&zero_bdi);
...
mem_class = class_create(THIS_MODULE, "mem");
...

for (minor = 1; minor < ARRAY_SIZE(devlist); minor++) {
...
device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),

     NULL, devlist[minor].name);
...

device_create() registers it with sysfs, as we can check in /sys/class:

henry@Zotac64:~$ ls /sys/class/mem/
full  kmsg  mem  null  port  random  urandom  zero

chr_dev_init() is probably called by an indirection through the fs_initcall(chr_dev_init) section declaration, which inserts the function with a predefined pattern for the linker:

#define __define_initcall(fn, id) \
static initcall_t __initcall_##fn##id __used \
__attribute__((__section__(".initcall" #id ".init"))) = fn; \
LTO_REFERENCE_INITCALL(__initcall_##fn##id)

The above macro will expand out to

static initcall_t __initcall_chr_dev_init5 __attribute((__section__(".initcall_chr_dev_init5.init"))) = chr_dev_init

that is, there is a static function pointer pointing to chr_dev_init.

/dev/zero file ops are in mem.c also:

static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.read = read_zero,
.write = write_zero,
.aio_read = aio_read_zero,
.aio_write = aio_write_zero,
.mmap = mmap_zero,
};

For "zero", open/release ops are unnecessary, so the fops does NOT include them.  The zero file does not need to seek or write, so they are defined off appropriately:

#define zero_lseek null_lseek

#define write_zero write_null

#define aio_write_zero aio_write_null

The read clears the user buffer using either slow byte operation for < 64 bytes, or wholesale memset; for larger buffer, the clearing happens in chunks of PAGE_SIZE.

static ssize_t read_zero(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
...
if (!access_ok(VERIFY_WRITE, buf, count))

return -EFAULT;


while (count) {
unwritten = __clear_user(buf, chunk);
written += chunk - unwritten;
if (unwritten)
break;
if (signal_pending(current))
return written ? written : -ERESTARTSYS;
buf += chunk;
count -= chunk;
cond_resched();

}
return written ? written : -EFAULT;

}

cond_resched() allows an important task to preempt this thread--all part of being a considerate citizen.

Although I don't understand fully what it means to memory map a character device, it IS possible to mmap zero for shared mapping:



static int mmap_zero(struct file *file, struct vm_area_struct *vma) {
#ifndef CONFIG_MMU
    return -ENOSYS;
#endif
    if (vma->vm_flags & VM_SHARED)
        return shmem_zero_setup(vma);
    return 0;
}

Trace this, baby!

Let's make this blog entry a bit more interesting by tracing the interaction with the /dev/zero file.  The Ubuntu 12.04 LTS kernel I am running on my PC has some FTRACE capability, according to the config file saved in /boot"

henry@Zotac64:~$ grep FTRACE /boot/config-3.13.0-37-generic 
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
# CONFIG_PSTORE_FTRACE is not set
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set

Low level tracing

According to the ftrace author Steven Rostedt (also this more advanced topic), CONFIG_DYNAMIC_FTRACE allows me to tap into any of the events in /sys/kernel/debug/tracing/available_events, which numbers over 1000 on my PC kernel!

# wc -l /sys/kernel/debug/tracing/available_events 
1114 /sys/kernel/debug/tracing/available_events

Of those, the read and write system calls (as shown in strace above) might be interesting:

syscalls:sys_exit_read
syscalls:sys_enter_read
syscalls:sys_exit_write
syscalls:sys_enter_write

Writing to trace buffer can be turned on/off by writing 1/0 to a file /sys/kernel/debug/tracing/tracing_on, but tracing itself is changed by writing the desired tracer to /sys/kernel/debug/tracing/current_tracer, which is nop on system startup:

# echo function_graph > current_tracer

function_graph tracer is more full featured than the simpler function tracer.

To synchronize the events with the user space, ftrace exposes a file /sys/kernel/debug/tracing/trace_marker, that can be written to like this example:

[tracing]# echo hello world > trace_marker
[tracing]# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-3718  [001]  5546.183420: 0: hello world

If I want to do the equivalent from within the kernel, use the function trace_printk().  I found that what I wrote to the trace_marker does NOT show up in trace when I use function_graph tracer.

While tracing is enabled, the result can be viewed either in /sys/kernel/debug/tracing/per_cpu/cpu[0-9]/trace, or rolled up in /sys/kernel/debug/tracing/trace (might be confusing to look at).  To view the file AFTER tracing is turned off, you have to copy the trace into another file.

Putting everything together, a way to briefly trace reading 4 bytes from /dev/zero would be:

echo read_zero > set_ftrace_filter

echo 1 > tracing_on
echo function > current_tracer
echo "begin read zero" > trace_marker
od -vAn -N4 -tx4 /dev/zero
echo "end read zero" > trace_marker
echo 0 > tracing_on
cp trace /home/henry/read_zero.ftrace
echo nop > current_tracer
chown henry /home/henry/read_zero.ftrace
echo > set_ftrace_filter # Undo the filter

Then look at the result in /sys/kernel/debug/tracing/trace

echo nop > current_tracer

Make it easy on your self: trace-cmd

I found that Steven Rosted wrote another another tool to ease data collection: trace-cmd.  

trace-cmd record -p function od -vAn -N4 -tx4 /dev/zero

The "-l" argument is the filter, and "-e" is the events; both can be repeated.  The rest of the command is the command to call: what you want to trace (in this example, it would be the string starting with "od").  To look at the result, I played around with kernelshark, but I found it more confusing than just the text output from trace-cmd report:

trace-cmd report | grep -v 'trace-cmd' > read_zero.txt

It shows that we fly through the open and then read of the /dev/zero in a few microseconds:

od-4878  [000] 22634.564764: sys_enter:            NR 2 ...
od-4878  [000] 22634.564764: sys_enter_open:...
...
od-4878  [000] 22634.564771: do_sys_open:          [FAILED TO PARSE] filename=/dev/zero flags=32768 mode=438
...
od-4878  [000] 22634.564772: sys_exit:             NR 2 = 3
od-4878  [000] 22634.564772: sys_exit_open:        0x3
...
od-4878  [000] 22634.564782: sys_enter:            NR 0 (3, 22612f0, 4, 7ffff7fd18c0, 1, 3)
od-4878  [000] 22634.564782: sys_enter_read: ...
od-4878  [000] 22634.564784: function:             read_zero
od-4878  [000] 22634.564784: sys_exit:             NR 0 = 4
od-4878  [000] 22634.564785: sys_exit_read:        0x4

The whole read took 3 us (22634.564785 - 22634.564782)!  The read() system call number is apparently 3.  If I remove the filter option, I can see the call graph like this example:

trace-cmd record -p function -e "irq:*" od -vAn -N4 -tx4 /dev/zero
25182.162030:              SyS_read
25182.162031:                 fget_light
25182.162031:                 vfs_read
25182.162031:                    rw_verify_area
25182.162031:                       security_file_permission
25182.162032:                          apparmor_file_permission
25182.162032:                             common_file_perm
25182.162032:                                aa_file_perm
25182.162032:                          __fsnotify_parent
25182.162032:                          fsnotify
25182.162033:                    read_zero
25182.162033:                       read_zero.part.5
25182.162033:                          __clear_user
25182.162033:                          _cond_resched
25182.162034:                    __fsnotify_parent
25182.162034:                    fsnotify

This call graph matches the source code we browsed earlier.  But I am disappointed about not seeing any interrupts so far; what does it look like in ftrace?  I think that system call involves raising a SWI (vector 11 on ARM), which handled in assembly (for ARM: <kernel>/arch/arm/kernel/entry-common.S, vector_swi function, which should call sys_read but calls sys_ni_syscall if system call is not implmented).  That sys_read is linked (through clever assembly NR_syscalls numbering in entry-common.S) to the read function defined in <kernel>/fs/read_write.c:

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
...

Conclusions

  • The read system call sure is fast at least when it does not actually have to read anything.
  • Tentatively, I cannot see the SWI interrupt through ftrace.