henry@Zotac64:~$ od -vAn -N4 -tx4 /dev/zero
00000000
The od command dumps file in octal and other formats "-N4" gets 4 bytes, and "-tx4" formats it as an 4 byte hex number. strace shows that the system call used is read(), to grab 4 bytes:
...
open("/dev/zero", O_RDONLY) = 3
read(3, "zv_\22", 4) = 4
...
How does it actually work under the hood? Firstly, it is a mem device, as you can see under /sys folder:
henry@Zotac64:~$ ls -o /sys/class/mem/zero
... /sys/class/mem/zero -> ../../devices/virtual/mem/zero
It is a char mem device (major number 1), as you can see below:
henry@Zotac64:~$ ls -lh --time-style=+ /dev/zero
crw-rw-rw- 1 root root 1, 5 /dev/zero
static const struct memdev {
const char *name;
umode_t mode;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
} devlist[] = {
... [3] = { "null", 0666, &null_fops, NULL },
... [5] = { "zero", 0666, &zero_fops, &zero_bdi },
[7] = { "full", 0666, &full_fops, NULL },
[8] = { "random", 0666, &random_fops, NULL },
[9] = { "urandom", 0666, &urandom_fops, NULL },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, NULL },
#endif
};
Note that the array indices ARE the minor device number appearing in the /dev/ folder. The actual device creation is done en-mass in chr_dev_init():
err = bdi_init(&zero_bdi);
...
mem_class = class_create(THIS_MODULE, "mem");
...
for (minor = 1; minor < ARRAY_SIZE(devlist); minor++) {
...
device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
NULL, devlist[minor].name);
...
device_create() registers it with sysfs, as we can check in /sys/class:
henry@Zotac64:~$ ls /sys/class/mem/
full kmsg mem null port random urandom zero
chr_dev_init() is probably called by an indirection through the fs_initcall(chr_dev_init) section declaration, which inserts the function with a predefined pattern for the linker:
#define __define_initcall(fn, id) \
static initcall_t __initcall_##fn##id __used \
__attribute__((__section__(".initcall" #id ".init"))) = fn; \
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
The above macro will expand out to
static initcall_t __initcall_chr_dev_init5 __attribute((__section__(".initcall_chr_dev_init5.init"))) = chr_dev_init
that is, there is a static function pointer pointing to chr_dev_init.
/dev/zero file ops are in mem.c also:
static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.read = read_zero,
.write = write_zero,
.aio_read = aio_read_zero,
.aio_write = aio_write_zero,
.mmap = mmap_zero,
};
cond_resched() allows an important task to preempt this thread--all part of being a considerate citizen.
Although I don't understand fully what it means to memory map a character device, it IS possible to mmap zero for shared mapping:
henry@Zotac64:~$ grep FTRACE /boot/config-3.13.0-37-generic
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
# CONFIG_PSTORE_FTRACE is not set
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
henry@Zotac64:~$ ls -lh --time-style=+ /dev/zero
crw-rw-rw- 1 root root 1, 5 /dev/zero
Browsing the kernel source
The memory devices are defined <kernel>/drivers/char/mem.c:static const struct memdev {
const char *name;
umode_t mode;
const struct file_operations *fops;
struct backing_dev_info *dev_info;
} devlist[] = {
... [3] = { "null", 0666, &null_fops, NULL },
... [5] = { "zero", 0666, &zero_fops, &zero_bdi },
[7] = { "full", 0666, &full_fops, NULL },
[8] = { "random", 0666, &random_fops, NULL },
[9] = { "urandom", 0666, &urandom_fops, NULL },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, NULL },
#endif
};
Note that the array indices ARE the minor device number appearing in the /dev/ folder. The actual device creation is done en-mass in chr_dev_init():
err = bdi_init(&zero_bdi);
...
mem_class = class_create(THIS_MODULE, "mem");
...
for (minor = 1; minor < ARRAY_SIZE(devlist); minor++) {
...
device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
NULL, devlist[minor].name);
...
henry@Zotac64:~$ ls /sys/class/mem/
full kmsg mem null port random urandom zero
#define __define_initcall(fn, id) \
static initcall_t __initcall_##fn##id __used \
__attribute__((__section__(".initcall" #id ".init"))) = fn; \
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
The above macro will expand out to
static initcall_t __initcall_chr_dev_init5 __attribute((__section__(".initcall_chr_dev_init5.init"))) = chr_dev_init
that is, there is a static function pointer pointing to chr_dev_init.
/dev/zero file ops are in mem.c also:
static const struct file_operations zero_fops = {
.llseek = zero_lseek,
.read = read_zero,
.write = write_zero,
.aio_read = aio_read_zero,
.aio_write = aio_write_zero,
.mmap = mmap_zero,
};
For "zero", open/release ops are unnecessary, so the fops does NOT include them. The zero file does not need to seek or write, so they are defined off appropriately:
#define zero_lseek null_lseek
#define write_zero write_null
#define aio_write_zero aio_write_null
#define write_zero write_null
#define aio_write_zero aio_write_null
The read clears the user buffer using either slow byte operation for < 64 bytes, or wholesale memset; for larger buffer, the clearing happens in chunks of PAGE_SIZE.
static ssize_t read_zero(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
...
if (!access_ok(VERIFY_WRITE, buf, count))
return -EFAULT;
while (count) {
unwritten = __clear_user(buf, chunk);
written += chunk - unwritten;
if (unwritten)
break;
if (signal_pending(current))
return written ? written : -ERESTARTSYS;
buf += chunk;
count -= chunk;
cond_resched();
}
return written ? written : -EFAULT;
}
size_t count, loff_t *ppos)
{
...
if (!access_ok(VERIFY_WRITE, buf, count))
return -EFAULT;
while (count) {
unwritten = __clear_user(buf, chunk);
written += chunk - unwritten;
if (unwritten)
break;
if (signal_pending(current))
return written ? written : -ERESTARTSYS;
buf += chunk;
count -= chunk;
cond_resched();
}
return written ? written : -EFAULT;
}
Although I don't understand fully what it means to memory map a character device, it IS possible to mmap zero for shared mapping:
static int mmap_zero(struct file *file, struct vm_area_struct *vma)
{
#ifndef CONFIG_MMU
return -ENOSYS;
#endif
if (vma->vm_flags & VM_SHARED)
return shmem_zero_setup(vma);
return 0;
}
Trace this, baby!
Let's make this blog entry a bit more interesting by tracing the interaction with the /dev/zero file. The Ubuntu 12.04 LTS kernel I am running on my PC has some FTRACE capability, according to the config file saved in /boot"henry@Zotac64:~$ grep FTRACE /boot/config-3.13.0-37-generic
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
# CONFIG_PSTORE_FTRACE is not set
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
Low level tracing
According to the ftrace author Steven Rostedt (also this more advanced topic), CONFIG_DYNAMIC_FTRACE allows me to tap into any of the events in /sys/kernel/debug/tracing/available_events, which numbers over 1000 on my PC kernel!
# wc -l /sys/kernel/debug/tracing/available_events
1114 /sys/kernel/debug/tracing/available_events
1114 /sys/kernel/debug/tracing/available_events
Of those, the read and write system calls (as shown in strace above) might be interesting:
syscalls:sys_exit_read
syscalls:sys_enter_read
syscalls:sys_exit_write
syscalls:sys_enter_write
syscalls:sys_enter_read
syscalls:sys_exit_write
syscalls:sys_enter_write
Writing to trace buffer can be turned on/off by writing 1/0 to a file /sys/kernel/debug/tracing/tracing_on, but tracing itself is changed by writing the desired tracer to /sys/kernel/debug/tracing/current_tracer, which is nop on system startup:
# echo function_graph > current_tracer
function_graph tracer is more full featured than the simpler function tracer.
To synchronize the events with the user space, ftrace exposes a file /sys/kernel/debug/tracing/trace_marker, that can be written to like this example:
[tracing]# echo hello world > trace_marker
[tracing]# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-3718 [001] 5546.183420: 0: hello world
[tracing]# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<...>-3718 [001] 5546.183420: 0: hello world
If I want to do the equivalent from within the kernel, use the function trace_printk(). I found that what I wrote to the trace_marker does NOT show up in trace when I use function_graph tracer.
While tracing is enabled, the result can be viewed either in /sys/kernel/debug/tracing/per_cpu/cpu[0-9]/trace, or rolled up in /sys/kernel/debug/tracing/trace (might be confusing to look at). To view the file AFTER tracing is turned off, you have to copy the trace into another file.
Putting everything together, a way to briefly trace reading 4 bytes from /dev/zero would be:
echo read_zero > set_ftrace_filter
echo 1 > tracing_on
echo function > current_tracer
echo "begin read zero" > trace_marker
od -vAn -N4 -tx4 /dev/zero
echo "end read zero" > trace_marker
echo 0 > tracing_on
cp trace /home/henry/read_zero.ftrace
echo nop > current_tracer
chown henry /home/henry/read_zero.ftrace
echo 1 > tracing_on
echo function > current_tracer
echo "begin read zero" > trace_marker
od -vAn -N4 -tx4 /dev/zero
echo "end read zero" > trace_marker
echo 0 > tracing_on
cp trace /home/henry/read_zero.ftrace
echo nop > current_tracer
chown henry /home/henry/read_zero.ftrace
# echo > set_ftrace_filter # Undo the filter
Then look at the result in /sys/kernel/debug/tracing/trace
echo nop > current_tracer
Make it easy on your self: trace-cmd
I found that Steven Rosted wrote another another tool to ease data collection: trace-cmd.
trace-cmd record -p function od -vAn -N4 -tx4 /dev/zero
The "-l" argument is the filter, and "-e" is the events; both can be repeated. The rest of the command is the command to call: what you want to trace (in this example, it would be the string starting with "od"). To look at the result, I played around with kernelshark, but I found it more confusing than just the text output from trace-cmd report:
trace-cmd report | grep -v 'trace-cmd' > read_zero.txt
It shows that we fly through the open and then read of the /dev/zero in a few microseconds:
od-4878 [000] 22634.564764: sys_enter: NR 2 ...
od-4878 [000] 22634.564764: sys_enter_open:...
...
od-4878 [000] 22634.564771: do_sys_open: [FAILED TO PARSE] filename=/dev/zero flags=32768 mode=438
...
od-4878 [000] 22634.564772: sys_exit: NR 2 = 3
od-4878 [000] 22634.564772: sys_exit_open: 0x3
...
od-4878 [000] 22634.564782: sys_enter: NR 0 (3, 22612f0, 4, 7ffff7fd18c0, 1, 3)
od-4878 [000] 22634.564782: sys_enter_read: ...
od-4878 [000] 22634.564784: function: read_zero
od-4878 [000] 22634.564784: sys_exit: NR 0 = 4
od-4878 [000] 22634.564785: sys_exit_read: 0x4
od-4878 [000] 22634.564764: sys_enter_open:...
...
od-4878 [000] 22634.564771: do_sys_open: [FAILED TO PARSE] filename=/dev/zero flags=32768 mode=438
...
od-4878 [000] 22634.564772: sys_exit: NR 2 = 3
od-4878 [000] 22634.564772: sys_exit_open: 0x3
...
od-4878 [000] 22634.564782: sys_enter: NR 0 (3, 22612f0, 4, 7ffff7fd18c0, 1, 3)
od-4878 [000] 22634.564782: sys_enter_read: ...
od-4878 [000] 22634.564784: function: read_zero
od-4878 [000] 22634.564784: sys_exit: NR 0 = 4
od-4878 [000] 22634.564785: sys_exit_read: 0x4
The whole read took 3 us (22634.564785 - 22634.564782)! The read() system call number is apparently 3. If I remove the filter option, I can see the call graph like this example:
trace-cmd record -p function -e "irq:*" od -vAn -N4 -tx4 /dev/zero
25182.162030: SyS_read
25182.162031: fget_light
25182.162031: vfs_read
25182.162031: rw_verify_area
25182.162031: security_file_permission
25182.162032: apparmor_file_permission
25182.162032: common_file_perm
25182.162032: aa_file_perm
25182.162032: __fsnotify_parent
25182.162032: fsnotify
25182.162033: read_zero
25182.162033: read_zero.part.5
25182.162033: __clear_user
25182.162033: _cond_resched
25182.162034: __fsnotify_parent
25182.162034: fsnotify
25182.162030: SyS_read
25182.162031: fget_light
25182.162031: vfs_read
25182.162031: rw_verify_area
25182.162031: security_file_permission
25182.162032: apparmor_file_permission
25182.162032: common_file_perm
25182.162032: aa_file_perm
25182.162032: __fsnotify_parent
25182.162032: fsnotify
25182.162033: read_zero
25182.162033: read_zero.part.5
25182.162033: __clear_user
25182.162033: _cond_resched
25182.162034: __fsnotify_parent
25182.162034: fsnotify
This call graph matches the source code we browsed earlier. But I am disappointed about not seeing any interrupts so far; what does it look like in ftrace? I think that system call involves raising a SWI (vector 11 on ARM), which handled in assembly (for ARM: <kernel>/arch/arm/kernel/entry-common.S, vector_swi function, which should call sys_read but calls sys_ni_syscall if system call is not implmented). That sys_read is linked (through clever assembly NR_syscalls numbering in entry-common.S) to the read function defined in <kernel>/fs/read_write.c:
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
...
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
...
Conclusions
- The read system call sure is fast at least when it does not actually have to read anything.
- Tentatively, I cannot see the SWI interrupt through ftrace.