The Sysctl Interface

The Sysctl Interface
(September 1997)

The sysctl system call is an interesting feature of the Linux kernel; it is quite unique in the Unix world. The system call exports the ability to fine-tune kernel parameters and is tightly bound to the /proc filesystem: the simpler file-based interface can be used to perform the same tasks that are available by means of the system call. sysctl appeared in 1.3.57 and is fully supported ever since. This article explains how to use sysctl with any kernel between 2.0.0 and 2.1.35.

by Alessandro Rubini

When running Unix kernels, system administrators often need to fine-tune some low-level features according to their specific needs. Most often, system tailoring forces to rebuild the kernel image and reboot the computer. Such tasks are lenghty ones, and require good skills and a little luck to be successfully fulfilled. The Linux developers diverted from this approach and chose to implement variable parameters in place of hardwired constants; runtime configuration is then eased by exploiting the /proc filesystem. The internals of sysctl are designed not only to read and modify configuration parameters, but also to support a dynamic set of such variables. In other words, the module writer can insert new entries in the sysctl tree, and allow run-time configuration of driver features.

The `/proc` interface to system control

Most Linux users get quickly accustomed to the idea of /proc. In short, the filesystem can be considered a sort of gateway to kernel internals: its files are entry points to some kernel information. Such information is usually exchanged in textual form to ease interactive use, although some node returns binary data to be used by specific applications. The typical example of binary /proc file is /proc/kcore: it is a core file representing the current kernel, so that you can run "gdb /usr/src/linux/vmlinux /proc/kcore" and peek in your running kernel. Naturally, compiling vmlinux with -g helps a lot if you run gdb on /proc/kcore.

Most of the /proc files are read-only: writing to them has no effect. This applies for instance to /proc/interrups, /proc/ioports, /proc/net/route and all the other informative nodes. The directory /proc/sys, on the other hand, behaves differently: it is the root of a file tree related to system control. Every subdirectory in /proc/sys deals with a kernel subsystem like net and vm, while kernel includes kernel-wide parameters, like the hostname.

Each sysctl file encloses numeric or string values: sometimes a single value, sometimes an array of them. For example, this is the content of some of the control files, which appear as an array of values. The screen snapshot refers to version 2.1.32 of the kernel.

      morgana.root# pwd
      /proc/sys
      morgana.root# grep . kernel/*
      kernel/ctrl-alt-del:0
      kernel/domainname:systemy.it
      kernel/file-max:1024
      kernel/file-nr:128
      kernel/hostname:morgana
      kernel/inode-max:3072
      kernel/inode-nr:384     263
      kernel/osrelease:2.1.32
      kernel/ostype:Linux
      kernel/panic:0
      kernel/printk:6 4       1       7
      kernel/securelevel:0
      kernel/version:#9 Mon Apr 7 23:08:18 MET DST 1997

It's worth stressing that reading /proc items with less doesn't work, because they appear as 0-length files to the stat system call, and less checks the features of the file before reading it. The inaccuracy of stat is a feature of /proc, rather than a bug: it's a saving in human resources (in writing code), and kernel size (in carrying the code around). stat information is completely irrelevant for most files, as cat, grep and all the other tools work painlessly. If you really need to run less over /proc, you can always invoke "cat file | less".

If you need to change system parameters, writing the new values to the right file in /proc/sys is all that's needed. If the file hosts an array of values, they are overwritten in order. Let's look at kernel/printk for example (but note that the file has been introduced only in version 2.1.32). The four numbers in /proc/sys/kernel/printk control the ``verbosity'' level of the printk kernel function, and the first number is the console_loglevel: kernel messages with priority less than or equal to the specified one will be printed to the system console (the active virtual console, unless you changed it). The parameter doesn't affect operation of klogd, which received all the messages in any case. The following commands show how to change the loglevel:

      morgana.root# cat kernel/printk
      6       4       1       7
      morgana.root# echo 8 > kernel/printk
      morgana.root# cat kernel/printk
      8       4       1       7

a level of 8 corresponds to debugging messages: they are not printed on the console by default, but the previous session changes the behaviour to print every message, even the debugging ones.

Similarly, you can change the hostname by writing the new value to /proc/kernel/hostname. A useful feature when you miss the hostname command.

Using the system call

Even though the /proc filesystem is a great resource to exploit, sometimes it is just missing. The filesystem is not vital to system operation, and there are cases when you choose to leave it out of the kernel image or simply don't mount it. When you build an embedded system, for example, saving 40-50 kB can be an interesting option; if you are very concerned about security, on the other hand, you might decide to hide system information and leave /proc unmounted.

The system call interface to kernel tuning, namely sysctl, is an alternative way to peek into configurable parameters and to modify them. An additional advantage of the system call interface is that it's faster, as no fork/exec is involved, nor any directory lookup. Anyway, unless you run a very old platform, the performance savings are irrelevant.

To use the system call, the header <sys/sysctl.h> must be included: it declares the function as:

      int sysctl (int *name, int nlen, void *oldval,
              size_t *oldlenp, void *newval, size_t newlen);

If your standard library is not up to date, the function is not prototyped in the headers nor defined in the library. I don't know when exactly the library function has been introduced, but at least I know that libc-5.0 misses it, while libc-5.3 has sysctl support. If you have an old library you must invoke the system call directly, using code like this one:

      #include <linux/unistd.h>
      #include <linux/sysctl.h>

      _syscall1(int, _sysctl, struct __sysctl_args *, args);
      /* now "_sysctl(struct __sysctl_args *args)" can be called */

As you see, the system call gets a single argument instead of six of them, so the mismatch in the prototypes has been solved by prepending an underscore to the name of the system call. Therefore, the system call is _sysctl and gets one argument, while the library function is sysctl and gets six arguments. The sample code introduced in this article uses the library function.

The arguments of the function have the following meaning:

name points to an array of integers: each of the integer values identifies a sysctl item, either a directory or a leaf node file. The symbolic names for such values are defined in <linux/sysctl.h>.
nlen states how many integer numbers are listed in the array name: to reach a particular entry you need to specify the path through the subdirectories, so you need to tell how long is such path.
oldval is a pointer to a data buffer where the old value of the sysctl item must be stored. If it is NULL, the system call won't return values to user space.
oldlenp points to an integer number stating the length of the oldval buffer. The system call changes the value to reflect how much data has been written, which can be less than the buffer length.
newval points to a data buffer hosting replacement data: the kernel will read this buffer to change the sysctl entry being acted upon. If it is NULL, the kernel value is not changed.
newlen is the length of newval. The kernel will read no more than newlen bytes from newval.

Let's try now to write some C code to access the four parameters in /proc/sys/kernel/printk. The ``name'' of the file is KERN_PRINTK, within the directory CTL_KERN. The following code is the complete program to access the values, and I call it pkparms.c.

#include <stdio.h>
#include <stdlib.h>
#include <sys/sysctl.h>
#include <linux/sysctl.h>

int main(int argc, char **argv)
{
        int name[] = {CTL_KERN, KERN_PRINTK};
        int namelen = 2;
        int oldval[8];  /* 4 would suffice */
        size_t len = sizeof(oldval);
        int i, error;


        error = sysctl (name, namelen, (void *)oldval, &len,
                NULL /* newval */, 0 /* newlen */);
        if (error) {
                fprintf(stderr,"%s: sysctl(): %s\n",
                        argv[0],strerror(errno));
                exit(1);
        }
        printf("len is %i bytes\n", len);
        for (i = 0; i < len/(sizeof(int)); i++)
                printf("%i\t", oldval[i]);
        printf("\n");
        exit(0);
}

Changing sysctl values is similar to reading them: just use newval and newlen. A program similar to pkparms.c can be used to change the console loglevel, the first number in kernel/printk. The program is called setlevel.c, and its core looks like:

        int newval[1];
        int newlen = sizeof(newval);

        /* assign newval[0] */

        error = sysctl (name, namelen, NULL /* oldval */, 0 /* len */,
                newval, newlen);

As you see, the program overwrites only the first sizeof(int) bytes of the kernel entry, but this is exactly what we meant. Please remember, however, that the printk parameters are not exported to sysctl in version 2.0 of the kernel. The programs won't compile under 2.0 due to the missing KERN_PRINTK symbol; if on the other hand you compile either of them elsewhere and run it under 2.0, you'll get an error when invoking sysctl.

A simple run of the two programs looks like the following: morgana.root# ./pkparms len is 16 bytes 6 4 1 7 morgana.root# cat /proc/sys/kernel/printk 6 4 1 7 morgana.root# ./setlevel 8 morgana.root# ./pkparms len is 16 bytes 8 4 1 7

If you run kernel 2.0, don't despair: the files are meant as samples, and the same code can be used to access any sysctl item with minimal modifications.

On the same ftp site you'll also find hname.c: a bare-bones ``hostname'' command based on sysctl. The source works with the 2.0 kernels and also shows how to invoke the system call with no library support, because my Linux-2.0 runs on a libc-5.0-based PC.

A quick look at some sysctl entries

Although low-level, the tunable parameters of the kernel are very interesting to play with, and can help optimizing system performance for the different uses of a Linux box.

The following list is an incomplete overview of the kernel and vm directories under /proc/sys. The following information apply to all 2.0 kernel and up to 2.1.35.

kernel/panic The integer value is the number of seconds the system will wait before automatic reboot in case of system panic. `0' means "disabled". Automatic reboot is an interesting feature to turn on for unattended systems. The command-line option panic= can be used to set the value at boot time.
kernel/file-max The maximum number of open files in the system. file-nr is the per-process maximum and can't be modified because is constrained by the page size. Similar entries exist for the inodes: one user-wide and one per-process. Severs with many processes and many open files might benefit from raising the items.
kernel/securelevel This is a hook for security features in the system. The securelevel is (currently) read-only even for root (!), so it can only be changed by program code (e.g., modules). Nowadays only the ext2 filesystem uses the securelevel: it refuses to change file flags (like "immutable" and "append-only") if the securelevel is greater than 0. A kernel with securelevel precompiled to 1 and no support for modules can be used to protect precious files from corruption in case of network intrusions. Stay tuned for new features of securelevel.
vm/freepages The file hosts three numbers, all of them are a count of free pages. The first number is the minimum free space in the system (free pages are needed to fulfill atomic allocation requests, like incoming network packets). The second number is the level at wich to start heavy-swapping, and the third is when starting light swapping. A network server with high bandwidth will benefit from higher numbers, to avoid dropping packets due to free memory shortage. By default, 1% of the memory is kept free.
vm/bdflush The numbers in this file can fine-tune the behaviour of the buffer cache. They are documented in fs/buffer.c.
vm/kswapd This file exists in all the 2.0.x kernels, but has been removed in 2.1.33 as unuseful. It can be safely ignored.
vm/swapctl This big file encloses all the parameters used in fine-tuning the swapping algorithms. The fields are listed in include/linux/swapctl.h, and are used in mm/swap.c. Interesting but difficult.

The programming interface: plugging new features

Module writers can easily add their own tunable features to /proc/sys by using the programming interface to extend the control tree. The following functions are exported to modules:

struct ctl_table_header * register_sysctl_table(ctl_table * table,
                                                int insert_at_head);
void unregister_sysctl_table(struct ctl_table_header * table);

The former is used to register a ``table'' of entries and returns a token, which is used by the latter function to detach your table. insert_at_head tells whether the new table must be inserted before or after other ones, and you can easily ignore the issue and specify 0 (not-at-head). But what is the ctl_table type, then? It is a structure made up of the following fields:

int ctl_name. This is a numeric id, unique in each table.
const char *procname. If the entry must be placed in /proc, this is the corresponding name.
void *data. The pointer to data. For example, it will point to an integer value for integer items.
int maxlen. The size of pointed data. Like sizeof(int).
mode_t mode. The octal mode of the file. Directories should have the executable bit turned on (e.g.: 0555).
ctl_table *child. For directories, the child table. For leaf nodes, NULL.
proc_handler *proc_handler. The handler is in charge of performing any read/write spawned by /proc files. If the item has no procname, this field is not used.
ctl_handler *strategy. This handler reads/writes data when the system call is used.
struct proc_dir_entry *de. Used internally.
void *extra1, *extra2. These fields only exist from 1.3.69 onwards, and are used to specify extra information for specific handlers. The kernel has an handler for integer vectors, for example, that uses the extra fields to know the allowable minimum and maximum value for each number in the array.

Well, I see that the previous outline can scare most readers. Therefore, I won't show the protorypes for the handling functions and switch directly to some sample code. Writing code is much easier than understanding it, because you can start by copying lines around. The outcome will fall under the GPL, but I don't see it as a disadvantage.

So, let's try to write a module with two integer parameters, called ontime and offtime. The module will busy-loop for a few timer ticks and sleep for a few more: the parameters control the duration of each state. Yes, this is silly, but is the simples hardware-independent thing I could conceive.

The parameters will appear in /proc/sys/kernel/busy, a new directory. To this aim, we need to register a tree like the one shown in figure 1. The kernel directory won't be created by register_sysctl_table, because it already exists, and it won't be deleted at unregister time because it still has active childs: by specifying the whole tree you thus add files to every directory within /proc/sys.

Donwload postscript: sysctl.ps

In the source file busy.c, the following code makes all the work related to sysctl:

#define KERN_BUSY 434 /* a random number, high enough */
enum {BUSY_ON=1, BUSY_OFF};

int busy_ontime = 0;   /* loop 0 ticks */
int busy_offtime = HZ; /* every second */

/* two integer items (files) */
static ctl_table busy_table[] = {
        {BUSY_ON, "ontime", &busy_ontime, sizeof(int), 0644,
        NULL, &proc_dointvec, &sysctl_intvec, /* fill with 0's */},
        {BUSY_ON, "offtime", &busy_offtime, sizeof(int), 0644,
        NULL, &proc_dointvec, &sysctl_intvec, /* fill with 0's */},
        {0}
        };

/* a directory */
static ctl_table busy_kern_table[] = {
        {KERN_BUSY, "busy", NULL, 0, 0555, busy_table},
        {0}
        };

/* the parent directory */
static ctl_table busy_root_table[] = {
        {CTL_KERN, "kernel", NULL, 0, 0555, busy_kern_table},
        {0}
        };

static struct ctl_table_header *busy_table_header;

int init_module(void) 
{
        busy_table_header = register_sysctl_table(busy_root_table, 0);
        if (!busy_table_header)
                return -ENOMEM;
        busy_loop();
        return 0;
}

void cleanup_module(void)
{
        unregister_sysctl_table(busy_table_header);
}

The trick here is leaving all the hard work to proc_dointvec and sysctl_intvec. These handlers are only exported by version 2.1.8 and later of the kernel, so you need to copy them in your module (or implement something similar) when compiling for older kernels.

I won't show here the code related to busy looping, which is completely out of the scope of this article. It works with both 2.0 and 2.1; Intel, Alpha and Sparc.

Probing further

Despite the usefulness of sysctl, it's hard to find documentation about it. This is not a concern for system programmers, who are accustomed to peeking in the source code, whence information can be extracted.

The main entry points to the sysctl internals are kernel/sysctl.c and net/sysctl_net.c. Most items in the sysctl tables just act on integers, strings or arrays of integers, so you'll end up using the data field as a symbol name to grep for in the whole source tree. I see no shortcut to this.

As an example, let's trace the meaning of ip_log_martians in /proc/sys/net/ipv4. sysctl_net.c refers to ipv4_table, which in turn is exported by sysctl_net_ipv4.c. This last file includes the following entry in its table: {NET_IPV4_LOG_MARTIANS, "ip_log_martians", &ipv4_config.log_martians, sizeof(int), 0644, NULL, &proc_dointvec},

The problem, therefore, reduces to looking for the field ipv4config.log_martians. It is used to control verbose reporting (via printk) of erroneous packets delivered to this host.

Unfortunately, many system administrators are not programmers, and need other sources of information. To their benefit, sometimes kernel developers write little docs to diverge from writing code, and these docs are distributed with the kernel source. The bad news is that, sysctl is quite recent in design, and such extra docs are quite scarce. Documentation/networking/Configurable is a short introduction to sysctl (much shorter than this article), and points to net/TUNABLE, which in turn is a huge list of configurable parameters in the network subtree. The description of each item is not intelligible to the unaddicted, but who doesn't know the details of networking can't proficiently tune network parameters. As I'm writing, I know only this file as non-C-language source of information about system control.

Alessandro reads email as rubini@linux.it and enjoys breeding oaks and playing with kernel code. He is currently looking for a job in either field.

Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved

Reprinted with permission of Linux Journal