The sysctl system call is an interesting feature of the
Linux kernel; it is quite unique in the Unix world. The system call
exports the ability to fine-tune kernel parameters and is tightly
bound to the /proc
filesystem: the simpler file-based interface
can be used to perform the same tasks that are available by means of
the system call. sysctl appeared in 1.3.57 and is fully
supported ever since. This article explains how to use sysctl
with any kernel between 2.0.0 and 2.1.35.
When running Unix kernels, system administrators often need to
fine-tune some low-level features according to their specific needs.
Most often, system tailoring forces to rebuild the kernel image and
reboot the computer. Such tasks are lenghty ones, and require good
skills and a little luck to be successfully fulfilled. The Linux
developers diverted from this approach and chose to implement variable
parameters in place of hardwired constants; runtime configuration is
then eased by exploiting the /proc
filesystem. The internals of
sysctl are designed not only to read and modify configuration
parameters, but also to support a dynamic set of such variables. In
other words, the module writer can insert new entries in the
sysctl tree, and allow run-time configuration of driver
features.
/proc
interface to system controlMost Linux users get quickly accustomed to the idea of
/proc
. In short, the filesystem can be considered a sort of
gateway to kernel internals: its files are entry points to some kernel
information. Such information is usually exchanged in textual form to
ease interactive use, although some node returns binary data to be
used by specific applications. The typical example of binary
/proc
file is /proc/kcore
: it is a core file
representing the current kernel, so that you can run "gdb
/usr/src/linux/vmlinux /proc/kcore
" and peek in your running
kernel. Naturally, compiling vmlinux
with -g
helps a lot if you run gdb on /proc/kcore
.
Most of the /proc
files are read-only: writing to them has no
effect. This applies for instance to /proc/interrups
,
/proc/ioports
, /proc/net/route
and all the other
informative nodes. The directory /proc/sys
, on the other hand,
behaves differently: it is the root of a file tree related to system
control. Every subdirectory in /proc/sys
deals with a kernel
subsystem like net
and vm
, while kernel
includes
kernel-wide parameters, like the hostname.
Each sysctl file encloses numeric or string values: sometimes a single value, sometimes an array of them. For example, this is the content of some of the control files, which appear as an array of values. The screen snapshot refers to version 2.1.32 of the kernel.
morgana.root# pwd
/proc/sys
morgana.root# grep . kernel/*
kernel/ctrl-alt-del:0
kernel/domainname:systemy.it
kernel/file-max:1024
kernel/file-nr:128
kernel/hostname:morgana
kernel/inode-max:3072
kernel/inode-nr:384 263
kernel/osrelease:2.1.32
kernel/ostype:Linux
kernel/panic:0
kernel/printk:6 4 1 7
kernel/securelevel:0
kernel/version:#9 Mon Apr 7 23:08:18 MET DST 1997
It's worth stressing that reading /proc
items with
less doesn't work, because they appear as 0-length files to the
stat system call, and less checks the features of the file
before reading it. The inaccuracy of stat is a feature of
/proc
, rather than a bug: it's a saving in human resources (in
writing code), and kernel size (in carrying the code
around). stat information is completely irrelevant for most
files, as cat, grep and all the other tools work
painlessly. If you really need to run less over /proc
,
you can always invoke "cat
file | less
".
If you need to change system parameters, writing the new values to the
right file in /proc/sys
is all that's needed. If the file
hosts an array of values, they are overwritten in order. Let's look at
kernel/printk
for example (but note that the file has been
introduced only in version 2.1.32). The four numbers in
/proc/sys/kernel/printk
control the ``verbosity'' level of the
printk kernel function, and the first number is the
console_loglevel
: kernel messages with priority less than or
equal to the specified one will be printed to the system console (the
active virtual console, unless you changed it). The parameter doesn't
affect operation of klogd, which received all the messages in
any case. The following commands show how to change the loglevel:
morgana.root# cat kernel/printk
6 4 1 7
morgana.root# echo 8 > kernel/printk
morgana.root# cat kernel/printk
8 4 1 7
a level of 8 corresponds to debugging messages: they are not printed on the console by default, but the previous session changes the behaviour to print every message, even the debugging ones.
Similarly, you can change the hostname by writing the new value to
/proc/kernel/hostname
. A useful feature when you miss the
hostname command.
Even though the /proc
filesystem is a great resource to
exploit, sometimes it is just missing. The filesystem is not vital to
system operation, and there are cases when you choose to leave it out
of the kernel image or simply don't mount it. When you build an embedded
system, for example, saving 40-50 kB can be an interesting option; if
you are very concerned about security, on the other hand, you might
decide to hide system information and leave /proc
unmounted.
The system call interface to kernel tuning, namely sysctl, is an alternative way to peek into configurable parameters and to modify them. An additional advantage of the system call interface is that it's faster, as no fork/exec is involved, nor any directory lookup. Anyway, unless you run a very old platform, the performance savings are irrelevant.
To use the system call, the header
<sys/sysctl.h>
must be included: it declares the
function as:
int sysctl (int *name, int nlen, void *oldval,
size_t *oldlenp, void *newval, size_t newlen);
If your standard library is not up to date, the function is not
prototyped in the headers nor defined in the library. I don't know
when exactly the library function has been introduced, but at least I
know that libc-5.0
misses it, while libc-5.3
has
sysctl support. If you have an old library you must invoke the
system call directly, using code like this one:
#include <linux/unistd.h>
#include <linux/sysctl.h>
_syscall1(int, _sysctl, struct __sysctl_args *, args);
/* now "_sysctl(struct __sysctl_args *args)" can be called */
As you see, the system call gets a single argument instead of six
of them, so the mismatch in the prototypes has been solved by
prepending an underscore to the name of the system call. Therefore,
the system call is _sysctl
and gets one argument, while the
library function is sysctl
and gets six arguments. The sample
code introduced in this article uses the library function.
The arguments of the function have the following meaning:
name
points to an array of integers: each of the
integer values identifies a sysctl item, either a directory or
a leaf node file. The symbolic names for such values are
defined in <linux/sysctl.h>
.
nlen
states how many integer numbers are listed in
the array name
: to reach a particular entry you need
to specify the path through the subdirectories, so you
need to tell how long is such path.
oldval
is a pointer to a data buffer where the old
value of the sysctl item must be stored. If it is NULL
,
the system call won't return values to user space.
oldlenp
points to an integer number stating the
length of the oldval
buffer. The system call changes
the value to reflect how much data has been
written, which can be less than the buffer length.
newval
points to a data buffer hosting replacement
data: the kernel will read this buffer to change the sysctl
entry being acted upon. If it is NULL
, the kernel
value is not changed.
newlen
is the length of newval
. The kernel
will read no more than newlen
bytes from newval
.
Let's try now to write some C code to access the four parameters in
/proc/sys/kernel/printk
. The ``name'' of the file is
KERN_PRINTK
, within the directory CTL_KERN
. The
following code is the complete program to access the values, and I
call it pkparms.c
.
#include <stdio.h>
#include <stdlib.h>
#include <sys/sysctl.h>
#include <linux/sysctl.h>
int main(int argc, char **argv)
{
int name[] = {CTL_KERN, KERN_PRINTK};
int namelen = 2;
int oldval[8]; /* 4 would suffice */
size_t len = sizeof(oldval);
int i, error;
error = sysctl (name, namelen, (void *)oldval, &len,
NULL /* newval */, 0 /* newlen */);
if (error) {
fprintf(stderr,"%s: sysctl(): %s\n",
argv[0],strerror(errno));
exit(1);
}
printf("len is %i bytes\n", len);
for (i = 0; i < len/(sizeof(int)); i++)
printf("%i\t", oldval[i]);
printf("\n");
exit(0);
}
Changing sysctl values is similar to reading them: just use
newval
and newlen
. A program similar to
pkparms.c
can be used to change
the console loglevel, the first number in kernel/printk
.
The program is called setlevel.c
, and its core looks like:
int newval[1];
int newlen = sizeof(newval);
/* assign newval[0] */
error = sysctl (name, namelen, NULL /* oldval */, 0 /* len */,
newval, newlen);
As you see, the program overwrites only the first
sizeof(int)
bytes of the kernel entry, but this is exactly what
we meant. Please remember,
however, that the printk parameters are not exported to
sysctl in version 2.0 of the kernel. The programs
won't compile under 2.0 due to the missing KERN_PRINTK
symbol;
if on the other hand you compile either of them elsewhere and run it under
2.0, you'll get an error when invoking sysctl.
A simple run of the two programs looks like the following: morgana.root# ./pkparms len is 16 bytes 6 4 1 7 morgana.root# cat /proc/sys/kernel/printk 6 4 1 7 morgana.root# ./setlevel 8 morgana.root# ./pkparms len is 16 bytes 8 4 1 7
If you run kernel 2.0, don't despair: the files are meant as samples, and the same code can be used to access any sysctl item with minimal modifications.
On the same ftp site you'll also find hname.c
: a bare-bones
``hostname'' command based on sysctl. The source works with the
2.0 kernels and also shows how to invoke the system call with no library
support, because my Linux-2.0 runs on a libc-5.0
-based PC.
Although low-level, the tunable parameters of the kernel are very interesting to play with, and can help optimizing system performance for the different uses of a Linux box.
The following list is an incomplete overview of the kernel
and
vm
directories under /proc/sys
. The following information
apply to all 2.0 kernel and up to 2.1.35.
kernel/panic
The integer value is the number of
seconds the system will wait before automatic reboot in
case of system panic. `0' means "disabled". Automatic reboot
is an interesting feature to turn on for unattended systems.
The command-line option panic=
can be used to set
the value at boot time.
kernel/file-max
The maximum number of open files in
the system. file-nr
is the per-process maximum and
can't be modified because is constrained by the page size.
Similar entries exist for the inodes: one user-wide and one
per-process. Severs with many processes and many open
files might benefit from raising the items.
kernel/securelevel
This is a hook for security
features in the system. The securelevel is (currently)
read-only even for root (!), so it can only be changed by
program code (e.g., modules). Nowadays only the ext2
filesystem uses the securelevel: it refuses to change file
flags (like "immutable" and "append-only") if the securelevel
is greater than 0. A kernel with securelevel precompiled to 1
and no support for modules can be used to protect precious
files from corruption in case of network intrusions. Stay
tuned for new features of securelevel.
vm/freepages
The file hosts three numbers, all of
them are a count of free pages. The first number is the
minimum free space in the system (free pages are needed to
fulfill atomic allocation requests, like incoming network
packets). The second number is the level at wich to start
heavy-swapping, and the third is when starting light
swapping. A network server with high bandwidth will benefit
from higher numbers, to avoid dropping packets due to free
memory shortage. By default, 1% of the memory is kept free.
vm/bdflush
The numbers in this file can fine-tune
the behaviour of the buffer cache. They are documented in
fs/buffer.c
.
vm/kswapd
This file exists in all the 2.0.x kernels,
but has been removed in 2.1.33 as unuseful. It can be safely
ignored.
vm/swapctl
This big file encloses all the
parameters used in fine-tuning the swapping algorithms. The
fields are listed in include/linux/swapctl.h
, and
are used in mm/swap.c
. Interesting but difficult.
Module writers can easily add their own tunable features to
/proc/sys
by using the programming interface to extend the
control tree. The following functions are exported to modules:
The former is used to register a ``table'' of entries and returns a
token, which is used by the latter function to detach your table.
struct ctl_table_header * register_sysctl_table(ctl_table * table,
int insert_at_head);
void unregister_sysctl_table(struct ctl_table_header * table);
insert_at_head
tells whether the new table must be inserted
before or after other ones, and you can easily ignore the issue and
specify 0 (not-at-head).
But what is the ctl_table
type, then? It is a structure made
up of the following fields:
int ctl_name
. This is a numeric id, unique in each
table.
const char *procname
. If the entry must be placed in
/proc
, this is the corresponding name.
void *data
. The pointer to data. For example, it will
point to an integer value for integer items.
int maxlen
. The size of pointed data. Like
sizeof(int)
.
mode_t mode
. The octal mode of the file. Directories
should have the executable bit turned on (e.g.: 0555
).
ctl_table *child
. For directories, the child table.
For leaf nodes, NULL
.
proc_handler *proc_handler
. The handler is in charge
of performing any read/write spawned by /proc
files. If
the item has no procname
, this field is not used.
ctl_handler *strategy
. This handler reads/writes
data when the system call is used.
struct proc_dir_entry *de
. Used internally.
void *extra1, *extra2
. These fields only exist from
1.3.69 onwards, and are used to specify extra information for
specific handlers. The kernel has an handler for integer
vectors, for example, that uses the extra fields to know the
allowable minimum and maximum value for each number in the
array.
Well, I see that the previous outline can scare most readers. Therefore, I won't show the protorypes for the handling functions and switch directly to some sample code. Writing code is much easier than understanding it, because you can start by copying lines around. The outcome will fall under the GPL, but I don't see it as a disadvantage.
So, let's try to write a module with two integer parameters, called
ontime
and offtime
. The module will busy-loop for
a few timer ticks and sleep for a few more: the parameters control
the duration of each state. Yes, this is silly, but is the
simples hardware-independent thing I could conceive.
The parameters will appear in /proc/sys/kernel/busy
, a new
directory. To this aim, we need to register a tree like the one shown
in figure 1. The kernel
directory won't be created by
register_sysctl_table
, because it already exists, and it won't
be deleted at unregister time because it still has active childs: by
specifying the whole tree you thus add files to every directory within
/proc/sys
.
Donwload postscript: sysctl.ps
In the source file busy.c
, the following code makes all
the work related to sysctl:
#define KERN_BUSY 434 /* a random number, high enough */
enum {BUSY_ON=1, BUSY_OFF};
int busy_ontime = 0; /* loop 0 ticks */
int busy_offtime = HZ; /* every second */
/* two integer items (files) */
static ctl_table busy_table[] = {
{BUSY_ON, "ontime", &busy_ontime, sizeof(int), 0644,
NULL, &proc_dointvec, &sysctl_intvec, /* fill with 0's */},
{BUSY_ON, "offtime", &busy_offtime, sizeof(int), 0644,
NULL, &proc_dointvec, &sysctl_intvec, /* fill with 0's */},
{0}
};
/* a directory */
static ctl_table busy_kern_table[] = {
{KERN_BUSY, "busy", NULL, 0, 0555, busy_table},
{0}
};
/* the parent directory */
static ctl_table busy_root_table[] = {
{CTL_KERN, "kernel", NULL, 0, 0555, busy_kern_table},
{0}
};
static struct ctl_table_header *busy_table_header;
int init_module(void)
{
busy_table_header = register_sysctl_table(busy_root_table, 0);
if (!busy_table_header)
return -ENOMEM;
busy_loop();
return 0;
}
void cleanup_module(void)
{
unregister_sysctl_table(busy_table_header);
}
The trick here is leaving all the hard work to proc_dointvec
and sysctl_intvec
. These handlers are only exported by version
2.1.8 and later of the kernel, so you need to copy them in your module
(or implement something similar) when compiling for older kernels.
I won't show here the code related to busy looping, which is completely out of the scope of this article. It works with both 2.0 and 2.1; Intel, Alpha and Sparc.
Despite the usefulness of sysctl, it's hard to find documentation about it. This is not a concern for system programmers, who are accustomed to peeking in the source code, whence information can be extracted.
The main entry points to the sysctl internals are
kernel/sysctl.c
and net/sysctl_net.c
. Most items in the
sysctl tables just act on integers, strings or arrays of integers, so
you'll end up using the data
field as a symbol name to grep for
in the whole source tree. I see no shortcut to this.
As an example, let's trace the meaning of ip_log_martians
in
/proc/sys/net/ipv4
. sysctl_net.c
refers to
ipv4_table
, which in turn is exported by sysctl_net_ipv4.c
.
This last file includes the following entry in its table:
{NET_IPV4_LOG_MARTIANS, "ip_log_martians",
&ipv4_config.log_martians, sizeof(int), 0644, NULL,
&proc_dointvec},
The problem, therefore, reduces to looking for the field
ipv4config.log_martians
. It is used to control verbose
reporting (via printk) of erroneous packets delivered to this
host.
Unfortunately, many system administrators are not programmers, and
need other sources of information. To their benefit, sometimes kernel
developers write little docs to diverge from writing code, and these
docs are distributed with the kernel source. The bad news is that,
sysctl is quite recent in design, and such extra docs are quite
scarce.
Documentation/networking/Configurable
is a short introduction to
sysctl (much shorter than this article), and points to
net/TUNABLE
, which in turn is a huge list of configurable
parameters in the network subtree. The description of each item is
not intelligible to the unaddicted, but who doesn't know the details
of networking can't proficiently tune network parameters. As I'm
writing, I know only this file as non-C-language source of information
about system control.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved
Reprinted with permission of Linux Journal