by Alessandro Rubini
In Linux (or Unix) world, most network interfaces, such as
eth0
and ppp0
, are associated to a physical
device that is in charge or transmitting and receiving data packets.
However, there are exceptions to this rule, and some logical network
interface doesn't feature any physical packet transmission; the most
known examples are the shaper
and eql
interfaces. This article shows how such ``virtual'' interfaces attach
to the kernel and to the packet transmission mechanism.
From the kernel's point of view, a network interface is a software object that can process outgoing packets, and the actual transmission mechanism remains hidden inside the interface driver. Even though most interfaces are associated to physical devices (or, for the loopback interface, to a software-only data loop), it is possible to design network interface drivers that rely on other interfaces to perform actual packet transmission. The idea of a ``virtual'' interface can be useful to implement special-purpose processing on data packets while avoiding to hack with the network subsystem of the kernel. To support this discussion with a real-world example, I wrote an insane (INterface SAmple for Network Errors) driver, available as insane.tar.gz. The interface simulates semi-random packet loss or intermittent network failures. The code fragments shown here are part of the insane driver, and have been tested with Linux-2.3.41.
While, the following description is rather terse, the sample code is well-commented and tries to fill the gaps left open by this quick tour of the topic.
Like other kinds of device drivers, a network interface module
connects to the rest of Linux by registering its own data structure
within the kernel. The insane driver, for example, registers
itself by calling ``register_netdev(&insane_dev);
''.
The device structure being registered, insane_dev
is a
struct net_device
object (but Linux 2.3.13 and earlier
called it struct device
), and it must feature at least
two valid fields: the interface name and a pointer to its
initialization function:
static struct net_device insane_dev = {
name: "insane",
init: insane_init,
};
The init callback is meant for internal use by the driver: It usually fills other fields of the data structure with pointers to device methods, the functions that performing real work during the interface life time. When an interface driver is linked into the kernel (instead of being loaded as a module), the first task of the init function is checking whether the interface hardware is there.
As you may imagine, the interface can be removed by calling unregister_netdev(), usually invoked by cleanup_module() (or not invoked at all if the driver is not modularized).
The net_device
structure includes, in addition to all
the standardized fields, a ``private'' pointer (a void *
)
that can be used by the driver for its own use. When virtual
interfaces are concerned, the private field is the best place to host
configuration information; the insane sample interface follows
the good practice of allocating its own priv structure at
initialization time:
/* priv is used to host the statistics, and packet dropping policy */
dev->priv = kmalloc(sizeof(struct insane_private), GFP_USER);
if (!dev->priv) return -ENOMEM;
memset(dev->priv, 0, sizeof(struct insane_private));
The allocation is released at interface shutdown (i.e., when the module is removed from the kernel).
A network interface object, like most kernel objects, exports a
list of methods so the rest of the kernel can use it. These methods
are function pointers located in fields of the object data stricture,
here struct net_device
.
An interface can be perfectly functional by exporting just a subset
of all the methods; the recommended minimum subset includes
open, stop (i.e., ``close''), do_ioctl and
get_stats. These methods are directly related to system calls
invoked by a user program (such as ifconfig). With the
exception of ioctl, which needs some detailed discussion, their
implementation is pretty trivial, and they turn out to be just a few
lines of code.
int insane_open(struct net_device *dev)
{
dev->start = 1;
MOD_INC_USE_COUNT;
return 0;
}
int insane_close(struct net_device *dev)
{
dev->start = 0;
MOD_DEC_USE_COUNT;
return 0;
}
struct net_device_stats *insane_get_stats(struct net_device *dev)
{
return &((struct insane_private *)dev->priv)->priv_stats;
}
The open method is called when you call ``ifconfig
insane up
'', and close deals with ``ifconfig
insane down
''; get_stats returns a pointer to the local
statistics structure and is used by ifconfig as well as by the
/proc
informative files. The driver is responsible of
filling the statistic information (although it may choose not to),
whose fields are defined in <linux/netdevice.h>
).
There are other methods, more related to the low level details of packet transmission, but they fall outside of the scope of this discussion (they are on show in the source package, though). The only interesting low-level method is hard_start_xmit, discussed later.
The do_ioctl entry point is the most important one for
virtual interfaces. When a user program configures the behavior of
the interface, it does its task by invoking the ioctl() system
call. This is how shapecfg defines network shaping and how
eql_enslave attaches real interfaces to the load-balancing
interface eql
. Similarly, the insanely
application configures the insane behavior on the insane virtual interface.
Unlikely what happens for ``normal'' device drivers (char and block
drivers), the implementation of ioctl for interfaces is pretty
well-defined: the invoking file descriptor must be a socket, the
available commands are only SIOCDEVPRIVATE
to
SIOCDEVPRIVATE+15
, and the infamous ``third argument'' of
the system call is always a struct ifreq *
pointer,
instead of the generic void *
pointer. This
``restriction'' in ioctl arguments takes place because socket
ioctl commands span several logical layers and several
protocols; the predefined values are reserved for device private use,
and are unique throughout the protocol stack (note that no other
ioctl command will be delivered to the network interface
method, so you really cannot choose your own values). Passing a
predefined data structure to ioctl doesn't limit the
flexibility of interface configuration, as the ifreq
structure includes a data
field, a caddr_t
value that can point to arbitrary configuration information
Based on the information above, the insane interface can be
controlled using these commands (defined in "insane.h"
:
#define SIOCINSANESETINFO SIOCDEVPRIVATE
#define SIOCINSANEGETINFO (SIOCDEVPRIVATE+1)
Actual use of the command, within the user-space program
insanely turns out to be pretty simple:
The kernel-space counterpart of the configuration process is slightly
more complex, but only because it must deal with permission checks and
copying data.
int sock = socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
struct insane_userinfo info; /* configuration data in/out */
struct ifreq req;
strcpy(req.ifr_name, "insane");
req.ifr_data = (caddr_t)&info;
/* fill info structure... */
if (ioctl(sock, SIOCINSANESETINFO, &req)<0) {
/* deal with error */
}
struct insane_userinfo info;
struct insane_userinfo *uptr;
/* only authorized users can control the interface */
if (cmd == SIOCINSANESETINFO && !capable(CAP_NET_ADMIN))
return -EPERM;
/* retrieve the data structure from user space */
uptr = (struct insane_userinfo *)ifr->ifr_data;
err = copy_from_user(&info, uptr, sizeof(info));
if (err) return err;
/* deal with the information */
return 0;
The most important entry point for a network interface driver is hard_start_xmit, where hard is a shorthand for hardware. The device method gets called whenever a network packet gets routed through the interface. Unlike the methods described above (and like the ones not discussed here), this one is not directly related to any system call or application; rather, it is used by the network subsystem of the Linux kernel according to its own policies.
When virtual interfaces are concerned, no actual hardware
transmission takes place in the interface itself. The interface will
instead resort to another network interface to perform
transmission. Packet passing is implemented in two steps: first
(usually at configuration time, within ioctl), the interface
must connect to another interface, the one that can transmit packets;
then, its own hard_start_xmit must take proper action to pass
the packet.
/* look for the hardware interface */
slave = __dev_get_by_name(info.name);
if (!slave) return -ENODEV;
priv->priv_device = slave;
/* .... */
/* update your statistic counters */
priv->priv_stats.tx_packets++;
priv->priv_stats.tx_bytes += skb->len;
/* assign the packet to the hw interface */
skb->dev = priv->priv_device;
/* and tell Linux to pass it to its device */
dev_queue_xmit (skb);
In a perfect world, the virtual interface should also register a notifier callback, so Linux will tell the driver when the physical hardware interface goes away -- if the slave interface is a module, its removal will make insane unhappy. The released insane implementation doesn't register any callback, and making it saner is left as an exercise for the reader.
When network packets hit an interface board, they generate an interrupt so that the Operating System can handle packet arrival (the only exception is the loopback interface, whose reception mechanism is part of packet transmission).
A virtual interface, on the other hand, has no way to receive interrupts, and thus it cannot receive any network packet. This can be perceived as unfortunate, because it would be nice to attach the same software operations to both directions of data flow. But the mechanics of packet reception don't allow virtual interface to enter the game, and whoever need to intercept incoming packets must use other ways to hook into the packets' path. This kind of functionality goes out of the scope of this discussion and leans very much towards the way netfilter works.
All of this talking may look rather pointless, unless we can see it
at work. The insane interface relies on an Ethernet interface
for physical transmission, and it be configured to operate in one of
three insane modes. It can relay every packet (``pass
''
mode), or relay only some percent of packets (``percent
''
mode, with an integer parameter), or turn relaying on and off on a
repeated timely basis (``time
mode -- with two
parameters, on-time and off-time, specified as jiffy counts,
architecture-dependent time quanta that correspond to 10ms each for
the PC platform). Here are three examples of use of insanely:
# insanely eth0 pass ; # relay everything to eth0
# insanely eth0 percent 80 ; # drop 20% (pseudo random)
# insanely eth0 time 50 100 ; # relay for .5 seconds, drop for 1s
In order to connect insane to the network, you need to
assign a ``local'' IP address to the interface (that IP address will
be used as ``source address'' and be used by remote hosts to send
their replies) and route some packets through it. Current versions of
Linux automatically associate a network route to each device, and this
routing cannot be removed. Therefore, we can't re-route all of the lan
through insane at once, and the following example reroutes a
single host, called "morgana", in the routing table of the host
"borea".
borea# insmod insane ; # load module
borea# ifconfig insane borea ; # give same IP as eth0
borea# route add morgana dev insane ; # re-route this host
borea# ./insanely eth0 percent 60 ; # set dropping rate
Unfortunately, due to a glitch in Linux-2.3.41, you'll also need to
disable the packet filters on the Ethernet interface used by
insane. The following command worked for me: ``echo 0 >
/proc/sys/net/ipv4/conf/eth0/rp_filter
''
With this setup, you can connect to morgana
with any
protocol you like, and experience a 40% packet loss -- only on
transmitted packets, though, unless morgana
runs another
instance of insane with a similar configuration.
An interesting effect of this transmission path through two
interfaces is that you can run tcpdump on both
eth0
and insane
and see different results.
While ``tcpdump -i eth0
'' shows the packets being
transmitted, ``tcpdump -i insane
'' displays every packet
sent out by the protocol layers, before any dropping is applied.
rubini@gnu.org
.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved