Processes on Linux

This is an unfinished work in progress! This short course on Linux Processes is not ready for consumption, but is posted here so I can get feedback on its contents. Please use the comments form at the bottom of the page to leave me a note (requests, mistakes, additions, etc…). Thanks!

Before I left Oracle I was asked by my manager to create a short ‘course’ on how processes work under Linux. I was going to cover everything from how they are started, what they do when they are running, and how they die. Unfortunately now I’m too busy with Uni and other things to finish it off. If you’d like it finished, please drop me a line and tell me what you’re most interested in.

Introduction
Starting Processes
Tracing Running Processes
The End of a Process’s Life
Conclusion

Introduction

The objectives of this course are to cover the ins and outs of how processes work on Linux. You will learn how and when processes are started, how to trace what a process is doing, and in what situations and what happens when they exit. We will be looking at both the kernel interfaces for process control and their glibc counterparts, as well as what the kernel does to handle processes. We will also look at a plethora or tools such as strace and gdb, the /proc filesystem, and signals.

A process goes through three major stages in its life: first it is spawned (started), then it spends most of its life executing, and finally dies (exits). This course is divided into three chapters, one corresponding to each of the process stages.

Starting Processes

Most people have a preconception that a process is started by executing a new program file, optionally passing a certain number of arguments to it. In the UNIX world (including Linux, Mac OS X, Sun Solaris, etc…) this process is divided into two stages—forking and executing—each of which can be performed separately in order to achieve various effects. Forking a process creates a copy of the second process that can continue working as a second thread of execution, and executing a binary will replace the current process with the new process.

Forking

When a process forks itself a second, mostly identical process is created. The two processes execute from the same point, in other words they both continue execution when the fork() system call returns. The processes can only tell each other apart by what the fork() call returns: the child’s PID in the parent process, or 0 in the new child process. The following short program demonstrates this:

#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    int pid = fork();
    if (pid == -1) {
        printf("fork() failed\n");
        return 1;
    }
    else if (pid) {
        printf("I am the parent process (my child is %d)\n", pid);
    }
    else {
        printf("I am the child process\n");
    }
}

Below is the output of a sample run of the above program. Note that the order of the lines may be different on your system, and the PID in the print statement will be different at each execution.

I am the child process
Iam the parent process (my child is 26010)

I wrote earlier that the child process created by a fork() is mostly identical to its parent. The fork(2) manpage states the following: “fork creates a child process that differs from the parent process only in its PID and PPID, and in the fact that resource utilizations are set to 0. File locks and pending signals are not inherited.” The glibc threads documentation states that threads are also not copied over for a variety of reasons.

Glibc wraps the fork() call in order to make sure internal structures are kept in a consistent state, as well as to call the pthread_atfork handler functions. These are designed to reset mutexes and other such threading related things. Unless one is very careful, mixing threads with fork() can be very dangerous, since mutexes in the forked process are invalid and must be re-initialised.

There is a second version of fork(), called vfork(), which may be faster than fork() in certain situations but the child is much more limited in what it can do: fork() typically copies a process’s entire address space (on Linux this is an efficient copy-on-write operation), instead vfork() shares this address space. This means that you cannot change any variables other than one to store the return value from vfork(), and can only call execve() or _exit() (note the underscore), nor can you return from the function from which vfork() was called (the stack is also shared). On Linux the parent process also blocks until the child calls execve() or _exit().

Executing

As I mentioned earlier, the second part in executing a new program as a new process involves executing it, which is done with the execve() system call. When called, it replaces the current program with the new program, maintaining the PID and open file descriptors (such as open files and network sockets).

The execve() call only ever returns on failure, a little like exit() except exit() can’t fail. If execve() succeeds, the new program is immediately loaded into memory and the old one replaced. If it fails, however, the function returns as usual.

Glibc has a very minimal wrapper around the execve() syscall (system call), which simply provides a C function around the assembly code required to call it. The syscall is picked up in the kernel as sys_execve(), which copies in some data from userspace and calls do_execve().

The do_execve() function opens the file for execution (which checks permissions on it), creates new execution context, prepares certain parameters (like command-line arguments, environment variable, checks for the setuid and setgid bits and sets the euid and egid appropriately, and reads the file’s first 128 bytes), then calls search_binary_handler() which decides what to do with the executable, and finally updates process accounting information before returning.

The search_binary_handler() function then goes through each registered binfmt handler until one accepts the binary. There are 4 standard handlers built on 32-bit x86 Linux systems: binfmt_script for starting scripts (which is included on every architecture and system), binfmt_elf for ELF binaries (Executable and Linking Format, which is the modern binary format for Linux and other systems like HP-UX and Solaris), binfmt_misc for binaries started by a wrapper program, and binfmt_aout for loading legacy a.out format binaries.

binfmt_script

The binfmt_script loader is the simplest loader in the kernel: it simply reads the start of the executable to find the shebang (“#!“) as the first two characters of the program. If this exists, and the handler isn’t being called as an interpreter for another script, it prepends the interpreter and its arguments from the shebang to the argc/argv arguments, and calls the interpreter. This is done similarly to do_execve(), in that it opens the interpreter’s executable file, prepares its arguments and flags, and hands off to search_binary_handler().

binfmt_elf

ELF is the modern binary format on Linux. This is now used by Linux on virtually all the architectures on which it is available, including 32-bit and 64-bit processors, and big- and little-endian systems. It natively supports dynamic shared objects and other advanced functionality, which unfortunately can make it incredibly complex to understand.

ELF’s loader function first checks that the file’s magic numbers are correct, and that the binary was created for the current architecture and operating system (since ELF is also used on other operating systems). It then reads the program’s header information and extracts the interpreter program to execute from the image’s PT_INTERP header chunk if it exists. It then proceeds to flush the old program out of memory and start the interpreter. If the program has no interpreter (in the case of the interpreter program itself) it simply loads that into memory and starts it.

The ELF interpreter program can be either an ELF or a.out executable, nothing else. The interpreter (or dynamic linker / loader) program—most often /lib/ld-linux.so.2 for libc6 (glibc 2.x) applications—is charged with loading the shared libraries a program is linked with, preparing it for execution, and finally running it.

binfmt_misc

This binary format handler is very special and, as far as I know, unique to Linux. It allows almost any file to be treated as a program and executed, such as a Java class or JAR file, .NET program (via MONO), etc… This is accomplished by registering certain criteria by which to detect the program with the kernel, and associating that with a wrapper program that is executed with the binary as an argument a little like a script interpreter. For example, with binfmt_misc appropriately setup, a Java program packaged in a JAR file can be run simply by marking the JAR as executable and running it directly:

chmod +x something.jar
./something.jar arg1 arg2 arg3

Compare this to the standard way of doing this:

java -jar something.jar arg1 arg2 arg3

When the handler is called, it looks at its database of file-magic and filename extensions. If one matches, it adjusts the command-line arguments and calls the wrapper just like the binfmt_script handler: it opens the handler’s executable file, prepares its arguments and flags, and hands off to search_binary_handler().

binfmt_aout

The binfmt_aout handler is provided for backwards compatibility: it loads the legacy a.out format binaries. ELF was introduced around kernel 1.1.52 (we are now at 2.6.12), and the a.out format’s use is now actively discouraged. Since this format is very simple (read: weak), the handler is almost as plain as the binfmt_script handler in that it does very little.

All the handler needs to do is check the magic constants in the program, flushes out the old program from memory, and loads and starts the new program.

Chris's Digital Realm

Chris Boot's very, very occasional ramblings

Table of Contents

Introduction

Starting Processes

Forking

Executing

binfmt_script

binfmt_elf

binfmt_misc

binfmt_aout

Tracing Running Processes

The End of a Process’s Life

Conclusion

Appendix A: Resources