A system with real-time constraints aims to perform work with a guarantee of the time when the work will be finished. This time is often called a deadline, and the system is designed with the purpose of not missing any, or as few as possible, deadlines.
A system where the consequences of missing deadlines are severe, for example with respect to danger for personnel or damage of equipment, is called a hard real-time system.
A system where deadlines occasionally can be missed is called a soft real-time system.
The work done by a real-time system is often initiated by an external event, such as an interrupt. The nature of the work often requires participation of one or more concurrently executing tasks. Each task participates by doing processing, combined with interaction with other tasks. The interaction typically leads to one or more task switches. A deadline for the work is often associated with the completion of the work, i.e. when the last participating task has finished its processing.
There are several challenges when implementing real-time systems. One challenge is to obtain as much determinism as possible. In this way, it becomes easier to make predictions and calculations of the actual execution time that will be required. From these predictions, the risk that the system would miss deadlines can be evaluated.
When implementing a soft real-time system and using Linux as an operating system, it is important to try to characterize possible sources of indeterminism. This knowledge can then be used to configure, and perhaps also modify, Linux so that its real-time properties become more deterministic, and hence that the risk of missing deadlines is minimized, although not guaranteed.
The remainder of this section gives a selection of areas in Linux where sources of indeterminism are present. The purpose is to give a brief technical overview, to serve as general advice for projects aiming at using Linux in real-time systems.
Each section in this chapter objectively describes:
- The topic/mechanism
- Default configuration
- The real-time impact
It does not describe how to optimize for real-time behavior. That is done in Chapter 3, Improving the Real-Time Properties. The reason is to make it possible for a reader to skip this basic chapter.
- 1 Kernel Preemption Model
- 2 Scheduling
- 3 Real-Time Throttling
- 4 The PREEMPT_RT Patch
- 5 CPU Load Balancing
- 6 CPU Isolation
- 7 Dynamic Ticks (NO_HZ)
- 8 Power Save
- 9 I/O Scheduling
- 10 System Calls
- 11 Interrupts
- 12 Soft IRQs and Tasklets
- 13 Work Queues
- 14 Threaded Interrupts
- 15 Ticks
- 16 Memory Overcommit
- 17 RCU- Read, Copy and Update
Kernel Preemption Model
A task switch occurs when the currently running task is replaced by another task. In Linux, a task switch can be the result of two types of events:
1. As a side effect of a kernel interaction, e.g. a system call or when the kernel function schedule() is called. This type of task switch is referred to as yield. The function schedule() can be used by kernel threads to explicitly suggest a yield.
2. As a result of an asynchronous event, e.g. an interrupt. This is referred to as preemption and occurs asynchronously from the preempted tasks point of view.
In the kernel documentation, the terms voluntary preemption is used instead of yield and forced preemption for what here is called preemption. The terms were chosen since, strictly speaking, preemption means "to interrupt without the interrupted thread's cooperation", see http://en.wikipedia.org/wiki/Preemption_%28computing%29.
Note that the preemption model only determines when a task switch may occur. The algorithms used to determine if a switch shall be done and then which task to swap in belong to the scheduler and are not affected by the preemption model. See Section 2.2, Scheduling for more info.
Where a task can be preempted depends on if it executes in user space or in kernel space. A task executes in user space if it is a thread in a user-space application, except when in a system call. Otherwise it executes in kernel space, i.e. system calls, kernel threads, etc.
Tasks can always be preempted in user space. In kernel space, you either allow or disallow preemption at specific places, depending on the preemption model in the kernel configuration.
Simplified, the choice of preemption model is a balance between responsiveness (latency) and scheduler overhead. Lower latency requires more frequent opportunities for task switches which results in higher overhead and possibly more frequent task switches.
Linux offers several different models, specified at build time:
- No Forced Preemption (Server)
- Voluntary Kernel Preemption (Desktop)
- Preemptible Kernel (Low-Latency Desktop)
- Preemptible Kernel (Basic RT) [requires the PREEMPT_RT patch plus kernel configuration]
- Fully Preemptible Kernel (RT) [requires the PREEMPT_RT patch plus kernel configuration]
The server and desktop configurations both rely entirely on yield (voluntary preemption). The difference is mainly that with the desktop option there are more system calls that may yield.
Low-latency desktop introduces kernel preemption. This means that the code is preemptible everywhere except in parts of the kernel where preemption has been explicitly disabled, as for example in spinlocks.
The preemption models RT and basic RT require the PREEMPT_RT patch, https://rt.wiki.kernel.org/index.php/Main_Page. They are not only additional preemption models as they also add a number of modifications that further improve the worst-case latency. Read more about PREEMPT_RT in Section 2.4, The PREEMPT_RT Patch.
Basic RT model is mainly for debugging of RT, use RT instead. RT aims to minimize parts of the kernel where preemption is explicitly disabled.
A scheduling policy is a set of rules that determines if a task shall be swapped out and then which task to swap in. Linux supports a number of different scheduling policies:
- SCHED_RR (also called Round-Robin scheduling)
- SCHED_OTHER (also called SCHED_NORMAL)
The scheduling policy is a task property, i.e. different tasks can have different policies. SCHED_FIFO and SCHED_RR are sometimes referred to as the real-time scheduling policies.
The standard way of scheduling tasks in Linux is known as fair scheduling. This means that Linux aims to give each task a fair share of the CPU time available in the system.
In a real-time system, work is deadline-constrained and the most important quality of task scheduling is meeting deadlines rather than fairness of CPU utilization. Fair scheduling may affect the time passed from when a task becomes ready for execution to when it actually starts to execute, since there may be other tasks that have not yet been given their share of the processor time. As an additional complication, the actual length of the delay depends on the number of tasks being active in the system and is therefore difficult to predict. In this way, the system becomes indeterministic.
Their are other, non-fair, scheduling methods in Linux. For instance, it is possible to use priority-based scheduling, similar to the scheduling methods used in real-time operating systems. This method increases the determinism when interacting tasks perform work together. It requires, however, that explicit priorities are assigned to the tasks. The corresponding scheduling policy is named SCHED_FIFO.
SCHED_FIFO and SCHED_RR
SCHED_FIFO and SCHED_RR are the two real-time scheduling policies. Each task that is scheduled according to one of these policies has an associated static priority value that ranges from 1 (lowest priority) to 99 (highest priority).
The scheduler keeps a list of ready-to-run tasks for each priority level. Using these lists, the scheduling principles are quite simple:
- SCHED_FIFO tasks are allowed to run until they have completed their work or voluntarily yields.
- SCHED_RR tasks are allowed to run until they have completed their work, until they voluntarily yields, or until they have consumed a specified amount of CPU time.
- When the currently running task is to be swapped out, it is put at the end of the list and the task to swap in is selected as described below: a. From all non-empty lists, pick the one with the highest priority; b. From that list, pick the task at the beginning of the list.
As long as there are real-time tasks that are ready to run, they might consume all CPU power. A mechanism called RT throttling can help the system to avoid that problem; see Section 2.3, Real-Time Throttling.
SCHED_OTHER is the most widely used policy. These tasks do not have static priorities. Instead they have a "nice" value ranging from -20 (highest) to +19 (lowest). This scheduling policy is quite different from the real-time policies in that the scheduler aims at a "fair" distribution of the CPU. "Fair" means that each task shall get an average share of the execution time according to its nice value. See http://en.wikipedia.org/wiki/Completely_Fair_Scheduler and https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt for more info.
SCHED_BATCH is very similar to SCHED_OTHER. The difference is that SCHED_BATCH is optimized for throughput. The scheduler will assume that the process is CPU-intensive and treat it slightly differently. The tasks will get the same CPU share as SCHED_OTHER tasks, but will experience worse latency.
Also SCHED_IDLE is similar to SCHED_OTHER, but has a nice value even weaker than +19.
As long as there are real-time tasks, i.e. tasks scheduled as SCHED_FIFO or SCHED_RR, that are ready to run, they would consume all CPU power if the scheduling principles were followed. Sometimes that is the wanted behavior, but it would also allow that bugs in real-time threads completely block the system.
To prevent this from happening, there is a real-time throttling mechanism which makes it possible to limit the amount of CPU power that the real-time threads can consume.
The mechanism is controlled by 2 parameters: rt_period and rt_runtime. The semantics is that the total execution time for all real-time threads may not exceed rt_runtime during each rt_period. As a special case, rt_runtime can be set to -1 to disable the real-time throttling.
The throttling mechanism allows the real-time tasks to consume rt_runtime times the number of CPUs for every rt_period of elapsed time. A consequence is that a real-time task can utilize 100% of a single CPU as long as the total utilization does not exceed the limit.
The default settings rt_period=1000000 µs (1 s) and rt_runtime=950000 µs (0,95 s) give a limit of 95% CPU utilization.
The parameters are associated with two files in the /proc file system:
Changing a value is done by writing the new number to the corresponding file, for example:
$ echo 900000 > /proc/sys/kernel/sched_rt_runtime_us $ echo 1200000 > /proc/sys/kernel/sched_rt_period_us
The PREEMPT_RT Patch
PREEMPT_RT is a set of changes to the Linux kernel source code. When applied, these changes will make the Linux kernel more responsive to external interrupts and more deterministic in the time for performing a task involving cooperating processes.
PREEMPT_RT aims to minimize the amount of kernel code that is non-preemptible (http://lwn.net/Articles/146861). This is accomplished by adding and modifying functionality in the Linux kernel.
The main functional changes done by PREEMPT_RT are:
- Converting spin locks to sleeping locks. This allows preemption while holding a lock.
- Running interrupt handlers as threads. This allows preemption while servicing an interrupt.
- Adding priority inheritance to different kinds of semaphores and sleeping locks. This avoids scenarios where a lower prioritized process hinders the progress of a higher prioritized process, due to the lower priority process holding a lock. See http://info.quadros.com/blog/bid/103505/RTOS-Explained-Understanding-Priority-Inversion.
- Lazy preemption. This increases throughput for applications with tasks using ordinary SCHED_OTHER scheduling.
PREEMPT_RT is managed as a set of patches for the Linux kernel. It is available for a selection of kernel versions. They can be downloaded from https://www.kernel.org/pub/linux/kernel/projects/rt. General information about the patches can be found from the PREEMPT_RT wiki https://rt.wiki.kernel.org/index.php/Main_Page.
It is possible to obtain a Linux kernel with the patches already applied. A central source is https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable-rt.git. Another source is Linux from a hardware vendor, providing a Linux kernel with the patches applied. Enea Linux can be configured with or without the patches, depending on the specific customer use case.
Compared with the total number of source code lines in the Linux kernel, the PREEMPT_RT patch set affects just a small percentage of those. However, central parts of the Linux kernel are patched, which can be seen in that the latest patch set for the Linux 3.12 kernel contains 32 patches that change the central file kernel/sched/core.c. The actual size of the PREEMPT_RT patches can be estimated from the total number of affected files and source code lines. As an example, the latest patch set for the Linux 3.12 kernel https://www.kernel.org/pub/linux/kernel/projects/rt/3.12 contains 321 patches which affect (by adding or removing lines of code) 14241 source code lines.
Functionality has been moved over time from the PREEMPT_RT patches into the mainline kernel. In this way, the PREEMPT_RT patch set has become smaller. As an example, the latest patch set for the older Linux 3.0 kernel https://www.kernel.org/pub/linux/kernel/projects/rt/3.0 contains 385 patches, adding or removing 16928 lines of source code, which can be compared with the corresponding numbers for the Linux 3.12 kernel (321 patches, 14241 lines), and shows that the PREEMPT_RT patch set has decreased in size.
The PREEMPT_RT functionality is activated, after the PREEMPT_RT patches have been applied and the kernel has been built, by activating a kernel configuration menu alternative. The menu alternative is named Fully Preemptible Kernel (RT).
The performance of a Linux kernel with the PREEMPT_RT patches applied can then be evaluated. A common evaluation methodology involves measuring the interrupt latency. The interrupt latency is typically measured from the time of an interrupt until the time when a task, that is activated as a result of the interrupt, begins executing in user space. A commonly used tool for this purpose is cyclictest.
Additional information on results from evaluating the PREEMPT_RT performance is provided in e.g. http://www.mpi-sws.org/~bbb/papers/pdf/ospert13.pdf and http://sigbed.seas.upenn.edu/archives/2014-02/ewili13_submission_1.pdf.
When deciding on the use of PREEMPT_RT, its costs and benefits should be evaluated. The use of PREEMPT_RT implies a significant change to the kernel source code. The changes involve code that might be less tested and therefore less proven in use than the remaining parts of the mainline kernel.
Another aspect is maintenance. The development of the PREEMPT_RT patch set follows the development of the kernel. Each time the kernel version is changed, the corresponding PREEMPT_RT patch set, which may be available only after a certain time, must then be applied to the new kernel and the associated tests be performed again.
On the other hand, the use of a PREEMPT_RT-enabled kernel can lead to a system with a decreased worst-case interrupt latency and a more deterministic scheduling, which may be necessary to fulfil the real-time requirements for a specific product. For a single-core system many of the other methods, e.g. full dynamic ticks and CPU isolation, cannot be used, and can therefore make PREEMPT_RT an attractive alternative.
For additional information about the technical aspects of PREEMPT_RT see e.g. http://elinux.org/images/b/ba/Elc2013_Rostedt.pdf.
For additional information about the PREEMPT_RT development status, see e.g. http://lwn.net/Articles/572740.
CPU Load Balancing
Linux performs allocation of tasks to CPUs. In the default setting, this is done automatically.
The scheduler is free to move tasks around to evenly distribute the processing load among the available CPUs, similar to the default fair task scheduling which aims at dividing the CPU time for individual CPUs so that each task gets its fair share of processing time.
The moving of a task is referred to as task migration. The migration is done by a part of the scheduler called the load balancer. It is invoked, on a regular basis, as a part of the processing done during the scheduler tick. The decision to move a task is based on a variety of input data such as CPU load, task behavior, etc.
For an application which requires determinism, load balancing is problematic. The time to activate a certain task, as a result of an interrupt being serviced, may depend, not only on the scheduling method used, but also on where this task is currently executing.
It is possible to statically assign tasks to CPUs. One reason for doing this is to increase determinism, for example by making the response time to external events more predictable. Assigning a task to a CPU, or set of CPUs, is referred to as setting the affinity of the task.
A mechanism used both for setting affinity of tasks and for turning off automatic load balancing, is cpuset, which makes use of the Linux subsystem cgroup. See https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt and https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt.
If we would like to get real-time performance on single-CPU systems it is necessary to adapt the entire system, e.g. using the PREEMPT_RT patch or an RTOS. This is not always necessary in a multicore system. Recently added features in the Linux kernel makes it possible to aggressively migrate sources of kernel-introduced jitter away from selected CPUs. See Section 3.3.1, Isolating CPUs for more information. Doing this provides bare-metal-like performance on the CPUs where sources of jitter have been removed. You can also use a multicore system with CPU isolation to achieve higher throughput, although that is not the focus of this document.
One way to get real-time performance in Linux is by creating a bare-metal-like environment in Linux user space. On a default setup, this is not possible since the Linux kernel needs to do some regular housekeeping. It is possible to move much of this housekeeping to some dedicated CPUs, provided that we have a multicore system. That leaves the other CPUs relatively untouched by the Linux kernel, unless a user-space task triggers some kernel activity. The application that executes in this bare metal environment should avoid using libc calls and Linux system calls. See Chapter 4, Designing Real-Time Applications for more information. Since the kernel is not real-time safe, executing kernel code can have serious impact on real-time performance.
The biggest sources of kernel jitter are the scheduler and external interrupts. The scheduler's main duty is to switch between tasks. Switching between tasks can of course cause a lot of jitter. This is only a problem if the performance-critical task runs with a completely fair scheduler (CFS) policy and therefore gets preempted because of time slicing. A solution for this problem can be to move all non-critical tasks to other CPUs and/or run the critical task with real-time policy and an appropriate priority.
The load balancer will try to effectively utilize all the CPUs. That might be good for throughput, but it could damage real-time performance. The obvious problem is that general-purpose tasks could be moved to the real-time CPUs, and real-time tasks could be moved to general-purpose CPUs. The other problem is the actual work of migrating threads. This is easily solved by disabling load balancing on the CPUs that should be isolated.
The scheduler tick causes significant jitter and has negative real-time impact unless the PREEMPT_RT kernel is used. The tick can be removed with the CONFIG_NO_HZ_FULL kernel configuration. Read more about NO_HZ and the tick in Section 2.7, Dynamic Ticks (NO_HZ).
Interrupts can be a significant source of jitter. Some interrupts, like inter-processor interrupts (IPI) and per-CPU timer interrupts, need to be bound to a certain CPU. Other interrupts may be handled by any CPU in a multicore system and should be moved away from the isolated CPUs. Many timer interrupts can be removed by changing kernel configurations. See Section 2.11, Interrupts for more info.
Dynamic Ticks (NO_HZ)
The purpose of the tick is to balance CPU execution time between several tasks running on the same CPU. The tick is also used as a timer source for timeouts. Ticks are interrupts generated by a hardware timer and occur at a regular interval determined by the CONFIG_HZ kernel configuration, which for most architectures can be configured when compiling the kernel. The tick interrupt is a per-CPU interrupt.
Starting from Linux 2.6.21, the idle dynamic ticks feature can be configured by using the CONFIG_NO_HZ kernel configuration option. The goal was to eliminate tick interrupts while in idle state, to be able to go into deeper sleep modes. This is important for laptops but can also cut down power bills for server rooms.
Linux 3.10.0 introduced the full dynamic ticks feature to eliminate tick interrupts when running a single task on a CPU. The goal here was to better support high performance computing and real-time use cases by making sure that the thread could run undisturbed. The earlier configuration CONFIG_NO_HZ was renamed to CONFIG_NO_HZ_IDLE, and the new feature got the new configuration option CONFIG_NO_HZ_FULL.
The current implementation requires that ticks are kept on CPU 0 when using full dynamic ticks, which is not required for idle dynamic ticks. The only exception is when the whole system is idle, then the ticks can be turned off for CPU 0 as well.
Whether dynamic ticks turn tick interrupts off is a per-CPU decision.
The available timer tick options are described below, in the help text copied from the kernel configuration program. Note that the recommendations are valid for the default preemption model, Preemptible Kernel (Low-Latency Desktop):
Periodic timer ticks (constant rate, no dynticks) This option keeps the tick running periodically at a constant rate, even when the CPU doesn't need it. Idle dynticks system (tickless idle) This option enables a tickless idle system, where timer interrupts will only trigger on an as-needed basis when the system is idle. A tickless idle system is usually interesting for power saving. Most probably you would like to enable this option. Full dynticks system (tickless) Adaptively try to shutdown the tick whenever possible, even when the CPU is running tasks. Typically this requires running a single task on the CPU. Chances for running tickless are maximized when the task mostly runs in user space and has few kernel activities. You need to fill up the nohz_full boot parameter with the desired range of dynticks CPUs. This is implemented at the expense of some overhead in user <-> kernel transitions: syscalls, exceptions and interrupts. Even when it's dynamically off. Say N.
In order to make use of the full dynamic ticks system configuration you must ensure that only one task, including kernel threads, is running on a given CPU at any time. Furthermore there must not be any pending RCU callbacks or timeouts attached to the tick.
The official documentation for dynamic ticks can be found in the Linux kernel source tree https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt. There is also an excellent article about it at LWN, http://lwn.net/Articles/549580.
This section describes several power saving techniques available in Linux. These techniques often affect the system's real-time properties.
A technique not described here is hibernation, e.g. suspend to RAM or disk. The reason is that hibernation is difficult to combine with real-time properties and therefore outside the scope of this manual.
Dynamic Frequency Scaling
When there is little CPU-bound work to be done, the CPU frequency can be reduced as a way to reduce power consumption. This is known as dynamic frequency scaling, see http://en.wikipedia.org/wiki/Dynamic_frequency_scaling.
The function is enabled at compile time by the configuration parameter CONFIG_CPU_FREQ, see https://www.kernel.org/doc/Documentation/cpu-freq/user-guide.txt. If enabled, the system will include functionality, called a governor, for controlling the frequency. There are several governors optimized for different types of systems. Governors are made available to the system by compile-time configurations named as CONFIG_CPU_FREQ_GOV_<xxx>.
The possibility to use dynamic frequency scaling in a real-time system is strongly related to the time it takes to ramp up the frequency and that time's relation to the latency requirements.
CPU Power States
If the CPU is idle, i.e. there are no tasks ready to run, the CPU can be put in sleep state. A sleep state means that the CPU does not do any execution, while still ready to respond on certain events, e.g. an external interrupt.
CPUs usually have a range of power modes. See http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Processor_states for an example. Deeper sleep means lower power consumption at the price of increased wake-up time. As with dynamic frequency scaling, the transition between the power states is controlled by a governor.
To configure the kernel to enter sleeping state when idle, use the compile-time configuration parameter CONFIG_CPU_IDLE.
I/O scheduling determines in which order block I/O reads and writes are done. The algorithms require collection of statistics concerning block device activity, which decreases determinism for an eventual real-time application writing/reading to a block device.
In such a scenario, you may want to select the Noop I/O elevator for the block device which your determinism-sensitive application is reading from/writing to. However, the effect is expected to be small, and this will have side-effects for other applications accessing the same block device. It may even have negative side effects on your application, depending on the type of block device and the read/write behavior of the application. Fortunately, the I/O scheduler can be switched in runtime and should be selected based on the user-specific I/O load scenario.
Table 2.1 Available I/O schedulers in kernel 3.14
Completely Fair Queuing
User-space programs access services provided by the kernel by using system calls. Usually, applications do not invoke system calls directly, but rather use library wrapper functions that in turn execute system calls. A system call always involves a transition from user space to kernel space and back. It also involves passing parameters from the application program into the kernel, and back.
A system call becomes a source of indeterminism for the calling task, since the duration of a system call depends on the work done while executing in the kernel. The work may for example involve allocation of memory or other resources, and the time needed for the allocation depends on the availability of the resources. A system call may even result in the calling task being blocked as a result of a resource not being available. For example, a task reading data from a file may be blocked if there is no data immediately available. At some later time, when the requested resource becomes available, the task resumes its execution. The system call can then be completed and the task can return to user space.
There may also be other tasks wanting to execute during the execution of a system call. If it is possible to switch task during the system call, one or more of these other tasks can be allowed to execute. This can clearly be advantageous if these tasks have deadlines. This type of task switch is referred to as kernel preemption.
It is possible to configure the Linux kernel so that it allows more or less possibilities of kernel preemption. In general, if the level of preemption is increased, the Linux kernel becomes more complex and consumes larger part of the cpu time. As a consequence, the application gets a smaller part of the CPU time, and the throughput is decreased.
The hardware indicates events to the software by using interrupts. When an interrupt occurs the following is done:
1. The interrupt is mapped to a registered interrupt handler.
2. The registered interrupt handler is run. There can be several handlers for the same interrupt. Therefore, this step is retried for all handlers registered for this interrupt as long as they return IRQ_NONE.
3. If a registered interrupt handler returns IRQ_WAKE_THREAD, the interrupt thread corresponding to the registered interrupt handler is set in ready state, i.e. is now a schedulable task.
4. The interrupt is acknowledged.
All steps above are executed with all interrupts disabled, i.e. interrupt nestling is not supported. See http://lwn.net/Articles/380931/ for a discussion why nestled interrupts were removed from Linux. The patch found at https://lkml.org/lkml/2010/3/25/434 removes nestled interrupts.
Interrupt handlers can either be registered using request_irq() or request_threaded_irq(), the latter registering a threaded interrupt. In both cases the interrupt handler will determine whether the interrupt is to be handled by this interrupt handler or not. The handler returns IRQ_NONE if not.
Interrupt work is normally divided into two parts: Top half and bottom half. The top half is implemented by the interrupt handler. The bottom half is implemented by soft IRQs, tasklets or work queues initiated from the top half, or by the interrupt thread in case of threaded interrupt.
See also Section 2.12, Soft IRQs and Tasklets, Section 2.13, Work Queues and Section 2.14, Threaded Interrupts.
The latency induced by interrupts is determined by the real-time application's top half, soft IRQs and tasklets. For threaded interrupts, the priority of the interrupt thread can be adjusted to only affect the latency of less critical tasks.
Soft IRQs and Tasklets
Soft IRQs are intended for deferred interrupt work that should be run without all interrupts being disabled.
Soft IRQ work is executed at certain occasions, such as when interrupts are enabled, or when calling certain functions in the kernel, e.g. local_bh_enable() and spin_unlock_bh(). The soft IRQs can also be executed in the ksoftirqd kernel thread. All this makes it very hard to know when the soft-IRQ work will actually be executed. See https://lwn.net/Articles/520076/ for more information about soft IRQ and real-time.
Tasklets are built on top of soft IRQs, and they basically provide a generic interface for soft IRQs. While tasklets are much preferred over soft IRQs, the synchronization between interrupt handlers and corresponding soft IRQ or tasklet is non-trivial. For this reason, and for achieving better real-time properties, it is recommended to use work queues or threaded interrupts whenever possible.
Work queues execute in kernel threads. This means that preemption can occur for work queues. It also means that work performed in a work queue may block, which may be desirable in situations where resources are requested but not currently available. The reason for using work queues rather than having your own kernel thread is basically to keep things simple. Small jobs are better handled in batches in a few kernel threads rather than having a large number of kernel threads, each doing their little thing.
See http://lwn.net/Articles/239633/ for a discussion about why tasklets and soft IRQs are bad and why work queues are better.
When an interrupt handler is registered using request_thread_irq() it will also have an associated interrupt thread. In this case, the interrupt handler returns IRQ_WAKE_THREAD to invoke the associated interrupt thread. If the interrupt thread is already running, the interrupt will simply be ignored.
Even if the interrupt handler has an associated interrupt thread, it may still return IRQ_HANDLED to indicate that the interrupt thread does not need to be invoked this time.
There are two main advantages with using threaded interrupts:
1. No synchronization needed between the top half and bottom half of the interrupt handler, the kernel will do it for you.
2. Possibility to adjust priority of each interrupt handler, even when interrupt handlers share the same IRQ.
An interrupt handler that was registered using request_irq() can be forced to run as a threaded interrupt, using a kernel boot parameter named threadirqs. This will result in a short default interrupt handler to be executed instead of the registered interrupt handler, and the registered interrupt handler is run in an interrupt thread. But since the registered interrupt handler is not designed for this it may still invoke soft IRQs causing hard to predict latencies. The interrupt handlers forced to be threaded run with soft IRQ handling disabled, since they do not expect to be preempted by them.
The Linux scheduler, which handles the process scheduling, is invoked periodically. This invocation is referred to as the Linux scheduler tick.
The Linux scheduler is also invoked on demand, for example when a task voluntarily gives up the CPU. This happens when a task decides to block, for example when data that the task needs are not available.
The frequency of the scheduler tick can be configured when building the kernel. The value typically varies between 100 and 1000 Hz, depending on the target.
The scheduler tick is triggered by an interrupt. During the execution of this interrupt, kernel preemption is disabled. The amount of time passing before the tick execution is finished depends on the amount of work that needs to be done, which in turn depends on the number of tasks in the system and how these tasks are allocated among the CPUs.
In this way, the presence of ticks, with varying completion times, contribute to the indeterminism of Linux, in the sense that a task with a deadline cannot know beforehand how long it takes until it has completed its work, due to it being interrupted by the periodic ticks.
By default, the Linux kernel allows applications to allocate (but not use) more memory than is actually available in the system. This feature is known as memory overcommit. The idea is to provide a more efficient memory usage, under the assumption that processes typically ask for more memory than they will actually need.
However, overcommitting also means there is a risk that processes will try to utilize more memory than is available. If this happens, the kernel invokes the Out-Of-Memory Killer (OOM killer). The OOM killer scans through the tasklist and selects a task to kill to reclaim memory, based on a set of heuristics.
When an out-of-memory situation occurs, the whole system may become unresponsive for a significant amount of time, or even end up in a deadlock.
For embedded and real-time critical systems, the allocation policy should be changed so that memory overcommit is not allowed. In this mode, malloc() will fail if an application tries to allocate more memory than is strictly available, and the OOM killer is avoided. To disable memory overcommit:
$ echo 2 > /proc/sys/vm/overcommit_memory
For more information, see the man page for proc (5) and https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
RCU- Read, Copy and Update
RCU is an algorithm for updating non-trivial structures, e.g. linked lists, in a way that does not enforce any locks on the readers. This is done by allowing modifications of the structures to be done on a copy, and then have a publishing method that atomically replaces the old version with the new one.
After the data has been replaced there will still be readers that hold references to the old data. The time during which the readers can hold references to the old data is called a grace period.
This grace period ends when the last reader seeing the old version has finished reading. When this happens, an RCU callback is issued in order to free resources allocated by the old version of the structure.
These callbacks are called within the context of a kernel thread. By default these callbacks are done as a soft IRQ, adding hard to predict latencies to applications. There is a kernel configuration option named CONFIG_RCU_NOCB_CPU which, combined with the boot parameter rcu_nocb=<cpu list>, will relocate RCU callbacks to kernel threads. The threads can be migrated away from the CPUs in the <cpu list> giving these CPUs better real-time properties.
Using the nohz_full kernel boot parameter implies rcu_nocb.
For further reading, see the following links:
- RCU concepts: https://www.kernel.org/doc/Documentation/RCU/rcu.txt
- What is RCU (contains a lot of good links to LWN articles): https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt
- Relocating RCU callbacks: http://lwn.net/Articles/522262/