Improving the Real-Time Properties

From linuxrealtime
Jump to: navigation, search

Improving the real-time properties is very much about reducing latency. In a hard real-time system, it is all about worst case, while in a soft real-time system, it is about reducing the probability for high latency numbers. Often, the improvements come at a price, e.g. lower average throughput, increased average latency, reduced functionality, etc.

This chapter starts with a set of actions that are of low complexity, easy to apply and/or well established. For some systems these actions will be enough. If not, it is suggested to isolate the real-time critical application(s). The idea is to avoid sources of jitter on the real-time applications. Finally, a set of actions with larger impact are presented.

Evaluating Real-Time Properties

Finding the right balance is an iterative process that preferably starts with establishing some benchmark. A common way to evaluate real-time properties is to measure interrupt latency by using the test application cyclictest, which sets a timer and measures the time interval from the expected expiration time until the task that sets the timer becomes ready for execution.

For the result to be relevant, the system should also be put under stress, see e.g. http://people.seas.harvard.edu/~apw/stress, which can generate various types of stress.

Since benchmark results are based on samples rather than theoretical values, it is important to have a sufficiently large number of samples. What "sufficiently large" is depends on the distribution of the sampled properties, the requested accuracy etc.

NOTE: The benchmarks presented in this chapter are intended as an indication of the impact of the discussed measures, and the requirements on accuracy are relatively low. Therefore, a low number of samples (100 000) have been used. As a consequence, there can be significant variations between benchmarks with the same setup, e.g. the io benchmark for Low-Latency Desktop in Table 3.2 Worst-Case Latency in [µs] for the Different Preemption Models on P2041RDB and the Regular irqs benchmark in Table 3.3 Benchmarks of Threaded Interrupts on P2041RDB, Times in µs.

First Attempt

Configure the Proper Kernel Preemption Model

Lack of preemption can result in very high, possibly unbound, latencies. That makes the preemption models without preemption, server and desktop, unsuitable for real-time systems. See Section 2.1, Kernel Preemption Model for a description of the kernel preemption model.

The recommendation is to start with the default preemption model, Preemptible Kernel (Low-Latency Desktop). The Fully Preemptible Kernel model, referred to as the RT model, in many cases gives a lower worst-case latency but at the cost of higher overhead, which might have implications on quality and functional stability. Furthermore, the RT model requires the PREEMPT_RT patch, in contrast to the Low-Latency Desktop model which is a part of the standard kernel.

The preemption model is set at compile time by kernel configuration parameters. Modify the kernel configuration by using bitbake -c menuconfig, and select menu Kernel options followed by menu Preemption Model in the menuconfig tool. To configure the kernel to use the Low-Latency Desktop model, set CONFIG_PREEMPT=y or CONFIG_PREEMPT__LL=y (the latter from the PREEMPT_RT patch), and other CONFIG_PREEMPT_* parameters =n.

Table 3.1 Kernel Configuration Parameters for the Preemption Model
Name
Configuration Parameters
No Forced Preemption (Server) CONFIG_PREEMPT_NONE
Voluntary Kernel Preemption (Desktop) CONFIG_PREEMPT_VOLUNTARY
Preemptible Kernel (Low-Latency Desktop) CONFIG_PREEMPT or CONFIG_PREEMPT__LL*
Preemptible Kernel (Basic RT)* CONFIG_PREEMPT_RTB
Fully Preemptible Kernel (RT)* CONFIG_PREEMPT_RT_FULL

* Requires the PREEMPT_RT patch.


Benchmarks of preemption models

The table below shows the worst-case latency for the different preemption models, measured under various stress scenarios.

Table 3.2 Worst-Case Latency in [µs] for the Different Preemption Models on P2041RDB
Preemption Model
Stress Type see Appendix B
cpu hdd io vm full
Server 48 431 155 6247 8473
Desktop 52 161 139 6343 8423
Low-Latency Desktop 72 396 711 1084 977
RT (from PREEMPT_RT patch) 29 72 62 74 64

Optimize Power Saving

As described in Section 2.8, Power Save, power saving mechanisms interact poorly with real-time requirements. The reason is that exiting a power saving state cannot be done instantly, e.g. 200 µs wake-up latency from sleep mode C3 and 3 µs from C1 on an Intel i5 - 2GHz. This does not have to be a problem in e.g. a soft real-time system where the accepted latency is longer than the wake-up time or in a multicore system where power saving techniques may be used in a subset of the cores. It is however recommended to start with the power saving mechanisms disabled, using the following kernel configuration parameters:

It may be noted that the dynamic ticks function described in Section 2.7, Dynamic Ticks (NO_HZ), originated as a power saving function name idle dynamic ticks. We will later describe how to use full dynamic ticks for real-time purposes, see Section 3.3, Isolating the Application.

Use Threaded Interrupts

Interrupt handlers can, when registering, choose to run most part of their interrupt handling in a kernel thread as so called threaded interrupts. Also interrupt handlers that have not done this choice can be forced to become threaded interrupts, provided that the threadirq boot parameter was used when booting the kernel. Interrupt handlers that have registered using IRQF_NO_THREAD will not be threaded even if the boot parameter is given.

Threading interrupts will make interrupt handlers more preemptible, but some interrupt handlers might not work with this option, so make sure to test well when using it.

Benchmarks of threaded interrupts

Table 3.3 Benchmarks of Threaded Interrupts on P2041RDB, Times in µs
cpu full hdd io no_stress vm
Threaded irqs 46 994 705 88 185 1018
Regular irqs 59 961 481 217 110 1029
Ratio % 78.0 103.4 146.6 40.6 168.2 98.9

Optimize Real-Time Throttling

As described in Section 2.3, Real-Time Throttling, the default settings for the real-time throttling allows real-time tasks to run for 950000 µs every 1000000 µs. This may lead to a situation where real-time tasks are blocked for 50000 µs at the end of the throttling period.

In the generic case, execution of the real-time tasks may be blocked for a time equal to the difference between rt_runtime and rt_period. This situation should however be quite rare since it requires that there are real-time tasks (i.e. tasks scheduled as SCHED_FIFO or SCHED_RR) that are ready to run on all CPUs, a condition that should rarely be met since real-time systems are typically designed to have an average real-time load of significantly less than 100%.

Consequently, it is recommended to keep the real-time throttling enabled. For systems that do not have any real-time tasks, the real-time throttling will never be activated and the settings will not have any impact.

An alternative when using CPU isolation is to avoid using SCHED_FIFO and SCHED_RR, since the CPU is supposed to run a single task anyway. In this case, real-time throttling should not be activated.

Benchmarks of real-time throttling

The benchmarks were done using an application that repeatedly reads the time base register and keeps track of the largest increment of it:

set cpu affinity
set scheduling class to FIFO/99

get current time (T0) and current value of the time base register (TB0)

for some number of cycles { read time base register calculate the diff between the current and the previous value. (delta_TB) if the diff is the largest so far, update delta_TBmax } get current time (T1) and current value of the time base register (TB1)

use T0, T1, TB0 & TB1 to translate delta_TBmax to a delay in microseconds.

The longest delay is interpreted as the longest time the task has been preempted. Multiple instances of the test application are started on different cores. The results are summarized below:

Table 3.4 Max Preemption Period for Different Throttling Parameter Settings, Run on P2041RDB
Test Id
Throttling parameters [µs]
Max preemption period [µs]
rt_runtime
rt_period
CPU 0
CPU 1
CPU 2
CPU 3
Average
1 900 000 1 000 000 101 961 101 951 97 883 101 961 100 939
2 (default) 950 000 1 000 000 48 959 48 958 53 043 53 032 50 998
3 1 000 000 1 000 000 44 56 118 59 69
4 -1 1 000 000 45 144 38 56 71
5 900 000 1 100 000 203 860 203 856 203 862 203 847 203 856
6 95 000 100 000 4 132 4 132 4 121 8 210 5 149
7 500 000 1 000 000 342 445 342 440 338 369 N/A* 341 085 (CPU 0-2)
505 814 (CPU 0-3)

*no RT task was started on this CPU

There are some observations that can be done:

  • The real-time throttling is evenly distributed among the cores. This will be the case as long as there are real-time work available to all cores. See test 7 for an exception.
  • There is no significant difference between setting rt_runtime to -1 and setting rt_runtime = rt_period. (Tests 3 & 4)
  • In test 6, the preemption periods are quite far from the expected 5000 µs. The reason is that the system frequency was set to 250Hz, which implies a granularity of 4 000 µs.
  • Test 7, in which no real-time task is started on CPU 3, is intended to show how the limit is applied to the system with asymmetric real-time load. It can be observed that the real-time tasks on CPU 0 to CPU 2 uses on average 66 % of their CPUs, which is higher than 50%. CPU 3 on the other hand uses 0%, which keeps the system as a whole in line with the limit (50%).

Isolating the Application

Isolating CPUs

CPU isolation is a topic that has been discussed in Section 2.6, CPU Isolation. When using this method correctly it is possible to enhance throughput and real-time performance by reducing the overhead of interrupt handling. This is beneficial e.g. for applications that require very high throughput, and for device drivers that handle frequent interrupts, such as 10GbE drivers. This section explains how to achieve basic CPU isolation.

Linux has a function called cpuset that associates tasks with CPUs and memory nodes. By adding and configuring cpusets on a system, a user can control what memory and CPU resources a task is allowed to use. It is also possible to tune how the scheduler behaves for different cpusets. The configuration parameter is called CONFIG_CPUSETS.

This section will describe how to setup CPU isolation using an imaginary SMP system with 4 CPUs. CPU0-CPU1 will be used for general purpose tasks and CPU2-CPU3 will be dedicated to real-time tasks. The system is a NUMA (non-uniform memory access) system with CPU0 - CPU1 belonging to node 0 and CPU2 - CPU3 belonging to node 1. Here we call the cpuset used for the general purpose domain nRT, non-RT set.

Setting up a Partitioned System

This section describes how to set up basic CPU isolation. Two cpusets will be created; one for real-time (RT) tasks, one for general-purpose non-realtime tasks (nRT).

As a quicker alternative, the setup done manually in this section can instead be done using a tool that wraps the steps in a script, see Section 3.3.4, The CPU Partitioning Tool - partrt.

Configure the CPU sets

First create the RT and nRT cpusets:

#enable the creation of cpuset folder
$ mount -t tmpfs none /sys/fs/cgroup
#create the cpuset folder and mount the cgroup filesystem
$ mkdir /sys/fs/cgroup/cpuset/
$ mount -t cgroup -o cpuset none /sys/fs/cgroup/cpuset/
#create the partitions
$ mkdir /sys/fs/cgroup/cpuset/rt
$ mkdir /sys/fs/cgroup/cpuset/nrt

Then add the general purpose CPUs to the nRT set:

$ echo 0,1 > /sys/fs/cgroup/cpuset/nrt/cpuset.cpus

Add the real-time CPUs to the RT set:

$ echo 2,3 > /sys/fs/cgroup/cpuset/rt/cpuset.cpus

Finally make the CPUs in the RT set exclusive, i.e. do not let tasks in other sets use them:

$ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.cpu_exclusive
Restart real-time CPUs with CPU hotplug

The Linux kernel's CPU hotplug facility allows CPUs to be added to or removed from a running kernel. CPU hotplug is typically used to isolate failing CPUs but it could also be used to restart the CPUs in the real-time partition to migrate all CPU-specific timers away from it. The CPU hotplug implementation in Linux is based on notifiers, which are callbacks into the subsystems that need to be aware of CPUs coming and going.

CPU hotplug is enabled if the CONFIG_CPU_HOTPLUG parameter is defined in the kernel configuration. The file /sys/devices/system/cpu/«cpu id»/online will then be available, making it possible to achieve runtime isolation.

If you choose to restart hotplug CPUs, you need to recreate the RT partition. The reason for requiring to migrate twice is that it might not be possible to restart the CPU if tasks are running on it. Restart the hotplug CPUs like this:

Turn off all CPUs in the real-time partition, as in the following example for CPU3:

$ echo 0 > /sys/devices/system/cpu/cpu3/online

Then turn them on:

$ echo 1 > /sys/devices/system/cpu/cpu3/online
Configure memory nodes

The RT and nRT domains need to be associated with memory nodes. In NUMA-enabled systems, each domain should use its own memory node. In systems where NUMA is not enabled, both domains should be associated with memory node 0, which is created by default at cpuset creation.

NUMA-enabled

The following sequence sets up a suitable configuration for NUMA-enabled systems.

First associate the nRT set with NUMA node 0:

$ echo 0 > /sys/fs/cgroup/cpuset/nrt/cpuset.mems

Then associate the RT set with NUMA node 1:

$ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.mems

Finally make NUMA node 1 exclusive to the RT cpuset. That is, only tasks in the real-time cpuset will be able to allocate memory from node 1:

$ echo 1 > /sys/fs/cgroup/cpuset/rt/cpuset.mem_exclusive

Note that also tasks in nRT can "access" memory controlled by NUMA node 1.

Not NUMA-enabled

Even if the system is not NUMA-enabled, it is important to configure the memory nodes:

Associate the nRT set with NUMA node 0:

$ echo 0 > /sys/fs/cgroup/cpuset/nrt/cpuset.mems

Associate the RT set with NUMA node 0:

$ echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.mems
Configure load balancing

Load balancing, i.e. task migration, is an activity that introduces non-deterministic jitter. It is therefore necessary to disable load balancing in the real-time cpuset. This means that it is necessary to specify the correct affinity for the threads that should execute within the real-time CPUs.

Disable load balancing in the root cpuset. This is necessary for settings in the child cpusets to take effect:

$ echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance

Then disable load balancing in the RT cpuset:

$ echo 0 > /sys/fs/cgroup/cpuset/rt/cpuset.sched_load_balance

Finally enable load balancing in the nRT cpuset:

$ echo 1 > /sys/fs/cgroup/cpuset/nrt/cpuset.sched_load_balance
Move general purpose tasks to the general GP partition

For each task in the root cpuset, run the following command, where each pid of task should occur on its own line:

$ echo pid_of_task > /sys/fs/cgroup/cpuset/nrt/tasks

Note that it is not possible to move all tasks. Some tasks require that they can execute on all available CPUs. All future child tasks that are created from the nRT partition will also be placed in the nRT partition. That includes tasks started from the current shell, since it should have been moved to nRT as well.

Move IRQs to the general purpose CPUs

Some interrupts are not CPU-bound. Unwanted interrupts introduce jitter and can have serious negative impact on real-time performance. They should be handled on the general purpose CPUs whenever possible. The affinity of these interrupts can be controlled using the /proc file system.

First set the default affinity to CPU0 or CPU1 to make sure that new interrupts won’t be handled by the real-time CPUs. The set {CPU0, CPU1} is represented as a bitmask set to 3, (20 + 21)..

$ echo 3 > /proc/irq/default_smp_affinity

Move IRQs to the nRT partition

$ echo 3 > /proc/irq/<irq>/smp_affinity

Interrupts that can not be moved will be printed to stderr. When it is known what interrupts can not be moved, consult the hardware and driver documentation to see if this will be an issue. It might be possible to disable the device that causes the interrupt.

Typical interrupts that should and can be moved are: certain timer interrupts, network related interrupts and serial interface interrupts. If there are any interrupts that are part of the real-time application, they should of course be configured to fire in the real-time partition.

Execute a task in the real-time partition

Now it is possible to run a real-time task in the real-time partition:

$ echo pid_of_task > /sys/fs/cgroup/cpusets/rt/tasks

Since we have an RT partition with more then one CPU we might want to choose a specific CPU to run on. Change the task affinity to only include CPU3 in the real-time partition: $ taskset -p 0x8 pid_of_task

The system should now be partitioned in two sets. The next step to further improve real-time properties is to get rid of the tick interrupts, which is described in Section 3.3.2, Full Dynamic Ticks.

Benchmarks for CPU Isolation

The measurements that compare latency numbers with and without isolation on a stressed system can be found below. The benchmark program used is cyclictest (https://rt.wiki.kernel.org/index.php/Cyclictest) and the load was generated with the stress application.

This is how the non-isolated benchmark was executed:

$ ./stress -d 4 --hdd-bytes 20M -c 4 -i 4 -m 4 --vm-bytes 15M &
$ ./cyclictest -m -p 99 -l 300000 -q;

In the isolated case the following boot parameter is used:

isolcpus=3

This is how the isolated benchmark was executed:

$ partrt create 0x8;
$ partrt run nrt ./stress -d 4 --hdd-bytes 20M -c 4 -i 4 -m 4 --vm-bytes 15M &
$ partrt run rt ./cyclictest -m -p 99 -l 300000 -q;

Read more about partrt in Section 3.3.4, The CPU Partitioning Tool - partrt. The benchmark was executed on a TCI6638k2k board.

Table 3.5 Benchmarked Latencies in a Partitioned System
Latency (µs)
Min Max Avg
Not isolated 43 1428 68
Isolated (CPU 3) 40 151 59

The worst-case latency thus drops with a factor 10 when the benchmark runs on an isolated CPU compared to the results from a non-isolated CPU.

Further Reading About CPU Isolation

Full Dynamic Ticks

The full dynamic ticks feature is described in Section 2.7, Dynamic Ticks (NO_HZ). Several conditions must be met to really turn ticks off from a CPU while running a tasks on this CPU. Some of these conditions are of static nature such as kernel configurations, and some of them are runtime conditions such as having only one runnable task, no POSIX timers, etc.

The current implementation of full dynamic ticks will not disable ticks entirely, but rather set it to 1 Hz. This is because the system still needs to synchronize every now and then. For those wanting to experiment with turning off the ticks entirely there is a patch from Kevin Hilman to do this.

Note that CPU partitioning will be needed to make sure that only one task is running on a specific CPU, see Section 3.3.1.1, Setting up a Partitioned System.

Prerequisites for full dynamic ticks

To be able to enable full dynamic ticks, the following prerequisites need to be met:

  • Linux kernel 3.10.0 or newer.
  • SMP capable hardware with at least two real cores, excluding hyperthreads if any.
  • No more perf events than what the hardware supports being active on the CPU.

Kernel configuration

To select at boot time which CPUs should use the full dynamic ticks feature, the kernel configurations listed below need to be enabled. CONFIG_NO_HZ_FULL is the kernel parameter for full dynamic ticks:

CONFIG_NO_HZ_FULL=y
CONFIG_RCU_NOCB_CPU=y

If you want all CPUs except CPU 0 to use the full dynamic ticks feature, enable the following kernel configurations:

CONFIG_NO_HZ_FULL_ALL=y

In this latter case CONFIG_RCU_NOCB_CPU_ALL should be selected by default. RCU is a synchronization mechanism used by Linux, using kernel helper threads to finish updates to shared data. For more information, read this LWN article: What is RCU, Fundamentally?.

Kernel boot parameters

Linux has a number of boot parameters that enhances CPU isolation:

  • isolcpus=<cpu set> This parameter specifies a set of CPUs that will be excluded from the Linux scheduler load balancing algorithm. The set is specified as a comma separated list of cpu numbers or ranges. E.g. "0", "1-2" or "0,3-4". The set specification must not contain any spaces. It is definitely recommended to use this parameter if the target kernel lacks support for CPU hotplug.
  • nohz_full=<cpu set> A list of CPUs for which full dynamic ticks should be enabled. If the kernel configuration CONFIG_NO_HZ_FULL_ALL was given, then this list will be all CPUs except CPU 0, and this boot option is not needed.

To achieve isolation in the RT domain (CPU2 and CPU3), use the following parameters:

isolcpus=2,3 nohz_full=2,3

After the system has booted, check the boot messages to verify that full dynamic ticks was enabled, e.g. using the shell command dmesg. Search for entries similar to the following:

NO_HZ: Full dynticks CPUs: 2-3.

Also make sure there is an entry similar to the following:

Experimental no-CBs CPUs: 0-7.

The no-CB CPU list must include the CPU list for full dynticks.

When choosing the CPU lists on hardware using simulated CPUs, such as hyperthreads, ensure you include real cores and not half a core. The latter could occur if one hyperthread is in the set of CPUs using full dynamic ticks feature while the other hyperthread on the same core does not. This can cause problems when pinning interrupts to a CPU. The two hyperthreads might also affect each other depending on the load.

Application considerations

To achieve full dynamic ticks on a CPU, there are some requirements on the application being run on this CPU. First of all, it must not run more than one thread on each CPU. It must also not use any POSIX timers, directly or indirectly. This usually excludes any kernel calls that will access the network, but also excludes a number of other kernel calls. Keeping the kernel calls to a minimum will maximize the likelihood of achieving full dynamic ticks.

The application must utilize the CPU partitioning described in previous section, which is done by writing the application thread's PID into the file /sys/fs/cgroup/cpuset/rt/cpuset.tasks (assuming the real-time partition is called "rt"). After this, the shell command taskset can be used to bind the task to a specific CPU in the "rt" partition. Binding can also be done by the application itself using the sched_setaffinity() function defined in sched.h.

Cost of enabling full dynamic ticks

Full dynamic ticks incurs the following costs:

  • Transitions to and from idle are more expensive. This is inherited from CONFIG_NO_HZ_IDLE since CONFIG_NO_HZ_FULL builds on the same code as CONFIG_NO_HZ_IDLE.
  • Transitions between user space and kernel space are slightly more expensive, since some book-keeping must be done.
  • Scheduling statistics normally involve periodic timeouts, and are therefore implemented slightly differently for CONFIG_NO_HZ_FULL.

Benchmark for full dynamic ticks

Below is an example trace log for a call to the scheduler_tick() function in the kernel:

         0)               |  scheduler_tick() {
         0)               |    _raw_spin_lock() {
         0)   0.113 us    |      add_preempt_count();
         0)   0.830 us    |    }
         0)   0.085 us    |    update_rq_clock.part.72();
         0)   0.146 us    |    __update_cpu_load();
         0)   0.071 us    |    task_tick_idle();
         0)               |    _raw_spin_unlock() {
         0)   0.076 us    |      sub_preempt_count();
         0)   0.577 us    |    }
         0)               |    trigger_load_balance() {
         0)   0.098 us    |      raise_softirq();
         0)   0.065 us    |      idle_cpu();
         0)   1.715 us    |    }
         0)   6.585 us    |  }

As can be seen from the above trace, the tick took more than 6 us, excluding interrupt overhead. This was a common time for this target, HP Compaq Elite 8300 which is an Intel Core i5 3570.

Optimizing a Partitioned System

If the above sections do not offer enough real-time properties, then this section gives some extra hints of what can be done.

tsc boot parameter - x86 only

The time stamp counter is a per-CPU counter for producing time stamps. Since the counters might drift a bit, Linux will periodically check that they are synchronized. But this periodicity means that the tick might appear despite using full dynamic ticks.

By telling Linux that the counters are reliable, Linux will no longer perform the periodic synchronization. The side effect of this is that the counters may start to drift, something that can be visible in trace logs for example.

Here is an example of how to use it:

isolcpus=2,3 nohz_full=2,3 tsc=reliable

Delay vmstat timer

It is used for collecting virtual memory statistics.The statistics are updated at an interval specified as seconds in /proc/sys/vm/stat_interval. The amount of jitter can be reduced by writing a large value to this file. However, that will not solve the issue with worst-case latency.

Example (1000 seconds):

$ echo 1000 > /proc/sys/vm/stat_interval

There is one kernel patch that removes the periodic statistics collection and replaces it with a solution that only triggers if there is actual activity that needs to be monitored.

BDI writeback affinity

It is possible to configure the affinity of the block device writeback flusher threads. Since block I/O can have a serious negative impact on real-time performance, it should be moved to the general purpose partition. Disable NUMA affinity for the writeback threads

$ echo 0 > /sys/bus/workqueue/devices/writeback/numa

Set the affinity to only include the general purpose CPUs (CPU0 and CPU1).

$ echo 3 > /sys/bus/workqueue/devices/writeback/cpumask

Real-time throttling in partitioned system

Real-time throttling (RTT) is a kernel feature that limits the amount of CPU time given to Linux tasks with real-time priority. If any process that executes on an isolated CPU runs with real-time priority, the CPU will get interrupts with the interval specified in /proc/sys/kernel/sched_rt_period_us. If the system is configured with CONFIG_NO_HZ_FULL and a real-time process executes on a CONFIG_NO_HZ_FULL CPU, note that real-time throttling will cause the kernel to schedule extra ticks. See Section 2.3, Real-Time Throttling and Section 3.2.4, Optimize Real-Time Throttling for more information.

Disable real-time throttling by the following command:

$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us

Disable power management

The CPU frequency governor causes jitter because it is periodically monitoring the CPUs. The actual activity of changing the frequency can also have a serious impact. Disable the frequency governor with the following configuration CONFIG_CPU_FREQ=n.

An alternative is, at runtime, to change the governor policy to performance. The advantage is the policy affinity per each cpu:

$ echo "performance" > /sys/devices/system/cpu/cpu<cpu_id>/cpufreq/scaling_governor

An example based on the RT partition:

$ echo "performance" > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
$ echo "performance" > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

*Note that this could damage your hardware because of overheating. Make sure that you understand what works for your specific hardware.

Machine check - x86 only

The x86 architecture has a periodic check for corrected machine check errors (MCE). The periodic machine check requires a timer that causes unwanted jitter. The periodic check can be disabled. Note that this might lead to that silently corrected MCEs goes unlogged. Turn it off on the RT CPUs. For each CPU in the real-time partition, do the following:

$ echo 0 >/sys/devices/system/machinecheck/machinecheck<cpu>/check_interval
$ echo 0 >/sys/devices/system/machinecheck/machinecheck2/check_interval
$ echo 0 >/sys/devices/system/machinecheck/machinecheck3/check_interval

It has been observed that it is enough to disable this for CPU0 only; it will then be disabled on all CPUs.

Disable the watchdog

The watchdog timer is used to detect and recover from software faults . It requires a regular timer interrupt. This interrupt is a jitter source that can be removed. At the obvious cost of less error detection.

The watchdog can be disabled at compile time by setting CONFIG_LOCKUP_DETECTOR=n, or in runtime as follows :

$ echo 0 > /proc/sys/kernel/watchdog

Disabling the NMI Watchdog - x86 only

Disable the debugging feature for catching hardware hangings and cause a kernel panic. On some systems it can generate a lot of interrupts, causing a noticeable increase in power usage:

$ echo 0 > /proc/sys/kernel/nmi_watchdog

Increase flush time to disk

To make write-backs of dirty memory pages occur less often than the default, you can do the following:

$ echo 1500 > /proc/sys/vm/dirty_writeback_centisecs

Disable tick maximum deferment

To have the full tickless configuration, this patch should be included. This allows the tick interval to be maximized by setting sched_tick_max_deferment variable in the /proc filesystem. To disable the maximum deferment, it should be set to -1.

$ echo -1 > /sys/kernel/debug/sched_tick_max_deferment

Network queues affinity

Linux can route the packets on different CPUs in an SMP system. Also this handling can create timers on the specific CPUs, an example is the ARP timer management, based on neigh_timer. There are a couple of solutions that can be adopted to minimize the effect of rerouting packets on different CPUs, like migrating all the timers on the non-realtime partition if possible, specifying the affinity of network queues on some architectures.

If the application needs the packets to be received only in the nRT partition then the affinity should be set as follows:

$ echo <NRT cpus mask> > /sys/class/net/<ethernet interface>/queues/<queue>/<x/r>ps_cpus

Example for TCI6638k2k board:

$ echo 1 > /sys/class/net/eth1/queues/tx-0/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-1/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-2/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-3/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-4/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-5/xps_cpus
$ echo 1 > /sys/class/net/eth1/queues/tx-6/xps_cpus 

$ echo 1 > /sys/class/net/eth1/queues/rx-0/rps_cpus

*Note that if there is a need of network traffic on both partitions, the affinity should not be set. In this case, the neigh_timer can be handled by any CPU, including those in the RT partition.

The CPU Partitioning Tool - partrt

Enea Linux includes a tool, partrt, for dividing an SMP Linux system into partitions. By using the methods described in the previous section, this tool provides an easy to use interface to set up CPU isolation in an intelligent way. The tool can be downloaded from https://github.com/OpenEneaLinux/rt-tools.

Usage Examples:

Create the RT partition on CPU 2 and CPU 3. Default names will be "rt" for the real-time partition and "nrt" for the general purpose partition:

$ partrt create 0xc

Or on the NUMA system assumed in previous section:

$ partrt create -n 1

Show the recently created partitions like this:

$ partrt list

Run cyclictest on CPU 3 in the RT partition:

$ partrt run -c 0x8 rt cyclictest -n -i 10000 -l 10000

Move cyclictest to NRT partition:

$ partrt move «pid of cyclictest» nrt

Undo partitioning (restore environment):

$ partrt undo

See full partrt help text like this:

$ partrt -h

Further Actions if Needed

If the attempts in the previous sections to improve the real-time performance are not enough, consider those described in this section.

User-Space Runtime

As mentioned in several places in this manual, the standard Linux kernel is not real-time safe. Read more about this in Chapter 2, Basic Linux from a Real-Time Perspective.

Two proposed solutions to tackle real-time in Linux are PREEMPT_RT and CPU isolation, see Section 2.4, The PREEMPT_RT Patch and Section 3.3.1, Isolating CPUs. Both solutions have consequences. PREEMPT_RT has throughput issues, and CPU isolation on a standard kernel requires that only a subset of the libc can be used if real-time shall be maintained.

One approach that can improve throughput and real-time properties is to not use the kernel at all. Instead a runtime entirely in user space can be used.

Below are some issues that appear with PREEMPT_RT and/or CPU isolation:

  • Linux kernel adds too much overhead
    • For application writers that move from a bare metal or RTOS environment to Linux, the overhead of the Linux API might be unacceptable. PREEMPT_RT adds even more overhead.
  • Glibc does not provide a real-time deterministic API
    • CPU isolation can be a good way to get real-time performance on a standard Linux kernel. The big problem is that the Linux API is not real-time safe and has to be avoided whenever real-time determinism is required. This limits the application developer, since standard APIs and programming models can not be used.
    • IPC between real-time tasks and general purpose tasks is an other issue. Most IPC implementations on Linux relies on system calls and can add real-time problems if used on a standard kernel. Some IPC implementations might be unsafe to use on PREEMPT_RT, depending on how the IPC implementation handles dynamic memory. This subject is discussed in Chapter 4, Designing Real-Time Applications.

Running a real-time deterministic runtime environment completely in user space can be a good way to increase determinism and throughput. One example is Enea LWRT which provides deterministic multithreading, memory management and LINX IPC.

A specialized user space runtime system can solve the mentioned problems:

  • Decrease overhead by avoiding the Linux kernel
    • The runtime system can implement lightweight threads in user space. This can greatly decrease overhead and increase determinism. Voluntary context switches can be done entirely in user space.
    • A comparison between LWRT lightweight user-space threads and standard Linux pthreads can be seen below. The benchmark is done on a Texas Instruments TCI6638k2k (1.2 GHz) board.
Table 3.6 Task Switch Latency, LWRT vs. Pthreads [µs]
Latency (µs)
Min Max Avg
Linux pthread 4.25 31.35 8.58
LWRT process 0.26 1.88 0.41

*The table shows that the pthread context switch has about a factor of 30 higher overhead compared with LWRT processes in the average case. Applications that use a high amount of threads, such as telecommunication applications, could thus benefit from a design based on LWRT instead of pthreads.

  • Replace glibc API with a real-time deterministic API
    • A realtime-safe user-space runtime can provide a real-time deterministic API that can replace the indeterministic parts of the glibc API. Typical indeterministic calls that can be replaced are related to dynamic memory management, multi-threading, and IPC.

Use an RTOS

For completeness, it should be mentioned that using an RTOS may be the best alternative if all ways to improve the real-time properties in a Linux system have been exhausted without reaching acceptable performance. Doing that is outside the scope of this manual, but an example is using one of the RTOS products provided by Enea.