Hardware Aspects

From linuxrealtime
Jump to: navigation, search

The real-time properties of a system do not only depend on the operating system. The hardware is also important. This chapter contains a brief discussion on what to consider regarding real-time capable hardware.

Real-time capable hardware requires, at the minimum, that resources can be accessed within a deterministic time interval. If there is a risk for resource congestion, the hardware must implement deterministic multiple access algorithms which guarantee that there will be no starvation. Another typical hardware requirement in a real-time system is that CPUs have access to reliable high resolution timers.

CPU

Modern CPUs use a number of techniques to speed up code execution, such as instruction pipelines, out-of-order execution, and branch prediction. They all contribute to better average speed in code execution, but will also cause latency jitter.

Shared Resources

Designing real-time applications for SMP systems is more complex than designing for single-CPU systems. Typically, on SMP, all hardware resources such as I/O devices and memory are shared by the CPUs. Since all resources are shared, a low priority task could starve a high priority task, i.e. cause a priority inversion.

There are methods for scheduling multiple access to I/O devices, see Section 2.9, I/O Scheduling, but it is more tricky to manage shared memory. Currently there are no good tools for controlling multiple access to shared memory resources as there are for multiple access to I/O devices. Multiple access to shared memory is more hidden from a software perspective and is largely implemented in hardware.

A proper real-time design should consider how to deal with shared memory and memory congestion, how hyper-threading affects real-time performance, how to use NUMA to improve the execution time, and how to decrease impact of shared resource congestion; all of these topics are covered in this section.

For a deeper study in the area of resource sharing on multi-core devices, see the following article: Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources.

Shared Memory and Memory Congestion

SMP systems typically share a system bus, cache, MMU and one or several memory channels. An application that causes heavy load on a shared resource can significantly degrade performance for other CPUs. Not only CPUs, but also other devices that support DMA, will increase the congestion.

Memory congestion can have various sources. In Chapter 4, Designing Real-Time Applications you can study the software-driven congestion caused by a shared-memory programming model. This section covers the impact from hardware resource congestion due to shared memory and how it affects the worst-case execution time.

A real-time application designer probably wants to test the system to find a worst-case execution time. The methods below describe how to stress the shared memory and measure an approximate worst-case execution time. The methods are suitable for soft and firm real-time systems.

A pessimistic indication of what impact shared memory congestion has on worst-case latency can be estimated like this:

1. Turn off caches. On some architectures it is possible to disable the caches. Turning off the caches may be used to simulate 100% cache misses, making it possible to measure the worst-case impact of cache misses caused by e.g. congestion on the memory bus and on the off-chip memory.

Start memory load on each general-purpose CPU by calling the command specified below. This gives an indication about what impact the memory bus and off-chip memory congestion has on worst-case execution time. This will give a good indication even if it isn't possible to disable the caches.

The application that is used to generate load is called stress and is described in Appendix B, Benchmark Details. Start the stress application on each non-real-time CPU. Use memory stress with a stride that equals to the cache line size of the target architecture. Make sure that the value passed to --vm is larger than the last level shared cache.

taskset <GP-CPU-n> ./stress --vm <LAST_LEVEL_CACHE_SIZE> --vm-stride <CACHE_LINE_SIZE>

That will likely trash the shared cache and cause a minimal number of cache hits for that real-time application.

2. If the impact of MMU congestion is of interest, repeat step 1 but use a stride size that is equal to the system page size:

taskset <GP-CPU-n> ./stress --vm <LAST_LEVEL_CACHE_SIZE> --vm-stride <PAGE_SIZE>

The impact of the generated load in the above examples will vary significantly depending on the CPU clock speed, memory clock speed, cache size, coherency algorithm and cache/memory hierarchy. Changing these hardware parameters will create different congestion thresholds. Processor architectures that cannot guarantee CPU access to a specific bus or device within a deterministic amount of time cannot be used for real-time applications.

Note: For hard real-time systems, Linux is probably not a suitable operating system. If you choose Linux anyway, a static analysis should be done instead of using the methods above. The static analysis is needed to calculate a theoretical worst-case execution time based on number of clock cycles for a worst-case scenario which also takes the hardware latency into account.

Hyper-Threading

Hyper-threading means that there are multiple virtual CPUs per physical CPU. Two or more instruction pipelines will share hardware resources. A low priority process and a high priority process can run on separate virtual CPUs belonging to the same physical CPU. This can lead to a situation where the low priority process decreases the performance of the high priority process. It is recommended to disable hyper-threading when real-time performance is required. Another approach is to make sure that each real-time task has exclusive access to a physical CPU.

NUMA

The negative impacts on worst-case execution time can at large be eliminated if the target hardware supports NUMA. By using the Linux cpuset feature it is easy to give the real-time application its own memory node. Read more about this in Section 3.3.1, Isolating CPUs. Note that memory congestion will also occur if the real-time application runs on multiple CPUs in the real-time partition. However, that should be more manageable.

Decrease Impact of Shared Resource Congestion

Below is a list with suggestions on how to decrease the impact of shared resource congestion.

  1. If the platform has NUMA: Dedicate one NUMA node to the RT application. See Section 3.3.1, Isolating CPUs.
  2. Disable hyper-threading. If that isn't possible, use CPU isolation with static affinity so that only one real-time task executes per physical CPU.
  3. Disable the cache if the architecture allows it. Do this to avoid possible indeterminism added by cache misses. If this is needed, it could indicate that Linux as operating system or the hardware platform is unsuitable for the application.
  4. On some architectures it might be possible to lock real-time application pages into the cache. Consult the processor manual and, if available, the hardware platform specific SDK manual.

The System Management Mode (x86)

The x86 architecture has an operating mode called System Management Mode, also known as SMM. It is "a special-purpose operating mode provided for handling system-wide functions like power management, system hardware control, or proprietary OEM designed code."

The SMM is entered via an event called system management interrupt, SMI. SMM/SMI has a higher priority than the operating system and will therefore affect the latency. It cannot be disabled by the OS, and even if there might be other ways to disable it, it should probably be kept since it also handles thermal protection. Consequently, there is not much that can be done about it except for adding enough margins to tolerate it.