Designing Real-Time Applications

From linuxrealtime
Jump to: navigation, search

Optimizing the Linux system itself for real-time is only half the solution to give applications optimal real-time properties. The Linux applications themselves must be designed in a proper way to allow for real-time properties. This section will provide some hints how to do this.

The C function library, libc, is a part of the runtime for applications under Linux. It provides basic facilities like fopen, malloc, printf, exit, etc. The C library shall provide all functions that are specified by the ISO C standard. Usually, additional functions specific to POSIX are also supported. GNU libc is the most widely used libc, but there are alternative implementations like newlib and uClibc, all of them supporting the POSIX.1b (IEEE 1003.1b) real-time extensions.

libc provides a powerful toolbox with many useful and frequently used features, but from a real-time perspective, it must be used with some caution. The first problem to deal with is the level of real-time support in the libc code itself. The code is often considered to be proven in use and therefore used without deeper analysis. This is probably a valid assumption for a typical Linux system where average performance is more important than worst-case behavior, but for real-time systems it might be an unreasonable attitude. This should however not be a major issue since the source code is available for analysis. Another issue is the fact that the functions in libc may use system calls interacting with the kernel. Depending on the kernel preemption model, this may lead to execution of different non-preemptible sections. The kernel can also decide to execute other tasks like soft IRQs on its way back to user space.

Further application design challenges are:

  • Memory handling - memory access latency, shared memory
  • Multi-threaded applications - task switch overhead, locks
  • I/O - blocking, driver-induced jitter

For further reading, see RT-Preempt How-to wiki.

Application Memory Handling

Linux applications access memory by using virtual addresses. Each virtual address translates into a physical address with the help of hardware that uses translation tables. This feature makes it possible to address more virtual memory than there is physical memory available, assuming that not all applications need all their allocated memory at the same time.

Allocating memory will by default only reserve a virtual memory range. When the first memory access to this newly allocated virtual memory range occurs, this causes a page fault, which is a hardware interrupt indicating that the translation table does not contain the addressed virtual memory. The page fault interrupt will be handled by the Linux kernel, which will provide the virtual-to-physical memory mapping. Then the program execution can continue.

Most architectures use a cache called translation lookaside buffer, TLB, for the translation table. The TLB cache is used to speed up virtual-to-physical memory translations. A translation causes latency jitter when a looked-up address is not in the TLB, which is referred to as a TLB miss.

Virtual memory makes it possible for Linux to have memory content stored in a file, e.g. by loading an executed binary in an on-demand fashion or by swapping out seldom used memory. This is called demand paging, which can cause unbound latency since it involves accessing a file or a device. Therefore, the application needs to disable demand paging by using the mlockall() function call:

mlockall(MCL_CURRENT | MCL_FUTURE)

The MCL_CURRENT will make sure that all physical memory has the expected content and that the translation table contains the needed virtual-to-physical memory mapping. This includes code, global variables, shared libraries, shared memory, stack and heap. MCL_FUTURE means that updating the translation table and initializing the physical memory, if applicable, is done during future allocations, not when accessing the memory. Future allocations can be stack growth, heap growth, shm_open(), malloc(), or similar calls like mmap().

When using mlockall(), there is a risk that Linux will allow allocating less memory. For example, the allocated memory must be available as physical memory.

Note that a call to malloc() can still show large latency variation, since now the translation table update is done within this function call instead of when accessing the memory. Not to mention that a malloc() may or may not need to ask for more virtual memory from the kernel. It is therefore recommended to allocate all needed dynamic memory at start, to avoid this jitter.

In case dynamic memory allocation needs to be done within the real-time application, there are some actions that can be performed to mitigate the malloc() latency variation. The glibc has a function called mallopt() which can be used to change the behavior of malloc(). Two interesting options are M_MMAP_MAX and M_TRIM_THRESHOLD.

  • M_MMAP_MAX controls the allocation size that rules which function to use; mmap() or sbrk(). The advantage with mmap() is that the memory can be returned to the system as soon as free() is called, since each allocation is separate. The disadvantage with mmap() is that it is slower than sbrk().
  • M_TRIM_THRESHOLD controls when part of an sbrk() allocated memory area, where a large contiguous area of memory at the top has been freed, shall be returned to the system. Turning this feature off will not allow the application to release any memory to the system once it has been allocated, but it can somewhat improve the real-time properties.

See the following table for some measurements done on malloc() call time and memory access time. The measurements have been done on a system with CPU isolation.

Table 4.1 malloc() and Memory Access Measurements on TCI6638k2k Board
Partition
Scenario
Latency (µs)
Operation
Min Max Avg
RT
normal
4 178 108.8 Mem access
4 331 19.6 Malloc call
mlockall()
4 13 5.0 Mem access
4 515 96.5 Malloc call
mlockall(), M_MMAP_MAX = 0, M_TRIM_THRESHOLD = -1
4 11 4.8 Mem access
4 417 5.7 Malloc call
GP
normal
4 1384 109.0 Mem access
7 164 19.7 Malloc call
mlockall()
4 125 5.1 Mem access
4 1607 95.4 Malloc call
mlockall(), M_MMAP_MAX = 0, M_TRIM_THRESHOLD = -1
4 91 4.8 Mem access
4 463 5.7 Malloc call

There are a large number of "4" in the minimum latency column. This is due to the timer resolution that will not allow any smaller values. For a real-time application, the maximum latency is of the greatest importance. The measurements show that when using mlockall(), the memory access becomes more predictable, while malloc() will always have a latency that is hard to predict. Letting the real-time tasks run on dedicated CPUs, here referred to as an RT partition, also results in lower latency jitter.

An alternative is to use another heap allocator than the one provided by glibc;, an allocator which is better adapted for embedded and real-time requirements. Here are two examples:

TLSF - Two-level segregated fit
O(1) complexity for allocation and free calls and only 4 bytes overhead per allocation. It can be slower than the glibc default allocator on average, but should have better worst-case behavior.
HF - Half-fit
O(1) complexity for allocation and free, using bitmaps and segregated lists.

glibc itself uses ptmalloc), which is a more SMP-friendly version of DLMalloc.

Since memory is slow compared to the CPU, there is usually a fast but small memory area called cache between the CPU and the memory. When an access is done to non-cached memory, the CPU needs to wait until the cache has been updated. Accessing a memory location that has not been accessed for a long time will therefore take more time than accessing a recently accessed memory location.

One obvious way to the cache problem is to disable the cache. While this would help the worst-case latency it will make the average latency horrible, given that the selected architecture supports disabling the cache. You can also use CPU partitioning to let the real-time application run on the CPU alone, and making sure that this application does not access more memory than fits in the cache.

Another consequence of the per-CPU caches becomes obvious when using shared memory among tasks running on different CPUs. Such memory can be shared either by threads in the same process or by using the shm_open() function call. When such memory is updated by one task and being read by another task running on another CPU, the memory contents need to be copied. This copy usually has a minimum size, called cache line, which for many architectures is 32 bytes. A one byte write will therefore end up copying 32 bytes in this situation, which is 32 times more than might be expected.

Summary about application memory handling:

  • Use mlockall(MCL_CURRENT | MCL_FUTURE) to lower memory access latency jitter.
  • Pre-allocate all needed dynamic memory, since the time needed for memory allocation is hard to predict.
  • Avoid sparse memory accesses to better utilize hardware caches.
  • Be careful about sharing memory between tasks running on different CPUs, as each access may end up copying 32 bytes on many architectures.
  • Consider using the M_MMAP_MAX and M_TRIM_THRESHOLD options in mallopt() in case dynamic memory cannot be pre-allocated.
  • Consider using the TLSF heap allocator if lower worst-case latency for malloc() is needed.

Multi-Threaded Applications

There are two driving forces to make applications multi-threaded:

  1. Easier to design event-driven applications.
  2. Make use of concurrency offered by multicore systems.

A rule of thumb when designing a multi-threaded application is to assign one thread for each source of event, output destination and state machine for controller logic. This will usually lead to many threads, which can result in the application spending a lot of time doing task switches between those threads. A user-space scheduler can solve that problem but then the threads cannot run on different cores. You can also merge threads so that work with higher real-time requirements are kept in separate threads from work with lower real-time requirements.

To make the application truly concurrent, a good choice is to use POSIX threads, pthreads. Each pthread can run on its own core, and all pthreads within the same process are sharing memory, making it easy to communicate between the threads. Be careful however that if the threads need a lot of synchronization, the application will no longer be concurrent despite the fact that several CPUs are used. In this case the application might even become slower than having it as a single thread.

Asynchronous message passing can solve some of the synchronization problems, especially use cases where a thread with high real-time requirements can delegate work to a thread with less real-time requirements. An example of such mechanism is the POSIX functions mq_send() and mq_receive(). One challenge with message passing is how to handle flow control, i.e what to do when the queue is full.

Message passing as well as other synchronization mechanisms, e.g. the POSIX pthread_mutex_lock(), can suffer from priority inversion where a high priority task is forced to wait for a low priority task. For pthread_mutex_lock() it is possible to enable priority inheritance to avoid this problem, while for message passing this needs to be considered when designing the messaging protocol.

Example on how to set priority inheritance for a pthread mutex:

pthread_mutex_t mutex;
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);

When using gcc, you need to compile the code with options "-pthread -D_GNU_SOURCE".

Summary about multi-threaded applications:

  • When designing an event-driven application, it is often easier to use multiple threads rather than callback functions.
  • Compared to singe-threaded applications, multi-threaded applications using pthreads can scale better on a multicore systems.
  • In a multi-threaded application where the threads require heavy synchronization, the threads will spend most of the time waiting for each other. This will make the application less scalable.
  • A user-space scheduler, compared to scheduling in the kernel, will allow more synchronization and more threads before efficiency goes down.
  • Properly designed message passing applications can make the synchronization less painful.
  • When synchronizing threads, beware of priority inversion. Message passing protocol needs proper design, and POSIX mutexes can use priority inheritance to avoid priority inversion.

Application I/O

I/O accesses usually end up as system calls accessing a kernel driver. This kernel driver then accesses hardware, which can have an unbound latency. The calling task will by default block during this period. It is however possible to perform asynchronous I/O to avoid being blocked. See the aio(7) man page for more information about asynchronous I/O.

The driver can add deferred work to a work queue or to soft IRQs, which can make it hard to predict latencies for a real-time application.

Furthermore, the driver might need some timeouts. If using the full dynamic ticks feature, such timeouts may cause tick interrupts to be triggered. One possible solution to this is to delegate the I/O calls to a task running on another CPU than the real-time tasks. Then the latencies caused by deferred driver work will only affect the delegated task, but not the real-time task.

Note that the I/O concerns are also applicable to text messages sent to stdout or stderr, or text read from stdin. If a device driver writes a diagnostic message from the kernel, e.g. by using the kernel function printk(), the I/O concerns applies to this message as well.

Summary about application I/O:

  • Delegate I/O to a task running on another CPU than the real-time task.
  • If delegation is not possible, asynchronous I/O might help. See the aio(7) man page.