Appendix A - Optimizing Example - P2041

From linuxrealtime
Revision as of 08:00, 9 October 2015 by Lejo (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

When someone states the goal to "optimize a specific Linux target for real-time" and provides a benchmark result, it is very important to be clear on what capabilities the measured system actually has. Benchmark results may be interesting to read, but they are only valid and relevant if they are somewhat comparable with each other and if the setup is relevant for real-world use cases.

This appendix states the goal to optimize for real-time, but it actually tries to reach as far as possible regarding both throughput performance and low worst-case latency response time since the use case we focus on is an embedded control system within the communications domain, which normally has both fairly high soft real-time requirements and performance requirements.

In Reality, "Real-Time" is Not Only About Interrupt Latency

In a real-time system, the characteristic behavior of the entire operating system is very important. To start with, a deterministic response time from an external event until the application is invoked is what we normally refer to when talking about real-time characteristics of an operating system. This implies not only the interrupt latency, but also the event chain until the application gets scheduled.

Since a chain is not stronger than its weakest link, it is also important to provide a deterministic runtime environment for the entire application execution flow so that it can respond within the specified operational deadline on system level. This implies that also the task scheduler and the resource handling API in the OS must behave in a deterministic way.

When designing a system for high-throughput performance, the goal is to keep down the average latency, while designing a real-time system aims to keep the worst-case latency under a specified limit. As a consequence, the design of a system required to be both high-performing and real-time capable must take both the average and maximum latency into account, and this is something we will strive for in this application note.

Performance and Latency Tuning Strategy

The selected target for this exercise is the Enea Linux PowerPC kernel for the p2041rdb board. The p2041rdb kernel (and most other Enea Linux PPC kernels) is built in two different ways:

  • A high-performance kernel image that is optimized for high throughput performance and low footprint, for example intended for IP benchmarking. This image, and its corresponding kernel configuration file, is tagged "RELEASE".
  • A demo/debug kernel image that has the goal to be fully configured regarding all kinds of online- and off-line debug capabilities. The demo/debug image is not tagged, and this is also the one that you can modify via the command bitbake –c menuconfig virtual/kernel, and also rebuild via the command bitbake virtual/kernel.

Both kernels are configured as a Low-Latency Desktop (LLD) kernel, i.e. the most preemptive standard kernel variant, selected by CONFIG_PREEMPT.

The strategy we will follow in the tuning effort is to go through a number of steps, each of which we briefly describe the configuration level and the latency benchmark result:

  1. The first attempt is with the default "demo/debug" kernel image, for the main reason to highlight the difference to the end result caused by both debug overhead and selective tuning of important parameters.
  2. The performance-optimized "RELEASE" kernel image, which is clearly configured for speed. However, it will show here that it is not optimized for latency, and we can do additional tuning to improve both performance and worst-case latency.
  3. A standard LLD kernel image highly tuned and optimized for real-time AND performance. This kernel is based on the RELEASE kernel configuration but with additional build configuration changes and boot time configuration options that give the smallest worst-case latency figure possible while not compromising the performance.
  4. Finally, we enable the PREEMPT_RT patch as an "overlay" on the previous LLD kernel in 3) in order to see our possible best result regarding worst-case latency.

Note that the goal here is to optimize for performance and real-time in all phases; development, deployment, and maintenance of a real-life production system. We will very likely have to add some of the tracing and debugging features that we have now removed, because otherwise the system will become unmaintainable. This has a price in overhead, and the trade-off between performance and traceability has to be weighed from case to case.

Configuring and Building the Enea Linux PowerPC p2041rdb Kernel(s)

The Enea Linux PowerPC kernel is built so called "staged"; first the "RELEASE" image is configured and built, then after that the normal demo/debug image is configured and built. The recipe for the kernel build can be found under poky/meta-enea/recipes-kernel/linux. The file linux-qoriq-sdk.bbappend is essential here; it describes exactly what configuration that shall go into both the RELEASE kernel and the demo kernel. The kernel configuration file (.config) is built up incrementally by merging configuration fragments in a specific order from the sub-directory files/cfg according to specific variable definitions.

The actual file name might contain the kernel version instead of "-sdk", e.g. linux-qoriq-3.12.bbappend.

The configuration file for the high-performance RELEASE kernel is defined by the incremental merge of the fragments specified in KERNEL_FEATURES variable, and the resulting .config file can also be found as the config-p2041rdb-RELEASE.config file in the deployment images directory. The default demo/debug kernel has got additional configuration fragments merged to its .config file, specified by the STAGING_KERNEL_FEATURES variable, and the aggregated .config file is named config-p2041rdb.config.

Benchmark Description

The worst-case latency benchmark uses a combination of cyclictest and stress. The values of the buffer sizes used in the stress scenarios are chosen in order both to generate much stress load on network via NFS traffic in the hdd test, and also in the attempt to resemble a real live embedded application. The values are presented in the table below.

Table A.1 Details of Stress Test Scenarios
Scenario Corresponding stress Options and Arguments
hdd ./stress –d 4 –hdd-bytes 1M
vm ./stress –m 4 –vm-bytes 4096
full ./stress –c 4 –i 4 –m 4 –vm-bytes 4096 –d 4 –hdd-bytes 4096

The benchmark runs one stress instance per core in parallel with the cyclictest program:

./cyclictest –S –m –p99 –l 100000

The "Default" LLD Kernel Targeting Demo/Debug

This kernel is configured to contain all kinds of debug features, and thus it has a lot of overhead. Below is an enumeration of the features added, briefly described by the name of the configuration fragment:

  • files/cfg/00020-debug.cfg: misc collection of numerous ftrace options, performance events, counters, stacktrace support, file system debug options.
  • files/cfg/00033-kprobes.cfg
  • files/cfg/00007-oprofile.cfg
  • files/cfg/00019-i2c.cfg
  • files/cfg/00027-lttng.cfg
  • files/cfg/00025-powertop.cfg
  • files/cfg/00004-systemtap.cfg
  • files/cfg/00014-kgdb.cfg

Benchmark results for the default-built uImage-p2041rdb.bin image, prebuilt in the distribution:

Table A.2 Default LLD Kernel (Demo/Debug)
Latency [µs]
Stress Type
no stress cpu Io vm hdd full
Min 12 10 19 10 11 10
Average 24 17 32 18 30 25
Max 382 33 90 39 388 230

The table above shows the printed resulting min, average, max, latency in microseconds from the cyclictest program that runs one instance on each core in parallel with the stress program.

The result shows a fairly long and fluctuating worst-case latency, as well as a significant overhead in min- and average values. The conclusion is that this is neither a suitable kernel configuration for production systems with any kind of real-time requirements, nor for systems where the performance is important. This kernel is fully-featured with its pros and cons.

The Performance-Optimized RELEASE Kernel

As described earlier, this is a kernel configuration without the debug features found in the default demo/debug kernel. Benchmark results for the image uImage-p2041rdb-RELEASE.bin, prebuilt in the distribution:

Table A.3 Performance-Optimized RELEASE Kernel
Latency [µs]
Stress Type
no stress cpu Io vm hdd full
Min 4 4 5 4 4 4
Average 7 5 6 4 7 6
Max 67 32 18 25 126 56

The result still shows a somewhat fluctuating worst-case latency, but now the min- and average values are significantly improved. The conclusion is that this is still not a suitable kernel configuration for production systems with any kind of real-time requirements. However, for systems where you want to minimize the kernel overhead in order to maximize the application performance, this is a potential base configuration from which you may add debug features deemed necessary in field.

The "RT-Optimized" Mainstream Low-Latency Desktop Kernel

The two kernel builds previously described exist "out-of-the-box" in the Enea Linux p2041rdb distribution, so in order to be able to do further benchmarking we need to describe here how we can modify the kernel configuration for new builds. We can do this either by using the command

bitbake –c menuconfig

or, we can temporarily modify the kernel recipe in the meta-enea layer. We will choose to modify the kernel recipe, mainly in order to enable reproduction but also because we otherwise has to do a substantial amount of "reversing" of options in the menuconfig command since the .config file we have to work with is the demo/debug one with all debug features enabled. The modifications we have to do can be described in three steps:

Step 1

Add a new, latency-optimizing, fragment file in meta-enea/recipes-kernel/linux/files/cfg. Fragment file names contain leading numbers and the new file shall have a higher number than the existing files. We assume here that the name 00051-latency.cfg is suitable for the new file, which shall contain:

# CONFIG_HIGHMEM is not set, but you can set CONFIG_HIGHMEM=y
#                if access to the entire memory is needed
# CONFIG_FSL_ERRATUM_A_004801 is not set
# CONFIG_FSL_ERRATUM_A_005337 is not set
# CONFIG_RCU_TRACE is not set

Our intended target system for the exercise is still an embedded 32-bits target and we want the tick to be at least 1ms, i.e. the kernel timer frequency 1000 Hz. As of today, most p2041 processor devices are at least of rev 2 or later, which means that we can safely disable the HW errata workaround that disables HW MMU table walk and reverts to SW MMU table walk, which makes the kernel slower.

GCC for PowerPC produces slightly better code using likely/unlikely, so we enable this.

We will also remove some additional, potential tracing overhead.

A fundamental contribution to improving the real-time characteristics and to reduce OS jitter is to enable the priority-boosting for RCU readers and enable offloading of RCU callback invocation to kernel threads (from kernel version 3.7).}

Step 2

Edit the file linux-qoriq-sdk.bbappend. in meta-enea/recipes-kernel/linux. Replace the content of STAGING_KERNEL_FEATURES (debug, kprobes, oprofile, i2c, lttng, powertop, systemtap, kgdb) with the one single fragment from step 1 above, e.g. cfg/00051-latency. The actual bbappend file name might contain the kernel version instead of "-sdk", e.g. linux-qoriq-3.12.bbappend.

Step 3

Add the argument threadirqs to the Linux kernel boot argument list using uboot: E.g.:

setenv bootargs threadirqs root=/dev/nfs rw …

The actions above will further improve the worst-case latency figures as much as possible for a standard LLD PowerPC Linux kernel. The RCU callbacks can be fairly heavy and if the execution is offloaded and balanced in preemptible kernel threads, we will get a lower jitter in the response of external interrupts and thus better worst-case latency figures. Similarly, some ISRs (Interrupt Service Routines), can be very long and since these normally execute with interrupts (and thus preemption) disabled, the risk for such ISRs adding to worst-case latency is very big.

Since kernel version 2.6.39 (May 2011, as a spin-off from the PREEMPT_RT patch) it is possible to give the boot time parameter threadirqs, and this will instruct the kernel to defer the execution of the ISR from the hardirq context to a preemptible kernel thread instead, one per ISR. This will remove much of the driver’s ISR execution time from the sources of latency jitter, and thus it contributes to the improvement of the overall determinism. It will, however, due to the added context switch potentially increase the overhead slightly, but since this is a subset of the PREEMPT_RT patch we know that the corresponding overhead also is less than in the PREEMPT_RT case.

Benchmark results for the built uImage configured according to above and with boot time threadirqs are presented in the table here below:

Table A.4 RT-Optimized Mainstream LLD Kernel
Latency [µs]
Stress Type
no stress cpu Io vm hdd full
Min 3 3 4 3 3 3
Average 6 3 5 3 6 4
Max 31 16 19 20 45 44

The resulting figures here demonstrates a significant decrease in the jitter fluctuation and we end up getting a fairly good worst-case latency of 44 µs also while loading the system with network traffic via the NFS file system operations in hdd test. The minimal and average values are also very low, which indicates that the average code path executed when preemption is disabled has decreased. The conclusion here is that this kernel configuration would be a possible option for production systems with soft real-time requirements in the area of 100µs.

The "RT-Optimized" PREEMPT_RT Kernel

Just as for the RT-optimized LLD kernel in the previous section, we have to modify the kernel recipe temporarily in order to configure and build this kernel. The modifications we have to do can be described as follows:

  1. Repeat steps 1 & 2 in Section A.7, The "RT-Optimized" Mainstream Low-Latency Desktop Kernel.
  2. Make another copy of the STAGING_KERNEL_FEATURES_RT_p4080ds statement and call it STAGING_KERNEL_FEATURES_p2041rdb. This will generate the merge of the fragment 00018-rt on top of the LLD kernel and enable the preempt_rt patch.
  3. The existence of the _RT statement triggers a third stage kernel build, named uImage-p2041rdb-rt.bin and a corresponding config file.

Benchmark results for the built uImage-p2041rdb-rt.bin:

Table A.5 RT-Optimized PREEMPT_RT Kernel
Latency [µs]
Stress Type
no stress cpu Io vm hdd full
Min 3 3 3 3 3 3
Average 6 3 7 3 7 4
Max 16 12 19 13 27 18

The benchmark results shows that the preempt_rt patched kernel has even more improved worst-case latency figures, 27 µs compared to 45 µs we could reach with the standard LLD kernel. The other observation is that the minimal and average figures are very similar, perhaps with a slightly longer average latency and thus overhead for the preempt_rt kernel, but this is not significant. The conclusion here is that a preempt_rt kernel configuration would be a possible option for production systems with soft real-time requirements in the area of 50-100µs.

Summary and Conclusion

As a short summary, the result as we have seen ranges from around 300 µs in worst-case latency for the demo/debug LLD kernel down to as low as 40-45 µs with the RT-optimized LLD kernel with threadirqs and 25-30 µs when PREEMPT_RT is enabled. We have also seen a significant decrease of the minimal and average latency from around 20-25 µs down to about 3-6 µs, which implies that we also have got an overall significant throughput performance increase.

The benchmark indicates that the last years development in the mainline Low-Latency Desktop kernel with for example the threaded irqs feature, and the offloaded RCU callback feature, has made it possible to reduce the OS jitter and worst-case latency down to a level where it actually starts to be a real alternative to the preempt_rt patched kernel as an option for OS choice in an embedded Linux system with soft real-time requirements.

The benchmarks above are constructed only to indicate potential ways forward to reach soft real-time requirements. The chosen test case does not in any way guarantee that the results are valid for any type of BSP in any type of system. It is important to note that other or different versions of drivers may affect the result, as well as different versions of the kernel or application use case pattern.