Hírolvasó
Security updates for Tuesday
[$] ID-mapped mounts
[$] The Clever Audio Plugin
Four stable kernel releases
Perl 5.36.0 released
Security updates for Monday
Linux Plumbers Conference: Microconferences at Linux Plumbers Conference: Service Management and systemd
Linux Plumbers Conference 2022 is pleased to host the Service Management and systemd Microconference.
The focus of this microconference will be on topics related to the current
state of host-level service management and ideas for the future.
Most of the topics will be aroind the systemd ecosystem as the most widely adoped service manager. The Service Management and systemd microconference also welcomes proposals that are not specific to systemd so we can discover and share new ideas on how to improve service management in general.
Please come and join us in the discussion about the future of service management.
We hope to see you there!
Paul E. Mc Kenney: Stupid RCU Tricks: How Read-Intensive is The Kernel's Use of RCU?
One way to determine this would be to use something like ftrace to record all the calls to these functions. This works, but trace messages can be lost, especially when applied to frequently invoked functions. Also, dumping out the trace buffer can perturb the syatem. Another approach is to modify the kernel source code to count these function invocations in a cache-friendly manner, then come up with some way to dump this to userspace. This works, but I am lazy. Yet another approach is to ask the tracing folks for advice.
This last is what I actually did, and because the tracing person I happened to ask happened to be Andrii Nakryiko, I learned quite a bit about BPF in general and the bpftrace command in particular. If you don't happen to have Andrii on hand, you can do quite well with Appendix A and Appendix B of Brendan Gregg's “BPF Performance Tools”. You will of course need to install bpftrace itself, which is reasonably straightforward on many Linux distributions.
Linux-Kernel RCU Read IntensityThose of you who have used sed and awk have a bit of a running start because you can invoke bpftrace with a -e argument and a series of tracepoint/program pairs, where a program is bpftrace code enclosed in curly braces. This code is compiled, verified, and loaded into the running kernel as a kernel module. When the code finishes executing, the results are printed right there for you on stdout. For example:
bpftrace -e 'kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); }'
This command uses the kprobe facility to attach a program to the __rcu_read_lock() function and to attach a very similar program to the rcu_gp_fqs_loop() function, which happens to be invoked exactly once per RCU grace period. Both programs count the number of calls, with @gp being the bpftrace “variable” accumulating the count, and the count() function doing the counting in a cache-friendly manner. The final interval:s:10 in effect attaches a program to a timer, so that this last program will execute every 10 seconds (“s:10”). Except that the program invokes the exit() function that terminates this bpftrace program at the end of the very first 10-second time interval. Upon termination, bpftrace outputs the following on an idle system:
Attaching 3 probes... @gp: 977 @rcu_reader: 6435368
In other words, there were about a thousand grace periods and more than six million RCU readers during that 10-second time period, for a read-to-grace-period ratio of more than six thousand. This certainly qualifies as read-intensive.
But what if the system is busy? Much depends on exactly how busy the system is, as well as exactly how it is busy, but let's use that old standby, the kernel build (but using the nice command to avoid delaying bpftrace). Let's also put the bpftrace script into a creatively named file rcu1.bpf like so:
kprobe:__rcu_read_lock { @rcu_reader = count(); } kprobe:rcu_gp_fqs_loop { @gp = count(); } interval:s:10 { exit(); }
This allows the command bpftrace rcu1.bpf to produce the following output:
Attaching 3 probes... @gp: 274 @rcu_reader: 78211260
Where the idle system had about one thousand grace periods over the course of ten seconds, the busy system had only 274. On the other hand, the busy system had 78 million RCU read-side critical sections, more than ten times that of the idle system. The busy system had more than one quarter million RCU read-side critical sections per grace period, which is seriously read-intensive.
RCU works hard to make the same grace-period computation cover multiple requests. Because synchronize_rcu() invokes call_rcu(), we can use the number of call_rcu() invocations as a rough proxy for the number of updates, that is, the number of requests for a grace period. (The more invocations of synchronize_rcu_expedited() and kfree_rcu(), the rougher this proxy will be.)
We can make the bpftrace script more concise by assigning the same action to a group of tracepoints, as in the rcu2.bpf file shown here:
kprobe:__rcu_read_lock, kprobe:call_rcu, kprobe:rcu_gp_fqs_loop { @[func] = count(); } interval:s:10 { exit(); }
With this file in place, bpftrace rcu2.bpf produces the following output in the midst of a kernel build:
Attaching 4 probes... @[rcu_gp_fqs_loop]: 128 @[call_rcu]: 195721 @[__rcu_read_lock]: 21985946
These results look quite different from the earlier kernel-build results, confirming any suspicions you might harbor about the suitability of kernel builds as a repeatable benchmark. Nevertheless, there are about 180K RCU read-side critical sections per grace period, which is still seriously read-intensive. Furthermore, there are also almost 2K call_rcu() invocations per RCU grace period, which means that RCU is able to amortize the overhead of a given grace period down to almost nothing per grace-period request.
Linux-Kernel RCU Grace-Period LatencyThe following bpftrace program makes a histogram of grace-period latencies, that is, the time from the call to rcu_gp_init() to the return from rcu_gp_cleanup():
kprobe:rcu_gp_init { @start = nsecs; } kretprobe:rcu_gp_cleanup { if (@start) { @gplat = hist((nsecs - @start)/1000000); } } interval:s:10 { printf("Internal grace-period latency, milliseconds:\n"); exit(); }
The kretprobe attaches the program to the return from rcu_gp_cleanup(). The hist() function computes a log-scale histogram. The check of the @start variable avoids a beginning-of-time value for this variable in the common case where this script start in the middle of a grace period. (Try it without that check!)
The output is as follows:
Attaching 3 probes... Internal grace-period latency, milliseconds: @gplat: [2, 4) 259 |@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 591 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 137 |@@@@@@@@@@@@ | [16, 32) 3 | | [32, 64) 5 | | @start: 95694642573968
Most of the grace periods complete within between four and eight milliseconds, with most of the remainder completing within between two and four milliseconds and then between eight and sixteen milliseonds, but with a few stragglers taking up to 64 milliseconds. The final @start line shows that bpftrace simply dumps out all the variables. You can use the delete(@start) function to prevent printing of @start, but please note that the next invocation of rcu_gp_init() will re-create it.
It is nice to know the internal latency of an RCU grace period, but most in-kernel users will be more concerned about the latency of the synchronize_rcu() function, which will need to wait for the current grace period to complete and also for callback invocation. We can measure this function's latency with the following bpftrace script:
kprobe:synchronize_rcu { @start[tid] = nsecs; } kretprobe:synchronize_rcu { if (@start[tid]) { @srlat = hist((nsecs - @start[tid])/1000000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, milliseconds:\n"); exit(); }
The tid variable contains the ID of the currently running task, which allows this script to associate a given return from synchronize_rcu() with the corresponding call by using tid as an index to the @start variable.
As you would expect, the resulting histogram is weighted towards somewhat longer latencies, though without the stragglers:
Attaching 3 probes... synchronize_rcu() latency, milliseconds: @srlat: [4, 8) 9 |@@@@@@@@@@@@@@@ | [8, 16) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 31 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @start[4075307]: 96560784497352
In addition, we see not one but two values for @start. The delete statement gets rid of old ones, but any new call to synchronize_rcu() will create more of them.
Linux-Kernel Expedited RCU Grace-Period LatencyLinux kernels will sometimes executed synchronize_rcu_expedited() to obtain a faster grace period, and the following command will further cause synchronize_rcu() to act like synchronize_rcu_expedited():
echo 1 > /sys/kernel/rcu_expedited
Doing this on a dual-socket system with 80 hardware threads might be ill-advised, but you only live once!
Ill-advised or not, the following bpftrace script measures synchronize_rcu_expedited() latency, but in microseconds rather than milliseconds:
kprobe:synchronize_rcu_expedited { @start[tid] = nsecs; } kretprobe:synchronize_rcu_expedited { if (@start[tid]) { @srelat = hist((nsecs - @start[tid])/1000); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, microseconds:\n"); exit(); }
The output of this script run concurrently with a kernel build is as follows:
Attaching 3 probes... synchronize_rcu() latency, microseconds: @srelat: [128, 256) 57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 14 |@@@@@@@@@@@@ | [512, 1K) 1 | | [1K, 2K) 2 |@ | [2K, 4K) 7 |@@@@@@ | [4K, 8K) 2 |@ | [8K, 16K) 3 |@@ | @start[4140285]: 97489845318700
Most synchronize_rcu_expedited() invocations complete within a few hundred microseconds, but with a few stragglers around ten milliseconds.
But what about linear histograms? This is what the lhist() function is for, with added minimum, maximum, and bucket-size arguments:
kprobe:synchronize_rcu_expedited { @start[tid] = nsecs; } kretprobe:synchronize_rcu_expedited { if (@start[tid]) { @srelat = lhist((nsecs - @start[tid])/1000, 0, 1000, 100); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, microseconds:\n"); exit(); }
Running this with the usual kernel build in the background:
Attaching 3 probes... synchronize_rcu() latency, microseconds: @srelat: [100, 200) 26 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [200, 300) 13 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [300, 400) 5 |@@@@@@@@@@ | [400, 500) 1 |@@ | [500, 600) 0 | | [600, 700) 2 |@@@@ | [700, 800) 0 | | [800, 900) 1 |@@ | [900, 1000) 1 |@@ | [1000, ...) 18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @start[4184562]: 98032023641157
The final bucket is overflow, containing measurements that exceeded the one-millisecond limit.
The above histogram had only a few empty buckets, but that is mostly because the 18 synchronize_rcu_expedited() instances that overflowed the one-millisecond limit are consolidated into a single [1000, ...) overflow bucket. This is sometimes what is needed, but other times losing the maximum latency can be a problem. This can be dealt with given the following bpftrace program:
kprobe:synchronize_rcu_expedited { @start[tid] = nsecs; } kretprobe:synchronize_rcu_expedited { if (@start[tid]) { @srelat[(nsecs - @start[tid])/100000*100] = count(); delete(@start[tid]); } } interval:s:10 { printf("synchronize_rcu() latency, microseconds:\n"); exit(); }
Given the usual kernel-build background load, this produces the following output:
Attaching 3 probes... synchronize_rcu() latency, microseconds: @srelat[1600]: 1 @srelat[500]: 1 @srelat[1000]: 1 @srelat[700]: 1 @srelat[1100]: 1 @srelat[2300]: 1 @srelat[300]: 1 @srelat[400]: 2 @srelat[600]: 3 @srelat[200]: 4 @srelat[100]: 20 @start[763214]: 17487881311831
This is a bit hard to read, but simple scripting can be applied to this output to produce something like this:
100: 20 200: 4 300: 1 400: 2 500: 1 600: 3 700: 1 1000: 1 1100: 1 1600: 1
This produces compact output despite outliers such as the last entry, corresponding to an invocation that took somewhere between 1.6 and 1.7 milliseconds.
SummaryThe bpftrace command can be used to quickly and easily script compiled in-kernel programs that can measure and monitor a wide variety of things. This post focused on a few aspects of RCU, but quite a bit more material may be found in Brendan Gregg's “BPF Performance Tools” book.
Linux Plumbers Conference: Linux Plumbers Conference Refereed-Track Deadlines
The proposal deadline is June 12, which is right around the corner. We have excellent submissions, for which we gratefully thank our submitters! For the rest of you, we do have one problem, namely that we do not yet have your submission. So please point your browser at the call-for-proposals page and submit your proposal. After all, if you don’t submit it, we won’t accept it!
McKenney: Stupid RCU Tricks: Is RCU Watching?
Unfortunately, an eternally watchful RCU is impractical in the Linux kernel due to energy-efficiency considerations. The problem is that if RCU watches an idle CPU, RCU needs that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessary”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs can provide 30-40% increases in battery lifetime.
Dave Airlie (blogspot): lavapipe Vulkan 1.2 conformant
The software Vulkan renderer in Mesa, lavapipe, achieved official Vulkan 1.2 conformance. The non obvious entry in the table is here.
Thanks to all the Mesa team who helped achieve this, Shout outs to Mike of Zink fame who drove a bunch of pieces over the line, Roland who helped review some of the funkier changes.
We will be submitting 1.3 conformance soon, just a few things to iron out.
[$] 5.19 Merge window, part 1
Security updates for Friday
AlmaLinux 9 Now Available
[...] The AlmaLinux OS Foundation would like to thank all those involved in the CentOS Stream 9 efforts, CentOS SIGs and others that made this release possible. Thank you to the Fedora and RHEL teams, as well as upstream projects and contributors everywhere. You Rock!
OpenIKED 7.1 released
OpenIKED 7.1 was released on May 23rd, 2022.
The complete release notes may be read here:
https://ftp.openbsd.org/pub/OpenBSD/OpenIKED/openiked-7.1-relnotes.txt
Paul E. Mc Kenney: Stupid RCU Tricks: Is RCU Watching?
Unfortunately, an eternally watchful RCU is impractical in the Linux kernel due to energy-efficiency considerations. The problem is that if RCU watches an idle CPU, RCU needs that CPU to execute instructions. And making an idle CPU unnecessarily execute instructions (for a rather broad definition of the word “unnecessarily”) will terminally annoy a great many people in the battery-powered embedded world. And for good reason: Making RCU avoid watching idle CPUs can provide 30-40% increases in battery lifetime.
In this, CPUs are not all that different from people. Interrupting someone who is deep in thought can cause them to lose 20 minutes of work. Similarly, when a CPU is deeply idle, asking it to execute instructions will consume not only the energy required for those instructions, but also much more energy to work its way out of that deep idle state, and then to return back to that deep idle state.
And this is why CPUs must tell RCU to stop watching them when they go idle. This allows RCU to ignore them completely, in particular, to refrain from asking them to execute instructions.
In some kernel configurations, RCU also ignores portions of the kernel's entry/exit code, that is, the last bits of kernel code before switching to userspace and the first bits of kernel code after switching away from userspace. This happens only in kernels built with CONFIG_NO_HZ_FULL=y, and even then only on CPUs mentioned in the CPU list passed to the nohz_full kernel parameter. This enables carefully configured HPC applications and CPU-bound real-time applications to get near-bare-metal performance from such CPUs, while still having the entire Linux kernel at their beck and call. Because RCU is not watching such applications, the scheduling-clock interrupt can be turned off entirely, thus avoiding disturbing such performance-critical applications.
But if RCU is not watching a given CPU, rcu_read_lock() has no effect on that CPU, which can come as a nasty shock to the corresponding RCU read-side critical section, which naively expected to be able to safely traverse an RCU-protected data structure. This can be a trap for the unwary, which is why kernels built with CONFIG_PROVE_LOCKING=y (lockdep) complain bitterly when rcu_read_lock() is invoked on CPUs that RCU is not watching.
But suppose that you have code using RCU that is invoked both from deep within the idle loop and from normal tasks.
Back in the day, this was not much of a problem. True to its name, the idle loop was not much more than a loop, and the deep architecture-specific code on the kernel entry/exit paths had no need of RCU. This has changed, especially with the advent of idle drivers and governors, to say nothing of tracing. So what can you do?
First, you can invoke rcu_is_watching(), which, as its name suggests, will return true if RCU is watching. And, as you might expect, lockdep uses this function to figure out when it should complain bitterly. The following example code lays out the current possibilities:
if (rcu_is_watching()) printk("Invoked from normal or idle task with RCU watching.\n"); else if (is_idle_task(current)) printk("Invoked from deep within in the idle task where RCU is not watching.\"); else printk("Invoked from nohz_full entry/exit code where RCU is not watching.\");
Except that even invoking printk() is an iffy proposition while RCU is not watching.
So suppose that you invoke rcu_is_watching() and it helpfully returns false, indicating that you cannot invoke rcu_read_lock() and friends. What now?
You could do what the v5.18 Linux kernel's kernel_text_address() function does, which can be abbreviated as follows:
no_rcu = !rcu_is_watching(); if (no_rcu) rcu_nmi_enter(); // Make RCU watch!!! do_rcu_traversals(); if (no_rcu) rcu_nmi_exit(); // Return RCU to its prior watchfulness state.
If your code is not so performance-critical, you can do what the arm64 implementation of the cpu_suspend() function does:
RCU_NONIDLE(__cpu_suspend_exit());
This macro forces RCU to watch while it executes its argument as follows:
#define RCU_NONIDLE(a) \ do { \ rcu_irq_enter_irqson(); \ do { a; } while (0); \ rcu_irq_exit_irqson(); \ } while (0)
The rcu_irq_enter_irqson() and rcu_irq_exit_irqson() functions are essentially wrappers around the aforementioned rcu_nmi_enter() and rcu_nmi_exit() functions.
Although RCU_NONIDLE() is more compact than the kernel_text_address() approach, it is still annoying to have to pass your code to a macro. And this is why Peter Zijlstra has been reworking the various idle loops to cause RCU to be watching a much greater fraction of their code. This might well be an ongoing process as the idle loops continue gaining functionality, but Peter's good work thus far at least makes RCU watch the idle governors and a much larger fraction of the idle loop's trace events. When combined with the kernel entry/exit work by Peter, Thomas Gleixner, Mark Rutland, and many others, it is hoped that the functions not watched by RCU will all eventually be decorated with something like noinstr, for example:
static noinline noinstr unsigned long rcu_dynticks_inc(int incby) { return arch_atomic_add_return(incby, this_cpu_ptr(&rcu_data.dynticks)); }
We don't need to worry about exactly what this function does. For this blog entry, it is enough to know that its noinstr tag prevents tracing this function, making it less problematic for RCU to not be watching it.
What exactly are you prohibited from doing while RCU is not watching your code?
As noted before, RCU readers are a no-go. If you try invoking rcu_read_lock(), rcu_read_unlock(), rcu_read_lock_bh(), rcu_read_unlock_bh(), rcu_read_lock_sched(), or rcu_read_lock_sched() from regions of code where rcu_is_watching() would return false, lockdep will complain.
On the other hand, using SRCU (srcu_read_lock() and srcu_read_unlock()) is just fine, as is RCU Tasks Trace (rcu_read_lock_trace() and rcu_read_unlock_trace()). RCU Tasks Rude does not have explicit read-side markers, but anything that disables preemption acts as an RCU Tasks Rude reader no matter what rcu_is_watching() would return at the time.
RCU Tasks is an odd special case. Like RCU Tasks Rude, RCU Tasks has implicit read-side markers, which are any region of non-idle-task kernel code that does not do a voluntary context switch (the idle tasks are instead handled by RCU Tasks Rude). Except that in kernels built with CONFIG_PREEMPTION=n and without any of RCU's test suite, the RCU Tasks API maps to plain old RCU. This means that code not watched by RCU is ignored by the remapped RCU Tasks in such kernels. Given that RCU Tasks ignores the idle tasks, this affects only user entry/exit code in kernels built with CONFIG_NO_HZ_FULL=y, and even then, only on CPUs mentioned in the list given to the nohz_full kernel boot parameter. However, this situation can nevertheless be a trap for the unwary.
Therefore, in post-v5.18 mainline, you can build your kernel with CONFIG_FORCE_TASKS_RCU=y, in which case RCU Tasks will always be built into your kernel, avoiding this trap.
In summary, energy-efficiency, battery-lifetime, and application-performance/latency concerns force RCU to avert its gaze from idle CPUs, and, in kernels built with CONFIG_NO_HZ_FULL=y, also from nohz_full CPUs on the low-level kernel entry/exit code paths. Fortunately, recent changes have allowed RCU to watch more code, but this being the kernel, corner cases will always be with us. This corner-case code from which RCU must avert its gaze requires the special handling described in this blog post.
[$] splice() and the ghost of set_fs()
What happened to Perl 7?
For now, our plan is to continue introducing new features and to resolve all existing experimental features, so they're either dropped, or become non-experimental features (and so are included in the version bundle). The downside with this is that people often can't remember which version of Perl introduced which feature(s). At some point in the future, the PSC may decide that the set of features, taken together, represent a big enough step forward to justify a new baseline for Perl. If that happens, then the version will be bumped to 7.0.