Kernel Planet

Brendan Gregg: The Speed of Time

3 nap 7 óra óta
How long does it take to read the time? How would you _time_ time? These strange questions came to the fore back in 2014 when Netflix was switching services from CentOS Linux to Ubuntu, and I helped debug several weird performance issues including one I'll describe here. While you're unlikely to run into this specific issue anymore, what is interesting is this type of issue and the simple method of debugging it: a pragmatic mix of observability and experimentation tools. I've shared many posts about superpower observability tools, but often humble hacking is just as effective. A Cassandra database cluster had switched to Ubuntu and noticed write latency increased by over 30%. A quick check of basic performance statistics showed over 30% higher CPU consumption. What on Earth is Ubuntu doing that results in 30% higher CPU time!? ## 1. CLI tools The Cassandra systems were EC2 virtual machine (Xen) instances. I logged into one and went through some basic CLI tools to get started (my [60s checklist]). Was there some other program consuming CPU, like a misbehaving Ubuntu service that wasn't in CentOS? top(1) showed that only the Cassandra database was consuming CPU. What about short-lived processes, like a service restarting in a loop? These can be invisible to top(8). My execsnoop(8) tool (back then my [Ftrace version]) showed nothing. It seemed that the extra CPU time really was in Cassandra, but how? ## 2. CPU profile Understanding a CPU time should be easy by comparing [CPU flame graphs]. Since instances of both CentOS and Ubuntu were running in parallel, I could collect flame graphs at the same time (same time-of-day traffic mix) and compare them side by side. The CentOS flame graph: The Ubuntu flame graph: Darn, they didn't work. There's no Java stack—there should be a tower of green Java methods—instead there's only a single green frame or two. This is how Java flame graphs looked at the time. Later that year I prototyped the c2 frame pointer fix that became -XX:+PreserveFramePointer, which fixes Java stacks in these profiles. Even with the broken Java stacks, I noticed a big difference: On Ubuntu, there's a massive amount of CPU time in a libjvm call: os::javaTimeMillis(). 30.14% in the middle of the flame graph. Searching shows it was elsewhere as well for a total of 32.1%. This server is spending about a third of its CPU cycles just checking the time! This was a weird problem to think about: Time itself had now become a resource and target of performance analysis. The broken Java stacks turned out to be beneficial: They helped group together the os::javaTimeMillis() calls which otherwise might have have been scattered on top of different Java code paths, appearing as thin stacks everywhere. If that were the case, I'd have zoomed in to see what they were then zoomed out and searched for them for the cumulative percent; or flipped the merge order so it's an icicle graph, merging leaf to root. But in this case I didn't have to do anything as it was mostly merged by accident. (This is one of the motivating reasons for switching to a d3 version of flame graphs, as I want the interactivity of d3 to do things like collapse all the Java frames, all the user-mode frames, etc., to expose different groupings like this.) ## 3. Studying the flame graph os::javaTimeMillis fetches the current time. Browsing the flame graph shows it is calling the gettimeofday(2) syscall which enters the tracesys() and syscall_trace_enter/exit() kernel functions. This gave me two theories: A) Some syscall tracing is enabled in Ubuntu (auditing? apparmor?). B) Fetching time was somehow slower on Ubuntu, which could be a library change or a kernel/clocksource change. Theory (A) is most likely based on the frame widths in the flame graph. But I'm not completely sure. As a Xen guest, this profile was gathered using perf(1) and the kernel's software cpu-clock soft interrupts, not the hardware NMI. Without NMI, some kernel code paths (interrupts disabled) can't be profiled. Also, since it's a Xen guest, hypervisor time can never be profiled. These two factors mean that there can be missing kernel and hypervisor time in the flame graph, to the true breakdown of time in os::javaTimeMillis may be a little different. Note that Ubuntu also has a frame to show entry into vDSO (virtual dynamic shared object). This is a user-mode syscall accelerator, and gettimeofday(2) is a classic use case that is cited in the vdso(7) man page. At the time, the Xen pvclock source didn't support vDSO, so you can see the syscall code above the vdso frame. It's the same on CentOS, although it doesn't include a vdso frame in the flame graph (I'd guess due to a perf(1) difference alone). ## 4. Colleagues/Internet I love using [Linux performance tools]. But I also love solving issues quickly, and sometimes that means just asking colleagues or searching the Internet. I'm including this step as a reminder for anyone following this kind of analysis. Others I asked hadn't hit this issue, and the Internet at the time had nothing using the search terms os::javaTimeMillis, clocksource, tracesys(), Ubuntu, EC2, Xen, etc. (That changed by the end of the year.) ## 5. Experimentation To further analysis this with observability tools, I could:
  1. Fix the Java stacks to see if there's a difference in how time is used on Ubuntu. Maybe Java is calling it more often for some reason.
  2. Trace the gettimeofday() and related syscall paths, to see if there's some difference there: E.g., errors.
But as I summarized in my [What is Observability] post, the term observability can be a reminder not to get stuck on that one type of analysis. Here's some experimental approaches I could also explore:
  1. Disable tracesys/syscall_trace.
  2. Microbenchmark os::javaTimeMillis() on both systems.
  3. Try changing the kernel clocksource.
As (C) looked like a kernel rebuild, I started with (D) and (E). ## 5. Measuring the speed of time Is there already a microbenchmark for os::javaTimeMillis()? This would help confirm that these calls really were slower on Ubuntu. I couldn't find such a microbenchmark so I wrote something simple. I'm not going to try to make before and after time calls to time the duration in time (which I guess would work if you factored in the extra time in the timing calls). Instead, I'm just going to call time millions of times in a loop and time how long it takes (sorry, that's two many different usages of the word "time" in one paragraph): $ cat public class TimeBench { public static void main(String[] args) { for (int i = 0; i < 100 * 1000 * 1000; i++) { long t0 = System.currentTimeMillis(); if (t0 == 87362) { System.out.println("Bingo"); } } } } This does 100 million calls of currentTimeMillis(). I then executed it via the shell time(1) command to give an overall runtime for those 100 million calls. (There's also a test and println() in the loop to, hopefully, convince the compiler not to optimize-out an otherwise empty loop. This will slow this test a little.) Trying it out: centos$ time java TimeBench real 0m12.989s user 0m3.368s sys 0m18.561s ubuntu# time java TimeBench real 1m8.300s user 0m38.337s sys 0m29.875s How long is each time call? Assuming the loop is dominated by the time call, it works out to be about 0.13 us on Centos and 0.68 us on Ubuntu. Ubuntu is 5x slower. As I'm interested in the relative comparison I can just compare the total runtimes (the "real" time) for the same result. I also rewrote this in C and called gettimeofday(2) directly: $ cat gettimeofdaybench.c #include <sys/time.h> int main(int argc, char *argv[]) { int i, ret; struct timeval tv; struct timezone tz; for (i = 0; i < 100 * 1000 * 1000; i++) { ret = gettimeofday(&tv, &tz); } return (0); } I compiled this with -O0 to avoid dropping the loop. Running this on the two systems saw similar results. I love short benchmarks like this as I can disassemble the resulting binary and ensure that the compiled instructions match my expectations, and the compiler hasen't messed with it. ## 6. clocksource Experimentation My second experiment was to change the clocksource. Checking those available: $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource xen tsc hpet acpi_pm $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource xen Ok, so it's defaulted to xen, which we saw in the flame graph (the tower ends with pvclock_clocksource_read()). Let's try tsc, which should be the fastest: # echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc $ time java TimeBench real 0m3.370s user 0m3.353s sys 0m0.026s The change is immediate, and my Java microbenchmark is now running over 20x faster than before! (And nearly 4x faster than on CentOS.) Now that it's reaching 33 ns, the loop instructions are likely inflating this result. If I wanted more accuracy, I'd partially [unroll the loop] so that the loop instructions become negligible. ## 6. Workaround The time stamp counter (TSC) clocksource is fast as it retrieves time using just an RDTSC instruction, and with vDSO it can do this without the syscall. TSC traditionally was not the default because of concerns about time drift. Software-based clocksources could fix those issues and provide accurate monotonically-increasing time. I happened to be speaking at a technical confering while still debugging this, and mentioned what I was working on to a processor engineer. He said that tsc had been stable for years, and any advise about avoiding it was old. I asked if he knew of a public reference saying so, but he didn't. That chance encounter, coupled with the Netflix's fault-tolerant cloud, gave me enough confidence to suggest trying tsc in production as a workaround for the issue. The change was obvious in the production graphs, showing a drop in write latencies: Once tested more broadly, it showed the write latencies dropped by 43%, delivering slightly better performance than on CentOS. The CPU flame graph for Ubuntu now looked like: os::javaTimeMillis() was now 1.6% in total. Note that it now enters "[[vdso]]" and nothing more: No kernel calls above it. ## 7. Aftermath I provided details to AWS and Canonical, and then moved onto the other performance issues as part of the migration. A colleague, Mike Huang, also hit this for a different service at Netflix and enabled tsc. We ended up setting it in the BaseAMI for all cloud services. Later that year (2014), Anthony Liguori from AWS gave a [re:Invent talk] recommending users switch the clocksource to tsc to improve performance. I also shared setting the clocksource in my talks and in my 2015 [Linux tunables] post. Over the years, more and more articles have been published about clocksource in virtual machines, and it's now a well-known issue. Amazon even provides an official [recommendation] \(2021\): "For EC2 instances launched on the AWS Xen Hypervisor, it's a best practice to use the tsc clock source. Other EC2 instance types, such as C5 or M5, use the AWS Nitro Hypervisor. The recommended clock source for the AWS Nitro Hypervisor is kvm-clock." As this indicates, things have changed with [Nitro] where clocksources are much faster (thanks!). In 2019 myself and others tested kvm-clock and found it was only about 20% slower than tsc. That's much better than the xen clocksource, but still slow enough to resist switching over absent a reason (such as an reemergence of tsc clock drift issues). I'm not sure if Intel ever published something to clarify tsc stability on newer processors. If you know they did, please drop a comment. The JMH benchmark suite can also now test System.currentTimeMillis(), so it's no longer necessary to roll your own (unless you want to dissassemble it, in which case it's easier to have something short and simple). As for tracesys: I investigated the overhead for other syscalls and found it to be negligible, and before I returned to work on it further the kernel code paths changed and it was no longer present in the stacks. Did that Ubuntu release have a misconfiguration of auditing that was later fixed? I like to get to the rock bottom of issues, so it was a bit unsatisfying that the problem went away before I did. Even if I did figure it out, we'd still have preferred to go with tsc instead of the xen clocksource for the 4x improvement. ## 8. Summary Reading time itself can become a bottleneck for some clocksources. This was much worse many years ago on Xen virtual machine guests. For Linux I've been recommending the faster tsc clocksource for years, altough I'm not a processor vendor so I can't make assurances about tsc issues of clock drift. At least AWS have now included it in their recommendations. Also, while I often post about superpower tracing tools, sometimes some humble hacking is best. In this case it was a couple of ad hoc microbenchmarks, only several lines of code each. Any time you're investigating performance of some small discrete system component, consider finding or rolling your own microbenchmark to get more information on it experimentally. You have two hands: observation and experimentation. [both hands]: /blog/2021-05-23/what-is-observability.html [What is Observability]: /blog/2021-05-23/what-is-observability.html [Ftrace version]: [CPU flame graphs]: /FlameGraphs/cpuflamegraphs.html [Linux performance tools]: /linuxperf.html [Linux tunables]: /blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html [60s checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [re:Invent talk]: [recommendation]: [Nitro]: /blog/2017-11-29/aws-ec2-virtualization-2017.html [unroll the loop]: /blog/2014-04-26/the-noploop-cpu-benchmark.html

Linux Plumbers Conference: Get ready for LPC 2021!

1 hét 3 nap óta

The LPC 2021 conference is just around the corner. We wanted to share the logistics on how to participate and watch the virtual conference.

For those that are not registered for the conference, we will have live streaming of the sessions on YouTube, like last year. This is free of charge. We will provide the URLs where to watch each day, on this page. The only limitation is that you cannot participate and ask questions live with audio. However this year we will have the chat in each Big Blue Button room also available externally via the Matrix open communication network. Anyone is invited to join with their personal Matrix account.

Those who are registered for the conference will be able to log into our Big Blue Button server through our front end page, starting Monday September 20 at 7:00AM US Pacific time.
To log in to BBB, please go to You will find a front end showing the schedule for the current day with all the active sessions you can join. Your credentials are the email address you used for registration, and the confirmation code you received in email when you registered. Please make sure you have those available in advance of trying to log in.

Please review the LPC 2021 Participant Guide before you join the conference.

Looking forward to seeing you there!

Linux Plumbers Conference: Linux Plumbers Conference 2021 is Almost Here

1 hét 4 nap óta

We are only three days away from the start of LPC 2021!

Thank you to all that made our conference possible:
– Our generous Sponsors, listed here on the right
– The Linux Foundation, which provides as always impeccable support
– Our speakers and leaders, who are providing a lot of great content and planning great discussions

As you can see, the schedule is finalized now. There are going to be seven parallel tracks each day, lasting four hours each. We have a total of 23 different tracks and Microconferences, with 191 sessions.

At this time we are closing the CfPs for all tracks. We have still room for a limited number of Birds of a Feather sessions. If you want to propose one, even during the conference, and the necessary participants are all registered, please send an email to our mailing list.

Take a look at all the great technical content at this year virtual LPC.
You can view the schedule by main blocks , or by track, or as a complete detailed view.

Note that at the end of the first day we’ll have a plenary keynote by Jon “maddog” Hall.
Additionally, at the end of the last day we’ll have a plenary session as a wrap up for this year conference.

The conference will be entirely virtual, offered on a completely free and open software stack.

We look forward to five days filled with great discussions, and we hope that LPC 2021 will provide once again a creative and productive environment where ideas can be exchanged and problems tackled. Many great ideas have sprung in the past from these meetings, driving innovation in the Linux plumbing layer!

Pete Zaitcev: Scalability of a varying degree

2 hét óta

Seen at official site of Qumulo:


Platforms must be able to serve petabytes of data, billions of files, millions of operations, and thousands of users.

Thousands of users...? Isn't it a little too low? Typical Swift clusters in Telcos have tens of millions of users, of which tens or hundreds of thousands are active simultaneously.

Google's Chumby paper has a little section on scalability problem with talking to a cluster over TCP/IP. Basically at low tens of thousands you're starting to have serious issues with kernel sockets and TIME_WAIT. So maybe that.

Paul E. Mc Kenney: Stupid RCU Tricks: Making Race Conditions More Probable

2 hét óta
Given that it is much more comfortable chasing down race conditions reported by rcutorture than those reported from the field, it would be good to make race conditions more probable during rcutorture runs than in production. A number of tricks are used to make this happen, including making rare events (such as CPU-hotplug operations) happen more frequently, testing the in-kernel RCU API directly from within the kernel, and so on.

Another approach is to change timing. Back at Sequent in the 1990s, one way that this was accomplished was by plugging different-speed CPUs into the same system and then testing on that system. It was observed that for certain types of race conditions, the probability of the race occurring increased by the ratio of the CPU speeds. One such race condition is when a timed event on the slow CPU races with a workload-driven event on the fast CPU. If the fast CPU is (say) two times faster than the slow CPU, then the timed event will provide two times greater “collision cross section” than if the same workload was running on CPUs running at the same speed.

Given that modern CPUs can easily adjust their core clock rates at runtime, it is tempting to try this same trick on present-day systems. Unfortunately, everything and its dog is adjusting CPU clock rates for various purposes, plus a number of modern CPUs are quite happy to let you set their core clock rates to a value sufficient to result in physical damage. Throwing rcutorture into this fray might be entertaining, but it is unlikely to be all that productive.

Another approach is to make use of memory latency. The idea is for the rcutorture scripting to place one pair of a given scenario's vCPUs in the hyperthreads of a single core and to place another pair of that same scenario's vCPUs in the hyperthreads of a different single core, and preferably a core on some other socket. The theory is that the different communications latencies and bandwidths within a core on the one hand and between cores (or, better yet, between sockets) on the other should have roughly the same effect as does varying CPU core clock rates.

OK, theory is all well and good, but what happens in practice?

As it turns out, on dual-socket systems, quite a bit.

With this small change to the rcutorture scripting, RCU Tasks Trace suddenly started triggering assertions. These test failures led to no fewer than 12 fixes, perhaps most notably surrounding proper handling of the count of tasks from which quiescent states are needed. This caused me to undertake a full review of RCU Tasks Trace, greatly assisted by Boqun Feng, Frederic Weisbecker, and Neeraj Upadhyay, with Neeraj providing half of the fixes. There is likely to be another fix or three, but then again isn't that always the case?

More puzzling were the 2,199.0-second RCU CPU stall warnings (described in more detail here). These were puzzling for a number of reasons:

  1. The RCU CPU stall warning timeout is set to only 21 seconds.
  2. There was absolutely no console output during the full stall duration.
  3. The stall duration was never 2,199.1 seconds and never 2,198.9 seconds, but always exactly 2,199.0 seconds, give or take a (very) few tens of milliseconds.
  4. The stalled CPU usually took only a handful of scheduling-clock interrupts during the stall, but would sometimes take them at a rate of 100,000 per second, which seemed just a bit excessive for a kernel built with HZ=1000.
  5. At the end of the stall, the kernel happily continued, usually with no other complaints.
These stall warnings appeared most frequently when running rcutorture's TREE04 scenario.

But perhaps this is not a stall, but instead a case of time jumping forward. This might explain the precision of the stall duration, and would definitely explain the lack of intervening console output, the lack of other complaints, and the kernel's being happy to continue at the end of the stall. Not so much the occasional extreme rate of scheduling-clock interrupts, but perhaps that is a separate problem.

However, running large numbers (as in 200) of concurrent shorter one-hour TREE04 runs often resulted in the run terminating (forcibly) in the middle of the stall. Now this might be due to the host's and the guests' clocks all jumping forward at the same time, except that different guests stalled at different times, and even when running TREE04, most guests didn't stall at all. Therefore, the stalls really did stall, and for a very long time.

But then it should be possible to work out what the CPUs were doing in the meantime. One approach would be to use tracing, but previous experience with massive volumes of trace messages (and thus lost trace messages) suggested a more surgical approach. Furthermore, the last console message before the stall was always of the form “kvm-clock: cpu 3, msr d4a80c1, secondary cpu clock” and the first console message after the stall was always of the form “kvm-guest: stealtime: cpu 3, msr 1f597140”. These are widely separated and and are often printed from different CPUs, which also suggests a more surgical approach. This situation also implicates CPU hotplug, but this is not at all unusual.

The first attempt at exploratory surgery used the jiffies counter to check for segments of code taking more than 100 seconds to complete. Unfortunately, these checks never triggered, even in runs having stall warnings. So maybe the jiffies counter is not being updated. It is easy enough to switch to ktime_get_mono_fast_ns(), right? Except that this did not trigger, either.

Maybe there is a long-running interrupt handler? Mark Rutland recently posted a patchset to detect exactly that, so I applied it. But it did not trigger.

I switched to ktime_get() in order to do cross-CPU time comparisons, and out of shear paranoia added checks for time going backwards. And these backwards-time checks really did trigger just before the stall warnings appeared, once again demonstrating the concurrent-programming value of a healthy level paranoia, and also explaining why many of my earlier checks were not triggering. Time moved forward, and then jumped backwards, making it appear that no time had passed. (Time did jump forward again, but that happened after the last of my debug code had executed.)

Adding yet more checks showed that the temporal issues were occurring within stop_machine_from_inactive_cpu(). This invocation takes the mtrr_rendezvous_handler() function as an argument, and it really does take 2,199.0 seconds (that is, about 36 minutes) from the time that stop_machine_from_inactive_cpu() is called until the time that mtrr_rendezvous_handler() is called. But only sometimes.

Further testing confirmed that increasing the frequency of CPU-hotplug operations increased the frequency of 2,199.0-second stall warnings.

A extended stint of code inspection suggested further diagnostics, which showed that one of the CPUs would be stuck in the multi_cpu_stop() state machine. The stuck CPU was never CPU 0 and was never the incoming CPU. Further tests showed that the scheduler always thought that all of the CPUs, including the stuck CPU, were in the TASK_RUNNING state. Even more instrumentation showed that the stuck CPU was failing to advance to state 2 (MULTI_STOP_DISABLE_IRQ), meaning that all of the other CPUs were spinning in a reasonably tight loop with interrupts disabled. This could of course explain the lack of console messages, at least from the non-stuck CPUs.

Might qemu and KVM be to blame? A quick check of the code revealed that vCPUs are preserved across CPU-hotplug events, that is, taking a CPU offline does not cause qemu to terminate the corresponding user-level thread. Furthermore, the distribution of stuck CPUs was uniform across the CPUs other than CPU 0. The next step was to find out where CPUs were getting stuck within the multi_cpu_stop() state machine. The answer was “at random places”. Further testing also showed that the identity of the CPU orchestrating the onlining of the incoming CPU had nothing to do with the problem.

Now TREE04 marks all but CPU 0 as nohz_full CPUs, meaning that they disable their scheduling-clock interrupts when running in userspace when only one task is runnable on that CPU. Maybe the CPUs need to manually enable their scheduling-clock interrupt when starting multi_cpu_stop()? This did not fix the problem, but it did manage to shorten some of the stalls, in a few cases to less than ten minutes.

The next trick was to send an IPI to the stalled CPU every 100 seconds during multi_cpu_stop() execution. To my surprise, this IPI was handled by the stuck CPU, although with surprisingly long delays ranging from just a bit less than one millisecond to more than eight milliseconds.

This suggests that the stuck CPUs might be suffering from an interrupt storm, so that the IPI had to wait for its turn among a great many other interrupts. Further testing therefore sent an NMI backtrace at 100 seconds into multi_cpu_stop() execution. The resulting stack traces showed that the stuck CPU was always executing within sysvec_apic_timer_interrupt() or some function that it calls. Further checking showed that the stuck CPU was in fact suffering from an interrupt storm, namely an interrupt storm of scheduling-clock interrupts. This spurred another code-inspection session.

Subsequent testing showed that the interrupt duration was about 3.5 microseconds, which corresponded to about one third of the stuck CPU's time. It appears that the other two-thirds is consumed repeatedly entering and exiting the interrupt.

The retriggering of the scheduling-clock interrupt does have some potential error conditions, including setting times in the past and various overflow possibilities. Unfortunately, further diagnostics showed that none of this was happening. However, they also showed that the code was trying to schedule the next interrupt at time KTIME_MAX, so that an immediate relative-time-zero interrupt is a rather surprising result.

So maybe this confusion occurs only when multi_cpu_stop() preempts some timekeeping activity. Now TREE04 builds its kernels with CONFIG_PREEMPT=n, but maybe there is an unfortunately placed call to schedule() or some such. Except that further code inspection found no such possibility. Furthermore, another test run that dumped the previous task running on each CPU showed nothing suspicious (aside from rcutorture, which some might argue is always suspicious).

And further debugging showed that tick_program_event() thought that it was asking for the scheduling-clock interrupt to be turned off completely. This seemed like a good time to check with the experts, and Frederic Weisbecker, noting that all of the action was happening within multi_cpu_stop() and its called functions, ran the following command to enlist ftrace, while also limiting its output to something that the console might reasonably keep up with:

./ --configs "18*TREE04" --allcpus --bootargs "ftrace=function_graph ftrace_graph_filter=multi_cpu_stop" --kconfig "CONFIG_FUNCTION_TRACER=y CONFIG_FUNCTION_GRAPH_TRACER=y"
This showed that there was no hrtimer pending (consistent with KTIME_MAX), and that the timer was nevertheless being set to fire immediately. Frederic then proposed the following small patch:

--- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -595,7 +595,8 @@ void irq_enter_rcu(void)  {         __irq_enter_raw();   -       if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)) +       if (tick_nohz_full_cpu(smp_processor_id()) || +           (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))                 tick_irq_enter();           account_hardirq_enter(current);
This forces the jiffies counter to be recomputed upon interrupt from nohz_full CPUs in addition to idle CPUs, which avoids the timekeeping confusion that caused KTIME_MAX to be interpreted as zero.

And a 20-hour run for each of 200 instances of TREE04 was free of RCU CPU stall warnings! (This represents 4,000 hours of testing consuming 32,000 CPU-hours.)

This was an example of that rare form of deadlock, a temporary deadlock. The stuck CPU was stuck because timekeeping wasn't happening. Timekeeping wasn't happening because all the timekeeping CPUs were spinning in multi_cpu_stop() with interrupts disabled. The other CPUs could not exit their spinloops (and thus could not update timekeeping information) because the stuck CPU did not advance through the multi_cpu_stop() state machine.

So what caused this situation to be temporary? I must confess that I have not dug into it (nor do I intend to), but my guess is that integer overflow resulted in KTIME_MAX once again taking on its proper role, thus ending the stuck CPU's interrupt storm and in turn allowing the multi_cpu_stop() state machine to advance.

Nevertheless, this completely explains the mystery. Assuming integer overflow, the extremely repeatable stall durations make perfect sense. The RCU CPU stall warning did not happen at the expected 21 seconds because all the CPUs were either spinning with interrupts disabled on the one hand or being interrupt stormed on the other. The interrupt-stormed CPU did not report the RCU CPU stall because the jiffies counter wasn't incrementing. A random CPU would report the stall, depending on which took the first scheduling-clock tick after time jumped backwards (again, presumably due to integer overflow) and back forwards. In the relatively rare case where this CPU was the stuck CPU, it reported an amazing number of scheduling clock ticks, otherwise very few. Since everything was stuck, it is only a little surprising that the kernel continued blithely on after the stall ended. TREE04 reproduced the problem best because it had the largest proportion of nohz_full CPUs.

All in all, this experience was a powerful (if sometimes a bit painful) demonstration of the ability of controlled memory latencies to flush out rare race conditions!

Brendan Gregg: ZFS Is Mysteriously Eating My CPU

3 hét 2 nap óta
A microservice team asked me for help with a mysterious issue. They claimed that the ZFS file system was consuming 30% of CPU capacity. I summarized this case study at [Kernel Recipes] in 2017; it is an old story that's worth resharing here. ## 1. Problem Statement The microservice was for metrics ingestion and had recently updated their base OS image (BaseAMI). After doing so, they claimed that ZFS was now eating over 30% of CPU capacity. My first thought was that they were somehow mistaken: I worked on ZFS internals at Sun Microsystems, and unless it is badly misconfigured there's no way it can consume that much CPU. I have been surprised many times by unexpected performance issues, so I thought I should check their instances anyway. At the very least, I could show that I took it seriously enough to check it myself. I should also be able to identify the real CPU consumer. ## 2. Monitoring I started with the cloud-wide monitoring tool, [Atlas], to check high-level CPU metrics. These included a breakdown of CPU time into percentages for "usr" (user: applications) and "sys" (system: the kernel). I was surprised to find a whopping 38% of CPU time was in sys, which is highly unusual for cloud workloads at Netflix. This supported the claim that ZFS was eating CPU, but how? Surely this is some other kernel activity, and not ZFS. ## 3. Next Steps I'd usually SSH to instances for deeper analysis, where I could use mpstat(1) to confirm the usr/sys breakdown and perf(1) to begin profiling on-CPU kernel code paths. But since Netflix has tools (previously [Vector], now FlameCommander) that allow us to easily fetch flame graphs from our cloud deployment UI, I thought I'd jump to the chase. Just for illustration, this shows the Vector UI and a typical cloud flame graph: Note that this sample flame graph is dominated by Java, shown by the green frames. ## 4. Flame Graph Here's the CPU flame graph from one of the problem instances: The kernel CPU time pretty obvious, shown as two orange towers on the left and right. (The other colors are yellow for C++, and red for other user-level code.) Zooming in to the left kernel tower: This is arc_reclaim_thread! I worked on this code back at Sun. So this really is ZFS, they were right! The ZFS Adapative Replacement Cache (ARC) is the main memory cache for the file system. The arc_reclaim_thread runs arc_adjust() to evict memory from the cache to keep it from growing too large, and to maintain a threshold of free memory that applications can quickly use. It does this periodically or when woken up by low memory conditions. In the past I've seen the arc_reclaim_thread eat too much CPU when a tiny file system record size was used (e.g., 512 bytes) creating millions of tiny buffers. But that was basically a misconfiguration. The default size is 128 Kbytes, and people shouldn't be tuning below 8 Kbytes. The right kernel tower enters spl_kmem_cache_reap_now(), another ZFS memory freeing function. I imagine this is related to the left tower (e.g., contending for the same locks). But first: Why is ZFS in use? ## 5. Configuration There was only one use of ZFS so far at Netflix that I knew of: A new infrastructure project was using ZFS for containers. This led me to a theory: If they were quickly churning through containers, they would also be churning through ZFS file systems, and this might mean that many old pages needed to be cleaned out of the cache. Ahh, it makes sense now. I told them my theory, confident I was on the right path. But they replied: "We aren't using containers." Ok, then how _are_ you using ZFS? I did not expect their answer: "We aren't using ZFS." What!? Yes you are, I can see the arc_reclaim_thread in the flame graph. It doesn't run for fun! It's only on CPU because it's evicting pages from the ZFS ARC. If you aren't using ZFS, there are no pages in the ARC, so it shouldn't run. They were confident that they weren't using ZFS at all. The flame graph defied logic. I needed to prove to them that they were indeed using ZFS somehow, and figure out why. ## 6. cd & ls I should be able to debug this using nothing more than the cd and ls(1) commands. cd to the file system and ls(1) to see what's there. The file names should be a big clue as to its use. First, finding out where the ZFS file systems are mounted: df -h mount zfs list This showed nothing! No ZFS file systems were currently mounted. I tried another instance and saw the same thing. Huh? Ah, but containers may have been created previously and since destroyed, hence no remaining file systems now. How can I tell if ZFS has ever been used? ## 7. arcstats I know, arcstats! The kernel counters that track ZFS statistics, including ARC hits and misses. Viewing them: # cat /proc/spl/kstat/zfs/arcstats name type data hits 4 0 misses 4 0 demand_data_hits 4 0 demand_data_misses 4 0 demand_metadata_hits 4 0 demand_metadata_misses 4 0 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 0 prefetch_metadata_misses 4 0 mru_hits 4 0 mru_ghost_hits 4 0 mfu_hits 4 0 mfu_ghost_hits 4 0 deleted 4 0 mutex_miss 4 0 evict_skip 4 0 evict_not_enough 4 0 evict_l2_cached 4 0 evict_l2_eligible 4 0 [...] Unbelievable! All the counters were zero! ZFS really wasn't in use, ever! But at the same time, it was eating over 30% of CPU capacity! Whaaat?? The customer had been right all along. ZFS was straight up eating CPU, and for no reason. How can a file system _that's not in use at all_ consume 38% CPU? I'd never seen this before. This was a mystery. ## 8. Code Analysis I took a closer look at the flame graph and the paths involved, and saw that the CPU code paths led to get_random_bytes() and extract_entropy(). These were new to me. Browsing the [source code] and change history I found the culprit. The ARC contains lists of cached buffers for different memory types. A performance feature ("multilist") had been added that split the ARC lists into one per CPU, to reduce lock contention on multi-CPU systems. Sounds good, as that should improve performance. But what happens when you want to evict some memory? You need to pick one of those CPU lists. Which one? You could go through them in a round-robin fashion, but the developer thought it better to pick one at random. _Cryptographically-secure random._ The kicker was that ZFS wasn't even in use. The ARC was detecting low system memory and then adjusting its size accordingly, at which point it'd discover it was zero size already and didn't need to do anything. But this was after randomly selecting a zero-sized list, using a CPU-expensive random number generator. I filed this as ZFS [issue #6531]. I believe the first fix was to have the arc_reclaim_thread bail out earlier if ZFS wasn't in use, and not enter list selection. The ARC has since had many changes, and I haven't heard of this issue again. [Kernel Recipes]: [Kernel Recipes2]: [talks]: /index.html#Videos [issue #6531]: [source code]: [Atlas]: [Vector]:

Brendan Gregg: Analyzing a High Rate of Paging

4 hét 2 nap óta
These are rough notes. A service team was debugging a performance issue and noticed it coincided with a high rate of paging. I was asked to help, and used a variety of Linux performance tools to solve this including those that use eBPF and Ftrace. This is a rough post to share this old but good case study of using these tools, and to help justify their further development. No editing, spell checking, or comments. Mostly screenshots. ## 1. Problem Statement The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes. A cloud-wide monitoring tool, Atlas, showed a high rate of paging for the larger file uploads: The blue is pageins (page ins). Pageins are a type of disk I/O where a page of memory is read in from disk, and are normal for many workloads. You might be able to guess the issue from the problem statement alone. ## 2. iostat Starting with my [60-second performance checklist], the iostat(1) tool showed a high rate of disk I/O during a large file upload: # iostat -xz 1 Linux 4.4.0-1072-aws (xxx) 12/18/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.03 0.00 0.83 1.94 0.02 92.18 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.29 0.21 0.17 6.29 3.09 49.32 0.00 12.74 6.96 19.87 3.96 0.15 xvdb 0.00 0.08 44.39 9.98 5507.39 1110.55 243.43 2.28 41.96 41.75 42.88 1.52 8.25 avg-cpu: %user %nice %system %iowait %steal %idle 14.81 0.00 1.08 29.94 0.06 54.11 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 745.00 0.00 91656.00 0.00 246.06 25.32 33.84 33.84 0.00 1.35 100.40 avg-cpu: %user %nice %system %iowait %steal %idle 14.86 0.00 0.89 24.76 0.06 59.43 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 739.00 0.00 92152.00 0.00 249.40 24.75 33.49 33.49 0.00 1.35 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 14.95 0.00 0.89 28.75 0.06 55.35 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 734.00 0.00 91704.00 0.00 249.87 24.93 34.04 34.04 0.00 1.36 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 14.54 0.00 1.14 29.40 0.06 54.86 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 750.00 0.00 92104.00 0.00 245.61 25.14 33.37 33.37 0.00 1.33 100.00 ^C I'm looking at the r_await column in particular: the average wait time for reads in milliseconds. Reads usually have apps waiting on them; writes may not (write-back caching). An r_wait of 33 ms is kinda high, and likely due to the queueing (avgqu-sz). They are largeish I/O, about 128 Kbytes (divide rkB/s by r/s). ## 3. biolatency From [bcc], this eBPF tool shows a latency histogram of disk I/O. I'm running it in case the averages are hiding outliers, which could be a device issue: # /usr/share/bcc/tools/biolatency -m Tracing block device I/O... Hit Ctrl-C to end. ^C msecs : count distribution 0 -> 1 : 83 | | 2 -> 3 : 20 | | 4 -> 7 : 0 | | 8 -> 15 : 41 | | 16 -> 31 : 1620 |******* | 32 -> 63 : 8139 |****************************************| 64 -> 127 : 176 | | 128 -> 255 : 95 | | 256 -> 511 : 61 | | 512 -> 1023 : 93 | | This doesn't look too bad. Most I/O are between 16 and 127 ms. Some outliers reaching the 0.5 to 1.0 second range, but again, there's quite a bit of queueing here seen in the earlier iostat(1) output. I don't think this is a device issue. I think it's the workload. ## 4. bitesize As I think it's a workload issue, I want a better look at the I/O sizes in case there's something odd: # /usr/share/bcc/tools/bitesize Tracing... Hit Ctrl-C to end. ^C Process Name = java Kbytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 31 | | 16 -> 31 : 15 | | 32 -> 63 : 15 | | 64 -> 127 : 15 | | 128 -> 255 : 1682 |****************************************| The I/O is mostly in the 128 to 255 Kbyte bucket, as expected from the iostat(1) output. Nothing odd here. ## 5. free Also from the 60-second checklist: # free -m total used free shared buff/cache available Mem: 64414 15421 349 5 48643 48409 Swap: 0 0 0 There's not much memory left, 349 Mbytes, but more interesting is the amount in the buffer/page cache: 48,643 Mbytes (48 Gbytes). This is a 64-Gbyte memory system, and 48 Gbytes is in the page cache (the file system cache). Along with the numbers from the problem statement, this gives me a theory: Do the 100-Gbyte files bust the page cache, whereas 40-Gbyte files fit? ## 6. cachestat [cachestat] is an experimental tool I developed that uses Ftrace and has since been ported to bcc/eBPF. It shows statistics for the page cache: # /apps/perf-tools/bin/cachestat Counting cache functions... Output every 1 seconds. HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 1811 632 2 74.1% 17 48009 1630 15132 92 9.7% 17 48033 1634 23341 63 6.5% 17 48029 1851 13599 17 12.0% 17 48019 1941 3689 33 34.5% 17 48007 1733 23007 154 7.0% 17 48034 1195 9566 31 11.1% 17 48011 [...] This shows many cache misses, with a hit ratio varying between 6.5 and 74%. I usually like to see that in the upper 90's. This is "cache busting." The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance. The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file). ## 7. Smaller File Test For further confirmation, I collected the same outputs for a 32 Gbyte file upload. cachestat shows a ~100% cache hit ratio: # /apps/perf-tools/bin/cachestat Counting cache functions... Output every 1 seconds. HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 61831 0 126 100.0% 41 33680 53408 0 78 100.0% 41 33680 65056 0 173 100.0% 41 33680 65158 0 79 100.0% 41 33680 55052 0 107 100.0% 41 33680 61227 0 149 100.0% 41 33681 58669 0 71 100.0% 41 33681 33424 0 73 100.0% 41 33681 ^C This smaller size allows the service to process the file (however many passes it takes) entirely from memory, without re-reading it from disk. free(1) shows it fitting in the page cache: # free -m total used free shared buff/cache available Mem: 64414 18421 11218 5 34773 45407 Swap: 0 0 0 And iostat(1) shows little disk I/O, as expected: # iostat -xz 1 Linux 4.4.0-1072-aws (xxx) 12/19/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 12.25 0.00 1.24 0.19 0.03 86.29 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.32 0.31 0.19 7.09 4.85 47.59 0.01 12.58 5.44 23.90 3.09 0.15 xvdb 0.00 0.07 0.01 11.13 0.10 1264.35 227.09 0.91 82.16 3.49 82.20 2.80 3.11 avg-cpu: %user %nice %system %iowait %steal %idle 57.43 0.00 2.95 0.00 0.00 39.62 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util avg-cpu: %user %nice %system %iowait %steal %idle 53.50 0.00 2.32 0.00 0.00 44.18 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 0.00 2.00 0.00 19.50 19.50 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 39.02 0.00 2.14 0.00 0.00 58.84 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util [...] ## 8. Final Notes cachestat was the killer tool here, but I should stress that it's still experimental. I wrote it for Ftrace with the constraints that it must be low-overhead and use the Ftrace function profiler only. As I mentioned in my LSFMMBPF 2019 keynote in Puerto Rico, where the Linux mm kernel engineers were present, I think the cachestat statistics are so commonly needed that they should be in /proc and not need my experimental tool. They pointed out challenges with providing them properly, and I think any robust solution is going to need their help and expertise. I hope this case study helps show why it is useful and worth the effort. Until the kernel does support page cache statistics (which may be never: they are hot-path, so adding counters isn't free) we can use my cachestat tool, even if it does require frequent maintenance to keep working. [cachestat]: /blog/2014-12-31/linux-page-cache-hit-ratio.html [60-second performance checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [bcc]:

Michael Kerrisk (manpages): man-pages-5.13 released

1 hónap óta
Alex Colomar and I have released released man-pages-5.13. The release tarball is available on The browsable online pages can be found on The Git repository for man-pages is available on

This release resulted from patches, bug reports, reviews, and comments from 40 contributors. The release includes around 200 commits that changed around 120 manual pages.

The most notable of the changes in man-pages-5.13 are the following:
  • A new mount_setattr(2) page, written by Christian Brauner, that documents the mount_setattr() system call added in Linux 5.12.
  • I added documentation of the /proc/PID/projid_map file to the user_namespaces(7) page.
  • I added a greatly expanded description of the concept of locked mounts in the mount_namespaces(7) page.

Special thanks again to Alex, who kept track of a lot of patches while I was unavailable.

Brendan Gregg: Slack's Secret STDERR Messages

1 hónap óta
These are rough notes. I run the Slack messaging application on Ubuntu Linux, and it recently started mysteriously crashing. I'd Alt-Tab and find it was no longer there. No error message, no dialog, just gone. It usually happened when locking and unlocking the screen. A quick internet search revealed nothing. These are rough notes for how I debugged it, in case it's useful for someone searching on this topic. I spend many hours documenting advanced debugging stories for books, talks, and blog posts, but many things I never have time to share. I'm experimenting with this "rough notes" format as a way to quickly share things. No editing, spell checking, or comments. Mostly screenshots. Dead ends included. Note that I don't know anything about Slack internals, and there may be better ways to solve this. ## 1. Enabling core dumps I'm guessing it's core dumping and Ubuntu's apport is eating them. Redirecting them to the file system so I can then do core dump analysis using gdb(1), as root: # cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c %d %P # mkdir /var/cores # echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern [...another crash...] # ls /var/cores # This didn't work: No core file showed up. I may need to increase the core file size ulimits for Slack, but that might mean mucking around with its startup scripts; I'll try some other tracing first. ## 2. exitsnoop Using an eBPF/bcc tool to look for exit reasons: # exitsnoop -t TIME-AEST PCOMM PID PPID TID AGE(s) EXIT_CODE 13:51:19.432 kworker/dying 3663305 2 3663305 1241.59 0 13:51:30.948 kworker/dying 3663626 2 3663626 835.76 0 13:51:33.296 systemd-udevd 3664149 2054939 3664149 3.55 0 13:53:09.256 kworker/dying 3662179 2 3662179 2681.15 0 13:53:25.636 kworker/dying 3663520 2 3663520 1122.60 0 13:53:30.705 grep 3664239 6009 3664239 0.08 0 13:53:30.705 ps 3664238 6009 3664238 0.08 0 13:53:40.297 slack 3663135 1786 3663135 1459.54 signal 6 (ABRT) 13:53:40.298 slack 3663208 3663140 3663208 1457.86 0 13:53:40.302 slack 3663140 1786 3663140 1459.18 0 13:53:40.302 slack 3663139 1786 3663139 1459.18 0 13:53:40.303 slack 3663171 1786 3663171 1458.22 0 13:53:40.317 slack 3663197 1786 3663197 1458.03 0 13:53:44.827 gdm-session-wor 3664269 1778 3664269 0.02 0 [...] This traced a Slack SIGABRT which happened around the same time as a crash. A strong lead. ## 3. killsnoop Running killsnoop (eBPF/bcc) to get more info: # killsnoop TIME PID COMM SIG TPID RESULT 13:45:01 2053366 systemd-journal 0 1024 0 13:45:01 2053366 systemd-journal 0 3663525 -3 13:45:01 2053366 systemd-journal 0 3663528 -3 13:49:00 2054939 systemd-udevd 15 3664053 0 13:51:33 2054939 systemd-udevd 15 3664149 0 13:53:44 2053366 systemd-journal 0 4265 -1 13:53:44 2053366 systemd-journal 0 972 0 13:53:44 2053366 systemd-journal 0 1778 0 13:53:44 2053366 systemd-journal 0 6414 -1 [...] A crash happened, but killsnoop(8) didn't see it. A quick look at the killsnoop(8) source reminded me that I wrote it back in 2015, which is practically ancient in eBPF years. Back then there wasn't tracepoint support yet so I was using kprobes for everything. Kprobes aren't a stable interface, which might be the problem. ## 4. perf trace Nowadays this can be done as a perf one-liner: # perf list syscalls:sys_enter_*kill List of pre-defined events (to be used in -e): syscalls:sys_enter_kill [Tracepoint event] syscalls:sys_enter_tgkill [Tracepoint event] syscalls:sys_enter_tkill [Tracepoint event] # perf trace -e 'syscalls:sys_enter_*kill' 15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT) Ok, so there's our slack SIGABRT, sent via tgkill(2). (And I filed an issue to update bcc killsnoop(8) to use tracepoints.) This output doesn't really tell me much more about it though. I want to see a stack trace. I can use perf record or bpftrace...and that reminds me, didn't I write another signal tool using bpftrace? ## 5. The bpftrace tool from my BPF book traces the signal:signal_generate tracepoint, which should catch every type of generated signal, including tgkill(2). Trying it out: # bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/ Attaching 3 probes... Counting signals. Hit Ctrl-C to end. ^C @[SIGNAL, PID, COMM] = COUNT @[SIGPIPE, 1883, Xorg]: 1 @[SIGCHLD, 1797, dbus-daemon]: 1 @[SIGINT, 3665167, bpftrace]: 1 @[SIGTERM, 3665198, gdm-session-wor]: 1 @[SIGCHLD, 3665197, gdm-session-wor]: 1 @[SIGABRT, 3664940, slack]: 1 @[SIGTERM, 3665197, gdm-session-wor]: 1 @[SIGKILL, 3665207, dbus-daemon]: 1 @[SIGWINCH, 859450, bash]: 2 @[SIGCHLD, 1778, gdm-session-wor]: 2 @[, 3665201, gdbus]: 2 @[, 3665199, gmain]: 2 @[SIGWINCH, 3665167, bpftrace]: 2 @[SIGWINCH, 3663319, vi]: 2 @[SIGCHLD, 1786, systemd]: 6 @[SIGALRM, 1883, Xorg]: 106 Ok, there's the SIGABRT for slack. (There's a new sigsnoop(8) tool for bcc that uses this tracepoint as well.) ## 6. Signal Stacks Moving from to a bpftrace one-liner: # bpftrace -e 't:signal:signal_generate /comm == "slack"/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }' Attaching 1 probe... 6, __send_signal+579 __send_signal+579 send_signal+221 do_send_sig_info+81 do_send_specific+110 do_tkill+171 __x64_sys_tgkill+34 do_syscall_64+73 entry_SYSCALL_64_after_hwframe+68 0x7f4a2e2e2f95 This was supposed to print both the kernel and user stack traces that led to the SIGABRT, but the user stack is broken, showing 0x7f4a2e2e2f95 only. Grumble. There's ways to fix this, but it usually gets time consuming, so let me try something else first. Logs! ## 7. Logs Does Slack have logs? I have no idea. Maybe they contain the error message. # lsof -p `pgrep -n slack` | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. Ignore the lsof(8) warnings. There's no output here, nothing containing "log". Although I'm looking at the most recent process called "slack" and maybe that's the wrong one. # pstree -ps `pgrep -n slack` systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549) ├─{slack}(3666551) ├─{slack}(3666552) ├─{slack}(3666553) ├─{slack}(3666554) ├─{slack}(3666555) ├─{slack}(3666556) ├─{slack}(3666557) ├─{slack}(3666558) ├─{slack}(3666559) ├─{slack}(3666560) ├─{slack}(3666564) ├─{slack}(3666566) ├─{slack}(3666568) └─{slack}(3666609) Ok, how about I try the great-grandparent, PID 3666477: # lsof -p 3666477 | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. slack 3666477 bgregg 37r REG 253,1 32768 140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log slack 3666477 bgregg 40r REG 253,1 32768 131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log slack 3666477 bgregg 71w REG 253,1 15566 1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log slack 3666477 bgregg 72w REG 253,1 347 1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG slack 3666477 bgregg 73w REG 253,1 2324236 1718407 /home/bgregg/.config/Slack/logs/browser.log slack 3666477 bgregg 90w REG 253,1 363600 1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log slack 3666477 bgregg 92w REG 253,1 274 1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG slack 3666477 bgregg 108w REG 253,1 4182513 1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log slack 3666477 bgregg 116w REG 253,1 259 1704369 /home/bgregg/.config/Slack/Session Storage/LOG slack 3666477 bgregg 122w REG 253,1 31536 1749325 /home/bgregg/.config/Slack/Session Storage/000036.log slack 3666477 bgregg 126w REG 253,1 3970909 1704566 /home/bgregg/.config/Slack/logs/webapp-console.log slack 3666477 bgregg 127w REG 253,1 2330006 1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log slack 3666477 bgregg 131w REG 253,1 330 1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG slack 3666477 bgregg 640r REG 253,1 32768 140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted) Lots of logs in ~/config/Slack/logs! # cd ~/.config/Slack/logs # ls -lrth total 26M -rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/ -rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log -rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log -rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log Ok, so how about this one: # cat webapp-console.log [...] [08/27/21, 14:46:36:238] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.801 dnd.teamInfo is RESOLVED [08/27/21, 14:46:36:240] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.930 users.interactions.list is RESOLVED [08/27/21, 14:46:36:242] info: [DND] (TKZ41AXQD) Fetched DND info for the following member: ULG5H012L [08/27/21, 14:46:36:251] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L [08/27/21, 14:46:36:254] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Fetched DND info for the following members: UL0US3455 [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L,UL0US3455 [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S9267BE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S9267BE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.00ms No, I don't see any errors jumping out at me. How about searching for errors: # egrep -i 'error|fail' webapp-console.log [08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [...] # egrep -i 'error|fail' browser.log [07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:27:653] error: Failed to load empty window url in window "error": { "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n at Object.emit (events.js:315:20)\n at Object.EventEmitter.emit (domain.js:467:12)", [07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel "error": { "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n at Module.load (internal/modules/cjs/loader.js:935:32)\n at Module._load (internal/modules/cjs/loader.js:776:14)\n at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n at Module.require (internal/modules/cjs/loader.js:959:19)\n at require (internal/modules/cjs/helpers.js:88:18)\n at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n at Object. (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)", [07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links [07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag [...] I browsed the logs for a while but didn't see a smoking gun. Surely it spits out some error message when crashing, like to STDERR... ## 8. STDERR Tracing Where is STDERR written? # lsof -p 3666477 [...] slack 3666477 bgregg mem REG 7,16 141930 7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak slack 3666477 bgregg mem REG 7,16 165680 7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin slack 3666477 bgregg 0r CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 1w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 2w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 3r FIFO 0,12 0t0 29532192 pipe slack 3666477 bgregg 4u unix 0x00000000134e3c45 0t0 29526717 type=SEQPACKET slack 3666477 bgregg 5r REG 7,16 10413488 7167 /snap/slack/44/usr/lib/slack/icudtl.dat [...] /dev/null? Like that's going to stop me. I could trace writes to STDERR, but I think my old shellsnoop(8) tool (another from eBPF/bcc) already does that: # shellsnoop 3666477 [...] [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms (slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme. This may indicate that pixbuf loaders or the mime database could not be found. ** Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5) Ah-ha! The last message printed is an error about a .png file and a .so file. As it's Slack's final mesage before crashing, this is a smoking gun. Note that this was not in any log!: # grep image-missing.png * grep: recorded-trace: Is a directory It's the .so file that is missing, not the .png: # ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png -rw-r--r-- 1 root root 535 Nov 6 2020 /usr/share/icons/Yaru/16x16/status/image-missing.png # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/': No such file or directory But there is a .so file with a similar path: # ls -lh /snap/slack/ total 0 drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ Hmm, I wonder... ## 9. Workaround This is obviously a hack and is not guaranteed to be safe: # cd /snap/slack # ln -s current 42 # ls -lh total 0 lrwxrwxrwx 1 root root 7 Aug 27 15:01 42 -> current/ drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ I don't know why Slack was looking up this library via the old directory version, but linking the new version to the old path did the trick. Slack has stopped crashing! I'm guessing this is a problem with how the snap is built. Needs more debugging. ## 10. Other debugging In case you're wondering what I'd do if I didn't find the error in STDERR, I'd go back to setting ulimits to see if I could get a core dump, and if that still didn't work, I'd try to run Slack from a gdb(1) session. I'd also work on fixing the user stack trace and symbols to see what that revealed. ## 11. Bonus: opensnoop I often wonder how I could have debugged things sooner, and I'm kicking myself I didn't run opensnoop(8) as I usually do. Tracing just failed opens: # opensnoop -Tx TIME(s) PID COMM FD ERR PATH [...] 11.412358000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/subclasses 11.412360000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/icons 11.412363000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/generic-icons 11.412495000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ 11.412527000 3673057 slack -1 2 /usr/share/locale/en_AU/LC_MESSAGES/ 11.412537000 3673057 slack -1 2 /usr/share/locale/en/LC_MESSAGES/ 11.412559000 3673057 slack -1 2 /usr/share/locale-langpack/en/LC_MESSAGES/ 11.412916000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/ 11.425405000 1786 systemd -1 2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/user@1000.service/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/ That shows its last failed open was to the .so file. Which would have been a good lead. But the best clue was the secret STDERR messages Slack sends to /dev/null, rescued using shellsnoop(8).

Linux Plumbers Conference: BOFs Call for Proposals Now Open

1 hónap óta

We have formally opened the CfP for Birds of a Feather. Select the BOFs track when submitting a BOF here.

As a reminder:

  • Submissions for Microconferences topics are still open.
  • The GNU Tools Track CfP will close in a few days, on August 31.
  • Kernel Summit track is open for late breaking topics.
  • All the other tracks are now closed for additional submissions.


Linux Plumbers Conference: Diversity, Equity & Inclusion Microconference Accepted into 2021 Linux Plumbers Conference

1 hónap 1 hét óta

We are pleased to announce that the Diversity, Equity & Inclusion Microconference has been accepted into the 2021 Linux Plumbers Conference.

Creating diverse communities requires effort and commitment to creating inclusive and welcoming spaces. Recognizing that communities which adopt inclusive language and actions attract and retain more individuals from diverse backgrounds, the Linux kernel community adopted inclusive language in Linux 5.8 release. Understanding if this sort of change has been effective is a topic of active research. This MC will take a pulse of the Linux kernel community as it turns 30 this year and discuss some next steps. Experts from the DEI research community will share their perspectives, together with the perspectives from the Linux community members. This microconference will build on what was started at the LPC 2020 BoF session on Improving Diversity.

This year’s topics to be discussed include:

  • What are the challenges in attracting and retaining a diverse group of developers that are worth focusing on.
  • Does the Code of Conduct and Inclusive naming efforts help people of diverse groups feel at home? What else is missing?
  • How effective have the kernel mentoring initiatives been? Are there best practices emerging that help the limited pool of mentors be more effective?
  • What will be the most effective next steps for advancing Diversity, Equity and Inclusion that will improve the trends, and help us scale?

Come and join us in the discussion of how we can improve the diversity of the Linux Kernel community and help keep it vibrant for the next 30 years!

We hope to see you there.

Linux Plumbers Conference: GPU/media/AI buffer management and interop Microconference Accepted into 2021 Linux Plumbers Conference

1 hónap 1 hét óta

We are pleased to announce that the GPU/media/AI buffer management and interop Microconference has been accepted into the 2021 Linux Plumbers Conference. The Linux GPU subsystem has long had three major tenets:

  • Kernel mediation, validation, submission, and scheduling of GPU jobs
  • Implicit synchronisation between multiple user space accessors
  • Open-source user space

Forthcoming hardware makes the former two difficult, if not impossible, to achieve. In order to give user space the fastest possible path to support modern complex workloads, forthcoming hardware is removing the notion of a small number of kernel-controlled job queues, replacing it with direct user space access to command queues to submit and control their own jobs.

This, and evolution in the Vulkan API, make it difficult to retain the existing implicit synchronization model, where the kernel tracks all access and ensures that the hardware executes jobs in the order of user space submission, so that multiple independent clients can reuse the same buffers without data hazards. As all of these changes impact both media and neural-network accelerators, this Linux Plumbers Conference microconference allows us to open the discussion past the graphics community and into the wider kernel community.

This year’s topics to be discussed include:

Come and join us in the discussion of keeping Linux a first class citizen
in the would of graphics and media.

We hope to see you there.

Linux Plumbers Conference: Android Microconference Accepted into 2021 Linux Plumbers Conference

1 hónap 3 hét óta

We are pleased to announce that the Android Microconference has been accepted into the 2021 Linux Plumbers Conference. The past Android microconferences have been centered around the idea that it was primarily a synchronization point between the Android kernel team and the rest of the community to inform them on what they have been doing. With the help of last year’s focus on the Generic Kernel Image[1] (GKI), this year’s Android microconference will instead be an opportunity to foster a higher level of collaboration between the Android and Linux kernel communities. Discussions will be centered on the goal of ensuring that both the Android and Linux development moves in a lockstep fashion going forward.

Last year’s meetup achieved the following:

This year’s topics to be discussed include:

  • Alignment issues between Android and Cgroups v2: Issues in refactoring Android’s use of cgroups to utilize cgroups v2
  • Thermal: Issues around performance and thermal handling between the kernel and Android’s HAL
  • Fuse/device-mapper/other storage: Discuss out-of-tree dm/storage drivers and how they might go upstream or better align with upstream efforts
  • In kernel memory tracking: Tracking/account GPU (and other multi-device shared) memory and how it might fit with cgroups
  • fw_devlink: Remaining fw_devlink issues to resolve, now that its enabled by default.
  • Hermetic builds/Kbuild
  • GKI updates: Whats new this year in GKI and where its going next year
  • Rust in AOSP / Kernel / Binder: How Android is adopting rust for userland and potentially for kernel drivers
  • Android Automotive OS Reference Platform: Details on recent Android Automotive work
  • Community devboard/device Collaboration: Ways to better collaborate on enabling various devboard against AOSP, without needing close interlock with Google

Come and join us in the discussion of making Android a better partner with Linux.

We hope to see you there.

Dave Airlie (blogspot): crocus misrendering of the week

1 hónap 3 hét óta

 I've been chasing a crocus misrendering bug show in a qt trace.

The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

Linux Plumbers Conference: System Boot and Security Microconference Accepted into 2021 Linux Plumbers Conference

1 hónap 4 hét óta

We are pleased to announce that the System Boot and Security Microconference has been accepted into the 2021 Linux Plumbers Conference. This microconference brings together those that are interested in the firmware, bootloaders, system boot and security. The events around last year’s BootHole showed how crucial platform initialization is for the overall system security. Those events may have showed the shortcomings in the current boot process, but they have also tightened the cooperation between various companies and organizations. Now is the time to use this opportunity to discuss the lessons learned and what can be done to improve in the future. Other cooperation discussions are also welcomed like those based on legal and organizational issues which may hinder working together.

Last year’s meetup achieved the following:

This year’s topics to be discussed include:

Come and join us in the discussion about how to keep your system secure even at bootup.

We hope to see you there.

Linux Plumbers Conference: Kernel Dependability and Assurance Microconference Accepted into 2021 Linux Plumbers Conference

2 hónap óta

We are pleased to announce that the Kernel Dependability and Assurance Microconference has been accepted into the 2021 Linux Plumbers Conference.

Linux development is producing kernels at an ever increasing rate, and at the same time with arguably increasing software quality. The process of kernel development has been adapting to handle the increasing number of contributors over the years to ensure a sufficient software quality. This quality is key in that Linux is now being used in applications that require a high degree of trust that the kernel is going to behave as expected. Some of the key areas we’re seeing Linux start to be used are in medical devices, civil infrastructure, caregiving robots, automotives, etc.

Last year’s miniconference raised awareness about this topic with the wider community. Since then the ELISA team has made contributions to the Documentation and tools. The team has deployed a CI server that runs static analysis tools and syzkaller on the Linux kernel repos and is making the results of last 10 days of linux-next are made available to the community.

This year’s topics to be discussed include:

  • Identify missing features that will provide assurance in safety critical systems.
  • Which test coverage infrastructures are most effective to provide evidence for kernel quality assurance? How should it be measured?
  • Explore ways to improve testing framework and tests in the kernel with a specific goal to increase traceability and code coverage.
  • Regression Testing for safety: Prioritize configurations and tests critical and important for quality and dependability

Come and join us in the discussion on how we can assure that Linux becomes the most trusted and dependable software in the world!

We hope to see you there.

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XVIII: Preventing Involuntary Generosity

2 hónap óta
I recently learned that all that is required for someone to take out a loan in some random USA citizen's name is that citizen's full name, postal address, email address, date of birth, and social security number. If you are above a certain age, all of these are for all intents and purposes a matter of public record. If you are younger, then your social security number is of course supposed to be secret—and it will be, right up to that data breach that makes it available to all the wrong people.

This sort of thing can of course be a bit annoying to our involuntarily generous USA citizen. Some unknown person out there gets a fancy toy, and our citizen gets some bank's dunning notices. Fortunately, there are quite a few things you can do, although I will not try to reproduce the entirety of the volumes of good advice that are available out there. Especially given that laws, processes, and procedures are all subject to change.

But at present, one important way to prevent this is to put a hold on your credit information through either of Experian, Equifax, or Transunion. I strongly suggest that you save yourself considerable time and hassle by doing this, which is free of charge for a no-questions-asked one-year hold. Taking this step is especially important if you are among the all too many of us whose finances don't have much slack, as was the case with my family back when my children were small. After all, it is one thing to have to deal with a few hassles in the form of phone calls, email, and paperwork, but it is quite another if you and your loved ones end up missing meals. Thankfully, it never came to that for my family, although one of my children did complain bitterly to a medical professional about the woefully insufficient stores of candy in our house.

Of course, I also have some advice for the vendor, retailer, digital-finance company, and bank that were involved in my case of attempted involuntary generosity:

  1. Put just a little more effort into determining who you are really doing business with.
  2. If the toy contains a computer and connects to the internet, consider the option of directing your dunning notices through the toy rather than to the email and phone of your involuntarily generous USA citizen.
  3. A loan application for a toy that is shipped to a non-residential address should be viewed with great suspicion.
  4. In fact, such a loan application should be viewed with with some suspicion even if the addresses match. Porch pirates and all that.
  5. If the toy is of a type that must connect to the internet to do anything useful, you have an easy method of dealing with non-payment, don't you?

I should hasten to add that after only a little discussion, these companies have thus far proven quite cooperative in my particular case, which is why they are thus far going nameless.

Longer term, it is hard to be optimistic, especially given advances in various easy-to-abuse areas of information technology. In the meantime, I respectfully suggest that you learn from my experience and put a hold on your credit information!

Linux Plumbers Conference: RISC-V Microconference Accepted into 2021 Linux Plumbers Conference

2 hónap óta

We are pleased to announce that the RISC-V Microconference has been accepted into the 2021 Linux Plumbers Conference. The RISC-V software eco-system is gaining momentum at breakneck speed with three new Linux development platforms available this year. The new platforms bring new issues to deal with.

Last year’s meetup achieved the following:

This year’s topics to be discussed include:

  • Platform specification progress, including SBI-0.3 and the future plans for SBI-0.4. There has been significant progress on the platform specifications, including a server profile, that needs discussion.
  • Privileged specification progress, possible 1.12 (which is a work in progress at the foundation).
  • Support for the V and B specifications, along with questions about the drafts. The V extension is of particular interest, as there are implementation of the draft extensions that are likely to be incompatible with what will eventually be ratified so we need to discuss what exactly user ABI compatibility means.
  • H extension / KVM discussion, which is probably part of the drafts.  The KVM port has been hung up on the H extension ratification process, which is unlikely to proceed any time soon. We should discuss other options for a KVM port that avoid waiting for the H extension.
  • Support for the batch of SOCs currently landing (JH7100, D1)
  • Support for non-coherent systems
  • How to handle compliance.

Come join us and participate in the discussion on how we can improve the support for RISC-V in the Linux kernel.

We hope to see you there.

Pete Zaitcev: MinIO liberates your storage from rebalancing

2 hónap óta

MinIO posted a blog entry a few days ago where the bragged about adding capacity without a need to re-balance.

First, they went into a full marketoid mode, whipping up the fear:

Rebalancing a massive distributed storage system can be a nightmare. There’s nothing worse than adding a storage node and watching helplessly as user response time increases while the system taxes its own resources rebalancing to include the new node.

Seems like MinIO folks assume that operators of distributed storage such as Swift and Ceph have no tools to regulate the resource consumption of rebalancing. So they have no choice but to "wait helplessly". Very funny.

But it gets worse when obviously senseless statements are made:

Rebalancing doesn’t just affect performance - moving many objects between many nodes across a network can be risky. Devices and components fail and that often leads to data loss or corruption.

Often, man! Also, a commit protocol? Never heard of her!

Then, we talk about some unrelated matters:

A group of drives is an erasure set and MinIO uses a Reed-Solomon algorithm to split objects into data and parity blocks based on the size of the erasure set and then uniformly distributes them across all of the drives in the erasure such that each drive in the set contains no more than one block per object.

Understood, your erasure set is what we call "partition" in Swift or a placement group in Ceph.

Finally, we get to the matter at hand:

To enable rapid growth, MinIO scales by adding server pools and erasure sets. If we had built MinIO to allow you to add a drive or even a single hardware node to an existing server pool, then you would have to suffer through rebalancing.

MinIO scales up quickly by adding server pools, each an independent set of compute, network and storage resources.

Add hardware, run MinIO server to create and name server processes, then update MinIO with the name of the new server pool. MinIO leaves existing data in their original server pools while exposing the new server pools to incoming data.

My hot take on the social media was: "Placing new sets on new storage impacts utilization and risks hotspotting because of time affinity. There's no free lunch." Even on the second thought, I think that is about right. But let us not ignore the cost of the data movement associated with rebalancing. What if the operator wants to implement in Swift what MinIO blog post talks about?

It is possible to emulate MinIO, to an extent. Some operators add a new storage policy when they expand the cluster, configure all the new nodes and/or volumes in its ring, then make it default, so newly-created objects end on the new hardware. This works to accomplish the same goals that MinIO outline above, but it's a kludge. Swift was not intended for this originally and it shows. In particular, storage policies were intended for low numbers of storage classes, such as rotating media and SSD, or Silver/Gold/Platinum. Once you make a new policy for each new forklift visit, you run a risk of finding scalability issues. Well, most clusters only upgrade a few times over their lifetime, but potentially it's a problem. Also, policies are customer visible, they are intended to be.

In the end, I still think that balanced cluster is the way to go. Just think rationally about it.

Interestingly, the reverse emulation appears to be not possible for MinIO: if you wanted to rebalance your storage, you would not be able to. Or at least the blog post above says: "If we had built MinIO to allow you to add a drive or ... a node to an existing server pool". I take it to mean that they don't allow, and the blog post is very much a case of sour grapes, then.

15 perc 49 másodperc ago
Kernel Planet -
Feliratkozás a következőre: Kernel Planet hírcsatorna