Hírolvasó

(open)rsync gains include/exclude support

3 év 11 hónap óta

Claudio Jeker (claudio@) has committed support for simple include and exclude cases in (open)rsync:

CVSROOT: /cvs Module name: src Changes by: claudio@cvs.openbsd.org 2021/08/29 07:43:46 Modified files: usr.bin/rsync : Makefile extern.h flist.c main.c receiver.c sender.c Added files: usr.bin/rsync : charclass.h rmatch.c rules.c Log message: Implement --exclude/exclude-file and --include/include-file. Currently only simple include and excludes work, the advanced filters introduced later in rsync are not implemented. It is unclear if the per directory filters are something we want to implement. This requires more modern protocols which openrsync is not able to handle right now. This adds a special matching function to allow the ** matching which behaves mostly like rsyncs version with the exception of how bad [] patterns are expanded. For bad patterns openrsync follows more how fnmatch behaves and not the somewhat strange rsync behaviour. Not perfect but committing now so people can test and provide feedback

Great stuff! This should further reduce the need for installing the net/rsync port.

Brendan Gregg: Analyzing a High Rate of Paging

3 év 11 hónap óta
These are rough notes. A service team was debugging a performance issue and noticed it coincided with a high rate of paging. I was asked to help, and used a variety of Linux performance tools to solve this including those that use eBPF and Ftrace. This is a rough post to share this old but good case study of using these tools, and to help justify their further development. No editing, spell checking, or comments. Mostly screenshots. ## 1. Problem Statement The microservice managed and processed large files, including encrypting them and then storing them on S3. The problem was that large files, such as 100 Gbytes, seemed to take forever to upload. Hours. Smaller files, as large as 40 Gbytes, were relatively quick, only taking minutes. A cloud-wide monitoring tool, Atlas, showed a high rate of paging for the larger file uploads: The blue is pageins (page ins). Pageins are a type of disk I/O where a page of memory is read in from disk, and are normal for many workloads. You might be able to guess the issue from the problem statement alone. ## 2. iostat Starting with my [60-second performance checklist], the iostat(1) tool showed a high rate of disk I/O during a large file upload: # iostat -xz 1 Linux 4.4.0-1072-aws (xxx) 12/18/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 5.03 0.00 0.83 1.94 0.02 92.18 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.29 0.21 0.17 6.29 3.09 49.32 0.00 12.74 6.96 19.87 3.96 0.15 xvdb 0.00 0.08 44.39 9.98 5507.39 1110.55 243.43 2.28 41.96 41.75 42.88 1.52 8.25 avg-cpu: %user %nice %system %iowait %steal %idle 14.81 0.00 1.08 29.94 0.06 54.11 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 745.00 0.00 91656.00 0.00 246.06 25.32 33.84 33.84 0.00 1.35 100.40 avg-cpu: %user %nice %system %iowait %steal %idle 14.86 0.00 0.89 24.76 0.06 59.43 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 739.00 0.00 92152.00 0.00 249.40 24.75 33.49 33.49 0.00 1.35 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 14.95 0.00 0.89 28.75 0.06 55.35 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 734.00 0.00 91704.00 0.00 249.87 24.93 34.04 34.04 0.00 1.36 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 14.54 0.00 1.14 29.40 0.06 54.86 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 750.00 0.00 92104.00 0.00 245.61 25.14 33.37 33.37 0.00 1.33 100.00 ^C I'm looking at the r_await column in particular: the average wait time for reads in milliseconds. Reads usually have apps waiting on them; writes may not (write-back caching). An r_wait of 33 ms is kinda high, and likely due to the queueing (avgqu-sz). They are largeish I/O, about 128 Kbytes (divide rkB/s by r/s). ## 3. biolatency From [bcc], this eBPF tool shows a latency histogram of disk I/O. I'm running it in case the averages are hiding outliers, which could be a device issue: # /usr/share/bcc/tools/biolatency -m Tracing block device I/O... Hit Ctrl-C to end. ^C msecs : count distribution 0 -> 1 : 83 | | 2 -> 3 : 20 | | 4 -> 7 : 0 | | 8 -> 15 : 41 | | 16 -> 31 : 1620 |******* | 32 -> 63 : 8139 |****************************************| 64 -> 127 : 176 | | 128 -> 255 : 95 | | 256 -> 511 : 61 | | 512 -> 1023 : 93 | | This doesn't look too bad. Most I/O are between 16 and 127 ms. Some outliers reaching the 0.5 to 1.0 second range, but again, there's quite a bit of queueing here seen in the earlier iostat(1) output. I don't think this is a device issue. I think it's the workload. ## 4. bitesize As I think it's a workload issue, I want a better look at the I/O sizes in case there's something odd: # /usr/share/bcc/tools/bitesize Tracing... Hit Ctrl-C to end. ^C Process Name = java Kbytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 31 | | 16 -> 31 : 15 | | 32 -> 63 : 15 | | 64 -> 127 : 15 | | 128 -> 255 : 1682 |****************************************| The I/O is mostly in the 128 to 255 Kbyte bucket, as expected from the iostat(1) output. Nothing odd here. ## 5. free Also from the 60-second checklist: # free -m total used free shared buff/cache available Mem: 64414 15421 349 5 48643 48409 Swap: 0 0 0 There's not much memory left, 349 Mbytes, but more interesting is the amount in the buffer/page cache: 48,643 Mbytes (48 Gbytes). This is a 64-Gbyte memory system, and 48 Gbytes is in the page cache (the file system cache). Along with the numbers from the problem statement, this gives me a theory: Do the 100-Gbyte files bust the page cache, whereas 40-Gbyte files fit? ## 6. cachestat [cachestat] is an experimental tool I developed that uses Ftrace and has since been ported to bcc/eBPF. It shows statistics for the page cache: # /apps/perf-tools/bin/cachestat Counting cache functions... Output every 1 seconds. HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 1811 632 2 74.1% 17 48009 1630 15132 92 9.7% 17 48033 1634 23341 63 6.5% 17 48029 1851 13599 17 12.0% 17 48019 1941 3689 33 34.5% 17 48007 1733 23007 154 7.0% 17 48034 1195 9566 31 11.1% 17 48011 [...] This shows many cache misses, with a hit ratio varying between 6.5 and 74%. I usually like to see that in the upper 90's. This is "cache busting." The 100 Gbyte file doesn't fit in the 48 Gbytes of page cache, so we have many page cache misses that will cause disk I/O and relatively poor performance. The quickest fix is to move to a larger-memory instance that does fit 100 Gbyte files. The developers can also rework the code with the memory constraint in mind to improve performance (e.g., processing parts of the file, instead of making multiple passes over the entire file). ## 7. Smaller File Test For further confirmation, I collected the same outputs for a 32 Gbyte file upload. cachestat shows a ~100% cache hit ratio: # /apps/perf-tools/bin/cachestat Counting cache functions... Output every 1 seconds. HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB 61831 0 126 100.0% 41 33680 53408 0 78 100.0% 41 33680 65056 0 173 100.0% 41 33680 65158 0 79 100.0% 41 33680 55052 0 107 100.0% 41 33680 61227 0 149 100.0% 41 33681 58669 0 71 100.0% 41 33681 33424 0 73 100.0% 41 33681 ^C This smaller size allows the service to process the file (however many passes it takes) entirely from memory, without re-reading it from disk. free(1) shows it fitting in the page cache: # free -m total used free shared buff/cache available Mem: 64414 18421 11218 5 34773 45407 Swap: 0 0 0 And iostat(1) shows little disk I/O, as expected: # iostat -xz 1 Linux 4.4.0-1072-aws (xxx) 12/19/2018 _x86_64_ (16 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 12.25 0.00 1.24 0.19 0.03 86.29 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.32 0.31 0.19 7.09 4.85 47.59 0.01 12.58 5.44 23.90 3.09 0.15 xvdb 0.00 0.07 0.01 11.13 0.10 1264.35 227.09 0.91 82.16 3.49 82.20 2.80 3.11 avg-cpu: %user %nice %system %iowait %steal %idle 57.43 0.00 2.95 0.00 0.00 39.62 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util avg-cpu: %user %nice %system %iowait %steal %idle 53.50 0.00 2.32 0.00 0.00 44.18 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvdb 0.00 0.00 0.00 2.00 0.00 19.50 19.50 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 39.02 0.00 2.14 0.00 0.00 58.84 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util [...] ## 8. Final Notes cachestat was the killer tool here, but I should stress that it's still experimental. I wrote it for Ftrace with the constraints that it must be low-overhead and use the Ftrace function profiler only. As I mentioned in my LSFMMBPF 2019 keynote in Puerto Rico, where the Linux mm kernel engineers were present, I think the cachestat statistics are so commonly needed that they should be in /proc and not need my experimental tool. They pointed out challenges with providing them properly, and I think any robust solution is going to need their help and expertise. I hope this case study helps show why it is useful and worth the effort. Until the kernel does support page cache statistics (which may be never: they are hot-path, so adding counters isn't free) we can use my cachestat tool, even if it does require frequent maintenance to keep working. [cachestat]: /blog/2014-12-31/linux-page-cache-hit-ratio.html [60-second performance checklist]: /Articles/Netflix_Linux_Perf_Analysis_60s.pdf [bcc]: https://github.com/iovisor/bcc

The 5.14 kernel has been released

3 év 11 hónap óta
Linus has released the 5.14 kernel.

So I realize you must all still be busy with all the galas and fancy balls and all the other 30th anniversary events, but at some point you must be getting tired of the constant glitz, the fireworks, and the champagne. That ball gown or tailcoat isn't the most comfortable thing, either. The celebrations will go on for a few more weeks yet, but you all may just need a breather from them.

And when that happens, I have just the thing for you - a new kernel release to test and enjoy.

Headline features in 5.14 include: core scheduling (at last), the burstable CFS bandwidth controller, some initial infrastructure for BPF program loaders, the rq_qos I/O priority policy, some improvements to the SO_REUSEPORT networking option, the control-group "kill" button, the memfd_secret() system call, the quotactl_fd() system call, and much more. See the LWN merge-window summaries (part 1, part 2) for more details.

corbet

Michael Kerrisk (manpages): man-pages-5.13 released

3 év 11 hónap óta
Alex Colomar and I have released released man-pages-5.13. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from 40 contributors. The release includes around 200 commits that changed around 120 manual pages.

The most notable of the changes in man-pages-5.13 are the following:
  • A new mount_setattr(2) page, written by Christian Brauner, that documents the mount_setattr() system call added in Linux 5.12.
  • I added documentation of the /proc/PID/projid_map file to the user_namespaces(7) page.
  • I added a greatly expanded description of the concept of locked mounts in the mount_namespaces(7) page.

Special thanks again to Alex, who kept track of a lot of patches while I was unavailable.

[$] Nftables reaches 1.0

3 év 11 hónap óta
The Linux kernel is a fast-moving project, but change can still be surprisingly slow to come at times. The nftables project to replace the kernel's packet-filtering subsystem has its origins in 2008, but is still not being used by most (or perhaps even many) production firewalls. The transition may be getting closer, though, as highlighted by the release of nftables 1.0.0 on August 19.
corbet

Security updates for Friday

3 év 11 hónap óta
Security updates have been issued by Fedora (haproxy and libopenmpt), openSUSE (aws-cli, python-boto3, python-botocore, python-service_identity, python-trustme, python-urllib3, dbus-1, and qemu), Oracle (rh-postgresql10-postgresql), Red Hat (compat-exiv2-023, compat-exiv2-026, exiv2, libsndfile, microcode_ctl, python27, rh-nodejs12-nodejs and rh-nodejs12-nodejs-nodemon, rh-nodejs14-nodejs and rh-nodejs14-nodejs-nodemon, and rh-python38), Scientific Linux (compat-exiv2-023 and compat-exiv2-026), SUSE (compat-openssl098), and Ubuntu (libssh, openssl, and openssl1.0).
jake

Brendan Gregg: Slack's Secret STDERR Messages

3 év 11 hónap óta
These are rough notes. I run the Slack messaging application on Ubuntu Linux, and it recently started mysteriously crashing. I'd Alt-Tab and find it was no longer there. No error message, no dialog, just gone. It usually happened when locking and unlocking the screen. A quick internet search revealed nothing. These are rough notes for how I debugged it, in case it's useful for someone searching on this topic. I spend many hours documenting advanced debugging stories for books, talks, and blog posts, but many things I never have time to share. I'm experimenting with this "rough notes" format as a way to quickly share things. No editing, spell checking, or comments. Mostly screenshots. Dead ends included. Note that I don't know anything about Slack internals, and there may be better ways to solve this. ## 1. Enabling core dumps I'm guessing it's core dumping and Ubuntu's apport is eating them. Redirecting them to the file system so I can then do core dump analysis using gdb(1), as root: # cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c %d %P # mkdir /var/cores # echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern [...another crash...] # ls /var/cores # This didn't work: No core file showed up. I may need to increase the core file size ulimits for Slack, but that might mean mucking around with its startup scripts; I'll try some other tracing first. ## 2. exitsnoop Using an eBPF/bcc tool to look for exit reasons: # exitsnoop -t TIME-AEST PCOMM PID PPID TID AGE(s) EXIT_CODE 13:51:19.432 kworker/dying 3663305 2 3663305 1241.59 0 13:51:30.948 kworker/dying 3663626 2 3663626 835.76 0 13:51:33.296 systemd-udevd 3664149 2054939 3664149 3.55 0 13:53:09.256 kworker/dying 3662179 2 3662179 2681.15 0 13:53:25.636 kworker/dying 3663520 2 3663520 1122.60 0 13:53:30.705 grep 3664239 6009 3664239 0.08 0 13:53:30.705 ps 3664238 6009 3664238 0.08 0 13:53:40.297 slack 3663135 1786 3663135 1459.54 signal 6 (ABRT) 13:53:40.298 slack 3663208 3663140 3663208 1457.86 0 13:53:40.302 slack 3663140 1786 3663140 1459.18 0 13:53:40.302 slack 3663139 1786 3663139 1459.18 0 13:53:40.303 slack 3663171 1786 3663171 1458.22 0 13:53:40.317 slack 3663197 1786 3663197 1458.03 0 13:53:44.827 gdm-session-wor 3664269 1778 3664269 0.02 0 [...] This traced a Slack SIGABRT which happened around the same time as a crash. A strong lead. ## 3. killsnoop Running killsnoop (eBPF/bcc) to get more info: # killsnoop TIME PID COMM SIG TPID RESULT 13:45:01 2053366 systemd-journal 0 1024 0 13:45:01 2053366 systemd-journal 0 3663525 -3 13:45:01 2053366 systemd-journal 0 3663528 -3 13:49:00 2054939 systemd-udevd 15 3664053 0 13:51:33 2054939 systemd-udevd 15 3664149 0 13:53:44 2053366 systemd-journal 0 4265 -1 13:53:44 2053366 systemd-journal 0 972 0 13:53:44 2053366 systemd-journal 0 1778 0 13:53:44 2053366 systemd-journal 0 6414 -1 [...] A crash happened, but killsnoop(8) didn't see it. A quick look at the killsnoop(8) source reminded me that I wrote it back in 2015, which is practically ancient in eBPF years. Back then there wasn't tracepoint support yet so I was using kprobes for everything. Kprobes aren't a stable interface, which might be the problem. ## 4. perf trace Nowadays this can be done as a perf one-liner: # perf list syscalls:sys_enter_*kill List of pre-defined events (to be used in -e): syscalls:sys_enter_kill [Tracepoint event] syscalls:sys_enter_tgkill [Tracepoint event] syscalls:sys_enter_tkill [Tracepoint event] # perf trace -e 'syscalls:sys_enter_*kill' 15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT) Ok, so there's our slack SIGABRT, sent via tgkill(2). (And I filed an issue to update bcc killsnoop(8) to use tracepoints.) This output doesn't really tell me much more about it though. I want to see a stack trace. I can use perf record or bpftrace...and that reminds me, didn't I write another signal tool using bpftrace? ## 5. signals.bt The signals.bt bpftrace tool from my BPF book traces the signal:signal_generate tracepoint, which should catch every type of generated signal, including tgkill(2). Trying it out: # bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/signals.bt Attaching 3 probes... Counting signals. Hit Ctrl-C to end. ^C @[SIGNAL, PID, COMM] = COUNT @[SIGPIPE, 1883, Xorg]: 1 @[SIGCHLD, 1797, dbus-daemon]: 1 @[SIGINT, 3665167, bpftrace]: 1 @[SIGTERM, 3665198, gdm-session-wor]: 1 @[SIGCHLD, 3665197, gdm-session-wor]: 1 @[SIGABRT, 3664940, slack]: 1 @[SIGTERM, 3665197, gdm-session-wor]: 1 @[SIGKILL, 3665207, dbus-daemon]: 1 @[SIGWINCH, 859450, bash]: 2 @[SIGCHLD, 1778, gdm-session-wor]: 2 @[, 3665201, gdbus]: 2 @[, 3665199, gmain]: 2 @[SIGWINCH, 3665167, bpftrace]: 2 @[SIGWINCH, 3663319, vi]: 2 @[SIGCHLD, 1786, systemd]: 6 @[SIGALRM, 1883, Xorg]: 106 Ok, there's the SIGABRT for slack. (There's a new sigsnoop(8) tool for bcc that uses this tracepoint as well.) ## 6. Signal Stacks Moving from signals.bt to a bpftrace one-liner: # bpftrace -e 't:signal:signal_generate /comm == "slack"/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }' Attaching 1 probe... 6, __send_signal+579 __send_signal+579 send_signal+221 do_send_sig_info+81 do_send_specific+110 do_tkill+171 __x64_sys_tgkill+34 do_syscall_64+73 entry_SYSCALL_64_after_hwframe+68 0x7f4a2e2e2f95 This was supposed to print both the kernel and user stack traces that led to the SIGABRT, but the user stack is broken, showing 0x7f4a2e2e2f95 only. Grumble. There's ways to fix this, but it usually gets time consuming, so let me try something else first. Logs! ## 7. Logs Does Slack have logs? I have no idea. Maybe they contain the error message. # lsof -p `pgrep -n slack` | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. Ignore the lsof(8) warnings. There's no output here, nothing containing "log". Although I'm looking at the most recent process called "slack" and maybe that's the wrong one. # pstree -ps `pgrep -n slack` systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549) ├─{slack}(3666551) ├─{slack}(3666552) ├─{slack}(3666553) ├─{slack}(3666554) ├─{slack}(3666555) ├─{slack}(3666556) ├─{slack}(3666557) ├─{slack}(3666558) ├─{slack}(3666559) ├─{slack}(3666560) ├─{slack}(3666564) ├─{slack}(3666566) ├─{slack}(3666568) └─{slack}(3666609) Ok, how about I try the great-grandparent, PID 3666477: # lsof -p 3666477 | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. slack 3666477 bgregg 37r REG 253,1 32768 140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log slack 3666477 bgregg 40r REG 253,1 32768 131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log slack 3666477 bgregg 71w REG 253,1 15566 1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log slack 3666477 bgregg 72w REG 253,1 347 1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG slack 3666477 bgregg 73w REG 253,1 2324236 1718407 /home/bgregg/.config/Slack/logs/browser.log slack 3666477 bgregg 90w REG 253,1 363600 1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log slack 3666477 bgregg 92w REG 253,1 274 1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG slack 3666477 bgregg 108w REG 253,1 4182513 1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log slack 3666477 bgregg 116w REG 253,1 259 1704369 /home/bgregg/.config/Slack/Session Storage/LOG slack 3666477 bgregg 122w REG 253,1 31536 1749325 /home/bgregg/.config/Slack/Session Storage/000036.log slack 3666477 bgregg 126w REG 253,1 3970909 1704566 /home/bgregg/.config/Slack/logs/webapp-console.log slack 3666477 bgregg 127w REG 253,1 2330006 1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log slack 3666477 bgregg 131w REG 253,1 330 1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG slack 3666477 bgregg 640r REG 253,1 32768 140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted) Lots of logs in ~/config/Slack/logs! # cd ~/.config/Slack/logs # ls -lrth total 26M -rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/ -rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log -rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log -rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log Ok, so how about this one: # cat webapp-console.log [...] [08/27/21, 14:46:36:238] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.801 dnd.teamInfo is RESOLVED [08/27/21, 14:46:36:240] info: [API-Q] (TKZ41AXQD) 614b3789-1630039595.930 users.interactions.list is RESOLVED [08/27/21, 14:46:36:242] info: [DND] (TKZ41AXQD) Fetched DND info for the following member: ULG5H012L [08/27/21, 14:46:36:251] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L [08/27/21, 14:46:36:254] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Fetched DND info for the following members: UL0US3455 [08/27/21, 14:46:36:313] info: [DND] (TKZ41AXQD) Checking for changes in DND status for the following members: ULG5H012L,UL0US3455 [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S9267BE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S9267BE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S9267BE over 0.00ms No, I don't see any errors jumping out at me. How about searching for errors: # egrep -i 'error|fail' webapp-console.log [08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [...] # egrep -i 'error|fail' browser.log [07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:27:653] error: Failed to load empty window url in window "error": { "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n at Object.emit (events.js:315:20)\n at Object.EventEmitter.emit (domain.js:467:12)", [07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel "error": { "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n at Module.load (internal/modules/cjs/loader.js:935:32)\n at Module._load (internal/modules/cjs/loader.js:776:14)\n at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n at Module.require (internal/modules/cjs/loader.js:959:19)\n at require (internal/modules/cjs/helpers.js:88:18)\n at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n at Object. (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)", [07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links [07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag [...] I browsed the logs for a while but didn't see a smoking gun. Surely it spits out some error message when crashing, like to STDERR... ## 8. STDERR Tracing Where is STDERR written? # lsof -p 3666477 [...] slack 3666477 bgregg mem REG 7,16 141930 7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak slack 3666477 bgregg mem REG 7,16 165680 7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin slack 3666477 bgregg 0r CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 1w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 2w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 3r FIFO 0,12 0t0 29532192 pipe slack 3666477 bgregg 4u unix 0x00000000134e3c45 0t0 29526717 type=SEQPACKET slack 3666477 bgregg 5r REG 7,16 10413488 7167 /snap/slack/44/usr/lib/slack/icudtl.dat [...] /dev/null? Like that's going to stop me. I could trace writes to STDERR, but I think my old shellsnoop(8) tool (another from eBPF/bcc) already does that: # shellsnoop 3666477 [...] [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms (slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme. This may indicate that pixbuf loaders or the mime database could not be found. ** Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5) Ah-ha! The last message printed is an error about a .png file and a .so file. As it's Slack's final mesage before crashing, this is a smoking gun. Note that this was not in any log!: # grep image-missing.png * grep: recorded-trace: Is a directory It's the .so file that is missing, not the .png: # ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png -rw-r--r-- 1 root root 535 Nov 6 2020 /usr/share/icons/Yaru/16x16/status/image-missing.png # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so': No such file or directory But there is a .so file with a similar path: # ls -lh /snap/slack/ total 0 drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so Hmm, I wonder... ## 9. Workaround This is obviously a hack and is not guaranteed to be safe: # cd /snap/slack # ln -s current 42 # ls -lh total 0 lrwxrwxrwx 1 root root 7 Aug 27 15:01 42 -> current/ drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so I don't know why Slack was looking up this library via the old directory version, but linking the new version to the old path did the trick. Slack has stopped crashing! I'm guessing this is a problem with how the snap is built. Needs more debugging. ## 10. Other debugging In case you're wondering what I'd do if I didn't find the error in STDERR, I'd go back to setting ulimits to see if I could get a core dump, and if that still didn't work, I'd try to run Slack from a gdb(1) session. I'd also work on fixing the user stack trace and symbols to see what that revealed. ## 11. Bonus: opensnoop I often wonder how I could have debugged things sooner, and I'm kicking myself I didn't run opensnoop(8) as I usually do. Tracing just failed opens: # opensnoop -Tx TIME(s) PID COMM FD ERR PATH [...] 11.412358000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/subclasses 11.412360000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/icons 11.412363000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/generic-icons 11.412495000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.412527000 3673057 slack -1 2 /usr/share/locale/en_AU/LC_MESSAGES/gdk-pixbuf.mo 11.412537000 3673057 slack -1 2 /usr/share/locale/en/LC_MESSAGES/gdk-pixbuf.mo 11.412559000 3673057 slack -1 2 /usr/share/locale-langpack/en/LC_MESSAGES/gdk-pixbuf.mo 11.412916000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.425405000 1786 systemd -1 2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/user@1000.service/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/memory.events That shows its last failed open was to the .so file. Which would have been a good lead. But the best clue was the secret STDERR messages Slack sends to /dev/null, rescued using shellsnoop(8).

[$] Not-a-GPU accelerator drivers cross the line

3 év 11 hónap óta
As a general rule, the kernel community is happy to merge working device drivers without much concern for the availability of any associated user-space code. What happens in user space is beyond the kernel's concern and unaffected by the kernel's license. There is an exception, though, in the form of drivers for graphical processors (GPUs), which cannot be merged in the absence of a working, freely-licensed user-space component. The question of which drivers are subject to that rule has come up a few times in recent years; that discussion has now come to a decision point with an effort to block some Habana Labs driver updates from entry into the 5.15 kernel.
corbet

Security updates for Thursday

3 év 11 hónap óta
Security updates have been issued by Fedora (community-mysql, containerd, dotnet3.1, dotnet5.0, perl-Encode, and tor), Mageia (gpsd), openSUSE (cacti, cacti-spine, go1.16, jetty-minimal, libmspack, mariadb, openexr, and tor), SUSE (aspell, jetty-minimal, libesmtp, mariadb, and unrar), and Ubuntu (firefox and mongodb).
jake

Recent and not so recent changes in OpenBSD that make life better (and may turn up elsewhere too)

3 év 11 hónap óta
OpenBSD Journal co-editor Peter Hansteen writes in, saying

When Jonathan at SEMI_bug asked me whether I could do a talk for them, it took me a few moments to come up with a good subject and title. But for what would be a lunchtime session in the US East time zone on Sunday August 22nd, 2021 the subject became

Recent and not so recent changes in OpenBSD that make life better (and may turn up elsewhere too)

Once I had come up with the title, I wrote it up as a full article, now hosted on my blog.

If you prefer the hyper-condensed version, the slides for the talk are available, too.

Linux Plumbers Conference: BOFs Call for Proposals Now Open

3 év 11 hónap óta

We have formally opened the CfP for Birds of a Feather. Select the BOFs track when submitting a BOF here.

As a reminder:

  • Submissions for Microconferences topics are still open.
  • The GNU Tools Track CfP will close in a few days, on August 31.
  • Kernel Summit track is open for late breaking topics.
  • All the other tracks are now closed for additional submissions.

 

Reminder: The Kernel Report on August 26

3 év 11 hónap óta
One last reminder that LWN editor Jonathan Corbet will be presenting a version of The Kernel Report at 9:00 US/Mountain (15:00 UTC) on August 26. This live presentation is part of a test of the infrastructure for the 2021 Linux Plumbers Conference, but anybody is welcome to attend regardless of whether they are registered for LPC or not. The meeting "room" will open one hour ahead of the talk at meet.lpc.events; we hope to see you there.
corbet

[$] DVB, header files, and user-space regressions

3 év 11 hónap óta
A regression that was recently reported for 5.14 in the media subsystem is a bit of a strange beast. The kernel's user-space binary interface (ABI) was not changed, which is the usual test for a patch to get reverted, but the report still led to a reversion. The change did lead to problems building a user-space application because it moved some header files to staging/ as part of a cleanup for a deprecated—though apparently still functioning—driver for a Digital Video Broadcasting (DVB) device. There are a few different issues tangled together here, but the reversion of a regression in the user-space API (and not ABI) is a new wrinkle.
jake

Security updates for Wednesday

3 év 11 hónap óta
Security updates have been issued by Debian (openssl), openSUSE (libspf2, openssl-1_0_0, and openssl-1_1), Oracle (libsndfile), SUSE (nodejs10, nodejs12, openssl, openssl-1_0_0, openssl-1_1, and openssl1), and Ubuntu (openssl).
ris

Linux Foundation Technical Advisory Board election: nominees sought

3 év 11 hónap óta
The call for nominees for the 2021 Linux Foundation Technical Advisory Board election has gone out.

The TAB serves as the interface between the kernel development community and the Linux Foundation, advising the Foundation on kernel-related matters, helping member companies learn to work with the community, and working to resolve community-related problems (preferably before they get out of hand). We also support the Code of Conduct committee in their mission.

The election itself will be held during the Linux Plumbers Conference, September 20 to 24. Note that the election procedures have changed this year with an eye toward broadening the community that is eligible to vote.

corbet

30 Years ago...

3 év 11 hónap óta
On August 25, 1991, Linus Torvalds posted his famous message to the comp.os.minix USENET group:

I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready.

After all of these years, it still feels like it's "starting to get ready"; what a ride it has been.

corbet

[$] Adding a "duress" password with PAM Duress

3 év 11 hónap óta
Users often store a lot of sensitive information on their computers—from credentials to banned texts to family photos—that they might normally expect to be protected by the login password of their account. Under some circumstances, though, users can be required to log into their system so that some third party (e.g. government agent) can examine and potentially copy said data. A new project, PAM Duress, provides a way to add other passwords to an account, each with its own behavior, which might be a way to avoid granting full access to the system, though the legality is in question.
jake

Security updates for Tuesday

3 év 11 hónap óta
Security updates have been issued by Debian (ledgersmb, tnef, and tor), Fedora (nodejs-underscore and tor), openSUSE (aws-cli, python-boto3, python-botocore,, fetchmail, firefox, and isync), SUSE (aws-cli, python-boto3, python-botocore, python-service_identity, python-trustme, python-urllib3 and python-PyYAML), and Ubuntu (linux-aws-5.8, linux-azure-5.8, linux-gcp-5.8, linux-oracle-5.8).
ris