Kernel Planet

Harald Welte: OsmoDevCon 2024: "Anatomy of the eSIM Profile"

1 év 3 hónap óta

I've presented a talk Anatomy of the eSIM Profile as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

In the eSIM universe, eSIM profiles are the virtualised content of a classic USIM (possibly with ISIM, CSIM, applets, etc.).

Let's have a look what an eSIM profile is:

  • how is the data structured / organized?

  • what data can be represented in it?

  • how to handle features provided by eUICC, how can the eSIM profile mandate some of them?

  • how does personalization of eSIM profiles work?

There is also hands-on navigation through profiles, based on the pySim.esim.saip module.

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-174-anatomy-of-the-esim-profile

Harald Welte: OsmoDevCon 2024: "Detailed workings of OTA for SIM/USIM/eUICC"

1 év 3 hónap óta

I've presented a talk Detailed workings of OTA for SIM/USIM/eUICC as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

Everyone knows that OTA (over the air) access to SIM cards exists for decades, and that somehow authenticated APDUs can be sent via SMS.

But let's look at the OTA architecture in more detail:

  • OTA transport (SCP80) over SMS, USSD, CellBroadcast, CAT-TP, BIP

  • The new SCP81 transport (HTTPS via TLS-PSK)

  • how to address individal applications on the card via their TAR

  • common applications like RFM and RAM

  • custom applications on the card

  • OTA in the world of eUICCs

  • talking to the ECASD

  • talking to the ISD-R

  • talking to the ISD-P/MNO-SD or applications therein

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-175-detailed-workings-of-ota-for-sim-usim-euicc

Harald Welte: OsmoDevCon 2024: "GlobalPlatform in USIM and eUICC"

1 év 3 hónap óta

I've presented a talk GlobalPlatform in USIM and eUICC as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

The GlobalPlatform Card Specification and its many amendments play a significant role in most real-world USIM/ISIM, and even more so in eUICC.

The talk will try to provide an overview of what GlobalPlatform does in the telecommunications context.

Topics include:

  • security domains

  • key loading

  • card and application life cycle

  • loading and installation of applications

  • Secure Channel Protocols SCP02, SCP03

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-173-globalplatform-in-usim-and-euicc

Linux Plumbers Conference: Awesome amount of Microconference submissions!

1 év 3 hónap óta

The Call-for-Proposals for Microconferences has come to a close, and with that, this year’s list of Microconferences is to be decided. A Microconference is a 3 and a half hour session with a half hour break (giving a total of 3 hours of content). Linux Plumbers has three Microconference tracks running per day, with each track having two Microconferences (one in the morning and one in the afternoon). Linux Plumbers runs for three days allowing for 18 Microconferences total (2 per track, with 3 tracks a day for 3 days).

This year we had a total of 26 quality submissions! Linux Plumbers is known as the conference that gets work done, and its success is proof of that. But sometimes success brings its own problems. How can we accept 26 Microconferences when we only have 18 slots to place them? Two of the Microconferences have agreed to merge as one bringing the total down to just 25. But that still is 7 more than we can handle.

We want to avoid rejecting 7 microconferences, but to do so, we need to make compromises. The first idea we have is to add a 4th Microconference track. But that still only gives us 6 more slots. As it will also require more A/V and manpower, the cost will increase and may not be within the budget to do so.

Pros to a 4 track are:

  • Have 24 full Microconferences and reject one (or 23 and keep 2 as per the next option).

Cons to a 4 track are:

  • Increased costs.
  • Having 4 Microconference tracks running simultaneously will cause more conflicts in the schedule. People have complained in the past about conflicts between sessions with just 3 tracks, having 4 will exacerbate the situation.
  • Still may need to reject 1 Microconference.

Another solution is to create a half Microconference (Nanoconference?). That is an hour and a half session, run the same as the full sessions, with the 30 minute break between two Nanoconferences. Doing so will allow for 11 full Microconferences and 14 Nanoconferences which will allow for all submissions to be accepted and fit within the 3 tracks.

The difference between a Nanoconference and a BOF is that a Nanoconference still has all the rules of a Microconference. That is, all sessions should be strictly discussion focus. If presentations are needed, they should be submitted as Refereed talks (the CFP for them are still open). A BOF is usually focused on a single issue. A Nanoconference should still be broken up into small discussions about different issues with sessions lasting 15 to 20 minutes each.

Pros for Nanoconferences are:

  • Can accommodate all submitted Microconferences.

Cons for Nanoconferences are:

  • Shortened time for Microconferences, even for topics in the past that had filled a full Microconference may now only get half the time.

Note, as BOFs will be in a separate track, a Nanoconference may be able to still submit for topics there (BOF submissions are still open).

Currently, we also give out 3 free passes to each Microconference that can be handed to anyone in their session. For 18 Microconferences, that is 54 passes. This will not be feasible to give out 3 passes to 25 Microconferences (totaling 75 passes), thus one solution is to drop it down to 2 free passes. The problem with passes is still an issue with the Nanoconference approach, as you can not give out 1 and a half passes. Thus, the Nanoconferences may only get 1 pass each, or perhaps have both the Microconferences and Nanoconferences all get just 2 passes each.

Anyway, since the above solutions still allow for 11 full Microconferences, we have accepted 9 so far. They are:

  • Android
  • KVM
  • Kernel Memory Management
  • Scheduling
  • Rust
  • Kernel Testing and Dependability
  • Graphics and DRM
  • Safe Systems with Linux
  • Tracing / Perf events (This is the merged Microconference)

We are still weighing our options so stay tuned for updates on the situation, and thank you to all the Microconference submitters that make
Linux Plumbers the best technical conference around!

Harald Welte: OsmoDevCon 2024: "High-performance I/O using io_uring via osmo_io"

1 év 3 hónap óta

I've co-presented a talk (together with Andreas Eversberg High-performance I/O using io_uring via osmo_io as part of the OsmoDevCon 2024 conference on Open Source Mobile Communications.

Traditional socket I/O via read/write/recvfrom/sendto/recvmsg/sendmsg and friends creates a very high system call load. A highly-loaded osmo-bsc spends most of its time in syscall entry and syscall exit.

io_uring is a modern Linux kernel mechanism to avoid this syscall overhead. We have introduced the osmo_io`API to libosmocore as a generic back-end for non-blocking/asynchronous I/O and a back-end for our classic `osmo_fd / poll approach as well as a new backend for io_uring.

The talk will cover

  • a very basic io_uring introduction

  • a description of the osmo_io API

  • the difficulties porting from osmo_fd to osmo_io

  • status of porting various sub-systems over to osmo_io

You can find the video recording at https://media.ccc.de/v/osmodevcon2024-209-high-performance-i-o-using-iouring-via-osmoio

Pete Zaitcev: Export to STEP in OpenSCAD

1 év 3 hónap óta
The only way to obtain STEP from OpenSCAD that I know is an external tool that someone made. It's pretty crazy actually: it parses OpenSCAD's native export, CSG, and issues commands to OpenCASCADE's CLI, OCC-CSG. The biggest issue for me here is that his approach cannot handle transformations that the CLI does not support. I use hull all over the place and a tool that does not support hull is of no use for me.

So I came up with a mad lad idea: just add a native export of STEP to OpenSCAD. The language itself is constructive, and an export to CSG exists. I just need to duplicate whatever it does, and then at each node, transform it into something that can be expressed in STEP.

As it turned out, STEP does not have any operations. It only has manifolds assembled from faces, which are assembled from planes and lines, which are assembled from cartesian points and vectors. Thus, I need to walk the CSG, compile it into a STEP representation, and only then write it out. Operations like union, difference, or hull have to be computed by my code. The plan is to borrow from OpenSCAD's compiler that builds the mesh, only build with larger pieces - possibly square or round.

Not sure if this is sane and can be made to work, but it's pretty fun at least.

Pete Zaitcev: sup Python you okay bro

1 év 3 hónap óta
What do you think this does:

class A(object):
 def aa(self):
 return 'A1'
class A(object):
 def aa(self):
 return 'A2'
a = A()
print("%s" % a.aa())

It prints "A2".

But before you think "what's the big deal, the __dict__ of A is getting updated", how about this:

class A(object):
 def aa(self):
 return 'A1'
class A(object):
 def bb(self):
 return 'A2'
a = A()
print("%s" % a.aa())

This fails with "AttributeError: 'A' object has no attribute 'aa'".

Apparently, the latter definition replaces the former completely. This is darkly amusing.

Python 3.12.2

Pete Zaitcev: Trailing whitespace in vim

1 év 3 hónap óta
Problem:
When copying from tmux in gnome-terminal, the text is full of whitespace. How do I delete it in gvim?

Solution:
/ \+$

Obviously.

This is an area where tmux is a big regression from screen. Too bad.

Pete Zaitcev: Boot management magic in Fedora 39

1 év 3 hónap óta
Problem: After an update to F39, a system continues to boot F38 kernels

The /bin/kernel-install generates entries in /boot/efi/loader/entries instead of /boot/loader/entries. Also, they are in BLS Type 1 format, and not in the legacy GRUB format. So I cannot copy them over.

Solution:
[root@chihiro zaitcev]# dnf install ostree
[root@chihiro zaitcev]# rm -rf /boot/efi/$(cat /etc/machine-id) /boot/efi/loader/

I've read a bunch of docs and the man page for kernel-install(8), but they are incomprehensible. Still the key insight was that all that Systemd stuff loves to autodetect by finding this directory or that.

The way to test is:
[root@chihiro zaitcev]# /bin/kernel-install -v add 6.8.5-201.fc39.x86_64 /lib/modules/6.8.5-201.fc39.x86_64/vmlinuz

Harald Welte: Gradual migration of IP address/port between servers

1 év 4 hónap óta

I'm a strong proponent of self-hosting all your services, if not on your own hardware than at least on dedicated rented hardware. For IT nerds of my generation, this has been the norm sicne the early 1990s: If you wante to run your own webserver/mailserver/... back then, the only way was to self-host.

So over the last 30 years, I've always been running a fleet of machines, some my own hardware colocated, and during the past ~18 years also some rented dedicated "root servers". They run a plethora of services for either my personal stuff (like this blog, or my personal email server), or any of the IT services of the open source projects I'm involved in (like osmocom) or the company I co-founded and run (sysmocom).

Every few years there's the need to migrate to new hardware. Either due to power consumption/efficiency, or to increase performance, or to simply avoid aging hardware that may be dying soon.

When upgrading from one [hosted] server to another [hosted] server, there's always the question of how to manage the migration with minimal interruption to services. For very simple services like http/https, it can usually be done entirely within DNS: You reduce the TTL of the records, bring up the service on the new server (with a new IP), make the change in the DNS and once the TTL of the DNS record is expired in all caches, everybody will access the new server/IP.

However, there are services where the IP address must be retained. SMTP is a prime example of that. Given how spam filtering works, you certainly will not want to give up your years if not decadeds of good reputation for your IP address. As a result, you will want to keep the IP address while doing the migration.

If it's a physical machine in colocation or your home, you can of course do that all rather easily under your control. You can synchronize the various steps from stopping the services on the old machine, rsync'ing over the spool files to the new, then migrate the IP over to the new machine.

However, if it's a rented "root" server at a company like Hetzner or KVH, then you do not have full control over when exactly the IP address will be migrated over to the new server.

Also, if there are many different services on that same physical machine, running on a variety of different IPv4/IPv6 addresess and ports, it may be difficult to migrate all of them at once. It would be much more practical, if individual services could be migrated step by step.

The poor man's approach would be to use port-forwarding / reverse-proxying. In this case, the client establishes a TCP connection to the old IP address on the old server, and a small port-forward proxy accepts that TCP connectin, creates a second TCP connection to the new server, and bridges those two together. This approach only works for the most simplistic of services (like web servers), where

  • there are only inbound connections from remote clients (as outbound connections from the new server would originate from the new IP, not the old one), and

  • where the source IP of the client doesn't matter. To the new server all connections' source IP addresses suddenly are masked and there's only one source IP (the old server) for all connections.

For more sophisticated serviecs (like e-mail/SMTP, again), this is not an option. The SMTP client IP address matters for whitelists/blacklists/relay rules, etc. And of course there are also plenty of outbound SMTP connections which need to originate from the old IP, not the new IP.

So in bed last night [don't ask], I was brainstorming if the goal of fully transparent migration of individual TCP+UDP/IP (IPv4+IPv6) services could be made between and old and new server. In theory it's rather simple, but in practice the IP stack is not really designed for this, and we're breaking a lot of the assumptions and principles of IP networking.

After some playing around earlier today, I was actually able to create a working setup!

It fulfills the followign goals / exhibits the following properties:

  • old and new server run concurrently for any amount of time

  • individual IP:port tuples can be migrated from old to new server, as services are migrated step by step

  • fully transparent to any remote peer: Old IP:port of server visible to client

  • fully transparent to the local service: Real client IP:port of client visible to server

  • no constraints on whether or not the old and new IPs are in the same subnet, link-layer, data centre, ...

  • use only stock features of the Linux kernel, no custom software, kernel patches, ...

  • no requirement for controlling a router in front of either old or new server

General Idea

The general idea is to receive and classify incoming packets on the old server, and then selectively tunnel some of them via a GRE tunnel from the old machine to the new machine, where they are decapsulated and passed to local processes on the new server. Any packets generated by the service on the new server (responses to clients or outbound connections to remote serveers) will take the opposite route: They will be encapsulated on the new server, passed through that GRE tunnel back to the old server, from where they will be sent off to the internet.

That sounds simple in theory, but it poses a number of challenges:

  • packets destined for a local IP address of the old server need to be re-routed/forwarded, not delivered to local sockets. This is easily done with fwmark, multiple routing tables and a rule, similar ro many other policy routing setups.

  • FIXME

Linux Plumbers Conference: Networking Track

1 év 4 hónap óta

Linux Plumbers Conference 2024 is pleased to host the Networking Track!

LPC Networking track is an in-person manifestation of the netdev mailing list, bringing together developers, users and vendors to discuss topics related to Linux networking. Relevant topics span from proposals for kernel changes, through user space tooling, netdev testing and CI, to presenting interesting use cases, new protocols or new, interesting problems waiting for a solution.

The goal is to allow gathering early feedback on proposals, reach consensus on long running mailing list discussions and raise awareness of interesting work and use cases.

After four years of co-locating BPF & Networking Tracks together this year we separated the two, again. Please submit to the track which feels suitable, the committee will transfer submissions between tracks as it deems necessary.

Please come and join us in the discussion. We hope to see you there!

Brendan Gregg: Linux Crisis Tools

1 év 4 hónap óta

When you have an outage caused by a performance issue, you don't want to lose precious time just to install the tools needed to diagnose it. Here is a list of "crisis tools" I recommend installing on your Linux servers by default (if they aren't already), along with the (Ubuntu) package names that they come from:

    PackageProvidesNotes procpsps(1), vmstat(8), uptime(1), top(1)basic stats util-linuxdmesg(1), lsblk(1), lscpu(1)system log, device info sysstatiostat(1), mpstat(1), pidstat(1), sar(1)device stats iproute2ip(8), ss(8), nstat(8), tc(8)preferred net tools numactlnumastat(8)NUMA stats tcpdumptcpdump(8)Network sniffer linux-tools-common
    linux-tools-$(uname -r)perf(1), turbostat(8)profiler and PMU stats bpfcc-tools (bcc)opensnoop(8), execsnoop(8), runqlat(8), softirqs(8),
    hardirqs(8), ext4slower(8), ext4dist(8), biotop(8),
    biosnoop(8), biolatency(8), tcptop(8), tcplife(8),
    trace(8), argdist(8), funccount(8), profile(8), etc.canned eBPF tools[1] bpftracebpftrace, basic versions of opensnoop(8),
    execsnoop(8), runqlat(8), biosnoop(8), etc.eBPF scripting[1] trace-cmdtrace-cmd(1)Ftrace CLI nicstatnicstat(1)net device stats ethtoolethtool(8)net device info tiptoptiptop(1)PMU/PMC top cpuidcpuid(1)CPU details msr-toolsrdmsr(8), wrmsr(8)CPU digging

(This is based on Table 4.1 "Linux Crisis Tools" in SysPerf 2.)

    Some longer notes: [1] bcc and bpftrace have many overlapping tools: the bcc ones are more capable (e.g., CLI options), and the bpftrace ones can be edited on the fly. But that's not to say that one is better or faster than the other: They emit the same BPF bytecode and are equally fast once running. Also note that bcc is evolving and migrating tools from Python to libbpf C (with CO-RE and BTF) but we haven't reworked the package yet. In the future "bpfcc-tools" should get replaced with a much smaller "libbpf-tools" package that's just tool binaries.

This list is a minimum. Some servers have accelerators and you'll want their analysis tools installed as well: e.g., on Intel GPU servers, the intel-gpu-tools package; on NVIDIA, nvidia-smi. Debugging tools, like gdb(1), can also be pre-installed for immediate use in a crisis.

Essential analysis tools like these don't change that often, so this list may only need updating every few years. If you think I missed a package that is important today, please let me know (e.g., in the comments).

The main downside of adding these packages is their on-disk size. On cloud instances, adding Mbytes to the base server image can add seconds, or fractions of a second, to instance deployment time. Fortunately the packages I've listed are mostly quite small (and bcc will get smaller) and should cost little size and time. I have seen this size concern prevent debuginfo (totaling around 1 Gbyte) from being included by default.

Can't I just install them later when needed?

Many problems can occur when trying to install software during a production crisis. I'll step through a made-up example that combines some of the things I've learned the hard way:

  • 4:00pm: Alert! Your company's site goes down. No, some people say it's still up. Is it up? It's up but too slow to be usable.
  • 4:01pm: You look at your monitoring dashboards and a group of backend servers are abnormal. Is that high disk I/O? What's causing that?
  • 4:02pm: You SSH to one server to dig deeper, but SSH takes forever to login.
  • 4:03pm: You get a login prompt and type "iostat -xz 1" for basic disk stats to begin with. There is a long pause, and finally "Command 'iostat' not found...Try: sudo apt install sysstat". Ugh. Given how slow the system is, installing this package could take several minutes. You run the install command.
  • 4:07pm: The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration. Since the server owners are now in the SRE chatroom to help with the outage, you ask: "how do you install system packages?" They respond "We never do. We only update our app." Ugh. You find a different server and copy its working /etc/apt config over.
  • 4:10pm: You need to run "apt-get update" first with the fixed config, but it's miserably slow.
  • 4:12pm: ...should it really be taking this long??
  • 4:13pm: apt returned "failed: Connection timed out." Maybe this system is too slow with the performance issue? Or can't it connect to the repos? You begin network debugging and ask the server team: "Do you use a firewall?" They say they don't know, ask the network security team.
  • 4:17pm: The network security team have responded: Yes, they have blocked any unexpected traffic, including HTTP/HTTPS/FTP outbound apt requests. Gah. "Can you edit the rules right now?" "It's not that easy." "What about turning off the firewall completely?" "Uh, in an emergency, sure."
  • 4:20pm: The firewall is disabled. You run apt-get update again. It's slow, but works! Then apt-get install, and...permission errors. What!? I'm root, this makes no sense. You share your error in the SRE chatroom and someone points out: Didn't the platform security team make the system immutable?
  • 4:24pm: The platform security team are now in the SRE chatroom explaining that some parts of the file system can be written to, but others, especially for executable binaries, are blocked. Gah! "How do we disable this?" "You can't, that's the point. You'd have to create new server images with it disabled."
  • 4:27pm: By now the SRE team has announced a major outage and informed the executive team, who want regular status updates and an ETA for when it will be fixed. Status: Haven't done much yet.
  • 4:30pm: You start running "cat /proc/diskstats" as a rudimentary iostat(1), but have to spend time reading the Linux source (admin-guide/iostats.rst) to make sense of it. It just confirms the disks are busy which you knew anyway from the monitoring dashboard. You really need the disk and file system tracing tools, like biosnoop(8), but you can't install them either. Unless you can hack up rudimentary tracing tools as well...You "cd /sys/kernel/debug/tracing" and start looking for the FTrace docs.
  • 4:55pm: New server images finally launch with all writable file systems. You login – gee it's fast – and "apt-get install sysstat". Before you can even run iostat there are messages in the chatroom: "Website's back up! Thanks! What did you do?" "We restarted the servers but we haven't fixed anything yet." You have the feeling that the outage will return exactly 10 minutes after you've fallen asleep tonight.
  • 12:50am: Ping! I knew this would happen. You get out of bed and open your work laptop. The site is down – it's been hacked – someone disabled the firewall and file system security.

I've fortunately not experienced the 12:50am event, but the others are based on real world experiences. In my prior job this sequence can often take a different turn: a "traffic team" may initiate a cloud region failover by about the 15 minute mark, so I'd eventually get iostat installed but then these systems would be idle.

Default install

The above scenario explains why you ideally want to pre-install crisis tools so you can start debugging a production issue quickly during an outage. Some companies already do this, and have OS teams that create custom server images with everything included. But there are many sites still running default versions of Linux that learn this the hard way. I'd recommend Linux distros add these crisis tools to their enterprise Linux variants, so that companies large and small can hit the ground running when performance outages occur.

James Bottomley: Figuring out how ipsec transforms work in Linux

1 év 4 hónap óta

I’ve had a couple of reasons recently to wonder about ipsec: one was doing private overlay networks in confidential VMs and the other was trying to be more efficient than my IPv4 openVPN when I’m remote on an IPv6 capable network. Usually ipsec descriptions begin with tools like raccoon or strong/open/libreswan; however, I’m going to try to explain how you do ipsec at a very basic level within Linux networking stack without using an ipsec toolkit. I’m going to concentrate on my latter use case, so this post is going to be ipsec over IPv6 (although most of the concepts should be applicable to IPv4). To attempt to do this, I’ll be delving into the ip xfrm commands extensively and trying to explain how the transform filters and policy work with the rest of the Linux networking stack.

The basics of ipsec

ipsec has two “protocols”: Authentication Header (AH) which means the packet is fully authenticated (or integrity protected) by an HMAC but not encrypted; and Encapsulating Security Payload (ESP), where the packet is encrypted but not necessarily integrity protected (ipsec was invented before AEAD ciphers, so previously you used both protocols to ensure confidentiality and integrity but, in the modern world, you can use ESP with an AEAD cipher and dispense entirely with AH). Once you have the protocol set, you encapsulate either in transport or tunnel mode. Transport mode means that the protocol headers are simply added to an existing IP packet (so the source and destination address remain the same) in the case of AH and an added header plus an encryption transformed payload with ESP and tunnel mode means that the entire IP packet is encapsulated and a new outer source and destination address is added (this is sometimes referred to as an ipsec VPN).

Understanding ipsec flows

This diagram should help understand how ipsec transforms work. There are two aspects to this: policy which basically does accept/reject and tagging and state, which does the encode/decode

The square boxes are the firewall filters and the ellipses are the ip xfrm policy and state transforms. xfrm decode is unconditionally activated whenever an ipsec packet reaches the input flow (provided there’s a matching state rule), output encoding only occurs if a matching output policy says it should (otherwise the packet is passed unencoded) and the xfrm policy fwd has no matching encode/decode, so it’s not possible to ipsec transform forwarded packets.

ip xfrm policy

To decapsulate, there is no requirement for a policy: every ipsec packet coming into the INPUT flow will be checked for a state match and decapsulated if one is found. However, for the packet to progress further you may need a policy. For transport mode, the decapsulated packet will go back around the input loop, so a dir in policy likely isn’t required but for tunnel mode, the decapsulated packet will likely traverse the FORWARD table and a dir fwd policy plus a firewall rule is likely required to permit the packet. Forwarded packets also hit the dir out policy as well.

A policy is specified with two parts: a selector which contains a set of matches (the only mandatory part of which is the direction dir) and which may match on partial addresses an action (block or allow, with allow being default) and a template (tmpl) which can specify the encoding (for dir out) or additional rules based on encapsulation.

Encoding Templates

Encoding only applies to dir out policies. Transport mode simply requires a statement of which encapsulation to use (proto ah or esp) and doesn’t require IP addresses in the ID section. In tunnel mode, the template must also have the source and destination outer IP addresses (the current source and destination become the inner addresses).

Every packet matching an encoding policy must also have a corresponding ip xfrm state match to specify the encapsulation parameters. Note that if a state transform is missing, the kernel will signal this on a netlink socket (which you can monitor with ip xfrm monitor). This socket is mostly used by ipsec toolkits to add state transforms just in time.

Other Policy Templates

For the in and fwd directions, the template acts as an additional filter on what packets to allow and what to block. For instance, if only decapsulated packets should be forwarded, then there should be a policy like

ip xfrm policy add dst net/netmask dir fwd tmpl proto ah mode tunnel level required

Which says the only allowed packets are those which were encapsulated in tunnel mode. Note that for this policy to be reached at all, the FORWARD table must allow the packets to pass. For most network security people, having a blanket forward permit rule is an anathema, so they often achieve the same thing by applying a firewall mark in decapsulation (the output-mark option of ip xfrm state) and only allowing market packets to pass the FORWARD chain (which dispenses with the need for a xfrm dir fwd policy). The two levers for controlling filter policies are the action (allow or block) the default is allow which is why this statement usually doesn’t appear) and the template level (required or use). The default level is required, which means for the allow rule to match the packet must be decapsulate (level use means pass regardless of decapsulation status)

Security Parameters Index (SPI) and reqid

For ipsec to work, every encapsulated packet must have a SPI value. You can specify this in the state transformation. The standards (RFC2409, RFC4303, etc) specify that SPI values 1-255 are “reserved”. Additionally the standards allow SPI value 0 to be used internally, which the Linux Kernel takes advantage of. SPI is mostly used to distinguish packet streams from the same host for complex ipsec policy, and don’t have much use in a simple policy situation. However, you must provide a value that isn’t 0-255 otherwise strange things can happen. In particular 0, the value you’ll get if you don’t specify spi, often causes the packet to get lost after decapsulation, so always specify a large number for spi. requid is a label which is attached to an unencapsulated packet that effectively remembers what the SPI value was; it’s mostly used as a label based discriminator in policy template to state transforms.

For the purposes of the following example I’ll simply use the randomly chosen 4321 for the spi value (but you could choose anything outside the 0-255 range).

ip xfrm state

Unlike policy, which attaches to a particular location (in, out or fwd), state is location agnostic and the same state match could theoretically be used both to encapsulate or decapsulate. State matching also isn’t subnet based: the address matching is either exact or fully wild card (match everything). However, a state encapsulation transform rule must match on dst and a decapsulation one may match on either src or dst, but must have an exact match on one or other. The main thing the state specifies is the algorithm to encapsulate (see man ip-xfrm for a full list). Remember you also must specify spi. The only other thing you might want to specify is the sel parameter. The selector applies to the inner address of decapsulated packets and is to ensure that a mode tunnel packet is going to an address you approve of.

Simple Example: HMAC authentication between two nodes

Assume a [4321::]/64 subnet with two nodes [4321::1] and [4321::2]. To set up authenticated headers one way (from 1->2) you need a policy specifying AH (can be specific or subnet based, so this is subnet)

ip xfrm policy add dst 4321::/64 dir out tmpl proto ah mode tunnel

Followed by a state that’s specific to the destination (using random spi 4321 and short key 1234):

ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport

If you ping from [4321::1] and do a tcp dump from [4321::2] you’ll see

IP6 4321::1 > 4321::2: AH(spi=0x000010e1,seq=0x1b,icv=0x530cdd96149288da7a35fc6d): ICMP6, echo request, id 4, seq 104, length 64

But nothing will come back until you add on [4321::2]

ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport

Which will cause a ping response to be seen. Note the ping packet has an authentication header, but the response is a simple icmp6 response packet (no AH) demonstrating ipsec can be set up asymmetrically.

Example: Private network for Cloud Nodes

Assume we have N nodes with public IP addresses [4321::1]…[4321::N] (which could be provided by the cloud overlay or simply by virtue of the physical network the nodes are on) and we want to connect them in a private mesh network using encryption. There are two ways of doing this: the first is to encrypt all traffic between the nodes on the public network using transport mode and the second would be to set up overlay tunnels between the nodes (this latter can be used even if the public addresses aren’t on a single network segment).

Simple Transport Mode Encryption

Firstly each node needs a policy to require encryption both to and from the private network

ip xfrm policy add dst 4321::0/64 dir out tmpl proto esp ip xfrm policy add scr 4321::/64 dir in tmpl proto esp

And then for each node on the star and encrypt and a decrypt policy (the aes cipher type is taken from the key length, so I’ve chosen a 128 bit key “1234567890123456”)

ip xfrm state add dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport ip xfrm state add src 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport ... ip xfrm state add dst 4321::N proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport ip xfrm state add src 4321::N proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport

This encryption scheme has one key for the entire network, but you could use 1 key per node if you wished (although this wouldn’t necessarily increase security that much). Note that what’s described above is not an overlay network because it relies on using the characteristics of the underlying network (in this case that all nodes are on an IPv6 /64 segment) to do opportunistic transport encryption. To get a true single subnet overlay on top of a disjoint network there must be some sort of tunnel. One way to get the tunnel is simply to use ipsec in tunnel mode, but another is to set up gre tunnels (or another, not necessarily trusted, network overlay which the cloud can likely provide) for the virtual overlay and then use ipsec in transport mode to ensure the packets are always encrypted.

Tunnel Mode Overlay Network

For this example, we’ll allow unencrypted packets to flow over a routed network [4321::1] and [4322::2] (assume network device eth0 on each node) but set up an overlay network on [6666::N]/64 which is fully encrypted. Firstly, each node requires a local address addition for the [6666::N] address. So on node [4321::1] do

ip addr add 6666::1/128 dev eth0

Now add policies and state transforms in both directions (in this case a dir in policy is required otherwise the decapsulated packets won’t get sent up the input flow):

# required policy for encapsulation ip xfrm policy add dst 6666::2 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel # state transform for encapsulation ip xfrm state add src 4321::1 dst 4322::2 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel # policy to allow passing of decapsulated packets ip xfrm policy add dst 6666::1 dir in tmpl proto esp mode tunnel level required # automatic decapsulation. sel ensures addresses after decapsulation ip xfrm state add src 4322::2 dst 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::2 dst 6666::1

And on the other node [4322::2] do the same in reverse

ip addr add 6666::2/128 dev eth0 ip xfrm policy add dst 6666::1 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel ip xfrm state add src 4322::2 dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel ip xfrm policy add dst 6666::2 dir in tmpl proto esp mode tunnel level required ip xfrm state add src 4321::1 dst 4322::2 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::1 dst 6666::2

Note there’s no need to add any routing entries because the decapsulation is point to point (incoming decapsulated packets always end up in the input flow destined for the local [6666::N] address). With the above rules you should be able to ping from [6666::1] to [6666::2] and tcpdump should show fully encrypted packets going over the wire.

Obviously, you can add more nodes but each time you have to add rules for all the other nodes, making this an N(N-1) scaling problem. The need for a specific source and destination template in the out policy means you must have one for each connection. The in policy can be subnet based. The reason why people use ipsec toolkits is that they can add the transforms just in time for the subset of nodes you’re actually communicating with rather than having to add all the rules up front.

Final Example: correcly keyed AH Inbound Packet Acceptance

The final example is me trying to penetrate my router firewall when on an external IPv6 connection. I have a class /60 set of IPv6 space, so each of my systems has its own IPv6 address but, as is usual, inbound packets in the NEW state are blocked. It occurred to me that I should be able to use ipsec AH (no real need for encryption since most of the protocols I use are encrypted anyway) to accept packets to an internal destination with the NEW state. This would be an asymmetric use of ipsec because inbound would have AH but return wouldn’t.

My initial thought was to use AH in transport mode, but as you can see from the diagram above that won’t work because the router merely forwards the packets and to get to a state decapsulation on the router they have to go up the INPUT flow.

The next attempt used tunnel mode, so the packet was aimed at the router and then the inner destination was the real node. The next problem was the fwd policy to permit this: the xfrm policy has no connection tracker and return packets from connections originating in the interior node also have to pass this filter. The solution to this conundrum is to install a level use policy and rely on the FORWARD table firewall rules to allow RELATED,ESTABLISHED and MARKed packets (so the decapsulation can add the MARK to pass this rule).

IPSEC on the Router

Assume my external router address is [4321::1] and my internal network is [4444::]/60. On the router, I install a catch all state transform

ip xfrm state add add dst 4321::1 proto ah spi 4321 auth "hmac(sha1)" 1234 mode tunnel sel dst 4444::/60 output-mark 0x1

But I also need a policy to permit the decapsulated packets (and the state RELATED,ESTABLISHED unencapsulated ones) to pass:

ip xfrm policy add dst 4444::/60 dir fwd tmpl proto ah spi 4321 mode tunnel level use

And finally I need an addition to the firewall rules to allow packets in state NEW but with mark 0x1 to pass

ip6tables -A FORWARD -m conntrack --ctstate NEW -m mark --mark 0x1/0x1 -j ACCEPT

Which should be placed directly after the RELATED,ESTABLISHED state check.

Since there’s no encapsulation on outbound, the return packets simply pass through the firewall as normal. This means that any external entity wishing to use this AH packet acceptance simply needs a policy and state to tunnel

ip xfrm policy dst 4444::/60 dir out tmpl proto ah dst 4321::1 mode tunnel ip xfrm state add dst 4321::1 spi 4321 proto ah auth "hmac(sha1)" 1234 mode tunnel

And with that, any machine known by inner IPv6 address can be reached (for an IPv6 connected remote machine).

Conclusion

Hopefully this post has demystified some of the ip xfrm rules for you. I’m afraid the commands have a huge range of options, so I’ve only covered the essential ones above and there are still loads of interesting but not at all well documented ones remaining, but thanks to the examples you should have some scope now for playing with them.

Linux Plumbers Conference: eBPF Track

1 év 4 hónap óta

Linux Plumbers Conference 2024 is pleased to host the eBPF Track!

After four years in a row of co-locating eBPF & Networking Tracks together, this year we separated the two in order to allow for both tracks to grow further individually as well as to bring more diversity into LPC by attracting more developers from each community.

The eBPF Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s BPF subsystem and its surrounding user space ecosystem such as libraries, loaders, compiler backends, and other related low-level system tooling.

The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of the BPF subsystem.

Proposals can cover a wide range of topics related to BPF covering improvements in areas such as (but not limited to) BPF infrastructure and its use in tracing, security, networking, scheduling and beyond, as well as non-kernel components like libraries, compilers, testing infra and tools.

Please come and join us in the discussion. We hope to see you there!

Brendan Gregg: The Return of the Frame Pointers

1 év 4 hónap óta

Sometimes debuggers and profilers are obviously broken, sometimes it's subtle and hard to spot. From my flame graphs page:


CPU flame graph (partly broken)

(Click for original SVG.) This is pretty common and usually goes unnoticed as the flame graph looks ok at first glance. But there are 15% of samples on the left, above "[unknown]", that are in the wrong place and missing frames. The problem is that this system has a default libc that has been compiled without frame pointers, so any stack walking stops at the libc layer, producing a partial stack that's missing the application frames. These partial stacks get grouped together on the left.

Click here for a longer explanation.

    To explain this example in more detail: The profiler periodically interrupts software execution, and for those disconnected stacks it happens to be the execution of the kernel software ("vfs*", "ext*", etc.). Once interrupted, the profiler begins at the top edge of the flame graph (or bottom edge if you are using the icicle layout) and then "stack walks" down through the rectangles to collect a stack trace. It finally gets through the kernel frames then steps through the syscall (sys_write()) to userspace, hits the libc syscall wrapper (__GI___libc_write()), then tries to resolve the symbol for the next frame but fails and records "[unknown]". It fails because of a compiler optimization where the frame pointer register is used to store data instead of the frame pointer, but it's just a number so the profiler is unaware this happened and tries to match that address to a function symbol and fails (it is therefore an unknown symbol). Then, the profiler is usually unable to walk any more frames because that data doesn't point to the next frame either (and likely doesn't point to any valid mapping, because to the profiler it's effectively a random number), so it stops before it can reach the application frames. There's probably several frames missing from that left disconnected tower, similar to the application frames you see on the right (this example happens to be the bash(1) shell). What happens if the random data is a valid pointer by coincidence? You usually get an extra junk frame. I've seen situations where the random data ends up pointing to itself, so the profiler gets stuck in a loop and you get a tower of junk frames until perf hits its max frame limit.

Other types of profiling hit this more often. Off-CPU flame graphs, for example, can be dominated by libc read/write and mutex functions, so without frame pointers end up mostly broken. Apart from library code, maybe your application doesn't have frame pointers either, in which case everything is broken.

I'm posting about this problem now because Fedora and Ubuntu are releasing versions that fix it, by compiling libc and more with frame pointers by default. This is great news as it not only fixes these flame graphs, but makes off-CPU flame graphs far more practical. This is also a win for continuous profilers (my employer, Intel, just announced one) as it makes customer adoption easier.

What are frame pointers?

The x86-64 ABI documentation shows how a CPU register, %rbp, can be used as a "base pointer" to a stack frame, aka the "frame pointer." I pictured how this is used to walk stack traces in my BPF book.


Figure 3.3: Stack Frame with
Base Pointer (x86-64 ABI)

Figure 2-6: Frame Pointer-based
Stack Walking (BPF book)

This stack-walking technique is commonly used by external profilers and debuggers, including Linux perf and eBPF, and ultimately visualized by flame graphs. However, the x86-64 ABI has a footnote [12] to say that this register use is optional:

"The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available."

(Trivia: I had penciled the frame pointer function prologue and epilogue on my Netflix office wall, lower left.)

2004: Their removal

In 2004 a compiler developer, Roger Sayle, changed gcc to stop generating frame pointers, writing:

"The simple patch below tweaks the i386 backend, such that we now default to the equivalent of "-fomit-frame-pointer -ffixed-ebp" on 32-bit targets"

i386 (32-bit microprocessors) only have four general purpose registers, so freeing up %ebp takes you from four to five (or if you include %si and %di, from six to seven). I'm sure this delivered large performance improvements and I wouldn't try arguing against it. Roger cited two other reasons for this change: The desire to outperform Intel's icc compiler, and the belief that it didn't break debuggers (of the time) since they supported other stack walking techniques.

2005-2023: The winter of broken profilers

However, the change was then applied to x86-64 (64-bit) as well, which had over a dozen registers and didn't benefit so much from freeing up one more. And there are debuggers/profilers that this change did break (typically system profilers, not language specific ones), more so today with eBPF, which didn't exist back then. As my former Sun Microsystems colleague Eric Schrock (nickname Schrock) wrote in November 2004:

"On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn't going to open up a whole new world of compiler optimizations. You're just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it."

In Schrock's conclusion:

"it's when people start compiling /usr/bin/ without frame pointers that it gets out of control."

This is exactly what happened on Linux, not just /usr/bin but also /usr/lib and application code! I'm sure there are people who are too new to the industry to remember the pre-2004 days when profilers would "just work" without OS and runtime changes.

2014: Java in Flames
Broken Java Stacks (2014)

When I joined Netflix in 2014, I found Java's lack of frame pointer support broke all application stacks (pictured in my 2014 Surge talk on the right). I ended up developing a fix for the JVM c2 compiler which Oracle reworked and added as the -XX:+PreserveFramePointer option in JDK8u60 (see my Java in Flames post for details [PDF]).

While that Java change led to discovering countless performance wins in application code, libc was still breaking some portion of the samples (as pictured in the example at the top of this post) and was breaking most stacks in off-CPU flame graphs. I started by compiling my own libc for production use with frame pointers, and then worked with Canonical to have one prebuilt for Ubuntu. For a while I was promoting the use of Canonical's libc6-prof, which was libc6 with frame pointers.

2015-2020: Overhead

As part of production rollout I did many performance overhead tests, which I've described publicly before: The overhead of adding frame pointers to everything (libc and Java) was usually less than 1%, with one exception of 10%. That 10% was an unusual application that was generating stack traces over 1000 frames deep (via Groovy), so deep that it broke Linux's perf profiler. Arnaldo Carvalho de Melo (Red Hat) added the kernel.perf_event_max_stack sysctl just for this Netflix workload. It was also a virtual machine that lacked low-level hardware profiling capabilities, so I wasn't able to do cycle analysis to confirm that the 10% was entirely frame pointer-based.

The actual overhead depends on your workload. Others have reported around 1% and around 2%. Microbenchmarks can be the worst, hitting 10%: This doesn't surprise me since they resolve to running a small funciton in a loop, and adding any instructions to that function can cause it to spill out of L1 cache warmth (or cache lines) causing a drop in performance. If I were analyzing such a microbenchmark, apart from observability anaylsis (cycles, instructions, PMU, PMCs, PEBS) there is also an experiment I'd like to try:

    To test the theory of I-cache spillover: Compile the microbenchmark with and without frame pointers and find the performance delta. Then flame graph the microbenchmark to understand the hot function. Then add some inline assembly to the hot function where you add enough NOPs to the start and end to mimic the frame pointer prologue and epilogue (I recommend writing them on your office wall in pencil), compile it without frame pointers, disassemble the compiled binary to confirm those NOPs weren't stripped, and now test that. If the performance delta is still large (10%) you've confirmed that it is due to cache effects, and anyone who was worked at this level in production will tell you that it's the straw that broke the camel's back. Don't blame the straw, in this case, the frame pointers. Adding anything will cause the same effect. Having done this before, it reminds me of CSS programming: you make a little change here and everything breaks, and you spend hours chasing your own tail.

Another extreme example of overhead was the Python scimark_sparse_mat_mult benchmark, which could reach 10%. Fortunately this was analyzed by Andrii Nakryiko (Meta) who found it was a unusual case of a large function where gcc switched from %rsp offsets to %rbp-relative offsets, which took more bytes to store, causing performance issues. I've heard this has since been fixed so that Python can reenable frame pointers by default.

As I've seen frame pointers help find performance wins ranging from 5% to 500%, the typical "less than 1%" cost (or even 1% or 2% cost) is easily justified. But I'd rather the cost be zero, of course! We may get there with future technologies I'll cover later. In the meantime, frame pointers are the most practical way to find performance wins today.

What about Linux on devices where there is no chance of profiling or debugging, like electric toothbrushes? (I made that up, AFAIK they don't run Linux, but I may be wrong!) Sure, compile without frame pointers. The main users of this change are enterprise Linux. Back-end servers.

2022: Upstreaming, first attempt

Other large companies with OS and perf teams (Meta, Google) hinted strongly that they had already enabled frame pointers for everything years earlier. (Google should be no surprise because they pioneered continuous profiling.) So at this point you had Google, Meta, and Netflix running their own libc with frame pointers and able to enjoy profiling capabilities that most other companies – without dedicated OS teams – couldn't get working. Can't we just upstream this so everyone can benefit?

There's a bunch of difficulties when taking "works well for me" changes and trying to make them the default for everyone. Among the difficulties is that end-user companies don't have a clear return on the investment from telling their Linux vendor what they fixed, since they already fixed it. I guess the investment is quite small, we're talking about a single email, right?...Wrong! Your suggestion is now a 116-post thread where everyone is sharing different opinions and demanding this and that, as we found out the hard way. For Fedora, one person requested:

"Meta and/or Netflix should provide infrastructure for a side repository in which the change can be tested and benchmarked and the code size measured."

(Bear in mind that Netflix doesn't even use Fedora!)

Jonathan Corbet, who writes the best Linux articles, summarized this in "Fedora's tempest in a stack frame" which is so detailed that I feel PTSD when reading it. It's good that the Fedora community wants to be so careful, but I'd rather spend time discussing building something better than frame pointers, perhaps involving ORC, LBR, eBPF, and other technologies, than so much worry about looking bad in kitchen-sink benchmarks that I wouldn't trust in the first place.

2023, 2024: Frame Pointers in Fedora and Ubuntu!

Fedora revisited the proposal and has accepted it this time, making it the first distro to reenable frame pointers. Thank you!

Ubuntu has also announced frame pointers by default in Ubuntu 24.04 LTS. Thank you!

UPDATE: I've now heard that Arch Linux is also enabling frame pointers! Thanks Daan De Meyer (Meta).

While this fixes stack walking through OS libraries, you might find your application still doesn't support stack tracing, but that's typically much easier to fix. Java, for example, has the -XX:+PreserveFramePointer option. There were ways to get Golang to support frame pointers, but that became the default years ago. Just to name a couple of languages.

2034+: Beyond Frame Pointers

There's more than one way to walk a stack. These could be separate blog posts, but I want to comment briefly on alternates:

  • LBR (Last Branch Record): Intel's hardware feature that was limited to 16 or 32 frames. Most application stacks are deeper, so this can't be used to build flame graphs, but it is better than nothing. I use it as a last resort as it gives me some stack insights.
  • BTS (Branch Trace Store): Another Intel thing. Not so limited to stack depth, but has overhead from memory load/stores and BTS buffer overflow interrupt handling.
  • AET (Archetectural Event Trace): Another Intel thing. It's a JTAG-based tracer that can trace low-level CPU, BIOS, and device events, and apparently can be used for stack traces as well. I haven't used it. (I spent years as a cloud customer where I couldn't access many HW-level things.) I hope it can be configured to output to main memory, and not just a physical debug port.
  • DWARF: Binary debuginfo, has been used forever with debuggers. Update: I'd said it doesn't exist for JIT'd runtimes like the Java JVM, but others have pointed out there has been some JIT->DWARF work done. I still don't expect it to be practical on busy production servers that are constantly in c2. The overhead just to walk DWARF is also high, as it was designed for non-realtime use. Javier Honduvilla Coto (Polar Signals) did some interesting work using an eBPF walker to reduce the overhead, but...Java.
  • eBPF stack walking: Mark Wielaard (Red Hat) demonstrated a Java JVM stack walker using SystemTap back at LinuxCon 2014, where an external tracer walked a runtime with no runtime support or help. Very cool. This can be done using eBPF as well. The performance overhead could be too high, however, as it may mean a lot of user space reads of runtime internals depending on the runtime. It would also be brittle; such eBPF stack walkers should ship with the language code base and be maintained with it.
  • ORC (oops rewind capability): The Linux kernel's new lightweight stack unwinder by Josh Poimboeuf (Red Hat) that has allowed newer kernels to remove frame pointers yet retain stack walking. You may be using ORC without realizing it; the rollout was smooth as the kernel profiler code was updated to support ORC (perf_callchain_kernel()->unwind_orc.c) at the same time as it was compiled to support ORC. Can't ORCs invade user space as well?
  • SFrames (Stack Frames): ...which is what SFrames does: lightweight user stack unwinding based on ORC. There have been recent talks to explain them by Indu Bhagat (Oracle) and Steven Rostedt (Google). I should do a blog post just on SFrames.
  • Shadow Stacks: A newer Intel and AMD security feature that can be configured to push function return addresses onto a separate HW stack so that they can be double checked when the return happens. Sounds like such a HW stack could also provide a stack trace, without frame pointers.
  • (And this isn't even all of them.)

Daan De Meyer (Meta) did a nice summary as well of different stack walkers on the Fedora wiki.

So what's next? Here's my guesses:

  • 2029: Ubuntu and Fedora release new versions with SFrames for OS components (including libc) and ditches frame pointers again. We'll have had five years of frame pointer-based performance wins and new innovations that make use of user space stacks (e.g., better automated bug reporting), and will hit the ground running with SFrames.
  • 2034: Shadow stacks have been enabled by default for security, and then are used for all stack tracing.
Conclusion

I could say that times have changed and now the original 2004 reasons for omitting frame pointers are no longer valid in 2024. Those reasons were that it improved performance significantly on i386, that it didn't break the debuggers of the day (prior to eBPF), and that competing with another compiler (icc) was deemed important. Yes, times have indeed changed. But I should note that one engineer, Eric Schrock, claimed that it didn't make sense back in 2004 either when it was applied to x86-64, and I agree with him. Profiling has been broken for 20 years and we've only now just fixed it.

Fedora and Ubuntu have now returned frame pointers, which is great news. People should start running these releases in 2024 and will find that CPU flame graphs make more sense, Off-CPU flame graphs work for the first time, and other new things become possible. It's also a win for continuous profilers, as they don't need to convince their customers to make OS changes to get profiles to fully work.

Thanks

The online threads about this change aren't even everything, there have been many discussions, meetings, and work put into this, not just for frame pointers but other recent advances including ORC and SFrames. Special thanks to Andrii Nakryiko (Meta), Daan De Meyer (Meta), Davide Cavalca (Meta), Neal Gompa (Velocity Limitless), Ian Rogers (Google), Steven Rostedt (Google), Josh Poimboeuf (Red Hat), Arjan Van De Ven (Intel), Indu Bhagat (Oracle), Mark Shuttleworth (Canonical), Jon Seager (Canonical), Oliver Smith (Canonical), Javier Honduvilla Coto (Polar Signals), Mark Wielaard (Red Hat), Ben Cotton (Red Hat), and many others (see the Fedora discussions). And thanks to Schrock.

Appendix: Fedora

For reference, here's my writeup for the Fedora change:

I enabled frame pointers at Netflix, for Java and glibc, and summarized the effect in BPF Performance Tools (page 40): "Last time I studied the performance gain from frame pointer omission in our production environment, it was usually less than one percent, and it was often so close to zero that it was difficult to measure. Many microservices at Netflix are running with the frame pointer reenabled, as the performance wins found by CPU profiling outweigh the tiny loss of performance." I've spent a lot of time analyzing frame pointer performance, and I did the original work to add them to the JVM (which became -XX:+PreserveFramePoiner). I was also working with another major Linux distro to make frame pointers the default in glibc, although I since changed jobs and that work has stalled. I'll pick it up again, but I'd be happy to see Fedora enable it in the meantime and be the first to do so. We need frame pointers enabled by default because of performance. Enterprise environments are monitored, continuously profiled, and analyzed on a regular basis, so this capability will indeed be put to use. It enables a world of debugging and new performance tools, and once you find a 500% perf win you have a different perspective about the <1% cost. Off-CPU flame graphs in particular need to walk the pthread functions in glibc as most blocking paths go through them; CPU flame graphs need them as well to reconnect the floating glibc tower of futex/pthread functions with the developers code frames. I see the comments about benchmark results of up to 10% slowdowns. It's good to look out for regressions, although in my experience all benchmarks are wrong or deeply misleading. You'll need to do cycle analysis (PEBS-based) to see where the extra cycles are, and if that makes any sense. Benchmarks can be super sensitive to degrading a single hot function (like "CPU benchmarks" that really just hammer one function in a loop), and if extra instructions (function prologue) bump it over a cache line or beyond L1 cache-warmth, then you can get a noticeable hit. This will happen to the next developer who adds code anyway (assuming such a hot function is real world) so the code change gets unfairly blamed. It will only regress in this particular scenario, and regression is inevitable. Hence why you need the cycle analysis ("active benchmarking") to make sense of this. There was one microservice that was an outlier and had a 10% performance loss with Java frame pointers enabled (not glibc, I've never seen a big loss there). 10% is huge. This was before PMCs were available in the cloud, so I could do little to debug it. Initially the microservice ran a "flame graph canary" instance with FPs for flame graphs, but the developers eventually just enabled FPs across the whole microservice as the gains they were finding outweighed the 10% cost. This was the only noticeable (as in, >1%) production regression we saw, and it was a microservice that was bonkers for a variety of reasons, including stack traces that were over 1000 frames deep (and that was after inlining! Over 3000 deep without. ACME added the perf_event_max_stack sysctl just so Netflix could profile this microservice, as the prior limit was 128). So one possibility is that the extra function prologue instructions add up if you frequently walk 1000 frames of stack (although I still don't entirely buy it). Another attribute was that the microservice had over 1 Gbyte of instruction text (!), and we may have been flying close to the edge of hardware cache warmth, where adding a bit more instructions caused a big drop. Both scenarios are debuggable with PMCs/PEBS, but we had none at the time. So while I think we need to debug those rare 10%s, we should also bear in mind that customers can recompile without FPs to get that performance back. (Although for that microservice, the developers chose to eat the 10% because it was so valuable!) I think frame pointers should be the default for enterprise OSes, and to opt out if/when necessary, and not the other way around. It's possible that some math functions in glibc should opt out of frame pointers (possibly fixing scimark, FWIW), but the rest (especially pthread) needs them. In the distant future, all runtimes should come with an eBPF stack walker, and the kernel should support hopping between FPs, ORC, LBR, and eBPF stack walking as necessary. We may reach a point where we can turn off FPs again. Or maybe that work will never get done. Turning on FPs now is an improvement we can do, and then we can improve it more later. For some more background: Eric Schrock (my former colleague at Sun Microsystems) described the then-recent gcc change in 2004 as "a dubious optimization that severely hinders debuggability" and that "it's when people start compiling /usr/bin/* without frame pointers that it gets out of control" I recommend reading his post: [0]. The original omit FP change was done for i386 that only had four general-purpose registers and saw big gains freeing up a fifth, and it assumed stack walking was a solved problem thanks to gdb(1) without considering real-time tracers, and the original change cites the need to compete with icc [1]. We have a different circumstance today -- 18 years later -- and it's time we updated this change. [0] http://web.archive.org/web/20131215093042/https://blogs.oracle.com/eschrock/entry/debugging_on_amd64_part_one [1] https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01033.html

Matthew Garrett: Digital forgeries are hard

1 év 4 hónap óta
Closing arguments in the trial between various people and Craig Wright over whether he's Satoshi Nakamoto are wrapping up today, amongst a bewildering array of presented evidence. But one utterly astonishing aspect of this lawsuit is that expert witnesses for both sides agreed that much of the digital evidence provided by Craig Wright was unreliable in one way or another, generally including indications that it wasn't produced at the point in time it claimed to be. And it's fascinating reading through the subtle (and, in some cases, not so subtle) ways that that's revealed.

One of the pieces of evidence entered is screenshots of data from Mind Your Own Business, a business management product that's been around for some time. Craig Wright relied on screenshots of various entries from this product to support his claims around having controlled meaningful number of bitcoin before he was publicly linked to being Satoshi. If these were authentic then they'd be strong evidence linking him to the mining of coins before Bitcoin's public availability. Unfortunately the screenshots themselves weren't contemporary - the metadata shows them being created in 2020. This wouldn't fundamentally be a problem (it's entirely reasonable to create new screenshots of old material), as long as it's possible to establish that the material shown in the screenshots was created at that point. Sadly, well.

One part of the disclosed information was an email that contained a zip file that contained a raw database in the format used by MYOB. Importing that into the tool allowed an audit record to be extracted - this record showed that the relevant entries had been added to the database in 2020, shortly before the screenshots were created. This was, obviously, not strong evidence that Craig had held Bitcoin in 2009. This evidence was reported, and was responded to with a couple of additional databases that had an audit trail that was consistent with the dates in the records in question. Well, partially. The audit record included session data, showing an administrator logging into the data base in 2011 and then, uh, logging out in 2023, which is rather more consistent with someone changing their system clock to 2011 to create an entry, and switching it back to present day before logging out. In addition, the audit log included fields that didn't exist in versions of the product released before 2016, strongly suggesting that the entries dated 2009-2011 were created in software released after 2016. And even worse, the order of insertions into the database didn't line up with calendar time - an entry dated before another entry may appear in the database afterwards, indicating that it was created later. But even more obvious? The database schema used for these old entries corresponded to a version of the software released in 2023.

This is all consistent with the idea that these records were created after the fact and backdated to 2009-2011, and that after this evidence was made available further evidence was created and backdated to obfuscate that. In an unusual turn of events, during the trial Craig Wright introduced further evidence in the form of a chain of emails to his former lawyers that indicated he had provided them with login details to his MYOB instance in 2019 - before the metadata associated with the screenshots. The implication isn't entirely clear, but it suggests that either they had an opportunity to examine this data before the metadata suggests it was created, or that they faked the data? So, well, the obvious thing happened, and his former lawyers were asked whether they received these emails. The chain consisted of three emails, two of which they confirmed they'd received. And they received a third email in the chain, but it was different to the one entered in evidence. And, uh, weirdly, they'd received a copy of the email that was submitted - but they'd received it a few days earlier. In 2024.

And again, the forensic evidence is helpful here! It turns out that the email client used associates a timestamp with any attachments, which in this case included an image in the email footer - and the mysterious time travelling email had a timestamp in 2024, not 2019. This was created by the client, so was consistent with the email having been sent in 2024, not being sent in 2019 and somehow getting stuck somewhere before delivery. The date header indicates 2019, as do encoded timestamps in the MIME headers - consistent with the mail being sent by a computer with the clock set to 2019.

But there's a very weird difference between the copy of the email that was submitted in evidence and the copy that was located afterwards! The first included a header inserted by gmail that included a 2019 timestamp, while the latter had a 2024 timestamp. Is there a way to determine which of these could be the truth? It turns out there is! The format of that header changed in 2022, and the version in the email is the new version. The version with the 2019 timestamp is anachronistic - the format simply doesn't match the header that gmail would have introduced in 2019, suggesting that an email sent in 2022 or later was modified to include a timestamp of 2019.

This is by no means the only indication that Craig Wright's evidence may be misleading (there's the whole argument that the Bitcoin white paper was written in LaTeX when general consensus is that it's written in OpenOffice, given that's what the metadata claims), but it's a lovely example of a more general issue.

Our technology chains are complicated. So many moving parts end up influencing the content of the data we generate, and those parts develop over time. It's fantastically difficult to generate an artifact now that precisely corresponds to how it would look in the past, even if we go to the effort of installing an old OS on an old PC and setting the clock appropriately (are you sure you're going to be able to mimic an entirely period appropriate patch level?). Even the version of the font you use in a document may indicate it's anachronistic. I'm pretty good at computers and I no longer have any belief I could fake an old document.

(References: this Dropbox, under "Expert reports", "Patrick Madden". Initial MYOB data is in "Appendix PM7", further analysis is in "Appendix PM42", email analysis is "Sixth Expert Report of Mr Patrick Madden")

comments

Daniel Vetter: Upstream, Why & How

1 év 4 hónap óta

In a different epoch, before the pandemic, I’ve done a presentation about upstream first at the Siemens Linux Community Event 2018, where I’ve tried to explain the fundamentals of open source using microeconomics. Unfortunately that talk didn’t work out too well with an audience that isn’t well-versed in upstream and open source concepts, largely because it was just too much material crammed into too little time.

Last year I got the opportunity to try again for an Intel-internal event series, and this time I’ve split the material into two parts. I think that worked a lot better. For obvious reasons I cannot publish the recordings, but I can publish the slides.

The first part “Upstream, Why?” covers a few concepts from microeconomcis 101, and then applies them to upstream stream open source. The key concept is on one hand that open source achieves an efficient software market in the microeconomic sense by driving margins and prices to zero. And the only way to make money in such a market is to either have more-or-less unstable barriers to entry that prevent the efficient market from forming and destroying all monetary value. Or to sell a complementary product.

The second part”Upstream, How?” then looks at what this all means for the different stakeholders involved:

  • Individual engineers, who have skills and create a product with zero economic value, and might still be stupid enough and try to build a career on that.

  • Upstream communities, often with a formal structure as a foundation, and what exactly their goals should be to build a thriving upstream open source project that can actually pay some bills, generate some revenue somewhere else and get engineers paid. Because without that you’re not going to have much of a project with a long term future.

  • Engineering organizations, what exactly their incentives and goals should be, and the fundamental conflicts of interest this causes. Specifically on this I’ve only seen bad solutions, and ugly solutions, but not yet a really good one. A relevant pre-pandemic talk of mine on this topic is also “Upstream Graphics: Too Little, Too Late”

  • And finally the overall business and more importantly, what kind of business strategy is needed to really thrive with an open source upstream first approach: You need to clearly understand which software market’s economic value you want to destroy by driving margins and prices to zero, and which complemenetary product you’re selling to still earn money.

At least judging by the feedback I’ve received internally taking more time and going a bit more in-depth on the various concept worked much better than the keynote presentation I’ve done at Siemens, hence I decided to publish at the least the slides.

Brendan Gregg: eBPF Documentary

1 év 5 hónap óta

eBPF is a crazy technology – like putting JavaScript into the Linux kernel – and getting it accepted had so far been an untold story of strategy and ingenuity. The eBPF documentary, published late last year, tells this story by interviewing key players from 2014 including myself, and touches on new developments including Windows. (If you are new to eBPF, it is the name of a kernel execution engine that runs a variety of new programs in a performant and safe sandbox in the kernel, like how JavaScript can run programs safely in a browser sandbox; it is also no longer an acronym.) The documentary was played at KubeCon and is on youtube:

Watching this brings me right back to 2014, to see the faces and hear their voices discussing the problems we were trying to fix. Thanks to Speakeasy Productions for doing such a great job with this documentary, and letting you experience what it was like in those early days. This is also a great example of all the work that goes on behind the scenes to get code merged in a large codebase like Linux.

When Alexei Starovoitov visited Netflix in 2014 to discuss eBPF with myself and Amer Ather, we were so entranced that we lost track of time and were eventually kicked out of the meeting room as another meeting was starting. It was then I realized that we had missed lunch! Alexei sounded so confident that I was convinced that eBPF was the future, but a couple of times he added "if the patches get merged." If they get merged?? They have to get merged, this idea is too good to waste.

While only several of us worked on eBPF in 2014, more joined in 2015 and later, and there are now hundreds contributing to make it what it is. A longer documentary could also interview Brendan Blanco (bcc), Yonghong Song (bcc), Sasha Goldshtein (bcc), Alastair Robertson (bpftrace), Tobais Waldekranz (ply), Andrii Nakryiko, Joe Stringer, Jakub Kicinski, Martin KaFai Lau, John Fastabend, Quentin Monnet, Jesper Dangaard Brouer, Andrey Ignatov, Stanislav Fomichev, Teng Qin, Paul Chaignon, Vicent Marti, Dan Xu, Bas Smit, Viktor Malik, Mary Marchini, and many more. Thanks to everyone for all the work.

Ten years later it still feels like it's early days for eBPF, and a great time to get involved: It's likely already available in your production kernels, and there are tools, libraries, and documentation to help you get started.

I hope you enjoy the documentary. PS. Congrats to Isovalent, the role-model eBPF startup, as Cisco recently announced they would acquire them!

Linux Plumbers Conference: Toolchains Track

1 év 5 hónap óta

Linux Plumbers Conference 2024 is pleased to host the Toolchains Track!

The aim of the Toolchains track is to fix particular toolchain issues which are of the interest of the kernel and, ideally, find solutions in situ, making the best use of the opportunity of live discussion with kernel developers and maintainers. In particular, this is not about presenting research nor abstract/miscellaneous toolchain work.

The track will be composed of activities, of variable length depending on the topic being discussed. Each activity is intended to cover a particular topic or issue involving both the Linux kernel and one or more of its associated toolchains and development tools. This includes compiling, linking, assemblers, debuggers and debugging formats, ABI analysis tools, object manipulation, etc. Few slides shall be necessary, and most of the time shall be devoted to actual discussion, brainstorming and seeking agreement.

Please come and join us in the discussion. We hope to see you there!

Pete Zaitcev: Running OpenDKIM on Fedora 39

1 év 5 hónap óta
postfix-3.8.1-5.fc39.x86_64
opendkim-2.11.0-0.35.fc39.x86_64

Following generic guides (e.g. at Akamai Linode) almost got it all working with ease. There were a few minor problems with permissions.

Problem:
Feb 28 11:45:17 takane postfix/smtpd[1137214]: warning: connect to Milter service local:/run/opendkim/opendkim.sock: Permission denied
Solution:
add postfix to opendkim group; no change to Umask etc.

Problem:
Feb 28 13:36:39 takane opendkim[1136756]: 5F1F4DB81: no signing table match for 'zaitcev@kotori.zaitcev.us'
Solution:
change SigningTable from refile: to a normal file

Problem:
Feb 28 13:52:05 takane opendkim[1138782]: can't load key from /etc/opendkim/keys/dkim001.private: Permission denied
Feb 28 13:52:05 takane opendkim[1138782]: 93FE7D0E3: error loading key 'dkim001._domainkey.kotori.zaitcev.us'
Solution:
[root@takane postfix]# chmod 400 /etc/opendkim/keys/dkim001.private
[root@takane postfix]# chown opendkim /etc/opendkim/keys/dkim001.private
Ellenőrizve
15 perc 46 másodperc ago
Kernel Planet - http://planet.kernel.org
Feliratkozás a következőre: Kernel Planet hírcsatorna