Kernel Planet

Matthew Garrett: Debugging an odd inability to stream video

1 év 5 hónap óta

We have a cabin out in the forest, and when I say "out in the forest" I mean "in a national forest subject to regulation by the US Forest Service" which means there's an extremely thick book describing the things we're allowed to do and (somewhat longer) not allowed to do. It's also down in the bottom of a valley surrounded by tall trees (the whole "forest" bit). There used to be AT&T copper but all that infrastructure burned down in a big fire back in 2021 and AT&T no longer supply new copper links, and Starlink isn't viable because of the whole "bottom of a valley surrounded by tall trees" thing along with regulations that prohibit us from putting up a big pole with a dish on top. Thankfully there's LTE towers nearby, so I'm simply using cellular data. Unfortunately my provider rate limits connections to video streaming services in order to push them down to roughly SD resolution. The easy workaround is just to VPN back to somewhere else, which in my case is just a Wireguard link back to San Francisco.

This worked perfectly for most things, but some streaming services simply wouldn't work at all. Attempting to load the video would just spin forever. Running tcpdump at the local end of the VPN endpoint showed a connection being established, some packets being exchanged, and then… nothing. The remote service appeared to just stop sending packets. Tcpdumping the remote end of the VPN showed the same thing. It wasn't until I looked at the traffic on the VPN endpoint's external interface that things began to become clear.

This probably needs some background. Most network infrastructure has a maximum allowable packet size, which is referred to as the Maximum Transmission Unit or MTU. For ethernet this defaults to 1500 bytes, and these days most links are able to handle packets of at least this size, so it's pretty typical to just assume that you'll be able to send a 1500 byte packet. But what's important to remember is that that doesn't mean you have 1500 bytes of packet payload - that 1500 bytes includes whatever protocol level headers are on there. For TCP/IP you're typically looking at spending around 40 bytes on the headers, leaving somewhere around 1460 bytes of usable payload. And if you're using a VPN, things get annoying. In this case the original packet becomes the payload of a new packet, which means it needs another set of TCP (or UDP) and IP headers, and probably also some VPN header. This still all needs to fit inside the MTU of the link the VPN packet is being sent over, so if the MTU of that is 1500, the effective MTU of the VPN interface has to be lower. For Wireguard, this works out to an effective MTU of 1420 bytes. That means simply sending a 1500 byte packet over a Wireguard (or any other VPN) link won't work - adding the additional headers gives you a total packet size of over 1500 bytes, and that won't fit into the underlying link's MTU of 1500.

And yet, things work. But how? Faced with a packet that's too big to fit into a link, there are two choices - break the packet up into multiple smaller packets ("fragmentation") or tell whoever's sending the packet to send smaller packets. Fragmentation seems like the obvious answer, so I'd encourage you to read Valerie Aurora's article on how fragmentation is more complicated than you think. tl;dr - if you can avoid fragmentation then you're going to have a better life. You can explicitly indicate that you don't want your packets to be fragmented by setting the Don't Fragment bit in your IP header, and then when your packet hits a link where your packet exceeds the link MTU it'll send back a packet telling the remote that it's too big, what the actual MTU is, and the remote will resend a smaller packet. This avoids all the hassle of handling fragments in exchange for the cost of a retransmit the first time the MTU is exceeded. It also typically works these days, which wasn't always the case - people had a nasty habit of dropping the ICMP packets telling the remote that the packet was too big, which broke everything.

What I saw when I tcpdumped on the remote VPN endpoint's external interface was that the connection was getting established, and then a 1500 byte packet would arrive (this is kind of the behaviour you'd expect for video - the connection handshaking involves a bunch of relatively small packets, and then once you start sending the video stream itself you start sending packets that are as large as possible in order to minimise overhead). This 1500 byte packet wouldn't fit down the Wireguard link, so the endpoint sent back an ICMP packet to the remote telling it to send smaller packets. The remote should then have sent a new, smaller packet - instead, about a second after sending the first 1500 byte packet, it sent that same 1500 byte packet. This is consistent with it ignoring the ICMP notification and just behaving as if the packet had been dropped.

All the services that were failing were failing in identical ways, and all were using Fastly as their CDN. I complained about this on social media and then somehow ended up in contact with the engineering team responsible for this sort of thing - I sent them a packet dump of the failure, they were able to reproduce it, and it got fixed. Hurray!

(Between me identifying the problem and it getting fixed I was able to work around it. The TCP header includes a Maximum Segment Size (MSS) field, which indicates the maximum size of the payload for this connection. iptables allows you to rewrite this, so on the VPN endpoint I simply rewrote the MSS to be small enough that the packets would fit inside the Wireguard MTU. This isn't a complete fix since it's done at the TCP level rather than the IP level - so any large UDP packets would still end up breaking)

I've no idea what the underlying issue was, and at the client end the failure was entirely opaque: the remote simply stopped sending me packets. The only reason I was able to debug this at all was because I controlled the other end of the VPN as well, and even then I wouldn't have been able to do anything about it other than being in the fortuitous situation of someone able to do something about it seeing my post. How many people go through their lives dealing with things just being broken and having no idea why, and how do we fix that?

(Edit: thanks to this comment, it sounds like the underlying issue was a kernel bug that Fastly developed a fix for - under certain configurations, the kernel fails to associate the MTU update with the egress interface and so it continues sending overly large packets)

comments

Pete Zaitcev: Strongly consistent S3

Kernel Planet

1 év 5 hónap óta

Speaking of S3 becoming strongly consistent, Swift was strongly consistent for objects in practice from the very beginning. All the "in practice" assumes a reasonably healthy cluster. It's very easy, really. When your client puts an object into a cluster and receives a 201, it means that a quorum of back-end nodes stored that object. Therefore, for you to get a stale version, you need to find a proxy that is willing to fetch you an old copy. That can only happen if the proxy has no access to any of the back-end nodes with the new object.

Somewhat unfortunately for the lovers of consistency, we made a decision that makes the observation of eventual consistency easier some 10 years ago. We allowed a half of an even replication factor to satisfy quorum. So, if you have a distributed cluster with factor of 4, a client can write an object into 2 nearest nodes, and receive a success. That opens a window for another client to read an old object from the other nodes.

Oh, well. Original Swift defaulted for odd replication factors, such as 3 and 5. They provided a bit of resistance to intercontinental scenarios, at the cost of the client knowing immediately if a partition is occurring. But a number of operators insisted that they preferred the observable eventual consistency. Remember that the replication factor is configurable, so there's no harm, right?

Alas, being flexible that way helps private clusters only. Because Swift generally hides the artitecture of the cluster from clients, they cannot know if they can rely on consistency of a random public cluster.

Either way, S3 does something that Swift cannot do here: the consistency of bucket listings. These things start lagging in Swift at a drop of a hat. If the container servers fail to reply in milliseconds, storage nodes push the job onto updaters and proceed. The delay in listigs is often observable in Swift, which is one reason why people doing POSIX overlays often have their own manifests.

I'm quite impressed by S3 doing this, given their scale especially. Curious, too. I wish I could hack on their system a little bit. But it's all proprietary, alas.

Greg Kroah-Hartman: Linux is a CNA

Kernel Planet

1 év 5 hónap óta

As was recently announced, the Linux kernel project has been accepted as a CNA as a CVE Numbering Authority (CNA) for vulnerabilities found in Linux. This is a trend, of more open source projects taking over the haphazard assignments of CVEs against their project by becoming a CNA so that no other group can assign CVEs without their involvment. Here’s the curl project doing much the same thing for the same reasons.

Linux Plumbers Conference: Linux Plumbers Conference CFP announced

Kernel Planet

1 év 5 hónap óta

The Linux Plumbers Conference is proud to announce that it’s website for 2024 is up and the CFP has been issued. We will be running a hybrid conference as usual, but the in-person venue will be Vienna, Austria from 18-20 September. Deadlines to submit are 4 April for Microconferences and 16 June for Refereed and Kernel Summit track presentations. Details for other tracks and accepted Microconferences will be posted later.

Dave Airlie (blogspot): anv: vulkan av1 decode status

Kernel Planet

1 év 6 hónap óta

Vulkan Video AV1 decode has been released, and I had some partly working support on Intel ANV driver previously, but I let it lapse.

The branch is currently [1]. It builds, but is totally untested, I'll get some time next week to plug in my DG2 and see if I can persuade it to decode some frames.

Update: the current branch decodes one frame properly, reference frames need more work unfortunately.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/anv-vulkan-video-decode-av1

Paul E. Mc Kenney: Stupid RCU Tricks: So You Want to Torture RCU With a Non-Trivial Userspace?

Kernel Planet

1 év 6 hónap óta

In order to save mass-storage space and to reduce boot times, rcutorture runs out of a tiny initrd filesystem that consists only of a root directory and a statically linked init program based on nolibc. This init program binds itself to a randomly chosen CPU, spins for a short time, sleeps for a short time, and repeats, the point being to inject at least a little userspace execution for the benefit of nohz_full CPUs.

This works very well most of the time. But what if you want to use a full userspace when torturing RCU, perhaps because you want to test runtime changes to RCU's many sysfs parameters, run heavier userspace loads on nohz_full CPUs, or even to run BPF programs?

What you do is go back to the old way of building rcutorture's initrd.

Which raises the question as to what the new way might be.

What rcutorture does is to look at the tools/testing/selftests/rcutorture/initrd directory. If this directory does not already a file named init, the tools/testing/selftests/rcutorture/bin/mkinitrd.sh script builds the aforementioned statically linked init program.

Which means that you can put whatever initrd file tree you wish into that initrd directory, and as long as it contains a /init program, rcutorture will happily bundle that file tree into an initrd in each the resulting rcutorture kernel images.

And back in the old days, that is exactly what I did. I grabbed any convenient initrd and expanded it into my tools/testing/selftests/rcutorture/initrd directory. This still works, so you can do this too!

Dave Airlie (blogspot): radv: vulkan av1 video decode status

Kernel Planet

1 év 6 hónap óta

The Khronos Group announced VK_KHR_video_decode_av1 [1], this extension adds AV1 decoding to the Vulkan specification. There is a radv branch [2] and merge request [3]. I did some AV1 work on this in the past, but I need to take some time to see if it has made any progress since. I'll post an ANV update once I figure that out.

This extension is one of the ones I've been wanting for a long time, since having royalty-free codec is something I can actually care about and ship, as opposed to the painful ones. I started working on a MESA extension for this a year or so ago with Lynne from the ffmpeg project and we made great progress with it. We submitted that to Khronos and it has gone through the committee process and been refined and validated amongst the hardware vendors.

I'd like to say thanks to Charlie Turner and Igalia for taking over a lot of the porting to the Khronos extension and fixing up bugs that their CTS development brought up. This is a great feature of having open source drivers, it allows a lot quicker turn around time in bug fixes when devs can fix them themselves!

[1]: https://www.khronos.org/blog/khronos-releases-vulkan-video-av1-decode-extension-vulkan-sdk-now-supports-h.264-h.265-encode

[2] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-decode-av1

[3] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27424

James Bottomley: Securing a Rooted Android Phone

Kernel Planet

1 év 6 hónap óta

This article will discuss securing your phone after you’ve rooted it and installed your preferred os (it will not discuss how to root your phone or change the OS). Re-securing your phone requires the installation of a custom AVB key, which not all phones support, so I’ll only be discussing Google Pixel phones (which support this) and the LineageOS distribution (which is the one I use). The reason for wanting to do this is that by default LineageOS runs with the debug keys (i.e. known to everyone) with an unlocked bootloader, meaning OS updates and even binary changes to the system partition are easy to do. Since most android phones are fully locked, this isn’t a standard attack vector for malicious apps, but if someone is targetting you directly it may become one.

This article will cover how android verified boot (AVB) works, how to install your own custom AVB key and get a stock LineageOS distribution to use it and how to turn DM verity back on to make /system immutable.

A Brief Tour of Android Verified Boot (AVB)

We’ll actually be covering the 2.0 version of AVB, but that’s what most phones use today. The proprietary bootloader of a Pixel (the fastboot capable one you get to with adb reboot bootloader) uses a vbmeta partition to find the boot/recovery system and from there either enter recovery or boot the standard OS. vbmeta contains hashes for both this boot partition and the system partition. If your phone is unlocked the bootloader will simply boot the partitions vbmeta points to without any verification. If it is locked, the vbmeta partition must be signed by a key the phone knows. Pixel phones have two keyslots: a built in one which houses either the Google key or an OEM one and the custom slot, which is blank. In the unlocked mode, you can flash your own key into the custom slot using the fastboot flash avb_custom_key command.

The vbmeta partition also contains a boot flags region which tells the bootloader how to boot the OS. The two flags the OS knows about are in external/avb/libavb/avb_vbmeta_image.h:

/* Flags for the vbmeta image. * * AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED: If this flag is set, * hashtree image verification will be disabled. * * AVB_VBMETA_IMAGE_FLAGS_VERIFICATION_DISABLED: If this flag is set, * verification will be disabled and descriptors will not be parsed. */ typedef enum { AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED = (1 << 0), AVB_VBMETA_IMAGE_FLAGS_VERIFICATION_DISABLED = (1 << 1) } AvbVBMetaImageFlags;

if the first flag is set then dm-verity is disabled and if the second one is set, the bootloader doesn’t pass the hash of the vbmeta partition on the kernel command line. In a standard LineageOS build, both these flags are set.

The reason for passing the vbmeta hash on the command line is so the android init process can load the vbmeta partition, hash it and verify against what the bootloader passed in, thus confirming there hasn’t been a time of check to time of use (TOCTOU) security breach. The init process cannot verify the signature for itself because the vbmeta signing public key isn’t built into the OS (which allows the OS to be signed after the images are build).

The description of the AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED flag is slightly wrong. A standard android build always seems to calculate the dm-verity hash tree and insert it into the vbmeta partition (where it is verified by the vbmeta signature) it’s just that if this flag is set, the android init process won’t load the dm-verity hash tree and the system partition will thus be mutable.

Creating and Using your own custom Boot Key

Obviously android doesn’t use any standard tool form for its keys, so you have to create your own RSA 2048 (the literature implies it will work with 4096 as well but I haven’t tried it) AVB custom key using say openssl, then use avbtool (found in external/avb as a python script) to convert your RSA public key to a form that can be flashed in the phone:

avbtool extract_public_key --key pubkey.pem --output pkmd.bin

This can then be flashed to the unlocked phone (in the bootloader fastboot) with

fastboot flash avb_custom_key pkmd.bin

And you’re all set up to boot a custom signed OS.

Signing your LineageOS

There is a wrinkle to this: to re-sign the OS, you need the target-files.zip intermediate build, not the ROM install file. Unfortunately, this is pretty big (38GB for lineage-19.1) and doesn’t seem to be available for download any more. If you can find it, you can re-sign the stock LineageOS, but if not you have to build it yourself. Instructions for both building and re-signing can be found here. You need to follow this but in addition you must add two extra flags to the sign_target_files_apks command:

--avb_vbmeta_key=/path/to/private/avb.key --avb_vbmeta_algorithm=SHA256_RSA2048

Which will ensure the vbmeta partition is signed with the key you created above.

Optionally Enabling dm-verity

If you want to enable dm-verity, you have to change the vbmeta flags to 0 (enable both hashtree and vbmeta verification) before you execute the signing command above. These flags are stored in the META/misc_info.txt file which you can extract from target-files.zip with

unzip target-files.zip META/misc_info.txt

And once done you can vi this file to find the line

avb_vbmeta_args=--flags 3 --padding_size 4096 --rollback_index 1804212800

If you update the 3 to 0 this will unset the two disable flags and allow you to do a dm-verity verified boot. Then use zip to replace this updated file

zip -u target-files.zip META/misc_info.txt

And then proceed with signing the updated target-files.zip

Wrinkle for Android-12 (lineage-19.1) and above

For all these versions, this patch ensures that if the vbmeta was signed then the vbmeta hash must be verified, otherwise the system will crash in early init, so you have no choice and must alter the avb_vbmeta_args above to either --flags 1 or --flags 0 so the vbmeta hash is passed in to init. Since you have to alter the flags anyway, you might as well enable dm-verity (set to 0) at the same time.

Re-Lock the Bootloader

Once you have installed both your custom keys and your custom signed boot image, you are ready to re-lock the bootloader. Beware that some phones will erase your data partition when you do this (the Google advice says they shouldn’t, but not all manufacturers bother to follow it), so make sure you have a backup (or are playing with a newly rooted phone).

fastboot flashing lock

Check with a reboot (the phone should now display a yellow warning triangle saying it is booting a custom OS rather than the orange unsigned OS one). If everything goes according to plan you can enter the developer settings and click the “OEM Unlocking” settings to disabled and your phone can no longer be unlocked without your say so.

Conclusions and Caveats

Following the above instructions, you can updated your phone so it will verify images you signed with your AVB key, turn on dm-verity if you wish and generally make your phone much more secure. However, remember that you haven’t erased the original AVB key, so the phone can still be updated to an image signed with that key and, worse, the recovery partition of LineageOS is modified to allow rollback, so it will allow the flashing of any signed image without triggering an erase of the data partition. There are also a few more problems like, thanks to a bug in AOSP, the recovery version of fastboot will actually allow commands that are usually forbidden if the phone is locked.

James Bottomley: Debugging Android Early Boot Failures

Kernel Planet

1 év 6 hónap óta

Back in my blog post about Securing the Google SIP Stack, I did say I’d look at re-enabling SIP in Android-12, so with a view to doing that I tried building and booting LineageOS 19.1, but it crashed really early in the boot sequence (after the boot splash but before the boot animation started). It turns out that information on debugging the android early boot sequence is a bit scarce, so I thought I should write a post about how I did it just in case it helps someone else who’s struggling with a similar early boot problem.

How I usually Build and Boot Android

My builds are standard LineageOS with my patches to fix SIP and not much else. However, I do replace the debug keys with my signing keys and I also have an AVB key installed in the phone’s third party keyslot with which I sign the vbmeta for boot. This actually means that my phone is effectively locked but with a user supplied key (Yellow as google puts it).

My phone is now a pixel 3 (I had to say goodbye to the old Nexus One thanks to the US 3G turn off) and I do have a slightly broken Pixel 3 I play with for experimental patches, which is where I was trying to install Android-12.

Signing Seems to be the Problem

Just to verify my phone could actually boot a stock LineageOS (it could) I had to unlock it and this lead to the discovery that once unlocked, it would also boot my custom rom as well, so whatever was failing in early boot seemed to be connected with the device being locked.

I also discovered an interesting bug in the recovery rom fastboot: If you’re booting locked with your own keys, it will still let you perform all the usually forbidden fastboot commands (the one I was using was set_active). It turns out to be because of a bug in AOSP which treats yellow devices as unlocked in fastboot. Somewhat handy for debugging, but not so hot for security …

And so to Debugging Early Boot

The big problem with Android is there’s no way to get the console messages for early boot. Even if you enable adb early, it doesn’t get started until quite far in to the boot animation (which was way after the crash I was tripping over). However, android does have a pstore (previously ramoops) driver that can give you access to the previously crashed boot’s kernel messages (early init, fortunately, mostly logs to the kernel message log).

Forcing init to crash on failure

Ordinarily an init failure prints a message and reboots (to the bootloader), which doesn’t excite pstore into saving the kernel message log. fortunately there is a boot option (androidboot.init_fatal_panic) which can be set in the boot options (or kernel command line for a pixel-3 which can only boot the 4.9 kernel). If you build your own android, it’s fairly easy to add things to the android commandline (which is in boot.img) because all you need to do is extract BOOT/cmdline from the intermediate zip file you sign add any boot options you need and place it back in the zip file (before you sign it).

Unfortunately, this expedient didn’t work (no console logs appear in pstore). I did check that init was correctly panic’ing on failure by inducing an init failure in recovery mode and observing the panic (recovery mode allows you to run adb). But this induced panic also didn’t show up in pstore, meaning there’s actually some problem with pstore and early panics.

Security is the problem (as usual)

The actual problem turned out to be security (as usual): The pixel-3 does encrypted boot panic logs. The way this seems to work (at least in my reading of the google additional pstore patches) is that the bootloader itself encrypts the pstore ram area with a key on the /data partition, which means it only becomes visible after the device is unlocked. Unfortunately, if you trigger a panic before the device is unlocked (by echoing ‘c’ to /proc/sysrq-trigger) the panic message is lost, so pstore itself is useless for debugging early boot. There seems to be some communication of the keys by the vendor proprietary ramoops binary making it very difficult to figure out how it’s being done.

Why the early panic message is lost is a bit mysterious, but unfortunately pstore on the pixel-3 has several proprietary components around the encrypted message handling that make it hard to debug. I suspect if you don’t set up the pstore encryption keys, the bootloader erases the pstore ram area instead of encrypting it, but I can’t prove that.

Although it might be possible to fix the pstore drivers to preserve the ramoops from before device unlock, the participation of the proprietary bootloader in preserving the memory doesn’t make that look like a promising avenue to explore.

Anatomy of the Pixel-3 Boot Sequence

The Pixel-3 device boots through recovery. What this means is that the initial ramdisk (from boot.img) init is what boots both the recovery and normal boot paths. The only difference is that for recovery (and fastboot), the device stays in the ramdisk and for normal boot it mounts the /system partition and pivots to it. What makes this happen or not is the boot flag androidboot.force_normal_boot=1 which is added by the bootloader. Pretty much all the binary content and init rc files in the ramdisk are for recovery and its allied menus.

Since the boot paths are pretty radically different, because the normal boot first pivots to a first stage before going on to a second, but in the manner of containers, it might be possible to boot recovery first, start a dmesg logger and then re-exec init through the normal path

Forcing Re-Exec

The idea is to signal init to re-exec itself for the normal path. Of course, there have to be a few changes to do this: An item has to be added to the recovery menu to signal init and init itself has to be modified to do the re-exec on the signal (note you can’t just kick off an init with a new command line because init must be pid 1 for booting). Once this is done, there are problems with selinux (it won’t actually allow init to re-exec) and some mount moves. The selinux problem is fixable by switching it from enforcing to permissive (boot option androidboot.selinux=permissive) and the mount moves (which are forbidden if you’re running binaries from the mount points being moved) can instead become bind mounts. The whole patch becomes 31 insertions across 7 files in android_system_core.

The signal I chose was SIGUSR1, which isn’t usually used by anything in the bootloader and the addition of a menu item to recovery to send this signal to init was also another trivial patch. So finally we have a system from which I can start adb to trace the kernel log (adb shell dmesg -w) and then signal to init to re-exec. Surprisingly this worked and produced as the last message fragment:

[ 190.966881] init: [libfs_mgr]Created logical partition system_a on device /dev/block/dm-0 [ 190.967697] init: [libfs_mgr]Created logical partition vendor_a on device /dev/block/dm-1 [ 190.968367] init: [libfs_mgr]Created logical partition product_a on device /dev/block/dm-2 [ 190.969024] init: [libfs_mgr]Created logical partition system_ext_a on device /dev/block/dm-3 [ 190.969067] init: DSU not detected, proceeding with normal boot [ 190.982957] init: [libfs_avb]Invalid hash size: [ 190.982967] init: [libfs_avb]Failed to verify vbmeta digest [ 190.982972] init: [libfs_avb]vbmeta digest error isn't allowed [ 190.982980] init: Failed to open AvbHandle: No such file or directory [ 190.982987] init: Failed to setup verity for '/system': No such file or directory [ 190.982993] init: Failed to mount /system: No such file or directory [ 190.983030] init: Failed to mount required partitions early … [ 190.983483] init: InitFatalReboot: signal 6 [ 190.984849] init: #00 pc 0000000000123b38 /system/bin/init [ 190.984857] init: #01 pc 00000000000bc9a8 /system/bin/init [ 190.984864] init: #02 pc 000000000001595c /system/lib64/libbase.so [ 190.984869] init: #03 pc 0000000000014f8c /system/lib64/libbase.so [ 190.984874] init: #04 pc 00000000000e6984 /system/bin/init [ 190.984878] init: #05 pc 00000000000aa144 /system/bin/init [ 190.984883] init: #06 pc 00000000000487dc /system/lib64/libc.so [ 190.984889] init: Reboot ending, jumping to kernel

Which indicates exactly where the problem is.

Fixing the problem

Once the messages are identified, the problem turns out to be in system/core ec10d3cf6 “libfs_avb: verifying vbmeta digest early”, which is inherited from AOSP and which even says in in it’s commit message “the device will not boot if: 1. The image is signed with FLAGS_VERIFICATION_DISABLED is set 2. The device state is locked” which is basically my boot state, so thanks for that one google. Reverting this commit can be done cleanly and now the signed image boots without a problem.

I note that I could also simply add hashtree verification to my boot, but LineageOS is based on the eng target, which has FLAGS_VERIFICATION_DISABLED built into the main build Makefile. It might be possible to change it, but not easily I’m guessing … although I might try fixing it this way at some point, since it would make my phones much more secure.

Conclusion

Debugging android early boot is still a terribly hard problem. Probably someone with more patience for disassembling proprietary binaries could take apart pixel-3 vendor ramoops and figure out if it’s possible to get a pstore oops log out of early boot (which would be the easiest way to debug problems). But failing that the simple hack to re-exec init worked enough to show me where the problem was (of course, if init had continued longer it would likely have run into other issues caused by the way I hacked it).

Rusty Russell: OP_SEGMENT: Allowing Introspection to Check Partial Scripts

Kernel Planet

1 év 7 hónap óta

In my previous post on Examing scriptpubkeys in Script I pointed out that there are cases where we want to require a certain script condition, but not an exact script: an example would be a vault-like covenant which requires a delay, but doesn’t care what else is in the script.

The problem with this is that in Taproot scripts, any unknown opcode (OP_SUCCESSx) will cause the entire script to succeed without being executed, so we need to hobble this slightly. My previous proposal of some kind of separator was awkward, so I’ve developed a new idea which is simpler.

Introducing OP_SEGMENT

Currently, the entire tapscript is scanned for the OP_SUCCESS opcodes, and succeeds immediately if one it found. This would be modified:

The tapscript is scanned for either OP_SEGMENT or OP_SUCCESSx.
If OP_SEGMENT is found, the script up to that point is executed. If the script does not fail, scanning continues from that point.
If OP_SUCCESSx is found, the script succeeds.

This basically divides the script into segments, each executed serially. It’s not quite as simple as “cut into pieces by OP_SEGMENT and examine one at a time” because the tapscript is allowed to contain things which would fail to decode altogether, after an OP_SUCCESSx, and we want to retain that property.

When OP_SEGMENT is executed, it does nothing: it simply limits the range of OP_SUCCESS opcodes.

Implementation

The ExecuteWitnessScript would have to be refactored (probably as a separate ExecuteTapScript since 21 of its 38 lines are an “if Tapscript” anyway), and it also implies that the stack limits for the current tapscript would be enforced upon encountering OP_SEGMENT, even if OP_SUCCESS were to follow after.

Interestingly, the core EvalScript function wouldn’t change except to ignore OP_SEGMENT, as it’s already fairly flexible.

Note that I haven’t implemented it yet, so there may yet be surprises, but I plan to prototype after the idea has received some review!

Enjoy!

Matthew Garrett: Dealing with weird ELF libraries

Kernel Planet

1 év 7 hónap óta

Libraries are collections of code that are intended to be usable by multiple consumers (if you're interested in the etymology, watch this video). In the old days we had what we now refer to as "static" libraries, collections of code that existed on disk but which would be copied into newly compiled binaries. We've moved beyond that, thankfully, and now make use of what we call "dynamic" or "shared" libraries - instead of the code being copied into the binary, a reference to the library function is incorporated, and at runtime the code is mapped from the on-disk copy of the shared object[1]. This allows libraries to be upgraded without needing to modify the binaries using them, and if multiple applications are using the same library at once it only requires that one copy of the code be kept in RAM.

But for this to work, two things are necessary: when we build a binary, there has to be a way to reference the relevant library functions in the binary; and when we run a binary, the library code needs to be mapped into the process.

(I'm going to somewhat simplify the explanations from here on - things like symbol versioning make this a bit more complicated but aren't strictly relevant to what I was working on here)

For the first of these, the goal is to replace a call to a function (eg, printf()) with a reference to the actual implementation. This is the job of the linker rather than the compiler (eg, if you use the -c argument to tell gcc to simply compile to an object rather than linking an executable, it's not going to care about whether or not every function called in your code actually exists or not - that'll be figured out when you link all the objects together), and the linker needs to know which symbols (which aren't just functions - libraries can export variables or structures and so on) are available in which libraries. You give the linker a list of libraries, it extracts the symbols available, and resolves the references in your code with references to the library.

But how is that information extracted? Each ELF object has a fixed-size header that contains references to various things, including a reference to a list of "section headers". Each section has a name and a type, but the ones we're interested in are .dynstr and .dynsym. .dynstr contains a list of strings, representing the name of each exported symbol. .dynsym is where things get more interesting - it's a list of structs that contain information about each symbol. This includes a bunch of fairly complicated stuff that you need to care about if you're actually writing a linker, but the relevant entries for this discussion are an index into .dynstr (which means the .dynsym entry isn't sufficient to know the name of a symbol, you need to extract that from .dynstr), along with the location of that symbol within the library. The linker can parse this information and obtain a list of symbol names and addresses, and can now replace the call to printf() with a reference to libc instead.

(Note that it's not possible to simply encode this as "Call this address in this library" - if the library is rebuilt or is a different version, the function could move to a different location)

Experimentally, .dynstr and .dynsym appear to be sufficient for linking a dynamic library at build time - there are other sections related to dynamic linking, but you can link against a library that's missing them. Runtime is where things get more complicated.

When you run a binary that makes use of dynamic libraries, the code from those libraries needs to be mapped into the resulting process. This is the job of the runtime dynamic linker, or RTLD[2]. The RTLD needs to open every library the process requires, map the relevant code into the process's address space, and then rewrite the references in the binary into calls to the library code. This requires more information than is present in .dynstr and .dynsym - at the very least, it needs to know the list of required libraries.

There's a separate section called .dynamic that contains another list of structures, and it's the data here that's used for this purpose. For example, .dynamic contains a bunch of entries of type DT_NEEDED - this is the list of libraries that an executable requires. There's also a bunch of other stuff that's required to actually make all of this work, but the only thing I'm going to touch on is DT_HASH. Doing all this re-linking at runtime involves resolving the locations of a large number of symbols, and if the only way you can do that is by reading a list from .dynsym and then looking up every name in .dynstr that's going to take some time. The DT_HASH entry points to a hash table - the RTLD hashes the symbol name it's trying to resolve, looks it up in that hash table, and gets the symbol entry directly (it still needs to resolve that against .dynstr to make sure it hasn't hit a hash collision - if it has it needs to look up the next hash entry, but this is still generally faster than walking the entire .dynsym list to find the relevant symbol). There's also DT_GNU_HASH which fulfills the same purpose as DT_HASH but uses a more complicated algorithm that performs even better. .dynamic also contains entries pointing at .dynstr and .dynsym, which seems redundant but will become relevant shortly.

So, .dynsym and .dynstr are required at build time, and both are required along with .dynamic at runtime. This seems simple enough, but obviously there's a twist and I'm sorry it's taken so long to get to this point.

I bought a Synology NAS for home backup purposes (my previous solution was a single external USB drive plugged into a small server, which had uncomfortable single point of failure properties). Obviously I decided to poke around at it, and I found something odd - all the libraries Synology ships were entirely lacking any ELF section headers. This meant no .dynstr, .dynsym or .dynamic sections, so how was any of this working? nm asserted that the libraries exported no symbols, and readelf agreed. If I wrote a small app that called a function in one of the libraries and built it, gcc complained that the function was undefined. But executables on the device were clearly resolving the symbols at runtime, and if I loaded them into ghidra the exported functions were visible. If I dlopen()ed them, dlsym() couldn't resolve the symbols - but if I hardcoded the offset into my code, I could call them directly.

Things finally made sense when I discovered that if I passed the --use-dynamic argument to readelf, I did get a list of exported symbols. It turns out that ELF is weirder than I realised. As well as the aforementioned section headers, ELF objects also include a set of program headers. One of the program header types is PT_DYNAMIC. This typically points to the same data that's present in the .dynamic section. Remember when I mentioned that .dynamic contained references to .dynsym and .dynstr? This means that simply pointing at .dynamic is sufficient, there's no need to have separate entries for them.

The same information can be reached from two different locations. The information in the section headers is used at build time, and the information in the program headers at run time[3]. I do not have an explanation for this. But if the information is present in two places, it seems obvious that it should be able to reconstruct the missing section headers in my weird libraries? So that's what this does. It extracts information from the DYNAMIC entry in the program headers and creates equivalent section headers.

There's one thing that makes this more difficult than it might seem. The section header for .dynsym has to contain the number of symbols present in the section. And that information doesn't directly exist in DYNAMIC - to figure out how many symbols exist, you're expected to walk the hash tables and keep track of the largest number you've seen. Since every symbol has to be referenced in the hash table, once you've hit every entry the largest number is the number of exported symbols. This seemed annoying to implement, so instead I cheated, added code to simply pass in the number of symbols on the command line, and then just parsed the output of readelf against the original binaries to extract that information and pass it to my tool.

Somehow, this worked. I now have a bunch of library files that I can link into my own binaries to make it easier to figure out how various things on the Synology work. Now, could someone explain (a) why this information is present in two locations, and (b) why the build-time linker and run-time linker disagree on the canonical source of truth?

[1] "Shared object" is the source of the .so filename extension used in various Unix-style operating systems
[2] You'll note that "RTLD" is not an acryonym for "runtime dynamic linker", because reasons
[3] For environments using the GNU RTLD, at least - I have no idea whether this is the case in all ELF environments

comments

James Bottomley: Securing the Google SIP Stack

Kernel Planet

1 év 7 hónap óta

A while ago I mentioned I use Android-10 with the built in SIP stack and that the Google stack was pretty buggy and I had to fix it simply to get it to function without disconnecting all the time. Since then I’ve upported my fixes to Android-11 (the jejb-11 branch in the repositories) by using LineageOS-19.1. However, another major deficiency in the Google SIP stack is its complete lack of security: both the SIP signalling and the media streams are all unencrypted meaning they can be intercepted and tapped by pretty much anyone in the network path running tcpdump. Why this is so, particularly for a company that keeps touting its security credentials is anyone’s guess. I personally suspect they added SIP in Android-4 with a view to basing Google Voice on it, decided later that proprietary VoIP protocols was the way to go but got stuck with people actually using the SIP stack for other calling services so they couldn’t rip it out and instead simply neglected it hoping it would die quietly due to lack of features and updates.

This blog post is a guide to how I took the fully unsecured Google SIP stack and added security to it. It also gives a brief overview of some of the security protocols you need to understand to get secure VoIP working.

What is SIP

What I’m calling SIP (but really a VoIP system using SIP) is a protocol consisting of several pieces. SIP (Session Initiation Protocol), RFC 3261, is really only one piece: it is the “signalling” layer meaning that call initiation, response and parameters are all communicated this way. However, simple SIP isn’t enough for a complete VoIP stack; once a call moves to in progress, there must be an agreement on where the media streams are and how they’re encoded. This piece is called a SDP (Session Description Protocol) agreement and is usually negotiated in the body of the SIP INVITE and response messages and finally once agreement is reached, the actual media stream for call audio goes over a different protocol called RTP (Real-time Transport Protocol).

How Google did SIP

The trick to adding protocols fast is to take them from someone else (if you’re open source, this is encouraged) so google actually chose the NIST-SIP java stack (which later became the JAIN-SIP stack) as the basis for SIP in android. However, that only covered signalling and they had to plumb it in to the android Phone model. One essential glue piece is frameworks/opt/net/voip which supplies the SDP negotiating layer and interfaces the network codec to the phone audio. This isn’t quite enough because the telephony service and the Dialer also need to be involved to do the account setup and call routing. It always interested me that SIP was essentially special cased inside these services and apps instead of being a plug in, but that’s due to the fact that some of the classes that need extending to add phone protocols are internal only; presumably so only manufacturers can add phone features.

Securing SIP

This is pretty easy following the time honoured path of sending messages over TLS instead of in the clear simply by using a TLS wrappering technique of secure sockets and, indeed, this is how RFC 3261 says to do it. However, even this minor re-engineering proved unnecessary because the nist-sip stack was already TLS capable, it simply wasn’t allowed to be activated that way by the configuration options Google presented. A simple 10 line patch in a couple of repositories (external/nist_sip, packages/services/Telephony and frameworks/opt/net/voip) fixed this and the SIP stack messaging was secured leaving only the voice stream insecure.

SDP

As I said above, the google frameworks/opt/net/voip does all the SDP negotiation. This isn’t actually part of SIP. The SDP negotiation is conducted over SIP messages (which means it’s secured thanks to the above) but how this should be done isn’t part of the SIP RFC. Instead SDP has its own RFC 4566 which is what the class follows (mainly for codec and port negotiation). You’d think that if it’s already secured by SIP, there’s no additional problem, but, unfortunately, using SRTP as the audio stream requires the exchange of additional security parameters which added to SDP by RFC 4568. To incorporate this into the Google SIP stack, it has to be integrated into the voip class. The essential additions in this RFC are a separate media description protocol (RTP/SAVP) for the secure stream and the addition of a set of tagged a=crypto: lines for key negotiation.

As will be a common theme: not all of RFC 4568 has to be implemented to get a secure RTP stream. It goes into great detail about key lifetime and master key indexes, neither of which are used by the asterisk SIP stack (which is the one my phone communicates with) so they’re not implemented. Briefly, it is best practice in TLS to rekey the transport periodically, so part of key negotiation should be key lifetime (actually, this isn’t as important to SRTP as it is to TLS, see below, which is why asterisk ignores it) and the people writing the spec thought it would be great to have a set of keys to choose from instead of just a single one (The Master Key Identifier) but realistically that simply adds a load of complexity for not much security benefit and, again, is ignored by asterisk which uses a single key.

In the end, it was a case of adding a new class for parsing the a=crypto: lines of SDP and doing a loop in the audio protocol for RTP/SAVP if TLS were set as the transport. This ended up being a ~400 line patch.

Secure RTP

RTP itself is governed by RFC 3550 which actually contains two separate stream descriptions: the actual media over RTP and a control protocol over RTCP. RTCP is mostly used for multi-party and video calls (where you want reports on reception quality to up/downshift the video resolution) and really serves no purpose for audio, so it isn’t implemented in the Google SIP stack (and isn’t really used by asterisk for audio only either).

When it comes to securing RTP (and RTCP) you’d think the time honoured mechanism (using secure sockets) would have applied although, since RTP is transmitted over UDP, one would have to use DTLS instead of TLS. Apparently the IETF did consider this, but elected to define a new protocol instead (or actually two: SRTP and SRTCP) in RFC 3711. One of the problems with this new protocol is that it also defines a new ciphersuite (AES_CM_…) which isn’t found in any of the standard SSL implementations. Although the AES_CM ciphers are very similar in operation to the AES_GCM ciphers of TLS (Indeed AES_GCM was adopted for SRTP in a later RFC 7714) they were never incorporated into the TLS ciphersuite definition.

So now there are two problems: adding code for the new protocol and performing the new encyrption/decryption scheme. Fortunately, there already exists a library (libsrtp) that can do this and even more fortunately it’s shipped in android (external/libsrtp2) although it looks to be one of those throwaway additions where the library hasn’t really been updated since it was added (for cuttlefish gcastv2) in 2019 and so is still at a pre 2.3.0 version (I did check and there doesn’t look to be any essential bug fixes missing vs upstream, so it seems usable as is).

One of the really great things about libsrtp is that it has srtp_protect and srtp_unprotect functions which transform SRTP to RTP and vice versa, so it’s easily possible to layer this library directly into an existing RTP implementation. When doing this you have to remember that the encryption also includes authentication, so the size of the packet expands which is why the initial allocation size of the buffers has to be increased. One of the not so great things is that it implements all its own crypto primitives including AES and SHA1 (which most cryptographers think is always a bad idea) but on the plus side, it’s the same library asterisk uses so how much of a real problem could this be …

Following the simple layering approach, I constructed a patch to do the RTP<->SRTP transform in the JNI code if a key is passed in, so now everything just works and setting asterisk to SRTP only confirms the phone is able to send and receive encrypted audio streams. This ends up being a ~140 line patch.

So where does DTLS come in?

Anyone spending any time at all looking at protocols which use RTP, like webRTC, sees RTP and DTLS always mentioned in the same breath. Even asterisk has support for DTLS, so why is this? The answer is that if you use RTP outside the SIP framework, you still need a way of agreeing on the keys using SDP. That key agreement must be protected (and can’t go over RTCP because that would cause a chicken and egg problem) so implementations like webRTC use DTLS to exchange the initial SDP offer and answer negotiation. This is actually referred to as DTLS-SRTP even though it’s an initial DTLS handshake followed by SRTP (with no further DTLS in sight). However, this DTLS handshake is completely unnecessary for SIP, since the SDP handshake can be done over TLS protected SIP messaging instead (although I’ve yet to find anyone who can convincingly explain why this initial handshake has to go over DTLS and not TLS like SIP … I suspect it has something to do with wanting the protocol to be all UDP and go over the same initial port).

Conclusion

This whole exercise ended up producing less than 1000 lines in patches and taking a couple of days over Christmas to complete. That’s actually much simpler and way less time than I expected (given the complexity in the RFCs involved), which is why I didn’t look at doing this until now. I suppose the next thing I need to look at is reinserting the SIP stack into Android-12, but I’ll save that for when Android-11 falls out of support.

Rusty Russell: Arithmetic Opcodes: What Could They Look Like?

Kernel Planet

1 év 7 hónap óta

As noted in my previous post on Dealing with Amounts in Bitcoin Script, it’s possible to use the current opcodes to deal with satoshi values in script, but it’s horribly ugly and inefficient.

This makes it obvious that we want 64-bit-capable opcodes, and it makes sense to restore disabled opcodes such as OP_AND/OP_OR/OP_XOR/OP_INVERT, as well as OP_LSHIFT, OP_RSHIFT and maybe OP_LEFT, OP_RIGHT and OP_SUBSTR.

Blockstream’s Elements Arithmetic Opcodes

Blockstream’s Elements codebase is ahead here, with a full set of 64-bit opcodes available for Liquid. All these operations push a “success” flag on the stack, designed that you can follow with an OP_VERIFY if you want to assert against overflow. They then provide three conversion routines, to convert to and from CScriptNum, and one for converting from 32-bit/4-byte values.

But frankly, these are ugly. We need more than 31 bits for satoshi amounts, and we may never need more than 64 bits, but the conversions and new opcodes feel messy. Can we do better?

Generic Variable Length Amounts

Script amounts are variable length, which is nice when space matters, but terribly awkward for signed values: using the high bit of the last byte, this being little-endian, requires a 0 padding byte if the high bit would otherwise be set. And using negative numbers is not very natural (pun intended!), since we have OP_SUB already.

What if we simply used unsigned, variable-length little-endian numbers? These are efficient, retain compatibility with current unsigned Script numbers, and yet can be extended to 64 bits or beyond, without worrying about overflow (as much).

I propose OP_ADDV, which simply adds unsigned two little-endian numbers of arbitrary length. 64 bits is a little simpler to implement, but there are a number of useful bit tricks which can be done with wide values and I can’t see a good reason to add a limit we might regret in future.

OP_SUBV would have to be Elements-style: pushing the absolute result on the stack, then a “success” flag if the result isn’t negative (effectively, a positive sign bit).

Limitations on Variable Operations

If OP_CAT is not limited to 520 bytes as per OP_CAT beyond 520 bytes, then we could make OP_ADDV never fail, since the effective total stack size as per that proposal could not be increased by OP_ADDV (it consumes its inputs). But that introduces the problem of DoS, since on my laptop 2 million iterations (an entire block) of a mildly optimized OP_ADDV of 260,000 bytes would take 120 seconds.

Thus, even if we were to allow more then 520 byte stack elements, I would propose either limiting the inputs and outputs to 520 bytes, or simply re-using the dynamic hash budget proposed in that post for arithmetic operations.

Multiply And Divide

Restoring OP_MUL to apply to variable values is harder, since it’s O(N^2) in the size of the operands: it performs multiple shifts and additions in one operation. Yet both this and OP_DIV are useful (consider the case of calculating a 1% fee).

I suggest this is a good case for using an Elements-style success flag, rather than aborting the script. This might look like:

Normalize the inputs to eliminate leading zeros.
If either input exceeds 128 bits, push 0 0.
For OP_DIV, if the divisor is 0, push 0 0.
For OP_MUL, if the output overflows, push 0 0.
Otherwise the (normalized) result and 1.

Both GCC and clang support an extension for 128 bit operations via __uint128_t, so this implementation is fairly trivial (OP_MUL overflow detection is assisted greatly by __builtin_mul_overflow extension in GCC and clang).

Shift Operations, Splitting and Comparison

OP_LSHIFT and OP_RSHIFT would now be restored as simple bit operations (there’s no sign bit), treating their argument as unsigned rather than a signed amount. OP_LSHIFT would fail if the result would be excessive (either a 520 byte limit, or a dynamic hash budget limit). Neither would normalize, leaving zero bytes at the front. This is useful when combined with OP_INVERT to make a mask (0 OP_ADDV can be used to normalize if you want).

OP_LEFT, OP_SUBSTR and OP_RIGHT should probably return empty on out-of-range arguments rather than failing (I’m actually not sure what the original semantics were, but this is generally sensible).

OP_EQUAL just works, but we need at least a new OP_GREATERTHANV, and I suggest the full complement: OP_GREATERTHANOREQUALV, OP_LESSTHANV and OP_LESSTHANOREQUALV.

Use for Merkle Tree Construction.

Regretfully, our comparison opcodes don’t work as OP_LESS which is required for Merkle Tree construction to examine scripts: the SHA256 hashes there need to be compared big-endian, not little endian! So I suggest OP_BYTEREV: this adds some overhead (reverse both merkle hashes to compare them), but is generally a useful opcode to have anyway.

Summary

We can extend current bitcoin script numbers fairly cleanly by having new opcodes which deal with variable-length little-endian unsigned numbers. The limits on most of these can be quite large, but if we want excessive limits (> 520 bytes) we need to have a budget like the checksig budget.

We can deal easily with current CScriptNum numbers, 32-bit numbers used by nLocktime, and 64-bit numbers used by satoshi amounts.

The soft fork would restore the following opcodes:

OP_LSHIFT[^simplified]
OP_RSHIFT[^simplified]
OP_AND
OP_OR
OP_XOR
OP_INVERT
OP_MUL[^extended]
OP_DIV[^extended]
OP_LEFT
OP_RIGHT
OP_SUBSTR

And add the following new ones:

OP_ADDV: add two little-endian unsigned numbers
OP_SUBV: sub two little-endian unsigned numbers, push non-negative flag
OP_GREATERTHANV: compare two little-endian unsigned numbers
OP_GREATERTHANOREQUALV: compare two little-endian unsigned numbers
OP_LESSTHANV: compare two little-endian unsigned numbers
OP_LESSTHANOREQUALV: compare two little-endian unsigned numbers
OP_BYTEREV: reverse bytes in the top stack element

This would make it far easier to deal with numeric fields to bring covenant introspection to its full potential (e.g. OP_TXHASH).

[^simplified] The original versions preserved sign, but we don’t have that. [^extended] These apply to unsigned numbers up to 128 bits, not just signed 31 bit values as the originals did.

Rusty Russell: OP_CAT beyond 520 bytes

Kernel Planet

1 év 7 hónap óta

The original OP_CAT proposal limited the result to be 520 bytes, but we want more for covenents which analyze scripts (especially since OP_CAT itself makes scripting more interesting).

My post showed that it’s fairly simple to allow larger sizes (instead of a limit of 1000 stack elements each with a maximum of 520 bytes each, we reduce the element limit for each element over 520 bytes such that the total is still capped the same).

Hashing of Large Stack Objects

But consider hashing operations, such as OP_SHA256. Prior to OP_CAT, it can only be made to hash 520 bytes (9 hash rounds) with three opcodes:

OP_DUP OP_SHA256 OP_DROP

That’s 3 hash rounds per opcode. With OP_CAT and no 520 byte stack limit we can make a 260,000 byte stack element, and that’s 4062 hash rounds, or 1354 per opcode, which is 450x as expensive so we definitely need to think about it!

A quick benchmark shows OpenSSL’s sha256 on my laptop takes about 115 microseconds to hash 260k: a block full of such instructions would take about 150 seconds to validate!

So we need to have some limit, and there are three obvious ways to do it:

Give up, and don’t allow OP_CAT to produce more than 520 bytes.
Use some higher limit for OP_CAT, but still less than 260,000.
Use a weight budget, like BIP-342 does for checksig.

A quick benchmark on my laptop shows that we can hash about 48k (using the OpenSSL routines) in the time we do a single ECDSA signature verification (and Taproot charges 50 witness weight for each signature routine).

A simple limit would be to say “1 instruction lets you hash about 1k” and “the tightest loop we can find is three instructions” and so limit OP_CAT to producing 3k. But that seems a little arbitrary, and still quite restrictive for future complex scripts.

My Proposal: A Dynamic Limit for Hashing

A dynamic BIP-342-style approach would be to have a “hashing budget” of some number times the total witness weight. SHA256 uses blocks of 64 bytes, but it is easier to simply count bytes, and we don’t need this level of precision.

I propose we allow a budget of 520 bytes of hashing for each witness byte: this gives us some headroom from the ~1k measurement above, and cannot make any currently-legal script illegal, since the opcode itself would allow the maximal possible stack element.

This budget is easy to calculate: 520 times total witness weight, and would be consumed by every byte hashed by OP_RIPEMD160, OP_SHA1, OP_SHA256, OP_HASH160, OP_HASH256. I’ve ignored that some of these hash twice, since the second hash amounts to a single block.

Is 520 Bytes of hashing per Witness Weight Too Limiting?

Could this budget ever be limiting to future scripts? Not for the case of “your script must look like {merkle tree of some template}” since the presence of the template itself weighs more than enough to allow the hashing. Similarly for merkle calculations, where the neighbor hashes similarly contribute more than enough for the hash operations.

If you provide the data you’re hashing in your witness, you can’t reasonably hit the limit. One could imagine a future OP_TX which let you query some (unhashed) witness script of (another) input, but even in this case the limit is generous, allowing several kilobytes of hashing.

What Other Opcodes are Proportional to Element Length?

Hashing is the obvious case, but several other opcodes work on arbitrary length elements and should be considered. In particular, OP_EQUAL, OP_EQUALVERIFY, the various DUP opcodes, and OP_CAT itself.

I would really need to hack bitcoind to run exact tests, but modifying my benchmark to do a memcmp of two 260,000 byte blobs takes 3100ns, and allocating and freeing a copy takes 3600.

The worst case seems to be arranging for 173k on the stack then repeatedly doing:

OP_DUP OP_DUP OP_EQUALVERIFY

4MB of this would take about 8.9 seconds to evaluate on my laptop. Mitigating this further would be possible (copy-on-write for stack objects, or adding a budget for linear ops), but 10 seconds is probably not enough to worry about.

Dave Airlie (blogspot): radv: vulkan video encode status

Kernel Planet

1 év 7 hónap óta

Vulkan 1.3.274 moves the Vulkan encode work out of BETA and moves h264 and h265 into KHR extensions. radv support for the Vulkan video encode extensions has been in progress for a while.

The latest branch is at [1]. This branch has been updated for the new final headers.

Updated: It passes all of h265 CTS now, but it is failing one h264 test.

Initial ffmpeg support is [2].

[1] https://gitlab.freedesktop.org/airlied/mesa/-/tree/radv-vulkan-video-encode-h2645-spec-latest?ref_type=heads

[2] https://github.com/cyanreg/FFmpeg/commits/vulkan/

Matthew Garrett: Making SSH host certificates more usable

Kernel Planet

1 év 7 hónap óta

Earlier this year, after Github accidentally committed their private RSA SSH host key to a public repository, I wrote about how better support for SSH host certificates would allow this sort of situation to be handled in a user-transparent way without any negative impact on security. I was hoping that someone would read this and be inspired to fix the problem but sadly that didn't happen so I've actually written some code myself.

The core part of this is straightforward - if a server presents you with a certificate associated with a host key, then make the trust in that host be whoever signed the certificate rather than just trusting the host key. This means that if someone needs to replace the host key for any reason (such as, for example, them having published the private half), you can replace the host key with a new key and a new certificate, and as long as the new certificate is signed by the same key that the previous certificate was, you'll trust the new key and key rotation can be carried out without any user errors. Hurrah!

So obviously I wrote that bit and then thought about the failure modes and it turns out there's an obvious one - if an attacker obtained both the private key and the certificate, what stops them from continuing to use it? The certificate isn't a secret, so we basically have to assume that anyone who possesses the private key has access to it. We may have silently transitioned to a new host key on the legitimate servers, but a hostile actor able to MITM a user can keep on presenting the old key and the old certificate until it expires.

There's two ways to deal with this - either have short-lived certificates (ie, issue a new certificate every 24 hours or so even if you haven't changed the key, and specify that the certificate is invalid after those 24 hours), or have a mechanism to revoke the certificates. The former is viable if you have a very well-engineered certificate issuing operation, but still leaves a window for an attacker to make use of the certificate before it expires. The latter is something SSH has support for, but the spec doesn't define any mechanism for distributing revocation data.

So, I've implemented a new SSH protocol extension that allows a host to send a key revocation list to a client. The idea is that the client authenticates to the server, receives a key revocation list, and will no longer trust any certificates that are contained within that list. This seems simple enough, but a naive implementation opens the client to various DoS attacks. For instance, if you simply revoke any key contained within the received KRL, a hostile server could revoke any certificates that were otherwise trusted by the client. The easy way around this is for the client to ensure that any revoked keys are associated with the same CA that signed the host certificate - that way a compromised host can only revoke certificates associated with that CA, and can't interfere with anyone else.

Unfortunately that still means that a single compromised host can still trigger revocation of certificates inside that trust domain (ie, a compromised host a.test.com could push a KRL that invalidated the certificate for b.test.com), because there's no way in the KRL format to indicate that a given revocation is associated with a specific hostname. This means we need a mechanism to verify that the KRL update is legitimate, and the easiest way to handle that is to sign it. The KRL format specifies an in-band signature but this was deprecated earlier this year - instead KRLs are supposed to be signed with the sshsig format. But we control both the server and the client, which means it's easy enough to send a detached signature as part of the extension data.

Putting this all together: you ssh to a server you've never contacted before, and it presents you with a host certificate. Instead of the host key being added to known_hosts, the CA key associated with the certificate is added. From now on, if you ssh to that host and it presents a certificate signed by that CA, it'll be trusted. Optionally, the host can also send you a KRL and a signature. If the signature is generated by the CA key that you already trust, any certificates in that KRL associated with that CA key will be incorporated into local storage. The expected flow if a key is compromised is that the owner of the host generates a new keypair, obtains a new certificate for the new key, and adds the old certificate to a KRL that is signed with the CA key. The next time the user connects to that host, they receive the new key and new certificate, trust it because it's signed by the same CA key, and also receive a KRL signed with the same CA that revokes trust in the old certificate.

Obviously this breaks down if a user is MITMed with a compromised key and certificate immediately after the host is compromised - they'll see a legitimate certificate and won't receive any revocation list, so will trust the host. But this is the same failure mode that would occur in the absence of keys, where the attacker simply presents the compromised key to the client before trust in the new key has been created. This seems no worse than the status quo, but means that most users will seamlessly transition to a new key and revoke trust in the old key with no effort on their part.

The work in progress tree for this is here - at the point of writing I've merely implemented this and made sure it builds, not verified that it actually works or anything. Cleanup should happen over the next few days, and I'll propose this to upstream if it doesn't look like there's any showstopper design issues.

comments

James Bottomley: Solving the Looming Developer Liability Problem

Kernel Planet

1 év 7 hónap óta

Even if you’re a developer with legal leanings like me, you probably haven’t given much thought to the warranty disclaimer and the liability disclaimer that appears in almost every Open Source licence (see sections 14 and 15 of GPLv3). This post is designed to help you understand what they are, why they’re there and why we might need stronger defences in future thanks to a changing legal landscape.

History: Why no Warranty or Liability

It seems obvious that when considered in terms of what downstream gets from Open Source that an open ended obligation on behalf of upstream to fix your problems isn’t one of them because it wouldn’t be sustainable. Effectively the no warranty clause is notice that since you’re getting the code for free it comes with absolutely no obligations on developers: if it breaks, you get to fix it. This is why no warranty clauses have been present since the history of Open Source (and Free Software: GPLv1 included this). There’s also a historical commercial reason for this as well. Before the explosion of Open Source business models in the last decade, the Free Software Foundation (FSF) considered paid support for otherwise unsupported no warranty Open Source software to be the standard business model for making money on Open Source. Based on this, Cygnus Support (later Cygnus Solutions – Earliest web archive capture 1997) was started in 1989 with a business model of providing paid support and bespoke development for the compiler and toolchain.

Before 2000 most public opinion (when it thought about Open Source at all) was happy with this, because Open Source was seen by and large as the uncommercialized offerings of random groups of hackers. Even the largest Open Source project, the Linux kernel, was seen as the scrappy volunteer upstart challenging both Microsoft and the proprietary UNIXs for control of the Data Centre. On the back of this, distributions (Red Hat, SUSE, etc.) arose to commericallize support offerings around Linux to further its competition with UNIX and Windows and push it to win the war for the Data Centre (and later the Cloud).

The Rise of The Foundations: Public Perception Changes

The heyday explosion of volunteer Open Source happened in the first decade of the new Millennium. But volunteer Open Source also became a victim of this success: the more it penetrated industry, the greater control of the end product industry wanted. And, whenever there’s a Business Need, something always arises to fulfill it: the Foundation Model for exerting influence in exchange for cash. The model is fairly simple: interested parties form a foundation (or more likely go to a Foundation forming entity like the Linux Foundation). They get seats on the governing board, usually in proportion to their annual expenditure on the foundation and the foundation sets up a notionally independent Technical Oversight Body staffed by developers which is still somewhat beholden to the board and its financial interests. The net result is rising commercial franchise in Open Source.

The point of the above isn’t to say whether this commercial influence is good or bad, it’s to say that the rise of the Foundations have changed the public perception of Open Source. No longer is Open Source seen as the home of scrappy volunteers battling for technological innovation against entrenched commercial interests, now Open Source is seen as one more development tool of the tech industry. This change in attitude is pretty profound because now when a problem is found in Open Source, the public has no real hesitation in assuming the tech industry in general should be responsible; the perception that the no warranty clause protects innocent individual developers is supplanted by the perception that it’s simply one more tool big tech deploys to evade liability for the problems it creates. Some Open Source developers have inadvertently supported this notion by publicly demanding to be paid for working on their projects, often in the name of sustainability. Again, none of this is necessarily wrong but it furthers the public perception that Open Source developers are participating in a commercial not a volunteer enterprise.

Liability via Fiduciary Duty: The Bitcoin Case

An ongoing case in the UK courts (BL-2021-000313) between Tulip Trading and various bitcoin developers centers around the disputed ownership of about US$4bn in bitcoin. Essentially Tulip contends that it lost access to the bitcoins due to a computer hack but says that the bitcoin developers have a fiduciary duty to it to alter the blockchain code to recover its lost bitcoins. The unusual feature of this case is that Tulip sued the developers of the bitcoin code not the operators of the bitcoin network. (it’s rather like the Bank losing your money and then you trying to sue the Mint for recovery). The reason for this is that all the operators (the miners) use the same code base for the same blockchain and thus could rightly claim that it’s technologically impossible for them to recover the lost bitcoin, because that would necessitate a change to the fundamental blockchain code which only the developers control. The suit was initially lost by Tulip on the grounds of the no liability disclaimer, but reinstated by the UK appeal court which showed considerable interest in the idea that developers could pick up fiduciary liability in some cases, even though the suit may eventually get dismissed on the grounds that Tulip can’t prove it ever owned the US$4bn in bitcoins in the first place.

Why does all this matter? Well, even if this case resolves successfully, thanks to the appeal court ruling, the door is still open to others with less shady claims that they’ve suffered an injury due to some coding issue that gives developers fiduciary liability to them. The no warranty disclaimer is already judged not to be sufficient to prevent this, so the cracks are starting to appear in it as a defence against all liability claims.

The EU Cyber Resilience Act: Legally Piercing No Warranty Clauses

The EU Cyber Resilience Act (CRA) at its heart provides a fiduciary duty of care on all “digital components” incorporated into products or software offered on the EU market to adhere to prescribed cybersecurity requirements and an obligation to provide duty of care for these requirements over the whole lifecycle of such products or software. Essentially this is developer liability, notwithstanding any no warranty clauses, writ large. To be fair, there is currently a carve out for “noncommercial” Open Source but, as I pointed out above, most Open Source today is commercial and wouldn’t actually benefit from this. I’m not proposing to give a detailed analysis (many people have already done this and your favourite search engine will turn up dozens without even trying) I just want to note that this is a legislative act designed to pierce the no warranty clauses Open Source has relied on for so long.

EU CRA Politics: Why is this Popular?

Politicians don’t set out to effectively override licensing terms and contract law unless there’s a significant popularity upside and, if you actually canvas the general public, there is: People are tired of endless cybersecurity breaches compromising their private information, or even their bank accounts, and want someone to be held responsible. Making corporations pay for breaches that damage individuals is enormously popular (and not just in the EU). After all big Tech profits enormously from this, so big Tech should pay for the clean up when things go wrong.

Unfortunately, self serving arguments that this will place undue burdens on Foundations funded by starving corporations rather undermine the same arguments on behalf of individual developers. To the public at large such arguments merely serve to reinforce the idea that big Tech has been getting away with too much for too long. Trying to separate individual developer Open Source from corporate Open Source is too subtle a concept to introduce now, particularly when we, and the general public, have bought into the idea that they’re the same thing for so long.

So what should we do about this?

It’s clear that even if a massive (and expensive) lobbying effort succeeds in blunting the effect of the CRA on Open Source this time around, there will always be a next time because of the public desire for accountability for and their safety guarantees in cybersecurity practices. It is also clear that individual developer Open Source has to make common cause with commercial Open Source to solve this issue. Even though individuals hate being seen as synonymous with corporations, one of the true distinctions between Open Source and Free Software has always been the ability to make common cause over smaller goals rather than bigger philosophies and aspirations; so this is definitely a goal we can make a common cause over. This common cause means the eventual solution must apply to individual and commercial Open Source equally. And, since we’ve already lost the perception war, it will have to be something more legally based.

Indemnification: the Legal solution to Developer Liability

Indemnification means one party, in particular circumstances, agreeing to be on the hook for the legal responsibilities of another party. This is actually a well known way not of avoiding liability but transferring it to where it belongs. As such, it’s easily sellable in the court of public opinion: we’re not looking to avoid liability, merely trying to make sure it lands on those who are making all the money from the code.

The best mechanism for transmitting this is obviously the Licence and, ironically, a licence already exists with developer indemnity clauses: Apache-2 (clause 9). Unfortunately, the Apache-2 clause only attaches to an entity offering support for a fee, which doesn’t quite cover the intention of the CRA, which is for anyone offering a product in the EU market (whether free or for sale) should be responsible for its cybersecurity lifecycle, whether they offer support or not. However, it does provide a roadmap for what such a clause would look like:

If you choose to offer this work in whole or part as a component or product in a jurisdiction requiring lifecycle duty of care you agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your actions in such a jurisdiction.

Probably the wording would need some tweaking by an actual lawyer, but you get the idea.

Applying Indemnity to existing Licences

Obviously for a new project, the above clause can simply be added to the licence but for any existing project, since the clause is compatible with the standard no-warranty statements, it can be added after the fact without interfering with the existing operation of the licence or needing buy in from current copyright holders (there is an argument that this would represent an additional restriction within the meaning of GPL, but I addressed that here). This makes it very easy to add by anyone offering, for instance, a download over Github or Gitlab that could be incorporated by someone into a product in the EU.

Conclusion

Thanks to public perception, the issue of developer liability isn’t going to go away and lobbying will not forestall the issue forever, so a robust indemnity defence needs to be incorporated into Open Source licences so that Liability is seen to be accepted where it can best be served (by the people or corporation utilizing the code).

Pete Zaitcev: git cp orig copy

Kernel Planet

1 év 7 hónap óta

Problem: I want to copy a file in git and preserve its history.

Solution: Holy guacamole!:

git checkout -b dup
git mv orig copy
git commit --author="Greg " -m "cp orig copy"
git checkout HEAD~ orig
git commit --author="Greg " -m "restore orig"
git checkout -
git merge --no-ff dup

Pete Zaitcev: Pleroma and restricted timelines

Kernel Planet

1 év 8 hónap óta

Problem: you run a Pleroma instance, and you want to restrict the composite timelines to logged-in users only, but without going to the full "public: false". This is a common ask, and the option restrict_unauthenticated is documented, but the syntax is far from obvious!

Solution:

config :pleroma, :restrict_unauthenticated,
timelines: %{local: true, federated: true}

Pete Zaitcev: RHEL 9 on libvirt and KVM

Kernel Planet

1 év 8 hónap óta

Problem: you create and VM like you always did, but RHEL 9 bombs with:

Fatal glibc error: CPU does not support x86-64-v2

Solution: as Dan Berrange explains in bug #2060839, a traditional default CPU model qemu64 is no longer sufficient. Unfortunately, there's no "qemu64-v2". Instead, you must select one of the real CPUs.

<cpu mode='host-model' match='exact' check='none'>
<model fallback='forbid'>Broadwell-v4</model>
</cpu>