Kernel Planet

Dave Airlie (blogspot): LPC 2022 Accelerators BOF outcomes summary

1 év 6 hónap óta

 At Linux Plumbers Conference 2022, we held a BoF session around accelerators.

This is a summary made from memory and notes taken by John Hubbard.

We started with defining categories of accelerator devices.

1. single shot data processors, submit one off jobs to a device. (simpler image processors)

2. single-user, single task offload devices (ML training devices)

3. multi-app devices (GPU, ML/inference execution engines)

One of the main points made is that common device frameworks are normally about targeting a common userspace (e.g. mesa for GPUs). Since a common userspace doesn't exist for accelerators, this presents a problem of what sort of common things can be targetted. Discussion about tensorflow, pytorch as being the userspace, but also camera image processing and OpenCL. OpenXLA was also named as a userspace API that might be of interest to use as a target for implementations.

 There was a discussion on what to call the subsystem and where to place it in the tree. It was agreed that the drivers would likely need to use DRM subsystem functionality but having things live in drivers/gpu/drm would not be great. Moving things around now for current drivers is too hard to deal with for backports etc. Adding a new directory for accel drivers would be a good plan, even if they used the drm framework. There was a lot naming discussion, I think we landed on drivers/skynet or drivers/accel (Greg and I like skynet more).

We had a discussion about RAS (Reliability, Availability, Serviceability) which is how hardware is monitored in data centers. GPU and acceleration drivers for datacentre operations define a their own RAS interfaces that get plugged into monitoring systems. This seems like an area that could be standardised across drivers. Netlink was suggested as a possible solution for this area.

Namespacing for devices was brought up. I think the advice was if you ever think you are going to namespace something in the future, you should probably consider creating a namespace for it up front, as designing one in later might prove difficult to secure properly.

We should use the drm framework with another major number to avoid some of the pain points and lifetime issues other frameworks see.

There was discussion about who could drive this forward, and Oded Gabbay from Intel's Habana Labs team was the obvious and best placed person to move it forward, Oded said he would endeavor to make it happen.

This is mostly a summary of the notes, I think we have a fair idea on a path forward we just need to start bringing the pieces together upstream now.

 




Paul E. Mc Kenney: Stupid RCU Tricks: CPP Summit Presentation

1 év 6 hónap óta

I had the privilege of presenting Unraveling Fence & RCU Mysteries (C++ Concurrency Fundamentals) to the CPP Summit.  As the title suggests, this covered RCU from a C++ viewpoint.  At the end, there were several excellent questions:

  1. How does the rcu_synchronize() wait-for-readers operation work?
  2. But the rcu_domain class contains lock() and unlock() member functions!!!
  3. Lockless things make me nervous!

There was limited time for questions, and each question's answer could easily have consumed the full 50 minutes alloted for the full talk.  Therefore, I address these questions in the following sections.

How Does rcu_synchronize() Work?

There are a great many ways to make this work.  Very roughly speaking, userspace RCU implementations usually have per-thread counters that are updated by readers and sampled by updaters, with the updaters waiting for all of the counters to reach zero.  There are a large number of pitfalls and optimizations, some of which are covered in the 2012 Transactions On Parallel and Distributed Systems paper (non-paywalled draft).  The most detailed discussion is in the supplementary materials.

More recent information may be found in Section 9.5 of Is Parallel Programming Hard, And, If So, What Can You Do About It?

The rcu_domain Class Contains lock() and unlock() Member Functions?

Indeed it does!

But names notwithstanding, these lock() and unlock() member functions need not contain memory-barrier instructions, let alone read-modify-write atomic operations, let alone acquisition and release of actual locks.

So why these misleading names???  These misleading names exist so that the rcu_domain class meets the requirements of Cpp17BasicLockable, which provides RAII capability for C++ RCU readers.  Earlier versions of the RCU proposal for the C++ standard rolled their own RAII capability, but the committee wisely insisted that Cpp17BasicLockable's existing RAII capabilities be used instead.

So it is that rcu_domain::lock() simply enters an RCU read-side critical section and rcu_domain::unlock() exits that critical section.  Yes, RCU read-side critical sections can be nested.

Lockless Things Make Me Nervous!!!

As well they should!

The wise developer will be at least somewhat nervous when implementing lockless code because that nervousness will help motivate the developer to be careful, to test and stress-test carefully, and, when necessary, make good use of formal verification.

In fact, one of the purposes of RCU is to package lockless code so as to make it easier to use.  This presentation dove into one RCU use case, and other materials called out in this CPP Summit presentation looked into many other RCU use cases.

So proper use of RCU should enable developers to be less nervous.  But hopefully not completely lacking in nervousness!  :-)

Matthew Garrett: Cloud desktops aren't as good as you'd think

1 év 6 hónap óta
Fast laptops are expensive, cheap laptops are slow. But even a fast laptop is slower than a decent workstation, and if your developers want a local build environment they're probably going to want a decent workstation. They'll want a fast (and expensive) laptop as well, though, because they're not going to carry their workstation home with them and obviously you expect them to be able to work from home. And in two or three years they'll probably want a new laptop and a new workstation, and that's even more money. Not to mention the risks associated with them doing development work on their laptop and then drunkenly leaving it in a bar or having it stolen or the contents being copied off it while they're passing through immigration at an airport. Surely there's a better way?

This is the thinking that leads to "Let's give developers a Chromebook and a VM running in the cloud". And it's an appealing option! You spend far less on the laptop, and the VM is probably cheaper than the workstation - you can shut it down when it's idle, you can upgrade it to have more CPUs and RAM as necessary, and you get to impose all sorts of additional neat security policies because you have full control over the network. You can run a full desktop environment on the VM, stream it to a cheap laptop, and get the fast workstation experience on something that weighs about a kilogram. Your developers get the benefit of a fast machine wherever they are, and everyone's happy.

But having worked at more than one company that's tried this approach, my experience is that very few people end up happy. I'm going to give a few reasons here, but I can't guarantee that they cover everything - and, to be clear, many (possibly most) of the reasons I'm going to describe aren't impossible to fix, they're simply not priorities. I'm also going to restrict this discussion to the case of "We run a full graphical environment on the VM, and stream that to the laptop" - an approach that only offers SSH access is much more manageable, but also significantly more restricted in certain ways. With those details mentioned, let's begin.

The first thing to note is that the overall experience is heavily tied to the protocol you use for the remote display. Chrome Remote Desktop is extremely appealing from a simplicity perspective, but is also lacking some extremely key features (eg, letting you use multiple displays on the local system), so from a developer perspective it's suboptimal. If you read the rest of this post and want to try this anyway, spend some time working with your users to find out what their requirements are and figure out which technology best suits them.

Second, let's talk about GPUs. Trying to run a modern desktop environment without any GPU acceleration is going to be a miserable experience. Sure, throwing enough CPU at the problem will get you past the worst of this, but you're still going to end up with users who need to do 3D visualisation, or who are doing VR development, or who expect WebGL to work without burning up every single one of the CPU cores you so graciously allocated to their VM. Cloud providers will happily give you GPU instances, but that's going to cost more and you're going to need to re-run your numbers to verify that this is still a financial win. "But most of my users don't need that!" you might say, and we'll get to that later on.

Next! Video quality! This seems like a trivial point, but if you're giving your users a VM as their primary interface, then they're going to do things like try to use Youtube inside it because there's a conference presentation that's relevant to their interests. The obvious solution here is "Do your video streaming in a browser on the local system, not on the VM" but from personal experience that's a super awkward pain point! If I click on a link inside the VM it's going to open a browser there, and now I have a browser in the VM and a local browser and which of them contains the tab I'm looking for WHO CAN SAY. So your users are going to watch stuff inside their VM, and re-compressing decompressed video is going to look like shit unless you're throwing a huge amount of bandwidth at the problem. And this is ignoring the additional irritation of your browser being unreadable while you're rapidly scrolling through pages, or terminal output from build processes being a muddy blur of artifacts, or the corner case of "I work for Youtube and I need to be able to examine 4K streams to determine whether changes have resulted in a degraded experience" which is a very real job and one that becomes impossible when you pass their lovingly crafted optimisations through whatever codec your remote desktop protocol has decided to pick based on some random guesses about the local network, and look everyone is going to have a bad time.

The browser experience. As mentioned before, you'll have local browsers and remote browsers. Do they have the same security policy? Who knows! Are all the third party services you depend on going to be ok with the same user being logged in from two different IPs simultaneously because they lost track of which browser they had an open session in? Who knows! Are your users going to become frustrated? Who knows oh wait no I know the answer to this one, it's "yes".

Accessibility! More of your users than you expect rely on various accessibility interfaces, be those mechanisms for increasing contrast, screen magnifiers, text-to-speech, speech-to-text, alternative input mechanisms and so on. And you probably don't know this, but most of these mechanisms involve having accessibility software be able to introspect the UI of applications in order to provide appropriate input or expose available options and the like. So, I'm running a local text-to-speech agent. How does it know what's happening in the remote VM? It doesn't because it's just getting an a/v stream, so you need to run another accessibility stack inside the remote VM and the two of them are unaware of each others existence and this works just as badly as you'd think. Alternative input mechanism? Good fucking luck with that, you're at best going to fall back to "Send synthesized keyboard inputs" and that is nowhere near as good as "Set the contents of this text box to this unicode string" and yeah I used to work on accessibility software maybe you can tell. And how is the VM going to send data to a braille output device? Anyway, good luck with the lawsuits over arbitrarily making life harder for a bunch of members of a protected class.

One of the benefits here is supposed to be a security improvement, so let's talk about WebAuthn. I'm a big fan of WebAuthn, given that it's a multi-factor authentication mechanism that actually does a good job of protecting against phishing, but if my users are running stuff inside a VM, how do I use it? If you work at Google there's a solution, but that does mean limiting yourself to Chrome Remote Desktop (there are extremely good reasons why this isn't generally available). Microsoft have apparently just specced a mechanism for doing this over RDP, but otherwise you're left doing stuff like forwarding USB over IP, and that means that your USB WebAuthn no longer works locally. It also doesn't work for any other type of WebAuthn token, such as a bluetooth device, or an Apple TouchID sensor, or any of the Windows Hello support. If you're planning on moving to WebAuthn and also planning on moving to remote VM desktops, you're going to have a bad time.

That's the stuff that comes to mind immediately. And sure, maybe each of these issues is irrelevant to most of your users. But the actual question you need to ask is what percentage of your users will hit one or more of these, because if that's more than an insignificant percentage you'll still be staffing all the teams that dealt with hardware, handling local OS installs, worrying about lost or stolen devices, and the glorious future of just being able to stop worrying about this is going to be gone and the financial benefits you promised would appear are probably not going to work out in the same way.

A lot of this falls back to the usual story of corporate IT - understand the needs of your users and whether what you're proposing actually meets them. Almost everything I've described here is a corner case, but if your company is larger than about 20 people there's a high probability that at least one person is going to fall into at least one of these corner cases. You're going to need to spend a lot of time understanding your user population to have a real understanding of what the actual costs are here, and I haven't seen anyone do that work before trying to launch this and (inevitably) going back to just giving people actual computers.

There are alternatives! Modern IDEs tend to support SSHing out to remote hosts to perform builds there, so as long as you're ok with source code being visible on laptops you can at least shift the "I need a workstation with a bunch of CPU" problem out to the cloud. The laptops are going to need to be more expensive because they're also going to need to run more software locally, but it wouldn't surprise me if this ends up being cheaper than the full-on cloud desktop experience in most cases.

Overall, the most important thing to take into account here is that your users almost certainly have more use cases than you expect, and this sort of change is going to have direct impact on the workflow of every single one of your users. Make sure you know how much that's going to be, and take that into consideration when suggesting it'll save you money.

comments

James Bottomley: Paying Maintainers isn’t a Magic Bullet

1 év 7 hónap óta

Over the last few years it’s become popular to suggest that open source maintainers should be paid. There are a couple of stated motivations for this, one being that it would improve the security and reliability of the ecosystem (pioneered by several companies like Tidelift) and the others contending that it would be a solution to the maintainer burnout and finally that it would solve the open source free rider problem. The purpose of this blog is to examine each of these in turn to seeing if paying maintainers actually would solve the problem (or, for some, does the problem even exist in the first place).

Free Riders

The free rider problem is simply expressed: There’s a class of corporations which consume open source, for free, as the foundation of their profits but don’t give back enough of their allegedly ill gotten gains. In fact, a version of this problem is as old as time: the “workers” don’t get paid enough (or at all) by the “bosses”; greedy people disproportionately exploit the free but limited resources of the planet. Open Source is uniquely vulnerable to this problem because of the free (as in beer) nature of the software: people who don’t have to pay for something often don’t. Part of the problem also comes from the general philosophy of open source which tries to explain that it’s free (as in freedom) which matters not free (as in beer) and everyone, both producers and consumers should care about the former. In fact, in economic terms, the biggest impact open source has had on industry is from the free (as in beer) effect.

Open Source as a Destroyer of Market Value

Open Source is often portrayed as a “disrupter” of the market, but it’s not often appreciated that a huge part of that disruption is value destruction. Consider one of the older Open Source systems: Linux. As an operating system (when coupled with GNU or other user space software) it competed in the early days with proprietary UNIX. However, it’s impossible to maintain your margin competing against free and the net result was that one by one the existing players were forced out of the market or refocussed on other offerings and now, other than for historical or niche markets, there’s really no proprietary UNIX maker left … essentially the value contained within the OS market was destroyed. This value destruction effect was exploited brilliantly by Google with Android: to enter and disrupt an existing lucrative smart phone market, created and owned by Apple, with a free OS based on Open Source successfully created a load of undercutting handset manufacturers eager to be cheaper than Apple who went on to carve out an 80% market share. Here, the value isn’t completely destroyed, but it has significantly reduced (smart phones going from a huge margin business to a medium to low margin one).

All of this value destruction is achieved by the free (as in beer) effect of open source: the innovator who uses it doesn’t have to pay the full economic cost for developing everything from scratch, they just have to pay the innovation expense of adapting it (such adaptation being made far easier by access to the source code). This effect is also the reason why Microsoft and other companies railed about Open Source being a cancer on intellectual property: because it is1. However, this view is also the product of rigid and incorrect thinking: by destroying value in existing markets, open source presents far more varied and unique opportunities in newly created ones. The cardinal economic benefit of value destruction is that it lowers the barrier to entry (as Google demonstrated with Android) thus opening the market up to new and varied competition (or turning monopoly markets into competitive ones).

Envy isn’t a Good Look

If you follow the above, you’ll see the supposed “free rider” problem is simply a natural consequence of open source being free as in beer (someone is creating a market out of the thing you offered for free precisely because they didn’t have to pay for it): it’s not a problem to be solved, it’s a consequence to be embraced and exploited (if you’re clever enough). Not all of us possess the business acumen to exploit market opportunities like this, but if you don’t, envying those who do definitely won’t cause your quality of life to improve.

The bottom line is that having a mechanism to pay maintainers isn’t going to do anything about this supposed “free rider” problem because the companies that exploit open source and don’t give back have no motivation to use it.

Maintainer Burnout

This has become a hot topic over recent years with many blog posts and support groups devoted to it. From my observation it seems to matter what kind of maintainer you are: If you only have hobby projects you maintain on an as time becomes available basis, it seems the chances of burn out isn’t high. On the other hand, if you’re effectively a full time Maintainer, burn out becomes a distinct possibility. I should point out I’m the former not the latter type of maintainer, so this is observation not experience, but it does seem to me that burn out at any job (not just that of a Maintainer) seems to happen when delivery expectations exceed your ability to deliver and you start to get depressed about the ever increasing backlog and vocal complaints. In industry when someone starts to burn out, the usual way of rectifying it is either lighten the load or provide assistance. I have noticed that full time Maintainers are remarkably reluctant to give up projects (presumably because each one is part of their core value), so helping with tooling to decrease the load is about the only possible intervention here.

As an aside about tooling, from parallels with Industry, although tools correctly used can provide useful assistance, there are sure fire ways to increase the possibility of burn out with inappropriate demands:

It does strike me that some of our much venerated open source systems, like github, have some of these same management anti-patterns, like encouraging Maintainers to chase repository stars to show value, having a daily reminder of outstanding PRs and Issues, showing everyone who visits your home page your contribution records for every project over the last year.

To get back to the main point, again by parallel with Industry, paying people more doesn’t decrease industrial burn out; it may produce a temporary feel good high, but the backlog pile eventually overcomes this. If someone is already working at full stretch at something they want to do giving them more money isn’t going to make them stretch further. For hobby maintainers like me, even if you could find a way to pay me that my current employer wouldn’t object to, I’m already devoting as much time as I can spare to my Maintainer projects, so I’m unlikely to find more (although I’m not going to refuse free money …).

Security and Reliability

Everyone wants Maintainers to code securely and reliably and also to respond to bug reports within a fixed SLA. Obviously usual open source Maintainers are already trying to code securely and reliably and aren’t going to do the SLA thing because they don’t have to (as the licence says “NO WARRANTY …”), so paying them won’t improve the former and if they’re already devoting all the time they can to Maintenance, it won’t achieve the latter either. So how could Security and Reliability be improved? All a maintainer can really do is keep current with current coding techniques (when was the last time someone offered a free course to Maintainers to help with this?). Suggesting to a project that if they truly believed in security they’d rewrite it in Rust from tens of thousands of lines of C really is annoying and unhelpful.

One of the key ways to keep software secure and reliable is to run checkers over the code and do extensive unit and integration testing. The great thing about this is that it can be done as a side project from the main Maintenance task provided someone triages and fixes the issues generated. This latter is key; simply dumping issue reports on an overloaded maintainer makes the overload problem worse and adds to a pile of things they might never get around to. So if you are thinking of providing checker or tester resources, please also think how any generated issues might get resolved without generating more work for a Maintainer.

Business Models around Security and Reliability

A pretty old business model for code checking and testing is to have a distribution do it. The good ones tend to send patches upstream and their business model is to sell the software (or at least a support licence to it) which gives the recipients a SLA as well. So what’s the problem? Mainly the economics of this tried and trusted model. Firstly what you want supported must be shipped by a distribution, which means it must have a big enough audience for a distribution to consider it a fairly essential item. Secondly you end up paying a per instance use cost that’s an average of everything the distribution ships. The main killer is this per instance cost, particularly if you are a hyperscaler, so it’s no wonder there’s a lot of pressure to shift the cost from being per instance to per project.

As I said above, Maintainers often really need more help than more money. One good way to start would potentially be to add testing and checking (including bug fixing and upstreaming) services to a project. This would necessarily involve liaising with the maintainer (and could involve an honorarium) but the object should be to be assistive (and thus scalably added) to what the Maintainer is already doing and prevent the service becoming a Maintainer time sink.

Professional Maintainers

Most of the above analysis assumed Maintainers are giving all the time they have available to the project. However, in the case where a Maintainer is doing the project in their spare time or is an Employee of a Company and being paid partly to work on the project and partly on other things, paying them to become a full time Maintainer (thus leaving their current employment) has the potential to add the hours spent on “other things” to the time spent on the project and would thus be a net win. However, you have to also remember that turning from employee to independent contractor also comes with costs in terms of red tape (like health care, tax filings, accounting and so on), which can become a significant time sink, so the net gain in hours to the project might not be as many as one would think. In an ideal world, entities paying maintainers would also consider this problem and offer services to offload the burden (although none currently seem to consider this). Additionally, turning from part time to full time can increase the problem of burn out, particularly if you spend increasing portions of your newly acquired time worrying about admin issues or other problems associated with running your own consulting business.

Conclusions

The obvious conclusion from the above analysis is that paying maintainers mostly doesn’t achieve it’s stated goals. However, you have to remember that this is looking at the problem thorough the prism of claimed end results. One thing paying maintainers definitely does do is increase the mechanisms by which maintainers themselves make a living (which is a kind of essential existential precursor). Before paying maintainers became a thing, the only real way of making a living as a maintainer was reputation monetization (corporations paid you to have a maintainer on staff or because being a maintainer demonstrated a skill set they needed in other aspects of their business) but now a Maintainer also has the option to turn Professional. Increasing the net ways of rewarding Maintainership therefore should be a net benefit to attracting people into all ecosystems.

In general, I think that paying maintainers is a good thing, but should be the beginning of the search for ways of remunerating Open Source contributors, not the end.

Paul E. Mc Kenney: Parallel Programming: September 2022 Update

1 év 7 hónap óta
The v2022.09.25a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.

This version boasts an expanded index and API index, and also adds a number of improvements, perhaps most notably boldface for the most pertinent pages for a given index entry, courtesy of Akira Yokosawa. Akira also further improved the new ebook-friendly PDFs, expanded the list of acronyms, updated the build system to allow different perfbook formats to be built concurrently, adjusted for Ghostscript changes, carried out per-Linux-version updates, and did a great deal of formatting and other cleanup.

One of the code samples now use C11 thread-local storage instead of the GCC __thread storage class, courtesy of Elad Lahav. Elad also added support for building code samples on QNX.

Johann Klähn, SeongJae Park, Xuwei Fu, and Zhouyi Zhou provided many welcome fixes throughout the book.

This release also includes a number of updates to RCU, memory ordering, locking, and non-blocking synchronization, as well as additional information on the combined use of synchronization mechanisms.

Matthew Garrett: Handling WebAuthn over remote SSH connections

1 év 7 hónap óta
Being able to SSH into remote machines and do work there is great. Using hardware security tokens for 2FA is also great. But trying to use them both at the same time doesn't work super well, because if you hit a WebAuthn request on the remote machine it doesn't matter how much you mash your token - it's not going to work.

But could it?

The SSH agent protocol abstracts key management out of SSH itself and into a separate process. When you run "ssh-add .ssh/id_rsa", that key is being loaded into the SSH agent. When SSH wants to use that key to authenticate to a remote system, it asks the SSH agent to perform the cryptographic signatures on its behalf. SSH also supports forwarding the SSH agent protocol over SSH itself, so if you SSH into a remote system then remote clients can also access your keys - this allows you to bounce through one remote system into another without having to copy your keys to those remote systems.

More recently, SSH gained the ability to store SSH keys on hardware tokens such as Yubikeys. If configured appropriately, this means that even if you forward your agent to a remote site, that site can't do anything with your keys unless you physically touch the token. But out of the box, this is only useful for SSH keys - you can't do anything else with this support.

Well, that's what I thought, at least. And then I looked at the code and realised that SSH is communicating with the security tokens using the same library that a browser would, except it ensures that any signature request starts with the string "ssh:" (which a genuine WebAuthn request never will). This constraint can actually be disabled by passing -O no-restrict-websafe to ssh-agent, except that was broken until this weekend. But let's assume there's a glorious future where that patch gets backported everywhere, and see what we can do with it.

First we need to load the key into the security token. For this I ended up hacking up the Go SSH agent support. Annoyingly it doesn't seem to be possible to make calls to the agent without going via one of the exported methods here, so I don't think this logic can be implemented without modifying the agent module itself. But this is basically as simple as adding another key message type that looks something like:
type ecdsaSkKeyMsg struct { Type string `sshtype:"17|25"` Curve string PubKeyBytes []byte RpId string Flags uint8 KeyHandle []byte Reserved []byte Comments string Constraints []byte `ssh:"rest"` }Where Type is ssh.KeyAlgoSKECDSA256, Curve is "nistp256", RpId is the identity of the relying party (eg, "webauthn.io"), Flags is 0x1 if you want the user to have to touch the key, KeyHandle is the hardware token's representation of the key (basically an opaque blob that's sufficient for the token to regenerate the keypair - this is generally stored by the remote site and handed back to you when it wants you to authenticate). The other fields can be ignored, other than PubKeyBytes, which is supposed to be the public half of the keypair.

This causes an obvious problem. We have an opaque blob that represents a keypair. We don't have the public key. And OpenSSH verifies that PubKeyByes is a legitimate ecdsa public key before it'll load the key. Fortunately it only verifies that it's a legitimate ecdsa public key, and does nothing to verify that it's related to the private key in any way. So, just generate a new ECDSA key (ecdsa.GenerateKey(elliptic.P256(), rand.Reader)) and marshal it ( elliptic.Marshal(ecKey.Curve, ecKey.X, ecKey.Y)) and we're good. Pass that struct to ssh.Marshal() and then make an agent call.

Now you can use the standard agent interfaces to trigger a signature event. You want to pass the raw challenge (not the hash of the challenge!) - the SSH code will do the hashing itself. If you're using agent forwarding this will be forwarded from the remote system to your local one, and your security token should start blinking - touch it and you'll get back an ssh.Signature blob. ssh.Unmarshal() the Blob member to a struct like
type ecSig struct { R *big.Int S *big.Int }and then ssh.Unmarshal the Rest member to
type authData struct { Flags uint8 SigCount uint32 }The signature needs to be converted back to a DER-encoded ASN.1 structure (eg,
var b cryptobyte.Builder b.AddASN1(asn1.SEQUENCE, func(b *cryptobyte.Builder) { b.AddASN1BigInt(ecSig.R) b.AddASN1BigInt(ecSig.S) }) signatureDER, _ := b.Bytes() , and then you need to construct the Authenticator Data structure. For this, take the RpId used earlier and generate the sha256. Append the one byte Flags variable, and then convert SigCount to big endian and append those 4 bytes. You should now have a 37 byte structure. This needs to be CBOR encoded (I used github.com/fxamacker/cbor and just called cbor.Marshal(data, cbor.EncOptions{})).

Now base64 encode the sha256 of the challenge data, the DER-encoded signature and the CBOR-encoded authenticator data and you've got everything you need to provide to the remote site to satisfy the challenge.

There are alternative approaches - you can use USB/IP to forward the hardware token directly to the remote system. But that means you can't use it locally, so it's less than ideal. Or you could implement a proxy that communicates with the key locally and have that tunneled through to the remote host, but at that point you're just reinventing ssh-agent.

And you should bear in mind that the default behaviour of blocking this sort of request is for a good reason! If someone is able to compromise a remote system that you're SSHed into, they can potentially trick you into hitting the key to sign a request they've made on behalf of an arbitrary site. Obviously they could do the same without any of this if they've compromised your local system, but there is some additional risk to this. It would be nice to have sensible MAC policies that default-denied access to the SSH agent socket and only allowed trustworthy binaries to do so, or maybe have some sort of reasonable flatpak-style portal to gate access. For my threat model I think it's a worthwhile security tradeoff, but you should evaluate that carefully yourself.

Anyway. Now to figure out whether there's a reasonable way to get browsers to work with this.

comments

Matthew Garrett: Bring Your Own Disaster

1 év 7 hónap óta
After my last post, someone suggested that having employers be able to restrict keys to machines they control is a bad thing. So here's why I think Bring Your Own Device (BYOD) scenarios are bad not only for employers, but also for users.

There's obvious mutual appeal to having developers use their own hardware rather than rely on employer-provided hardware. The user gets to use hardware they're familiar with, and which matches their ergonomic desires. The employer gets to save on the money required to buy new hardware for the employee. From this perspective, there's a clear win-win outcome.

But once you start thinking about security, it gets more complicated. If I, as an employer, want to ensure that any systems that can access my resources meet a certain security baseline (eg, I don't want my developers using unpatched Windows ME), I need some of my own software installed on there. And that software doesn't magically go away when the user is doing their own thing. If a user lends their machine to their partner, is the partner fully informed about what level of access I have? Are they going to feel that their privacy has been violated if they find out afterwards?

But it's not just about monitoring. If an employee's machine is compromised and the compromise is detected, what happens next? If the employer owns the system then it's easy - you pick up the device for forensic analysis and give the employee a new machine to use while that's going on. If the employee owns the system, they're probably not going to be super enthusiastic about handing over a machine that also contains a bunch of their personal data. In much of the world the law is probably on their side, and even if it isn't then telling the employee that they have a choice between handing over their laptop or getting fired probably isn't going to end well.

But obviously this is all predicated on the idea that an employer needs visibility into what's happening on systems that have access to their systems, or which are used to develop code that they'll be deploying. And I think it's fair to say that not everyone needs that! But if you hold any sort of personal data (including passwords) for any external users, I really do think you need to protect against compromised employee machines, and that does mean having some degree of insight into what's happening on those machines. If you don't want to deal with the complicated consequences of allowing employees to use their own hardware, it's rational to ensure that only employer-owned hardware can be used.

But what about the employers that don't currently need that? If there's no plausible future where you'll host user data, or where you'll sell products to others who'll host user data, then sure! But if that might happen in future (even if it doesn't right now), what's your transition plan? How are you going to deal with employees who are happily using their personal systems right now? At what point are you going to buy new laptops for everyone? BYOD might work for you now, but will it always?

And if your employer insists on employees using their own hardware, those employees should ask what happens in the event of a security breach. Whose responsibility is it to ensure that hardware is kept up to date? Is there an expectation that security can insist on the hardware being handed over for investigation? What information about the employee's use of their own hardware is going to be logged, who has access to those logs, and how long are those logs going to be kept for? If those questions can't be answered in a reasonable way, it's a huge red flag. You shouldn't have to give up your privacy and (potentially) your hardware for a job.

Using technical mechanisms to ensure that employees only use employer-provided hardware is understandably icky, but it's something that allows employers to impose appropriate security policies without violating employee privacy.

comments

Dave Airlie (blogspot): LPC 2022 GPU BOF (user console and cgroups)

1 év 7 hónap óta

 At LPC 2022 we had a BoF session for GPUs with two topics.

Moving to userspace consoles:

Currently most mainline distros still use the kernel console, provided by the VT subsystem. We'd like to move to CONFIG_VT=n as the console and vt subsystem have historically been a source of bugs but are also nasty places for locking etc. It also can be the cause of oops going missing when it takes out the panic path with locking bugs stopping other paths from completely processing the oops (like pstore or serial).

 The session started discussing what things would like. Lennart gave a great summary of the work David did a few years ago and the train of thought involved.

Once you think through all the paths and things you want supported, you realise the best user console is going to be one that supports emojis and non-Latin scripts. This probably means you want a lightweight wayland compositor running a fullscreen VTE based terminal. Working back from the consequences of this means you probably aren't going to want this in systemd, and it should be a separate development.

The other area discussed was around the requirements for a panic/emergency kernel console, likely called drmlog, this would just be something to output to the display whenever the kernel panics or during boot before the user console is loaded.

cgroups for GPU

This has been an ongoing topic, where vendors often suggest cgroup patches, and on review told they are too vendor specific and to simplify and try again, never to try again. We went over the problem space of both memory allocation accounting/enforcement and processing restrictions. We all agreed the local memory accounting was probably the easiest place to start doing something. We also agreed that accounting should be implemented and solid before we start digging into enforcement. We had a discussion about where CPU memory should be accounted, especially around integrated vs discrete graphics, since users might care about the application memory usage apart from the GPU objects usage which would be CPU on integrated and GPU on discrete. I believe a follow-up hallway discussion also covered a bunch of this with upstream cgroups maintainer.

Matthew Garrett: git signatures with SSH certificates

1 év 7 hónap óta
Last night I complained that git's SSH signature format didn't support using SSH certificates rather than raw keys, and was swiftly corrected, once again highlighting that the best way to make something happen is to complain about it on the internet in order to trigger the universe to retcon it into existence to make you look like a fool. But anyway. Let's talk about making this work!

git's SSH signing support is actually just it shelling out to ssh-keygen with a specific set of options, so let's go through an example of this with ssh-keygen. First, here's my certificate:

$ ssh-keygen -L -f id_aurora-cert.pub
id_aurora-cert.pub:
Type: ecdsa-sha2-nistp256-cert-v01@openssh.com user certificate
Public key: ECDSA-CERT SHA256:(elided)
Signing CA: RSA SHA256:(elided)
Key ID: "mgarrett@aurora.tech"
Serial: 10505979558050566331
Valid: from 2022-09-13T17:23:53 to 2022-09-14T13:24:23
Principals:
mgarrett@aurora.tech
Critical Options: (none)
Extensions:
permit-agent-forwarding
permit-port-forwarding
permit-pty

Ok! Now let's sign something:

$ ssh-keygen -Y sign -f ~/.ssh/id_aurora-cert.pub -n git /tmp/testfile
Signing file /tmp/testfile
Write signature to /tmp/testfile.sig

To verify this we need an allowed signatures file, which should look something like:

*@aurora.tech cert-authority ssh-rsa AAA(elided)

Perfect. Let's verify it:

$ cat /tmp/testfile | ssh-keygen -Y verify -f /tmp/allowed_signers -I mgarrett@aurora.tech -n git -s /tmp/testfile.sig
Good "git" signature for mgarrett@aurora.tech with ECDSA-CERT key SHA256:(elided)

Woo! So, how do we make use of this in git? Generating the signatures is as simple as

$ git config --global commit.gpgsign true
$ git config --global gpg.format ssh
$ git config --global user.signingkey /home/mjg59/.ssh/id_aurora-cert.pub

and then getting on with life. Any commits will now be signed with the provided certificate. Unfortunately, git itself won't handle verification of these - it calls ssh-keygen -Y find-principals which doesn't deal with wildcards in the allowed signers file correctly, and then falls back to verifying the signature without making any assertions about identity. Which means you're going to have to implement this in your own CI by extracting the commit and the signature, extracting the identity from the commit metadata and calling ssh-keygen on your own. But it can be made to work!

But why would you want to? The current approach of managing keys for git isn't ideal - you can kind of piggy-back off github/gitlab SSH key infrastructure, but if you're an enterprise using SSH certificates for access then your users don't necessarily have enrolled keys to start with. And using certificates gives you extra benefits, such as having your CA verify that keys are hardware-backed before issuing a cert. Want to ensure that whoever made a commit was actually on an authorised laptop? Now you can!

I'll probably spend a little while looking into whether it's plausible to make the git verification code work with certificates or whether the right thing is to fix up ssh-keygen -Y find-principals to work with wildcard identities, but either way it's probably not much effort to get this working out of the box.

Edit to add: thanks to this commenter for pointing out that current OpenSSH git actually makes this work already!

comments

Paul E. Mc Kenney: Kangrejos 2022: The Rust for Linux Workshop

1 év 7 hónap óta
I had the honor of attending this workshop, however, this is not a full report. Instead, I am reporting on what I learned and how it relates to my So You Want to Rust the Linux Kernel? blog series that I started in 2021, and of which this post is a member. I will leave more complete coverage of a great many interesting sessions to others (for example, here, here, and here), who are probably better versed in Rust than am I in any case.

Device Communication and Memory OrderingAndreas Hindborg talked about his work on a Rust-language PCI NVMe driver for the Linux kernel x86 architecture. I focus on the driver's interaction with device firmware in main memory. As is often the case, there is a shared structure in normal memory in which the device firmware reports the status of an I/O request. This structure has a special 16-bit field that is used to report that all of the other fields are now filled in, so that the Linux device driver can now safely access them.

This of course requires some sort of memory ordering, both on the part of the device firmware and on the part of the device driver. One straightforward approach is for the driver to use smp_load_acquire() to access that 16-bit field, and only if the returned value indicates that the other fields have been filled in, access those other fields.

To the surprise of several Linux-kernel developers in attendance, Andreas's driver instead does a volatile load of the entire structure, 16-bit field and all. If the 16-bit field indicates that all the fields have been updated by the firmware, it then does a second volatile load of the entire structure and then proceeds to use the values from this second load. Apparently, the performance degradation from these duplicate accesses, though measurable, is quite small. However, given the possibility of larger structures for which the performance degradation might be more profound, Andreas was urged to come up with a more elaborate Rust construction that would permit separate access to the 16-bit field, thus avoiding the redundant memory traffic.

In a subsequent conversation, Andreas noted that the overall performance degradation of this Rust-language prototype compared to the C-language version is at most 2.61%, and that in one case there is a yet-as-unexplained performance increase of 0.67%. Of course, given that both C and Rust are compiled, statically typed, non-garbage-collected, and handled by the same compiler back-end, it should not be a surprise that they achieve similar performance. An interesting question is whether the larger amount of information available to the Rust compiler will ever allow it to consistently provide better performance than does C.

Rust and Linux-Kernel FilesystemsWedson Almeida Filho described his work on a Rust implementations of portions of the Linux kernel's 9p filesystem. The part of this effort that caught my attention was use of the Rust language's asynchronous facilities to implement some of the required state machines and use of a Rust-language type-aware packet-decoding facility.

So what do these two features have to do with concurrency in general, let alone memory-ordering in particular?

Not much. Sure, call_rcu() is (most of) the async equivalent of synchronize_rcu(), but that should be well-known and thus not particularly newsworthy.

But these two features might have considerable influence over the success or lack thereof for the Rust-for-Linux effort.

In the past, much of the justification for use of Rust in Linux has been memory safety, based on the fact that 70% of the Linux exploits have involved memory unsafety. Suppose that Rust manages to address the entire 70%, as opposed to relegating (say) half of that to Rust-language unsafe code. Then use of Rust might at best provide a factor-of-three improvement.

Of course, a factor of three is quite good in absolute terms. However, in my experience, at least an order of magnitude is usually required to with high probability motivate the kind of large change represented by the C-to-Rust transition. Therefore, in order for me to rate the Rust-for-Linux projects success as highly probable, another factor of three is required.

One possible source of the additional factor of three might come from the Rust-for-Linux community giving neglected drivers some much-needed attention. Perhaps the 9p work would qualify, but it seems unlikely that the NVMe drivers are lacking attention.

Perhaps Rust async-based state machines and type-aware packet decoding will go some ways towards providing more of the additional factor of three. Or perhaps a single factor of three will suffice in this particular case. Who knows?

“Pinning” for the Linux Kernel in RustBenno Lossin and Gary Guo reported on their respective efforts to implement pinning in Rust for the Linux kernel (a more detailed report may be found here).

But perhaps you, like me, are wondering just what pinning is and why the Linux kernel needs it.

It turns out that the Rust language allows the compiler great freedom to rearrange data structures by default, similar to what would happen in C-language code if each and every structure was marked with the Linux kernel's __randomize_layout directive. In addition, by default, Rust-language data can be moved at runtime.

This apparently allows additional optimizations, but it is not particularly friendly to Linux-kernel idioms that take addresses of fields in structures. Such idioms include pretty much all of Linux's linked-list code, as well as things like RCU callbacks. And who knows, perhaps this situation helps to explain recent hostility towards linked lists expressed on social media. ;-)

The Rust language accommodates these idioms by allowing data to be pinned, thus preventing the compiler from moving it around.

However, pinning data has consequences, including the fact that such pinning is visible in Rust's type system. This visibility allows Rust compilers to complain when (for example) list_for_each_entry() is passed an unpinned data structure. Such complaints are important, given that passing unpinned data to primitives expecting pinned data would result in memory unsafety. The code managing the pinning is expected to be optimized away, but there are concerns that it would make itself all too apparent during code review and debugging.

Benno's and Gary's approaches were compared and contrasted, but my guess is that additional work will be required to completely solve this problem. Some attendees would like to see pinning addressed at the Rust-language level rather than at the library level, but time will tell what approach is most appropriate.

RCU Use CasesAlthough RCU was not an official topic, for some reason it came up frequently during the hallway tracks that I participated in. Which is probably a good thing. ;-)

One question is exactly how the various RCU use cases should be represented in Rust. Rust's atomic-reference-count facility has been put forward, and it is quite possible that, in combination with liberal use of atomics, it will serve for some of the more popular use cases. Wedson suggested that Revocable covers another group of use cases, and at first glance, it appears that Wedson might be quite right. Still others might be handled by wait-wakeup mechanisms. There are still some use cases to be addressed, but some progress is being made.

Rust and PythonSome people suggested that Rust might take over from Python, adding that Rust solves many problems that Python is said to encounter in large-scale programs, and that Rust should prove more ergonomic than Python. The first is hard to argue with, given that Python seems to be most successful in smaller-scale projects that make good use of Python's extensive libraries. The second seems quite unlikely to me, in fact, it is hard for me to imagine that anything about Rust would seem in any way ergonomic to a dyed-in-the-wool Python developer.

One particularly non-ergonomic attribute of current Rust is likely to be its libraries, which last I checked were considerably less extensive than those of Python. Although the software-engineering shortcomings of large-scale Python projects can and have motivated conversion to Rust, it appears to me that smaller Python projects making heavier use of a wider variety of libraries might need more motivation.

Automatic conversion of Python libraries, anyone? ;-)

ConclusionsI found Kangrejos 2022 to be extremely educational and thought-provoking, and am very glad that I was able to attend. I am still uncertain about Rust's prospects in the Linux kernel, but the session provided some good examples of actual work done and problems solved. Perhaps Rust's async and type-sensitive packet-decoding features will help increase its probability of success, and perhaps there are additional Rust features waiting to be applied, for example, use of Rust iterators to simplify lockdep cycle detection. And who knows? Perhaps future versions of Rust will be able to offer consistent performance advantages. Stranger things have happened!

HistorySeptember 19, 2022: Changes based on LPC and LWN reports.

Linux Plumbers Conference: LPC 2022 Open for Last-Minute On-Site Registrations

1 év 8 hónap óta

Against all expectations, we have now worked through the entire waitlist, courtesy of some last-minute cancellations due to late-breaking corporate travel restrictions. We are therefore re-opening general registration. So if you can somehow arrange to be in Dublin on September 12-14 at this late date, please register.

Virtual registration never has closed, and is still open.

Either way, we look forward to seeing you there!

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XIX: Concurrent Computational Models

1 év 8 hónap óta
The September 2022 issue of the Communications of the ACM features an entertaining pair of point/counterpoint articles, with William Dally advocating for concurrent computational models that more accurately model both energy costs and latency here and Uzi Vishkin advocating for incremental modifications to the existing Parallel Random Access Machine (PRAM) model here. Some cynics might summarize Dally's article as “please develop software that makes better use of our hardware so that we can sell you more hardware” while other cynics might summarize Vishkin's article as “please keep funding our research project“. In contrast, I claim that both articles are worth a deeper look. The next section looks at Dally's article, the section after that at Vishkin's article, followed by discussion.

TL;DR: In complete consonance with a depressingly large fraction of 21st-century discourse, they are talking right past each other.

On the Model of Computation: Point: We Must Extend Our Model of Computation to Account for Cost and LocationDally makes a number of good points. First, he notes that past decades have seen a shift from computation being more expensive than memory references to memory references being more expensive than computation, and by a good many orders of magnitude. Second, he points out that in production, constant factors can be more important than asymptotic behavior. Third, he describes situations where hardware caches are not a silver bullet. Fourth, he calls out the necessity of appropriate computational models for performance programming, though to his credit, he does clearly state that not all programming is performance programming. Fifth, he calls out the global need for reduced energy expended on computation, motivated by projected increases in computation-driven energy demand. Sixth and finally, he advocates for a new computational model that builds on PRAM by (1) assigning different costs to computation and to memory references and (1) making memory-reference costs a function of a distance metric based on locations assigned to processing units and memories.

On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence CrisisVishkin also makes some interesting points. First, he calls out the days of Moore's-Law frequency scaling, calling for the establishing of a similar “free lunch” scaling in terms of numbers of CPUs. He echoes Dally's point about not all programming being performance programming, but argues that in addition to a “memory wall” and an “energy wall”, there is also a “parallel programming wall” because programming multicore systems is in “simply too difficult”. Second, Vishkin calls on hardware vendors to take more account of ease of use when creating computer systems, including systems with GPUs. Third, Vishkin argues that compilers should take up much of the burden of squeezing performance out of hardware. Fourth, he makes several arguments that concurrent systems should adapt to PRAM rather than adapting PRAM to existing hardware. Fifth and finally, he calls for funding for a “multicore software spiral” that he hopes will bring back free-lunch performance increases driven by Moore's Law, but based on increasing numbers of cores rather than increasing clock frequency.

DiscussionFigure 2.3 of Is Parallel Programming Hard, And, If So, What Can You Do About It? (AKA “perfbook”) shows the “Iron Triangle” of concurrent programming. Vishkin focuses primarily at the upper edge of this triangle, which is primarily about productivity. Dally focuses primarily on the lower left edge of this triangle, which is primarily about performance. Except that Vishkin's points about the cost of performance programming are all too valid, which means that defraying the cost of performance programming in the work-a-day world requires very large numbers of users, which is often accomplished by making performance-oriented concurrent software be quite general, thus amortizing its cost over large numbers of use cases and users. This puts much performance-oriented concurrent software at the lower tip of the iron triangle, where performance meets generality.

One could argue that Vishkin's proposed relegation of performance-oriented code to compilers is one way to break the iron triangle of concurrent programming, but this argument fails because compilers are themselves low-level software (thus at the lower tip of the triangle). Worse yet, there have been many attempts to put the full concurrent-programming burden on the compiler (or onto transactional memory), but rarely to good effect. Perhaps SQL is the exception that proves this rule, but this is not an example that Vishkin calls out.

Both Dally and Vishkin correctly point to the ubiquity of multicore systems, so it only reasonable to look at how these systems are handled in practice. After all, it is only fair to check their apparent assumption of “badly”. And Section 2.3 of perfbook lists several straightforward approaches: (1) Using multiple instances of a sequential application, (2) Using the large quantity of existing parallel software, ranging from operating-system kernels to database servers (SQL again!) to numerical software to image-processing libraries to machine-learning libraries, to name but a few of the areas that Python developers could teach us about, and (3) Applying sequential performance optimizations such that single-threaded performance suffices. Those taking door 1 or door 2 will need only a few parallel performance programmers, and it seems to have proven eminently feasible to have those few to use a sufficiently accurate computational model. The remaining users simply invoke the software produced by these parallel performance programmers, and need not worry much about concurrency. The cases where none of these three doors apply are more challenging, but surmounting the occasional challenge is part of the programming game, not just the parallel-programming game.

One additional historical trend is the sharply decreasing cost of concurrent systems, both in terms of purchase price and energy consumption. Where an entry-level late-1980s parallel system cost millions of dollars and consumed kilowatts, an entry-level 2022 system can be had for less than $100. This means that there is no longer an overwhelming economic imperative to extract every ounce of performance from most year-2022 concurrent systems. For example, much smartphone code attains the required performance while running single-threaded, which means that additional performance from parallelism could well be a waste of energy. Instead, the smartphone automatically powers down the unused hardware, thus providing only the performance that is actually needed, and does so in an energy-efficient manner. The occasional heavy-weight application might turn on all the CPUs, but only such applications need the ministrations of parallel performance programmers and complex computational models.

Thus, Dally and Vishkin are talking past each other, describing different types of software that is produced by different types of developers, who can each use whatever computational model is appropriate to the task at hand.

Which raises the question of whether there might be a one-size-fits-all computational model. I have my doubts. In my 45 years of supporting myself and (later) my family by developing software, I have seen many models, frameworks, languages, and libraries come and go. In contrast, to the best of my knowledge, the speed of light and the sizes of the various types of atoms have remained constant. Don't get me wrong, I do hope that the trend towards vertical stacking of electronics within CPU chips will continue, and I also hope that this trend will result in smaller systems that are correspondingly less constrained by speed-of-light and size-of-atoms limitations. However, the laws of physics appear to be exceedingly stubborn things, and we would therefore be well-advised to work together. Dally's attempts to dictate current hardware conveniences to software developers is about as likely to succeed as is Vishkin's attempts to dictate current software conveniences to hardware developers. Both of these approaches are increasingly likely to fail should the trend towards special-purpose hardware continue full force. Software developers cannot always ignore the laws of physics, and by the same token, hardware developers cannot always ignore the large quantity of existing software and the large number of existing software developers. To be sure, in order to continue to break new ground, both software and hardware developers will need to continue to learn new tricks, but many of these tricks will require coordinated hardware/software effort.

Sadly, there is no hint in either Dally's or Vishkin's article of any attempt to learn what the many successful developers of concurrent software actually do. Thus, it is entirely possible that neither Dally's nor Vishkin's preferred concurrent computational model is applicable to actual parallel programming practice. In fact, in my experience, developers often use the actual hardware and software as the model, driving their development and optimization efforts from actual measurements and from intuitions built up from those measurements. Some might look down on this approach as being both non-portable and non-mathematical, but portability is often achieved simply by testing on a variety of hardware, courtesy of the relatively low prices of modern computer systems. As for mathematics, those of us who appreciate mathematics would do well to admit that it is often much more efficient to let the objective universe carry out the mathematical calculations in its normal implicit and ineffable manner, especially given the complexity of present day hardware and software.

Circling back to the cynics, there are no doubt some cynics that would summarize this blog post as “please download and read my book”. Except that in contrast to the hypothesized cynical summarization of Dally's and Vishkin's articles, you can download and read my book free of charge. Which is by design, given that the whole point is to promulgate information. Cue cynics calling out which of the three of us is the greater fool! ;-)

Linux Plumbers Conference: LPC 2022 Evening Events Announcement

1 év 8 hónap óta

Linux Plumbers Conference 2022 will have evening events on Monday September 12 from 19:30 to 22:30 and on Wednesday September 14, again from 19:30 to 22:30. Tuesday is on your own, and we anticipate that a fair number of the people registered both for LPC and OSS EU might choose to attend their evening event on that day. Looking forward to seeing you all in Dublin the week after this coming one!

Paul E. Mc Kenney: Stupid SMP Tricks: A Review of Locking Engineering Principles and Hierarchy

1 év 8 hónap óta
Daniel Vetter put together a pair of intriguing blog posts entitled Locking Engineering Principles and Locking Engineering Hierarchy. These appear to be an attempt to establish a set of GPU-wide or perhaps even driver-tree-wide concurrency coding conventions.

Which would normally be none of my business. After all, to establish such conventions, Daniel needs to negotiate with the driver subsystem's developers and maintainers, and I am neither. Except that he did call me out on Twitter on this topic. So here I am, as promised, offering color commentary and the occasional suggestion for improvement, both of Daniel's proposal and of the kernel itself. The following sections review his two posts, and then summarize and amplify suggestions for improvement.

Locking Engineering Principles
Make it Dumb” should be at least somewhat uncontroversial, as this is quite similar to a rather more flamboyant sound bite from my 1970s Mechanical Engineering university coursework. Kernighan's Law asserting that debugging is twice as hard as coding should also be a caution.

This leads us to “Make it Correct”, which many might argue should come first in the list. After all, if correctness is not a higher priority than dumbness (or simplicity, if you prefer), then the do-nothing null solution would always be preferred. And to his credit, Daniel does explicitly state that “simple doesn't necessarily mean correct”.

It is hard to argue with Daniel's choosing to give lockdep place of pride, especially given how many times it has saved me over the years, and not just from deadlock. I never have felt the need to teach lockdep RCU's locking rules, but then again, RCU is much more monolithic than is the GPU subsystem. Perhaps if RCU's locking becomes more ornate, I would also feel the need to acquire RCU's key locks in the intended order at boot time. Give or take the fact that much of RCU can be used very early, even before the call to rcu_init().

The validation point is also a good one, with Daniel calling out might_lock(), might_sleep(), and might_alloc(). I go further and add lockdep_assert_irqs_disabled(), lockdep_assert_irqs_enabled(), and lockdep_assert_held(), but perhaps these additional functions are needed in low-level code like RCU more than they are in GPU drivers.

Daniel's admonishments to avoid reinventing various synchronization wheels of course comprise excellent common-case advice. And if you believe that your code is the uncommon case, you are most likely mistaken. Furthermore, it is hard to argue with “pick the simplest lock that works”.

It is hard to argue with the overall message of “Make it Fast”, especially the bit about the pointlessness of crashing faster. However, a minimum level of speed is absolutely required. For a 1977 example, an old school project required keeping up with the display. If the code failed to keep up with the display, that code was useless. More recent examples include deep sub-second request-response latency requirements in many Internet datacenters. In other words, up to a point, performance is not simply a “Make it Fast” issue, but also a “Make it Correct” issue. On the other hand, once the code meets its performance requirements, any further performance improvements instead falls into the lower-priority “Make it Fast” bucket.

Except that things are not always quite so simple. Not all performance improvements can be optimized into existence late in the game. In fact, if your software's performance requirements are expected to tax your system's resources, it will be necessary to make performance a first-class consideration all the way up to and including the software-architecture level, as discussed here. And, to his credit, Daniel hints at this when he notes that “the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts”. He also helpfully calls out io_uring as one avenue for radical updates.

The “Protect Data, not Code” is good general advice and goes back to Jack Inman's 1985 USENIX paper entitled “Implementing Loosely Coupled Functions on Tightly Coupled Engines”, if not further. However, there are times where what needs to be locked is not data, but perhaps a particular set of state transitions, or maybe a rarely accessed state, which might be best represented by the code for that state. That said, there can be no denying the big-kernel-lock hazards of excessive reliance on code locking. Furthermore, I have only rarely needed abstract non-data locking.

Nevertheless, a body of code as large as the full set of Linux-kernel device drivers should be expected to have a large number of exceptions to the “Protect Data” rule of thumb, reliable though that rule has proven in my own experience. The usual way of handling this is a waiver process. After all, given that every set of safety-critical coding guidelines I have seen has a waiver process, it only seems reasonable that device-driver concurrency coding conventions should also involve a waiver process. But that decision belongs to the device-driver developers and maintainers, not with me.

Locking Engineering Hierarchy
Most of the Level 0: No Locking section should be uncontroversial: Do things the easy way, at least when the easy way works. This approach goes back decades, for but one example, single-threaded user-interface code delegating concurrency to a database management system. And it should be well-understood that complicated things can be made easy (or at least easier) through use of carefully constructed and time-tested APIs, for example, the queue_work() family of APIs called out in this section. One important question for all such APIs is this: “Can someone with access to only the kernel source tree and a browser quickly find and learn how to use the given API?” After all, people are able to properly use the APIs they see elsewhere if they can quickly and easily learn about them.

And yes, given Daniel's comments regarding struct dma_fence later in this section, the answer for SLAB_TYPESAFE_BY_RCU appears to be “no” (but see Documentation/RCU/rculist_nulls.rst). The trick with SLAB_TYPESAFE_BY_RCU is that readers must validate the allocation. A common way of doing this is to have a reference count that is set to the value one immediately after obtaining the object from kmem_struct_alloc() and set to the value zero immediately before passing the object to kmem_struct_free(). And yes, I have a patch queued to fix the misleading text in Documentation/RCU/whatisRCU.rst, and I apologize to anyone who might have attempted to use locking without a reference count. But please also let the record show that there was no bug report.

A reader attempting to obtain a new lookup-based reference to a SLAB_TYPESAFE_BY_RCU object must do something like this:

  1. atomic_inc_not_zero(), which will return false and not do the increment if initial value was zero. Presumably dma_fence_get_rcu() handles this for struct dma_fence.
  2. If the value returned above was false, pretend that the lookup failed. Otherwise, continue with the following steps.
  3. Check the identity of the object. If this turns out to not be the desired object, release the reference (cleaning up if needed) and pretend that the lookup failed. Otherwise, continue. Presumably dma_fence_get_rcu_safe() handles this for struct dma_fence, in combination with its call to dma_fence_put().
  4. Use the object.
  5. Release the reference, cleaning up if needed.
This of course underscores Daniel's point (leaving aside the snark), which is that you should not use SLAB_TYPESAFE_BY_RCU unless you really need to, and even then you should make sure that you know how to use it properly.

But just when do you need to use SLAB_TYPESAFE_BY_RCU? One example is when you need RCU protection on such a hot fastpath that you cannot tolerate RCU-freed objects becoming cold in the CPU caches due to RCU's grace-period delays. Another example is where objects are being allocated and freed at such a high rate that if each and every kmem_cache_free() was subject to grace-period delays, excessive memory would be tied up waiting for grace periods, perhaps even resulting in out-of-memory (OOM) situations.

In addition, carefully constructed semantics are required. Semantics based purely on ordering tend to result in excessive conceptual complexity and thus in confusion. More appropriate semantics tend to be conditional, for example: (1) Objects that remain in existence during the full extent of the lookup are guaranteed to be found, (2) Objects that do not exist at any time during the full extent of the lookup are guaranteed not to be found, and (3) Objects that exist for only a portion of the lookup might or might not be found.

Does struct dma_fence need SLAB_TYPESAFE_BY_RCU? Is the code handling struct dma_fence doing the right thing? For the moment, I will just say that: (1) The existence of dma_fence_get_rcu_safe() gives me at least some hope, (2) Having two different slab allocators stored into different static variables having the same name is a bit code-reader-unfriendly, and (3) The use of SLAB_TYPESAFE_BY_RCU for a structure that contains an rcu_head structure is a bit unconventional, but perhaps these structures are sometimes obtained from kmalloc() instead of from kmem_struct_alloc().

One thing this section missed (or perhaps intentionally omitted): If you do need memory barriers, you are almost always better off using smp_store_release() and smp_load_acquire() than the old-style smp_wmb() and smp_rmb().

The Level 1: Big Dumb Lock certainly summarizes the pain of finding the right scope for your locks. Despite Daniel's suggesting that lockless tricks are almost always the wrong approach, such tricks are in common use. I would certainly agree that randomly hacking lockless tricks into your code will almost always result in disaster. In fact, this process will use your own intelligence against you: The smarter you think you are, the deeper a hole you will have dug for yourself before you realize that you are in trouble.

Therefore, if you think that you need a lockless trick, first actually measure the performance. Second, if there really is a performance issue, do the work to figure out which code and data is actually responsible, because your blind guesses will often be wrong. Third, read documentation, look at code, and talk to people to learn and fully understand how this problem has been solved in the past, then carefully apply those solutions. Fourth, if there is no solution, review your overall design, because an ounce of proper partitioning is worth some tons of clever lockless code. Fifth, if you must use a new solution, verify it beyond all reason. The code in kernel/rcu/rcutorture.c will give you some idea of the level of effort required.

The Level 2: Fine-grained Locking sections provide some excellent advice, guidance, and cautionary tales. Yes, lockdep is a great thing, but there are deadlocks that it does not detect, so some caution is still required.

The yellow-highlighted Locking Antipattern: Confusing Object Lifetime and Data Consistency describes how holding locks across non-memory-style barrier functions, that is, things like flush_work() as opposed to things like smp_mb(), can cause problems, up to and including deadlock. In contrast, whatever other problems memory-barrier functions such as smp_mb() cause, it does take some creativity to add them to a lock-based critical section so as to cause them to participate in a deadlock cycle. I would not consider flush_work() to be a memory barrier in disguise, although it is quite true that a correct high-performance implementation of flush_work() will require careful memory ordering.

I would have expected a rule such as “Don't hold a lock across flush_work() that is acquired in any corresponding workqueue handler”, but perhaps Daniel is worried about future deadlocks as well as present-day deadlocks. After all, if you do hold a lock across a call to flush_work(), perhaps there will soon be some compelling reason to acquire that lock in a workqueue handler, at which point it is game over due to deadlock.

This issue is of course by no means limited to workqueues. For but one example, it is quite possible to generate similar deadlocks by holding spinlocks across calls to del_timer_sync() that are acquired within timer handlers.

And that is why both flush_work() and del_timer_sync() tell lockdep what they are up to. For example, flush_work() invokes __flush_work() which in turn invokes lock_map_acquire() and lock_map_release() on a fictitious lock, and this same fictitious lock is acquired and released by process_one_work(). Thus, if you acquire a lock in a workqueue handler that is held across a corresponding flush_work(), lockdep will complain, as shown in the 2018 commit 87915adc3f0a ("workqueue: re-add lockdep dependencies for flushing") by Johannes Berg. Of course, del_timer_sync() uses this same trick, as shown in the 2009 commit 6f2b9b9a9d75 ("timer: implement lockdep deadlock detection"), also by Johannes Berg.

Nevertheless, deadlocks involving flush_work() and workqueue handlers can be subtle. It is therefore worth investing some up-front effort to avoid them.

The orange-highlighted Level 2.5: Splitting Locks for Performance Reasons discusses splitting locks. As the section says, there are complications. For one thing, lock acquisitions are anything but free, having overheads of hundreds or thousands of instructions at best, which means that adding additional levels of finer-grained locks can actually slow things down. Therefore, Daniel's point about prioritizing architectural restructuring over low-level synchronization changes is an extremely good one, and one that is all too often ignored.

As Daniel says, when moving to finer-grained locking, it is necessary to avoid increasing the common-case number of locks being acquired. As one might guess from the tone of this section, this is not necessarily easy. Daniel suggests reader-writer locking, which can work well in some cases, but suffers from performance and scalability limitations, especially in situations involving short read-side critical sections. The fact that readers and writers exclude each other can also result in latency/response-time issues. But again, reader-writer locking can be a good choice in some cases.

The last paragraph is an excellent cautionary tale. Take it from Daniel: Never let userspace dictate the order in which the kernel acquires locks. Otherwise, you, too, might find yourself using wait/wound mutexes. Worse yet, if you cannot set up an two-phase locking discipline in which all required locks are acquired before any work is done, you might find yourself writing deadlock-recovery code, or wounded-mutex recovery code, if you prefer. This recovery code can be surprisingly complex. In fact, one of the motivations for the early 1990s introduction of RCU into DYNIX/ptx was that doing so allowed deletion of many thousands of lines of such recovery code, along with all the yet-undiscovered bugs that code contained.

The red-highlighted Level 3: Lockless Tricks section begins with the ominous sentence “Do not go here wanderer!”

And I agree. After all, if you are using the facilities discussed in this section, namely RCU, atomics, preempt_disable(), local_bh_disable(), local_irq_save(), or the various memory barriers, you had jolly well better not be wandering!!!

Instead, you need to understand what you are getting into and you need to have a map in the form of a principled design and a careful well-validated implementation. And new situations may require additional maps to be made, although there are quite a few well-used maps already in Documentation/rcu and over here. In addition, as Daniel says, algorithmic and architectural fixes can often provide much better results than can lockless tricks applied at low levels of abstraction. Past experience suggests that some of those algorithmic and architectural fixes will involve lockless tricks on fastpaths, but life is like that sometimes.

It is now time to take a look at Daniel's alleged antipatterns.

The “Locking Antipattern: Using RCU” section does have some “interesting” statements:

  1. RCU is said to mix up lifetime and consistency concerns. It is not clear exactly what motivated this statement, but this view is often a symptom of a strictly temporal view of RCU. To use RCU effectively, one must instead take a combined spatio-temporal view, as described here and here.
  2. rcu_read_lock() is said to provide both a read-side critical section and to extend the lifetime of any RCU-protected object. This is true in common use cases because the whole purpose of an RCU read-side critical section is to ensure that any RCU-protected object that was in existence at any time during that critical section remains in existence up to the end of that critical section. It is not clear what distinction Daniel is attempting to draw here. Perhaps he likes the determinism provided by full mutual exclusion. If so, never forget that the laws of physics dictate that such determinism is often surprisingly expensive.
  3. The next paragraph is a surprising assertion that RCU readers' deadlock immunity is a bad thing! Never forget that although RCU can be used to replace reader-writer locking in a great many situations, RCU is not reader-writer locking. Which is a good thing from a performance and scalability viewpoint as well as from a deadlock-immunity viewpoint.
  4. In a properly designed system, locks and RCU are not “papering over” lifetime issues, but instead properly managing object lifetimes. And if your system is not properly designed, then any and all facilities, concurrent or not, are weapons-grade dangerous. Again, perhaps more maps are needed to help those who might otherwise wander into improper designs. Or perhaps existing maps need to be more consistently used.
  5. RCU is said to practically force you to deal with “zombie objects”. Twitter discussions with Daniel determined that such “zombie objects” can no longer be looked up, but are still in use. Which means that many use cases of good old reference counting also force you to deal with zombie objects: After all, removing an object from its search structure does not invalidate the references already held on that object. But it is quite possible to avoid zombie objects for both RCU and for reference counting through use of per-object locks, as is done in the Linux-kernel code that maps from a System-V semaphore ID to the corresponding in-kernel data structure, first described in Section of this paper. In short, if you don't like zombies, there are simple RCU use cases that avoid them. You get RCU's speed and deadlock immunity within the search structure, but full ordering, consistency, and zombie-freedom within the searched-for object.
  6. The last bullet in this section seems to argue that people should upgrade from RCU read-side critical sections to locking or reference counting as quickly as possible. Of course, such locking or reference counting can add problematic atomic-operation and cache-miss overhead to those critical sections, so specific examples would be helpful. One non-GPU example is the System-V semaphore ID example mentioned above, where immediate lock acquisition is necessary to provide the necessary System-V semaphore semantics.
  7. On freely using RCU, again, proper design is required, and not just with RCU. One can only sympathize with a driver dying in synchronize_rcu(). Presumably Daniel means that one of the driver's tasks hung in synchronize_rcu(), in which case there should have been an RCU CPU stall warning message which would point out what CPU or task was stuck in an RCU read-side critical section. It is quite easy to believe that diagnostics could be improved, both within RCU and elsewhere, but if this was intended to be a bug report, it is woefully insufficient.
  8. It is good to see that Daniel found at least one RCU use case that he likes (or at least doesn't hate too intensely), namely xarray lookups combined with kref_get_unless_zero() and kfree_rcu(). Perhaps this is a start, especially if the code following that kref_get_unless_zero() invocation does additional atomic operations on the xarray object, which would hide at least some of the kref_get_unless_zero() overhead.

The following table shows how intensively the RCU API is used by various v5.19 kernel subsystems:

Subsystem Uses LoC Uses/KLoC --------- ---- ---------- --------- ipc 91 9,822 9.26 virt 68 9,013 7.54 net 7457 1,221,681 6.10 security 599 107,622 5.57 kernel 1796 423,581 4.24 mm 324 170,176 1.90 init 8 4,236 1.89 block 108 65,291 1.65 lib 319 214,291 1.49 fs 1416 1,470,567 0.96 include 836 1,167,274 0.72 drivers 5596 20,861,746 0.27 arch 546 2,189,975 0.25 crypto 6 102,307 0.06 sound 21 1,378,546 0.02 --------- ---- ---------- --------- Total 19191 29,396,128 0.65
As you can see, the drivers subsystem has the second-highest total number of RCU API uses, but it also has by far the largest number of lines of code. As a result, the RCU usage intensity in the drivers subsystem is quite low, at about 27 RCU uses per 100,000 lines of code. But this view is skewed, as can be seen by looking more deeply within the drivers subsystem, but leaving out (aside from in the Total line) and drivers containing fewer than 100 instances of the RCU API:

Subsystem Uses LoC Uses/KLoC --------- ---- ---------- --------- drivers/target 293 62,532 4.69 drivers/block 355 97,265 3.65 drivers/md 334 147,881 2.26 drivers/infiniband 548 434,430 1.26 drivers/net 2607 4,381,955 0.59 drivers/staging 114 618,763 0.18 drivers/scsi 150 1,011,146 0.15 drivers/gpu 399 5,753,571 0.07 --------- ---- ---------- --------- Total 5596 20,861,746 0.27
The drivers/infiniband and drivers/net subtrees account for more than half of the RCU usage in the Linux kernel's drivers, and could be argued to be more about networking than about generic device drivers. And although drivers/gpu comes in third in terms of RCU usage, it also comes in first in terms of lines of code, making it one of the least intense users of RCU. So one could argue that Daniel already has his wish, at least within the confines of drivers/gpu.

This data suggests that Daniel might usefully consult with the networking folks in order to gain valuable guidelines on the use of RCU and perhaps atomics and memory barriers as well. On the other hand, it is quite possible that such consultations actually caused some of Daniel's frustration. You see, a system implementing networking must track the state of the external network, and it can take many seconds or even minutes for changes in that external state to propagate to that system. Therefore, expensive synchronization within that system is less useful than one might think: No matter how many locks and mutexes that system acquires, it cannot prevent external networking hardware from being reconfigured or even from failing completely.

Moving to drivers/gpu, in theory, if there are state changes initiated by the GPU hardware without full-system synchronization, networking RCU usage patterns should apply directly to GPU drivers. In contrast, there might be significant benefits from more tightly synchronizing state changes initiated by the system. Again, perhaps the aforementioned System-V semaphore ID example can help in such cases. But to be fair, given Daniel's preference for immediately acquiring a reference to RCU-protected data, perhaps drivers/gpu code is already taking this approach.

The “Locking Antipattern: Atomics” section is best taken point by point:

  1. For good or for ill, the ordering (or lack thereof) of Linux kernel atomics predates C++ atomics by about a decade. One big advantage of the Linux-kernel approach is that all accesses to atomic*_t variables are marked. This is a great improvement over C++, where a sequentially consistent load or store looks just like a normal access to a local variable, which can cause a surprising amount of confusion.
  2. Please please please do not “sprinkle” memory barriers over the code!!! Instead, actually design the required communication and ordering, and then use the best primitives for the job. For example, instead of “smp_mb__before_atomic(); atomic_inc(&myctr); smp_mb__after_atomic();”, maybe you should consider invoking atomic_inc_return(), discarding the return value if it is not needed.
  3. Indeed, some atomic functions operate on non-atomic*_t variables. But in many cases, you can use their atomic*_t counterparts. For example, instead of READ_ONCE(), atomic_read(). Instead of WRITE_ONCE(), atomic_set(). Instead of cmpxchg(), atomic_cmpxchg(). Instead of set_bit(), in many situations, atomic_or(). On the other hand, I will make no attempt to defend the naming of set_bit() and __set_bit().

To Daniel's discussion of “unnecessary trap doors”, I can only agree that reinventing read-write semaphores is a very bad thing.

I will also make no attempt to defend ill-thought-out hacks involving weak references or RCU. Sure, a quick fix to get your production system running is all well and good, but the real fix should be properly designed.

And Daniel makes an excellent argument when he says that if a counter can be protected by an already held lock, that counter should be implemented using normal C-language accesses to normal integral variables. For those situations where no such lock is at hand, there are a lot of atomic and per-CPU counting examples that can be followed, both in the Linux kernel and in Chapter 5 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?”. Again, why unnecessarily re-invent the wheel?

However, the last paragraph, stating that atomic operations should only be used for locking and synchronization primitives in the core kernel is a bridge too far. After all, a later section allows for memory barriers to be used in libraries (at least driver-hacker-proof libraries), so it seems reasonable that atomic operations can also be used in libraries.

Some help is provided by the executable Linux-kernel memory model (LKMM) in tools/memory-model along with the kernel concurrency sanitizer (KCSAN), which is documented in Documentation/dev-tools/kcsan.rst, but there is no denying that these tools currently require significant expertise. Help notwithstanding, it almost always makes a lot of sense to hide complex operations, including complex operations involving concurrency, behind well-designed APIs.

And help is definitely needed, given that there are more than 10,000 invocations of atomic operations in the drivers tree, more than a thousand of which are in drivers/gpu.

There is not much to say about the “Locking Antipattern: preempt/local_irq/bh_disable() and Friends” section. These primitives are not heavily used in the drivers tree. However, lockdep does have enough understanding of these primitives to diagnose misuse of irq-disabled and bh-disabled spinlocks. Which is a good thing, given that there are some thousands of uses of irq-disabled spinlocks in the drivers tree, along with a good thousand uses of bh-disabled spinlocks.

The “Locking Antipattern: Memory Barriers” suggests that memory barriers should be packaged in a library or core kernel service, which is in the common case excellent advice. Again, the executable LKMM and KCSAN can help, but again these tools currently require some expertise. I was amused by Daniel's “I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly”, and I am glad that my articles and talks provide at least a little entertainment value, if nothing else. ;-)

Summing up my view of these two blog posts, Daniel recommends that most driver code avoid concurrency entirely. Failing that, he recommends sticking to certain locking and reference-counting use cases, albeit including a rather complex acquire-locks-in-any-order use case. For the most part, he recommends against atomics, RCU, and memory barriers, with a very few exceptions.

For me, reading these blog posts induced great nostalgia, taking me back to my early 1990s days at Sequent, when the guidelines were quite similar, give or take a large number of non-atomically manipulated per-CPU counters. But a few short years later, many Sequent engineers were using atomic operations, and yes, a few were even using RCU. Including one RCU use case in a device driver, though that use case could instead be served by the Linux kernel's synchronize_irq() primitive.

Still, the heavy use of RCU and (even more so) of atomics within the drivers tree, combined with Daniel's distaste for these primitive, suggests that some sort of change might be in order.

Can We Fix This?Of course we can!!!

But will a given fix actually improve the situation? That is the question.

Reading through this reminded me that I need to take another pass through the RCU documentation. I have queued a commit to fix the misleading wording for SLAB_TYPESAFE_BY_RCU on the -rcu tree: 08f8f09b2a9e ("doc: SLAB_TYPESAFE_BY_RCU uses cannot rely on spinlocks"). I also expect to improve the documentation of reference counting and its relation to SLAB_TYPESAFE_BY_RCU, and will likely find a number of other things in need of improvement.

This is also as good a time as any to announce that I will be holding an an RCU Office Hours birds-of-a-feather session at the 2022 Linux Plumbers Conference, in case that is helpful.

However, the RCU documentation must of necessity remain fairly high level. And to that end, GPU-specific advice about use of xarray, kref_get_unless_zero(), and kfree_rcu() really needs to be Documentation/gpu as opposed to Documentation/RCU. This would allow that advice to be much more specific and thus much more helpful to the GPU developers and maintainers. Alternatively, perhaps improved GPU-related APIs are required in order to confine concurrency to functions designed for that purpose. This alternative approach has the benefit of allowing GPU device drivers to focus more on GPU-specific issues and less on concurrency. On the other hand, given that the GPU drivers comprise some millions of lines of code, this might be easier said than done.

It is all too easy to believe that it is possible to improve the documentation for a number of other facilities that Daniel called on the carpet. At the same time, it is important to remember that the intent of documentation is communication, and that the optimal mode of communication depends on the target audience. At its best, documentation builds a bridge from where the target audience currently is to where they need to go. Which means the broader the target audience, the more difficult it is to construct that bridge. Which in turn means that a given subsystem likely need usage advice and coding standards specific to that subsystem. One size does not fit all.

The SLAB_TYPESAFE_BY_RCU facility was called on the carpet, and perhaps understandably so. Would it help if SLAB_TYPESAFE_BY_RCU were to be changed so as to allow locks to be acquired on objects that might at any time be passed to kmem_cache_free() and then reallocated via kmem_cache_alloc()? In theory, this is easy: Just have the kmem_cache in question zero pages allocated from the system before splitting them up into objects. Then a given object could have an “initialized” flag, and if that flag was cleared in an object just returned from kmem_struct_alloc(), then and only then would that lock be initialized. This would allow a lock to be acquired (under rcu_read_lock(), of course) on a freed object, and would allow a lock to be held on an object despite its being passed to kmem_struct_free() and returned from kmem_struct_alloc() in the meantime.

In practice, some existing SLAB_TYPESAFE_BY_RCU users might not be happy with the added overhead of page zeroing, so this might require an additional GFP_ flag to allow zeroing on a kmem_cache-by-kmem_cache basis. However, the first question is “Would this really help?”, and answering that question requires feedback developers and maintainers who are actually using SLAB_TYPESAFE_BY_RCU.

Some might argue that the device drivers should all be rewritten in Rust, and cynics might argue that Daniel wrote his pair of blog posts with exactly that thought in mind. I am happy to let those cynics make that argument, especially given that I have already held forth on Linux-kernel concurrency in Rust here. However, a possible desire to rust Linux-kernel device drivers does not explain Daniel's distaste for what he calls “zombie objects” because Rust is in fact quite capable of maintaining references to objects that have been removed from their search structure.

Summary and ConclusionsAs noted earlier, reading these blog posts induced great nostalgia, taking me back to my time at Sequent in the early 1990s. A lot has happened in the ensuing three decades, including habitual use of locking in across the industry, and sometimes even correct use of locking.

But will generic developers ever be able to handle more esoteric techniques involving atomic operations and RCU?

I believe that the answer to this question is “yes”, as laid out in my 2012 paper Beyond Expert-Only Parallel Programming?. As in the past, tooling (including carefully designed APIs), economic forces (including continued ubiquitous multi-core systems), and acculturation (assisted by a vast quantity of open-source software) have done the trick, and I see no reason why these trends will not continue.

But what happens in drivers/gpu is up to the GPU developers and maintainers!

References
Atomic Operations and Memory Barriers, though more description than reference:

  1. Documentation/atomic_t.txt
  2. Documentation/atomic_bitops.txt
  3. Documentation/memory-barriers.txt
  4. Sometimes the docbook header is on the x86 arch_ function, for example, arch_atomic_inc() in arch/x86/include/asm/atomic.h rather than atomic_inc().
  5. The LKMM references below can also be helpful.

Kernel Concurrency Sanitizer (KCSAN):

  1. Documentation/dev-tools/kcsan.rst
  2. Finding race conditions with KCSAN
  3. Concurrency bugs should fear the big bad data-race detector (part 1)
  4. Concurrency bugs should fear the big bad data-race detector (part 2)
  5. Detecting missing memory barriers with KCSAN

Linux-Kernel Memory Model (LKMM):

  1. tools/memory-model, including its Documentation subdirectory.
  2. A formal kernel memory-ordering model (part 1)
  3. A formal kernel memory-ordering model (part 2)
  4. Who's afraid of a big bad optimizing compiler?
  5. Calibrating your fear of big bad optimizing compilers
  6. Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel (non-paywalled extended )edition

Read-Copy Update (RCU):

  1. Documentation/RCU
  2. The RCU API, 2019 edition
  3. Unraveling RCU-Usage Mysteries (Fundamentals)
  4. Unraveling RCU-Usage Mysteries (Additional Use Cases)
  5. Sections 9.5 and 9.6 of “Is Parallel Programming Hard, And, If So, What Can You Do About It?
  6. Many other references, some of which are listed here.

Daniel Vetter: Locking Engineering Hierarchy

1 év 8 hónap óta

The first part of this series covered principles of locking engineering. This part goes through a pile of locking patterns and designs, from most favourable and easiest to adjust and hence resulting in a long term maintainable code base, to the least favourable since hardest to ensure it works correctly and stays that way while the code evolves. For convenience even color coded, with the dangerous levels getting progressively more crispy red indicating how close to the burning fire you are! Think of it as Dante’s Inferno, but for locking.

As a reminder from the intro of the first part, with locking engineering I mean the art of ensuring that there’s sufficient consistency in reading and manipulating data structures, and not just sprinkling mutex_lock() and mutex_unlock() calls around until the result looks reasonable and lockdep has gone quiet.

Level 0: No Locking

The dumbest possible locking is no need for locking at all. Which does not mean extremely clever lockless tricks for a “look, no calls to mutex_lock()” feint, but an overall design which guarantees that any writers cannot exist concurrently with any other access at all. This removes the need for consistency guarantees while accessing an object at the architectural level.

There’s a few standard patterns to achieve locking nirvana.

Locking Pattern: Immutable State

The lesson in graphics API design over the last decade is that immutable state objects rule, because they both lead to simpler driver stacks and also better performance. Vulkan instead of the OpenGL with it’s ridiculous amount of mutable and implicit state is the big example, but atomic instead of legacy kernel mode setting or Wayland instead of the X11 are also built on the assumption that immutable state objects are a Great Thing (tm).

The usual pattern is:

  1. A single thread fully constructs an object, including any sub structures and anything else you might need. Often subsystems provide initialization helpers for objects that driver can subclass through embedding, e.g. drm_connector_init() for initializing a kernel modesetting output object. Additional functions can set up different or optional aspects of an object, e.g. drm_connector_attach_encoder() sets up the invariant links to the preceding element in a kernel modesetting display chain.

  2. The fully formed object is published to the world, in the kernel this often happens by registering it under some kind of identifier. This could be a global identifier like register_chrdev() for character devices, something attached to a device like registering a new display output on a driver with drm_connector_register() or some struct xarray in the file private structure. Note that this step here requires memory barriers of some sort. If you hand roll the data structure like a list or lookup tree with your own fancy locking scheme instead of using existing standard interfaces you are on a fast path to level 3 locking hell. Don’t do that.

  3. From this point on there are no consistency issues anymore and all threads can access the object without any locking.

Locking Pattern: Single Owner

Another way to ensure there’s no concurrent access is by only allowing one thread to own an object at a given point of time, and have well defined handover points if that is necessary.

Most often this pattern is used for asynchronously processing a userspace request:

  1. The syscall or IOCTL constructs an object with sufficient information to process the userspace’s request.

  2. That object is handed over to a worker thread with e.g. queue_work().

  3. The worker thread is now the sole owner of that piece of memory and can do whatever it feels like with it.

Again the second step requires memory barriers, which means if you hand roll your own lockless queue you’re firmly in level 3 territory and won’t get rid of the burned in red hot afterglow in your retina for quite some time. Use standard interfaces like struct completion or even better libraries like the workqueue subsystem here.

Note that the handover can also be chained or split up, e.g. for a nonblocking atomic kernel modeset requests there’s three asynchronous processing pieces involved:

  • The main worker, which pushes the display state update to the hardware and which is enqueued with queue_work().

  • The userspace completion event handling built around struct drm_pending_event and generally handed off to the interrupt handler of the driver from the main worker and processed in the interrupt handler.

  • The cleanup of the no longer used old scanout buffers from the preceding update. The synchronization between the preceding update and the cleanup is done through struct completion to ensure that there’s only ever a single worker which owns a state structure and is allowed to change it.

Locking Pattern: Reference Counting

Users generally don’t appreciate if the kernel leaks memory too much, and cleaning up objects by freeing their memory and releasing any other resources tends to be an operation of the very much mutable kind. Reference counting to the rescue!

  • Every pointer to the reference counted object must guarantee that a reference exists for as long as the pointer is in use. Usually that’s done by calling kref_get() when making a copy of the pointer, but implied references by e.g. continuing to hold a lock that protects a different pointer are often good enough too for a temporary pointer.

  • The cleanup code runs when the last reference is released with kref_put(). Note that this again requires memory barriers to work correctly, which means if you’re not using struct kref then it’s safe to assume you’ve screwed up.

Note that this scheme falls apart when released objects are put into some kind of cache and can be resurrected. In that case your cleanup code needs to somehow deal with these zombies and ensure there’s no confusion, and vice versa any code that resurrects a zombie needs to deal the wooden spikes the cleanup code might throw at an inopportune time. The worst example of this kind is SLAB_TYPESAFE_BY_RCU, where readers that are only protected with rcu_read_lock() may need to deal with objects potentially going through simultaneous zombie resurrections, potentially multiple times, while the readers are trying to figure out what is going on. This generally leads to lots of sorrow, wailing and ill-tempered maintainers, as the GPU subsystem has and continues to experience with struct dma_fence.

Hence use standard reference counting, and don’t be tempted by the siren of trying to implement clever caching of any kind.

Level 1: Big Dumb Lock

It would be great if nothing ever changes, but sometimes that cannot be avoided. At that point you add a single lock for each logical object. An object could be just a single structure, but it could also be multiple structures that are dynamically allocated and freed under the protection of that single big dumb lock, e.g. when managing GPU virtual address space with different mappings.

The tricky part is figuring out what is an object to ensure that your lock is neither too big nor too small:

  • If you make your lock too big you run the risk of creating a dreaded subsystem lock, or violating the “Protect Data, not Code” principle in some other way. Split your locking further so that a single lock really only protects a single object, and not a random collection of unrelated ones. So one lock per device instance, not one lock for all the device instances in a driver or worse in an entire subsystem.

    The trouble is that once a lock is too big and has firmly moved into “protects some vague collection of code” territory, it’s very hard to get out of that hole.

  • Different problems strike when the locking scheme is too fine-grained, e.g. in the GPU virtual memory management example when every address mapping in the big vma tree has its own private lock. Or when a structure has a lot of different locks for different member fields.

    One issue is that locks aren’t free, the overhead of fine-grained locking can seriously hurt, especially when common operations have to take most of the locks anyway and so there’s no chance of any concurrency benefit. Furthermore fine-grained locking leads to the temptation of solving locking overhead with ever more clever lockless tricks, instead of radically simplifying the design.

    The other issue is that more locks improve the odds for locking inversions, and those can be tough nuts to crack. Again trying to solve this with more lockless tricks to avoid inversions is tempting, and again in most cases the wrong approach.

Ideally, your big dumb lock would always be right-sized everytime the requirements on the datastructures changes. But working magic 8 balls tend to be on short supply, and you tend to only find out that your guess was wrong when the pain of the lock being too big or too small is already substantial. The inherent struggles of resizing a lock as the code evolves then keeps pushing you further away from the optimum instead of closer. Good luck!

Level 2: Fine-grained Locking

It would be great if this is all the locking we ever need, but sometimes there’s functional reasons that force us to go beyond the single lock for each logical object approach. This section will go through a few of the common examples, and the usual pitfalls to avoid.

But before we delve into details remember to document in kerneldoc with the inline per-member kerneldoc comment style once you go beyond a simple single lock per object approach. It’s the best place for future bug fixers and reviewers - meaning you - to find the rules for how at least things were meant to work.

Locking Pattern: Object Tracking Lists

One of the main duties of the kernel is to track everything, least to make sure there’s no leaks and everything gets cleaned up again. But there’s other reasons to maintain lists (or other container structures) of objects.

Now sometimes there’s a clear parent object, with its own lock, which could also protect the list with all the objects, but this does not always work:

  • It might force the lock of the parent object to essentially become a subsystem lock and so protect much more than it should when following the “Protect Data, not Code” principle. In that case it’s better to have a separate (spin-)lock just for the list to be able to clearly untangle what the parent and subordinate object’s lock each protect.

  • Different code paths might need to walk and possibly manipulate the list both from the container object and contained object, which would lead to locking inversion if the list isn’t protected by it’s own stand-alone (nested) lock. This tends to especially happen when an object can be attached to multiple other objects, like a GPU buffer object can be mapped into multiple GPU virtual address spaces of different processes.

  • The constraints of calling contexts for adding or removing objects from the list could be different and incompatible from the requirements when walking the list itself. The main example here are LRU lists where the shrinker needs to be able to walk the list from reclaim context, whereas the superior object locks often have a need to allocate memory while holding each lock. Those object locks the shrinker can then only trylock, which is generally good enough, but only being able to trylock the LRU list lock itself is not.

Simplicity should still win, therefore only add a (nested) lock for lists or other container objects if there’s really no suitable object lock that could do the job instead.

Locking Pattern: Interrupt Handler State

Another example that requires nested locking is when part of the object is manipulated from a different execution context. The prime example here are interrupt handlers. Interrupt handlers can only use interrupt safe spinlocks, but often the main object lock must be a mutex to allow sleeping or allocating memory or nesting with other mutexes.

Hence the need for a nested spinlock to just protect the object state shared between the interrupt handler and code running from process context. Process context should generally only acquire the spinlock nested with the main object lock, to avoid surprises and limit any concurrency issues to just the singleton interrupt handler.

Locking Pattern: Async Processing

Very similar to the interrupt handler problems is coordination with async workers. The best approach is the single owner pattern, but often state needs to be shared between the worker and other threads operating on the same object.

The naive approach of just using a single object lock tends to deadlock:

start_processing(obj) { mutex_lock(&obj->lock); /* set up the data for the async work */; schedule_work(&obj->work); mutex_unlock(&obj->lock); } stop_processing(obj) { mutex_lock(&obj->lock); /* clear the data for the async work */; cancel_work_sync(&obj->work); mutex_unlock(&obj->lock); } work_fn(work) { obj = container_of(work, work); mutex_lock(&obj->lock); /* do some processing */ mutex_unlock(&obj->lock); }

Do not worry if you don’t spot the deadlock, because it is a cross-release dependency between the entire work_fn() and cancel_work_sync() and these are a lot trickier to spot. Since cross-release dependencies are a entire huge topic on themselves I won’t go into more details, a good starting point is this LWN article.

There’s a bunch of variations of this theme, with problems in different scenarios:

  • Replacing the cancel_work_sync() with cancel_work() avoids the deadlock, but often means the work_fn() is prone to use-after-free issues.

  • Calling cancel_work_sync()before taking the mutex can work in some cases, but falls apart when the work is self-rearming. Or maybe the work or overall object isn’t guaranteed to exist without holding it’s lock, e.g. if this is part of an async processing queue for a parent structure.

  • Cancelling the work after the call to mutex_unlock() might race with concurrent restarting of the work and upset the bookkeeping.

Like with interrupt handlers the clean solution tends to be an additional nested lock which protects just the mutable state shared with the work function and nests within the main object lock. That way work can be cancelled while the main object lock is held, which avoids a ton of races. But without holding the sublock that work_fn() needs, which avoids the deadlock.

Note that in some cases the superior lock doesn’t need to exist, e.g. struct drm_connector_state is protected by the single owner pattern, but drivers might have some need for some further decoupled asynchronous processing, e.g. for handling the content protect or link training machinery. In that case only the sublock for the mutable driver private state shared with the worker exists.

Locking Pattern: Weak References

Reference counting is a great pattern, but sometimes you need be able to store pointers without them holding a full reference. This could be for lookup caches, or because your userspace API mandates that some references do not keep the object alive - we’ve unfortunately committed that mistake in the GPU world. Or because holding full references everywhere would lead to unreclaimable references loops and there’s no better way to break them than to make some of the references weak. In languages with a garbage collector weak references are implemented by the runtime, and so no real worry. But in the kernel the concept has to be implemented by hand.

Since weak references are such a standard pattern struct kref has ready-made support for them. The simple approach is using kref_put_mutex() with the same lock that also protects the structure containing the weak reference. This guarantees that either the weak reference pointer is gone too, or there is at least somewhere still a strong reference around and it is therefore safe to call kref_get(). But there are some issues with this approach:

  • It doesn’t compose to multiple weak references, at least if they are protected by different locks - all the locks need to be taken before the final kref_put() is called, which means minimally some pain with lock nesting and you get to hand-roll it all to boot.

  • The mutex required to be held during the final put is the one which protects the structure with the weak reference, and often has little to do with the object that’s being destroyed. So a pretty nasty violation of the big dumb lock pattern. Furthermore the lock is held over the entire cleanup function, which defeats the point of the reference counting pattern, which is meant to enable “no locking” cleanup code. It becomes very tempting to stuff random other pieces of code under the protection of this look, making it a sprawling mess and violating the principle to protect data, not code: The lock held during the entire cleanup operation is protecting against that cleanup code doing things, and not anymore a specific data structure.

The much better approach is using kref_get_unless_zero(), together with a spinlock for your data structure containing the weak reference. This looks especially nifty in combination with struct xarray.

obj_find_in_cache(id) { xa_lock(); obj = xa_find(id); if (!kref_get_unless_zero(&obj->kref)) obj = NULL; xa_unlock(); return obj; }

With this all the issues are resolved:

  • Arbitrary amounts of weak references in any kind of structures protected by their own spinlock can be added, without causing dependencies between them.

  • In the object’s cleanup function the same spinlock only needs to be held right around when the weak references are removed from the lookup structure. The lock critical section is no longer needlessly enlarged, we’re back to protecting data instead of code.

With both together the locking does no longer leak beyond the lookup structure and it’s associated code any more, unlike with kref_put_mutex() and similar approaches. Thankfully kref_get_unless_zero() has become the much more popular approach since it was added 10 years ago!

Locking Antipattern: Confusing Object Lifetime and Data Consistency

We’ve now seen a few examples where the “no locking” patterns from level 0 collide in annoying ways when more locking is added to the point where we seem to violate the principle to protect data, not code. It’s worth to look at this a bit closer, since we can generalize what’s going on here to a fairly high-level antipattern.

The key insight is that the “no locking” patterns all rely on memory barrier primitives in disguise, not classic locks, to synchronize access between multiple threads. In the case of the single owner pattern there might also be blocking semantics involved, when the next owner needs to wait for the previous owner to finish processing first. These are functions like flush_work() or the various wait functions like wait_event() or wait_completion().

Calling these barrier functions while holding locks commonly leads to issues:

  • Blocking functions like flush_work() pull in every lock or other dependency the work we wait on, or more generally, any of the previous owners of an object needed as a so called cross-release dependency. Unfortunately lockdep does not understand these natively, and the usual tricks to add manual annotations have severe limitations. There’s work ongoing to add cross-release dependency tracking to lockdep, but nothing looks anywhere near ready to merge. Since these dependency chains can be really long and get ever longer when more code is added to a worker - dependencies are pulled in even if only a single lock is held at any given time - this can quickly become a nightmare to untangle.

  • Often the requirement to hold a lock over these barrier type functions comes from the fact that the object would disappear. Or otherwise undergo some serious confusion about it’s lifetime state - not just whether it’s still alive or getting destroyed, but also who exactly owns it or whether it’s maybe a resurrected zombie representing a different instance now. This encourages that the lock morphs from a “protects some specific data” to a “protects specific code from running” design, leading to all the code maintenance issues discussed in the protect data, not code principle.

For these reasons try as hard as possible to not hold any locks, or as few as feasible, when calling any of these memory barriers in disguise functions used to manage object lifetime or ownership in general. The antipattern here is abusing locks to fix lifetime issues. We have seen two specific instances thus far:

We will see some more, but the antipattern holds in general as a source of troubles.

Level 2.5: Splitting Locks for Performance Reasons

We’ve looked at a pile of functional reasons for complicating the locking design, but sometimes you need to add more fine-grained locking for performance reasons. This is already getting dangerous, because it’s very tempting to tune some microbenchmark just because we can, or maybe delude ourselves that it will be needed in the future. Therefore only complicate your locking if:

  • You have actual real world benchmarks with workloads relevant to users that show measurable gains outside of statistical noise.

  • You’ve fully exhausted architectural changes to outright avoid the overhead, like io_uring pre-registering file descriptors locally to avoid manipulating the file descriptor table.

  • You’ve fully exhausted algorithm improvements like batching up operations to amortize locking overhead better.

Only then make your future maintenance pain guaranteed worse by applying more tricky locking than the bare minimum necessary for correctness. Still, go with the simplest approach, often converting a lock to its read-write variant is good enough.

Sometimes this isn’t enough, and you actually have to split up a lock into more fine-grained locks to achieve more parallelism and less contention among threads. Note that doing so blindly will backfire because locks are not free. When common operations still have to take most of the locks anyway, even if it’s only for short time and in strict succession, the performance hit on single threaded workloads will not justify any benefit in more threaded use-cases.

Another issue with more fine-grained locking is that often you cannot define a strict nesting hierarchy, or worse might need to take multiple locks of the same object or lock class. I’ve written previously about this specific issue, and more importantly, how to teach lockdep about lock nesting, the bad and the good ways.

One really entertaining story from the GPU subsystem, for bystanders at least, is that we really screwed this up for good by defacto allowing userspace to control the lock order of all the objects involved in an IOCTL. Furthermore disjoint operations should actually proceed without contention. If you ever manage to repeat this feat you can take a look at the wait-wound mutexes. Or if you just want some pretty graphs, LWN has an old article about wait-wound mutexes too.

Level 3: Lockless Tricks

Do not go here wanderer!

Seriously, I have seen a lot of very fancy driver subsystem locking designs, I have not yet found a lot that were actually justified. Because only real world, non-contrived performance issues can ever justify reaching for this level, and in almost all cases algorithmic or architectural fixes yield much better improvements than any kind of (locking) micro-optimization could ever hope for.

Hence this is just a long list of antipatterns, so that people who have not yet a grumpy expression permanently chiseled into their facial structure know when they’re in trouble.

Note that this section isn’t limited to lockless tricks in the academic sense of guaranteed constant overhead forward progress, meaning no spinning or retrying anywhere at all. It’s for everything which doesn’t use standard locks like struct mutex, spinlock_t, struct rw_semaphore, or any of the others provided in the Linux kernel.

Locking Antipattern: Using RCU

Yeah RCU is really awesome and impressive, but it comes at serious costs:

  • By design, at least with standard usage, RCU elevates mixing up lifetime and consistency concerns to a virtue. rcu_read_lock() gives you both a read-side critical section and it extends the lifetime of any RCU protected object. There’s absolutely no way you can avoid that antipattern, it’s built in.

    Worse, RCU read-side critical section nest rather freely, which means unlike with real locks abusing them to keep objects alive won’t run into nasty locking inversion issues when you pull that stunt with nesting different objects or classes of objects. Using locks to paper over lifetime issues is bad, but with RCU it’s weapons-grade levels of dangerous.

  • Equally nasty, RCU practically forces you to deal with zombie objects, which breaks the reference counting pattern in annoying ways.

  • On top of all this breaking out of RCU is costly and kinda defeats the point, and hence there’s a huge temptation to delay this as long as possible. Meaning check as many things and dereference as many pointers under RCU protection as you can, before you take a real lock or upgrade to a proper reference with kref_get_unless_zero().

    Unless extreme restraint is applied this results in RCU leading you towards locking antipatterns. Worse RCU tends to spread them to ever more objects and ever more fields within them.

All together all freely using RCU achieves is proving that there really is no bottom on the code maintainability scale. It is not a great day when your driver dies in synchronize_rcu() and lockdep has no idea what’s going on, and I’ve seen such days.

Personally I think in driver subsystem the most that’s still a legit and justified use of RCU is for object lookup with struct xarray and kref_get_unless_zero(), and cleanup handled entirely by kfree_rcu(). Anything more and you’re very likely chasing a rabbit down it’s hole and have not realized it yet.

Locking Antipattern: Atomics

Firstly, Linux atomics have two annoying properties just to start:

  • Unlike e.g. C++ atomics in userspace they are unordered or weakly ordered by default in a lot of cases. A lot of people are surprised by that, and then have an even harder time understanding the memory barriers they need to sprinkle over the code to make it work correctly.

  • Worse, many atomic functions neither operate on the atomic types atomic_t and atomic64_t nor have atomic anywhere in their names, and so pose serious pitfalls to reviewers:

    • READ_ONCE() and WRITE_ONCE for volatile stores and loads.
    • cmpxchg() and the various variants of atomic exchange with or without a compare operation.
    • Atomic bitops like set_bit() are all atomic. Worse, their non-atomic variants have the __set_bit() double underscores to scare you away from using them, despite that these are the ones you really want by default.

Those are a lot of unnecessary trap doors, but the real bad part is what people tend to build with atomic instructions:

  • I’ve seen at least three different, incomplete and ill-defined reimplementations of read write semaphores without lockdep support. Reinventing completions is also pretty popular. Worse, the folks involved didn’t realize what they built. That’s an impressive violation of the “Make it Correct” principle.

  • It seems very tempting to build terrible variations of the “no locking” patterns. It’s very easy to screw them up by extending them in a bad way, e.g. reference counting with weak reference or RCU optimizations done wrong very quickly leads to a complete mess. There are reasons why you should never deviate from these.

  • What looks innocent are statistical counters with atomics, but almost always there’s already a lock you could take instead of unordered counter updates. Often resulting in better code organization to boot since the statistics for a list and it’s manipulation are then closer together. There are some exceptions with real performance justification, a recent one I’ve seen is memory shrinkers where you really want your shrinker->count_objects() to not have to acquire any locks. Otherwise in a memory intense workload all threads are stuck on the one thread doing actual reclaim holding the same lock in your shrinker->scan_objects() function.

In short, unless you’re actually building a new locking or synchronization primitive in the core kernel, you most likely do not want to get seen even looking at atomic operations as an option.

Locking Antipattern: preempt/local_irq/bh_disable() and Friends …

This one is simple: Lockdep doesn’t understand them. The real-time folks hate them. Whatever it is you’re doing, use proper primitives instead, and at least read up on the LWN coverage on why these are problematic what to do instead. If you need some kind of synchronization primitive - maybe to avoid the lifetime vs. consistency antipattern pitfalls - then use the proper functions for that like synchronize_irq().

Locking Antipattern: Memory Barriers

Or more often, lack of them, incorrect or imbalanced use of barriers, badly or wrongly or just not at all documented memory barriers, or …

Fact is that exceedingly most kernel hackers, and more so driver people, have no useful understanding of the Linux kernel’s memory model, and should never be caught entertaining use of explicit memory barriers in production code. Personally I’m pretty good at spotting holes, but I’ve had to learn the hard way that I’m not even close to being able to positively prove correctness. And for better or worse, nothing short of that tends to cut it.

For a still fairly cursory discussion read the LWN series on lockless algorithms. If the code comments and commit message are anything less rigorous than that it’s fairly safe to assume there’s an issue.

Now don’t get me wrong, I love to read an article or watch a talk by Paul McKenney on RCU like anyone else to get my brain fried properly. But aside from extreme exceptions this kind of maintenance cost has simply no justification in a driver subsystem. At least unless it’s packaged in a driver hacker proof library or core kernel service of some sorts with all the memory barriers well hidden away where ordinary fools like me can’t touch them.

Closing Thoughts

I hope you enjoyed this little tour of progressively more worrying levels of locking engineering, with really just one key take away:

Simple, dumb locking is good locking, since with that you have a fighting chance to make it correct locking.

Thanks to Daniel Stone and Jason Ekstrand for reading and commenting on drafts of this text.

Linux Plumbers Conference: LPC 2022 Schedule is posted!

1 év 9 hónap óta

 

The schedule for when the miniconferences and tracks are going to occur is now posted at: https://lpc.events/event/16/timetable/#all

The runners for the miniconferences will be adding more details to each of their schedules over the coming weeks.

The Linux Plumbers Refereed track schedule and Kernel Summit schedule is now available at: https://lpc.events/event/16/timetable/#all.detailed

The leads for the networking and toolchain tracks will be adding more details to each of their schedules over the coming weeks, as well.

For those that are registered as in person, you are free to continue to submit Birds of a Feather(BOF) sessions. They will be allocated space in the BOF rooms on a first come, first serve basis. Please note that the BOFs will not be recorded.

We’re looking forward to a great 3 days of presentations and discussions. We hope you can join us either in-person or virtually!

Matthew Garrett: UEFI rootkits and UEFI secure boot

1 év 9 hónap óta
Kaspersky describes a UEFI-implant used to attack Windows systems. Based on it appearing to require patching of the system firmware image, they hypothesise that it's propagated by manually dumping the contents of the system flash, modifying it, and then reflashing it back to the board. This probably requires physical access to the board, so it's not especially terrifying - if you're in a situation where someone's sufficiently enthusiastic about targeting you that they're reflashing your computer by hand, it's likely that you're going to have a bad time regardless.

But let's think about why this is in the firmware at all. Sophos previously discussed an implant that's sufficiently similar in some technical details that Kaspersky suggest they may be related to some degree. One notable difference is that the MyKings implant described by Sophos installs itself into the boot block of legacy MBR partitioned disks. This code will only be executed on old-style BIOS systems (or UEFI systems booting in BIOS compatibility mode), and they have no support for code signatures, so there's no need to be especially clever. Run malicious code in the boot block, patch the next stage loader, follow that chain all the way up to the kernel. Simple.

One notable distinction here is that the MBR boot block approach won't be persistent - if you reinstall the OS, the MBR will be rewritten[1] and the infection is gone. UEFI doesn't really change much here - if you reinstall Windows a new copy of the bootloader will be written out and the UEFI boot variables (that tell the firmware which bootloader to execute) will be updated to point at that. The implant may still be on disk somewhere, but it won't be run.

But there's a way to avoid this. UEFI supports loading firmware-level drivers from disk. If, rather than providing a backdoored bootloader, the implant takes the form of a UEFI driver, the attacker can set a different set of variables that tell the firmware to load that driver at boot time, before running the bootloader. OS reinstalls won't modify these variables, which means the implant will survive and can reinfect the new OS install. The only way to get rid of the implant is to either reformat the drive entirely (which most OS installers won't do by default) or replace the drive before installation.

This is much easier than patching the system firmware, and achieves similar outcomes - the number of infected users who are going to wipe their drives to reinstall is fairly low, and the kernel could be patched to hide the presence of the implant on the filesystem[2]. It's possible that the goal was to make identification as hard as possible, but there's a simpler argument here - if the firmware has UEFI Secure Boot enabled, the firmware will refuse to load such a driver, and the implant won't work. You could certainly just patch the firmware to disable secure boot and lie about it, but if you're at the point of patching the firmware anyway you may as well just do the extra work of installing your implant there.

I think there's a reasonable argument that the existence of firmware-level rootkits suggests that UEFI Secure Boot is doing its job and is pushing attackers into lower levels of the stack in order to obtain the same outcomes. Technologies like Intel's Boot Guard may (in their current form) tend to block user choice, but in theory should be effective in blocking attacks of this form and making things even harder for attackers. It should already be impossible to perform attacks like the one Kaspersky describes on more modern hardware (the system should identify that the firmware has been tampered with and fail to boot), which pushes things even further - attackers will have to take advantage of vulnerabilities in the specific firmware they're targeting. This obviously means there's an incentive to find more firmware vulnerabilities, which means the ability to apply security updates for system firmware as easily as security updates for OS components is vital (hint hint if your system firmware updates aren't available via LVFS you're probably doing it wrong).

We've known that UEFI rootkits have existed for a while (Hacking Team had one in 2015), but it's interesting to see a fairly widespread one out in the wild. Protecting against this kind of attack involves securing the entire boot chain, including the firmware itself. The industry has clearly been making progress in this respect, and it'll be interesting to see whether such attacks become more common (because Secure Boot works but firmware security is bad) or not.

[1] As we all remember from Windows installs overwriting Linux bootloaders
[2] Although this does run the risk of an infected user booting another OS instead, and being able to see the implant

comments

Daniel Vetter: Locking Engineering Principles

1 év 9 hónap óta

For various reasons I spent the last two years way too much looking at code with terrible locking design and trying to rectify it, instead of a lot more actual building cool things. Symptomatic that the last post here on my neglected blog is also a rant on lockdep abuse.

I tried to distill all the lessons learned into some training slides, and this two part is the writeup of the same. There are some GPU specific rules, but I think the key points should apply to at least apply to kernel drivers in general.

The first part here lays out some principles, the second part builds a locking engineering design pattern hierarchy from the most easiest to understand and maintain to the most nightmare inducing approaches.

Also with locking engineering I mean the general problem of protecting data structures against concurrent access by multiple threads and trying to ensure that each sufficiently consistent view of the data it reads and that the updates it commits won’t result in confusion. Of course it highly depends upon the precise requirements what exactly sufficiently consistent means, but figuring out these kind of questions is out of scope for this little series here.

Priorities in Locking Engineering

Designing a correct locking scheme is hard, validating that your code actually implements your design is harder, and then debugging when - not if! - you screwed up is even worse. Therefore the absolute most important rule in locking engineering, at least if you want to have any chance at winning this game, is to make the design as simple and dumb as possible.

1. Make it Dumb

Since this is the key principle the entire second part of this series will go through a lot of different locking design patterns, from the simplest and dumbest and easiest to understand, to the most hair-raising horrors of complexity and trickiness.

Meanwhile let’s continue to look at everything else that matters.

2. Make it Correct

Since simple doesn’t necessarily mean correct, especially when transferring a concept from design to code, we need guidelines. On the design front the most important one is to design for lockdep, and not fight it, for which I already wrote a full length rant. Here I will only go through the main lessons: Validating locking by hand against all the other locking designs and nesting rules the kernel has overall is nigh impossible, extremely slow, something only few people can do with any chance of success and hence in almost all cases a complete waste of time. We need tools to automate this, and in the Linux kernel this is lockdep.

Therefore if lockdep doesn’t understand your locking design your design is at fault, not lockdep. Adjust accordingly.

A corollary is that you actually need to teach lockdep your locking rules, because otherwise different drivers or subsystems will end up with defacto incompatible nesting and dependencies. Which, as long as you never exercise them on the same kernel boot-up, much less same machine, wont make lockdep grumpy. But it will make maintainers very much question why they are doing what they’re doing.

Hence at driver/subsystem/whatever load time, when CONFIG_LOCKDEP is enabled, take all key locks in the correct order. One example for this relevant to GPU drivers is in the dma-buf subsystem.

In the same spirit, at every entry point to your library or subsytem, or anything else big, validate that the callers hold up the locking contract with might_lock(), might_sleep(), might_alloc() and all the variants and more specific implementations of this. Note that there’s a huge overlap between locking contracts and calling context in general (like interrupt safety, or whether memory allocation is allowed to call into direct reclaim), and since all these functions compile away to nothing when debugging is disabled there’s really no cost in sprinkling them around very liberally.

On the implementation and coding side there’s a few rules of thumb to follow:

  • Never invent your own locking primitives, you’ll get them wrong, or at least build something that’s slow. The kernel’s locks are built and tuned by people who’ve done nothing else their entire career, you wont beat them except in bug count, and that by a lot.

  • The same holds for synchronization primitives - don’t build your own with a struct wait_queue_head, or worse, hand-roll your own wait queue. Instead use the most specific existing function that provides the synchronization you need, e.g. flush_work() or flush_workqueue() and the enormous pile of variants available for synchronizing against scheduled work items.

    A key reason here is that very often these more specific functions already come with elaborate lockdep annotations, whereas anything hand-roll tends to require much more manual design validation.

  • Finally at the intersection of “make it dumb” and “make it correct”, pick the simplest lock that works, like a normal mutex instead of an read-write semaphore. This is because in general, stricter rules catch bugs and design issues quicker, hence picking a very fancy “anything goes” locking primitives is a bad choice.

    As another example pick spinlocks over mutexes because spinlocks are a lot more strict in what code they allow in their critical section. Hence much less risk you put something silly in there by accident and close a dependency loop that could lead to a deadlock.

3. Make it Fast

Speed doesn’t matter if you don’t understand the design anymore in the future, you need simplicity first.

Speed doesn’t matter if all you’re doing is crashing faster. You need correctness before speed.

Finally speed doesn’t matter where users don’t notice it. If you micro-optimize a path that doesn’t even show up in real world workloads users care about, all you’ve done is wasted time and committed to future maintenance pain for no gain at all.

Similarly optimizing code paths which should never be run when you instead improve your design are not worth it. This holds especially for GPU drivers, where the real application interfaces are OpenGL, Vulkan or similar, and there’s an entire driver in the userspace side - the right fix for performance issues is very often to radically update the contract and sharing of responsibilities between the userspace and kernel driver parts.

The big example here is GPU address patch list processing at command submission time, which was necessary for old hardware that completely lacked any useful concept of a per process virtual address space. But that has changed, which means virtual addresses can stay constant, while the kernel can still freely manage the physical memory by manipulating pagetables, like on the CPU. Unfortunately one driver in the DRM subsystem instead spent an easy engineer decade of effort to tune relocations, write lots of testcases for the resulting corner cases in the multi-level fastpath fallbacks, and even more time handling the impressive amounts of fallout in the form of bugs and future headaches due to the resulting unmaintainable code complexity …

In other subsystems where the kernel ABI is the actual application contract these kind of design simplifications might instead need to be handled between the subsystem’s code and driver implementations. This is what we’ve done when moving from the old kernel modesetting infrastructure to atomic modesetting. But sometimes no clever tricks at all help and you only get true speed with a radically revamped uAPI - io_uring is a great example here.

Protect Data, not Code

A common pitfall is to design locking by looking at the code, perhaps just sprinkling locking calls over it until it feels like it’s good enough. The right approach is to design locking for the data structures, which means specifying for each structure or member field how it is protected against concurrent changes, and how the necessary amount of consistency is maintained across the entire data structure with rules that stay invariant, irrespective of how code operates on the data. Then roll it out consistently to all the functions, because the code-first approach tends to have a lot of issues:

  • A code centric approach to locking often leads to locking rules changing over the lifetime of an object, e.g. with different rules for a structure or member field depending upon whether an object is in active use, maybe just cached or undergoing reclaim. This is hard to teach to lockdep, especially when the nesting rules change for different states. Lockdep assumes that the locking rules are completely invariant over the lifetime of the entire kernel, not just over the lifetime of an individual object or structure even.

    Starting from the data structures on the other hand encourages that locking rules stay the same for a structure or member field.

  • Locking design that changes depending upon the code that can touch the data would need either complicated documentation entirely separate from the code - so high risk of becoming stale. Or the explanations, if there are any are sprinkled over the various functions, which means reviewers need to reacquire the entire relevant chunks of the code base again to make sure they don’t miss an odd corner cases.

    With data structure driven locking design there’s a perfect, because unique place to document the rules - in the kerneldoc of each structure or member field.

  • A consequence for code review is that to recheck the locking design for a code first approach every function and flow has to be checked against all others, and changes need to be checked against all the existing code. If this is not done you might miss a corner cases where the locking falls apart with a race condition or could deadlock.

    With a data first approach to locking changes can be reviewed incrementally against the invariant rules, which means review of especially big or complex subsystems actually scales.

  • When facing a locking bug it’s tempting to try and fix it just in the affected code. By repeating that often enough a locking scheme that protects data acquires code specific special cases. Therefore locking issues always need to be first mapped back to new or changed requirements on the data structures and how they are protected.

The big antipattern of how you end up with code centric locking is to protect an entire subsystem (or worse, a group of related subsystems) with a single huge lock. The canonical example was the big kernel lock BKL, that’s gone, but in many cases it’s just replaced by smaller, but still huge locks like console_lock().

This results in a lot of long term problems when trying to adjust the locking design later on:

  • Since the big lock protects everything, it’s often very hard to tell what it does not protect. Locking at the fringes tends to be inconsistent, and due to that its coverage tends to creep ever further when people try to fix bugs where a given structure is not consistently protected by the same lock.

  • Also often subsystems have different entry points, e.g. consoles can be reached through the console subsystem directly, through vt, tty subsystems and also through an enormous pile of driver specific interfaces with the fbcon IOCTLs as an example. Attempting to split the big lock into smaller per-structure locks pretty much guarantees that different entry points have to take the per-object locks in opposite order, which often can only be resolved through a large-scale rewrite of all impacted subsystems.

    Worse, as long as the big subsystem lock continues to be in use no one is spotting these design issues in the code flow. Hence they will slowly get worse instead of the code moving towards a better structure.

For these reasons big subsystem locks tend to live way past their justified usefulness until code maintenance becomes nigh impossible: Because no individual bugfix is worth the task to really rectify the design, but each bugfix tends to make the situation worse.

From Principles to Practice

Stay tuned for next week’s installment, which will cover what these principles mean when applying to practice: Going through a large pile of locking design patterns from the most desirable to the most hair raising complex.

Ellenőrizve
25 perc 13 másodperc ago
Kernel Planet - http://planet.kernel.org
Feliratkozás a következőre: Kernel Planet hírcsatorna