Show me what you run
Can I trust you?
When using a service or infrastructure on the internet, this is probably one of the hardest questions to answer. In this article, I’ll try to walk you through my journey to try to answer it.
This starts with one question: NixOS has a couple of public servers running on the Internet, and some of which are used to build packages for everyone and populate a cache. They don’t have any particular configuration, but we need to ensure they have not been tampered with before allowing them to upload packages to the binary cache. This is essential for the whole community to trust the binary cache repository (where everyone downloads packages from and trusts).
But as an individual without access to those servers, why should I trust them? Would there be a way for me to audit what they’re running without exposing the key?
Fortunately, yes! Read on, and we’ll try to assemble the pieces required together.
Build reproducibility is an essential part of this infrastructure. Fortunately, a lot of smart people already worked on it, and NixOS has a couple of arguments in this fight. The fact that whenever a derivation is changed, all of its dependents are rebuilt as well makes a huge difference.
The folks involved in this effort made so much, that we’re, as I write, nearing completion on the effort to have the minimal installation ISO full build reproducible.
See https://r13y.com to follow those efforts.
Why do we need build-reproducibility? The essential idea is to expose the source of the build and all its parameters for anyone to rebuild and audit. Since the build is reproducible, the output of said build should give the same checksum and checksum are easy to compare.
Unified Extensible Firmware Interface or UEFI, is the specification that glues low-level firmware and your everyday Operating Systems. It’s meant as a replacement for the legacy BIOS.
Why do we need UEFI? First, because it’s required for TPM integration (see below), but more than that, UEFI provides its user an event log. This event log records every component that has been involved in the boot process. Every time a component is loaded, the element before will hash it, hand that hash to the TPM for the PCR to be extended, and records that in the event log.
Once Linux boots, it will get a copy of the event log recording each and every component involved in loading our Linux Kernel.
Running stateless infrastructure
I’m a proponent of running stateless infrastructure or more accurately running the operating system purely stateless, but you’re more than welcome to store data on disks, but this should be your business logic. It’s not the responsibility of your operating system.
Running stateless also suggests you keep rebuilding system images over and over, and you have full control over that build process. All input parameters are accounted for, and all your software is a derivation of those inputs.
Should your system crash, you just reboot and recover from a blank slate. This takes some practice, but in my experience, this has proven to be extremely reliable, especially on remote systems.
But that being said, that’s not why it’s important here. Running the Operating System stateless allows us to run it as a single image. A single image that bundles kernel and the initramfs. And in that initramfs, all software required to run production. Because we’re bundling everything in a single image, computing the checksum of all moving parts is easy as pie, since there is only one.
One other critical component is the TPM. Essentially a TPM is a cryptographic co-processor which records the different components loaded from the initialization of a machine down to the Operating System. It holds cryptographic keys, use of which can be tied to a given state of the system.
Whenever a system boots, from the CPU initialization, every component involved in running a system (firmware, CPU microcode, bootloader, kernel, and all their configuration) is hashed and registered in the TPM. Two subsequent boots from the same machine in the same state should yield the same state in the TPM. This state is represented by the PCRs banks (or Platform Configuration Register). There are 16 system/reserved banks which can only go one way. The only way to reset them is to reset the machine itself.
Among those, 16 PCRs are standardized, such as PCR 0 for bios, PCR 1 for its configuration, etc. Here we’ll focus on the one measuring the Operating System. In my lab, booting from UEFI HTTP boot, this is PCR 4 usually reserved for MBR, but your mileage may change.
Amongst the feature offered by the TPMs, one is the remote attestation1 protocol, to be used with any TPM. Usually, that is used to provision server or remote devices, those servers will authenticate to a central authority holding keys, secrets and provided they display the expected PCRs, the said central authority will grant them access to critical resources.
The general idea of remote attestation is: Send a challenge to the attestor, the attestor will sign the challenge, and you will verify the signature along with the key which signed it. This challenge signature is called a quote which includes the PCR you selected in the signature scheme. Sadly because of limitations, the scheme is a bit more complicated than that.
There are essentially 2 keys involved here:
- Endorsement Key (EK). This key is signed by a TPM manufacturer and is stored securely in each TPM. This allows you to check what TPM you’re talking to and verify its authenticity by verifying the certificate chain. This would be the perfect key to sign data, except, most of the time, this key can only decrypt data.
- Attestation Key (AK). This key can only sign data, but it is not cross-signed by any authority.
At this point, we can verify the authenticity of the TPM, that an attestation key can sign data, but we still haven’t tied both keys together.
There is a solution for that! TPM infrastructure exposes an infrastructure called Endorsement Key Credentials2. The verifier will encrypt a secret on the public key of the Endorsement Key and will encode it along with a reference to the Attestation Key. Should the referenced attestation key not be present, the TPM will refuse to decrypt (or activate, in the TPM terminology) the credential. This process is described below in Figure 1:
At this point we have a few guarantees:
- The Endorsement Key we received is only held by the TPM, and can not be extracted. It is cross-signed by the TPM manufacturer.
- The Attestation Key is only held by the TPM, and can not be extracted. The last step, where we Activate the Credential received, ties both Endorsement Key and Attestation Key together on the same TPM. TPM will only decrypt if the referred Attestation Key is valid in this TPM.
- The Quote received is signed on the Attestation Key which we trust provided it can decrypt the secret.
The tricky part of this whole process is that you can’t start trusting the Attestation Key before the TPM proved it knew the secret you encoded in the Credential.
Safeboot remote attestation protocol
But there is a flaw. There is a round-trip between the quote being signed and the credential being shared. If you were to generate Attestation Key the usual way, which makes Attestation Keys stored persistently across reboots, an attacker could reboot in between the Quote and the ActivateCredential phases, no longer tying the Attestation Key that just signed the quote with the PCR you just verified.
The smart folks of Safeboot project came with an elegant solution to this problem. It is possible to create ephemeral Attestation Keys that do not survive reboots. If you were to set the
ST_CLEAR parameter when you created the key, the TPM would ensure the key is cleared upon reset. Thus making sure the TPM could not be reset mid-transaction and that the same key that signed the PCR can be trusted.
Tying it all together
The remote attestation protocol described above is used for one thing: Servers booting and fetching secrets. So, here the attestor would be the client, and the server would be the verifier. But what if, we just reversed the role? What if you, or I, or anyone curious for that matter was able to reach out to another machine and just asked “What software are you running?” or “What is your current state/configuration?”.
Could we make each hydra build servers run a TCP service doing:
- Receive a nonce
- Sign the nonce with its Attestation Key
- Reply with:
- the UEFI Eventlog,
- a reference to the build image and its derivation,
- The nonce signed and tied to the relevant PCR
- The verifier would then send back a credential
- and finally, the attestor would be able to send back the secret in the clear, proving ownership of the attestation key.
The verifier could then go, and rebuild the whole image, and audit the source code as they please.
Since I like answering my own questions, I believe that yes it is possible and I made a prototype. I’ve been playing with this idea for the last couple of weeks now and the demo available on my GitHub: Build-reproducibility lab
Here is a recording of it running:
There is still a question to answer: should we have to provide sensitive information to those servers (private keys, upload ssh host keys), how do we do that? Well, there is an obvious answer. Have you heard about remote attestation?
Is it done yet? No, of course not, there is still a lot of work, and we’ll most likely uncover a lot of new fun bugs along the way. But the future is promising and I’m excited about it.
If you have questions, comments, or want to mention things I missed, please reach out either on social media or file issues on the GitHub repository. I’ll try to do my best to answer.
Is this enough to trust a remote server running somewhere on the Internet? No. You would want to fully trust those servers, you should audit the eventlog fully and review all the checksums for the firmwares (all recorded in there) and compare them to known good values as much as possible.
I’d like to thank Graham Christensen for the help and insight of the NixOS infrastructure he provided.