TLDR: override any vDSO, without root, nor LD_* hacking, even in static binary
What is a vDSO?
vDSO (virtual dynamic shared object) has been introduced a couple years ago in linux following a simple observation. Some workloads will generate a lot of syscalls, and with every syscall, that means you’re context-switching to kernel space, which has a significant performance impact.
Could we possibly reduce the number of syscalls? or have them somehow not context-switch?
There are a couple simplier syscalls like the gettimeofday(2) or clock_gettime(2) that only do a simple lookup and return immediately.
To avoid that, couldn’t the kernel just share a piece of segment of memory and have the consumer just look up what they need in that shared segment? How do we achieve ABI stability for that segment of memory? Is there a shared object format that is designed for ABI?
Yes there is, it’s called an ELF, there is a specification and it’s been around since SunOS 5.0. So this is what have been picked up. vDSO is just a ELF object injected by the kernel with every ELF binary that is executed and it provides an alternative syscall implementation that is going to lookup clocks in a shared memory segment.
How is this injected?
A simple example:
What is interesting here is that it is an ELF binary that is dynamically linked. What that means is that it does not contain all the code it requires to run on its own. It will need a set of libraries to be linked at runtime:
What pieces everything together?
It’s the ELF interpreter (/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2) this one is provided by the glibc, but you could imagine any other interpreter, you can find the one used for each program.
What is an ELF interpreter?
The ELF interpreter is a static binary that is going to run first, load the shared libraries for you, and then jump on your code (main or any other entrypoint).
We can see it do its job at the start of each program:
It will read the dynamic section from the bash ELF, then load recursively every library and rewrite symbols tables of each one. Preparing the memory as needed for your program to run.
How is it invoked ?
What is this instruction position?
As we’ve seen, the interpreter has to run first. So the kernel is to load it, and jump on his start position.
The dynamic loader memory layout will look like:
That means, different sections will need to be loaded. The first section 0x0000000000000000 -> 0x0000000000000b98 is to be loaded as PROT_READ, the second section 0x0000000000001000 -> 0x000000000002394e is to be loaded as PROT_READ | PROT_EXEC (and so on and so forth).
Once this is done, we should put the instruction pointer to 0x0000000000001050 from the start position (remember this could be mapped anywhere in the memory (Position Independent Execution (PIE) and ASLR).
What is the kernel doing?
We’re seeing the different sections of /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so being mapped in memory at the offset 0x7ffff7fcc000 and it then sets to instruction register to 0x00007ffff7fcd050 ($rip) = 0x7ffff7fcc000 (mapping offset) + 0x0000000000001050 (start position).
Starting the program
Now we’re running the interpreter, how does the kernel tell the interpreter the user wanted to execute bash?
This is actually invoked directly by the kernel (logic is to be found in fs/binfmt_elf.c), and there is a couple interesting things on the stack.
The stack is laid out like:
argc + argv[argc] to pass arguments to a program
envp the execution environment of said program
after that, comes the lesser-known auxiliary vector which is a list of id value pairs.
The best documentation I found for the stack layout can be found on an LWN article. Here is a reproduction:
In our bash example that would look like:
We’ll find our argv:
We’ll find our environment:
But most importantly, we’ll find our vDSO. It is provided as memory mapping (as we can see in the /proc/$pid/maps from above), and its address is in the AT_SYSINFO_EHDR value:
Note: we’ll also find the entrypoint for our program in the AT_ENTRY of the auxiliary vector.
Can we override a vDSO?
Here comes the not-a-million dollar question. Can we actually override a vDSO and provide a different implementation?
Yes, using ptrace, and injecting a new mapping just when execve returns. How do we do that?
I made a simple implementation in https://github.com/baloo/emmett/. Emmett will inject a vDSO to intercept call to the clock_gettime(2) and gettimeofday(2) syscalls and return a static value.
On run, this will look like:
For each new process, Emmett will attach and intercept calls to execve, let the execve go through, then on return stop on the syscall-exit-stop state, will parse the stack, inject a new mapping, copy over a new implementation, and rewrite the stack to have our own vDSO linked.
How does that work for statics?
Golang for example, even on static builds, will parse the ELF provided as AT_SYSINFO_EHDR and call-in the vDSO implementation.