Is vDSO black magic?
TLDR: override any vDSO, without root, nor LD_*
hacking, even in static binary
What is a vDSO?
vDSO (virtual dynamic shared object) has been introduced a couple years ago in linux following a simple observation. Some workloads will generate a lot of syscalls, and with every syscall, that means you’re context-switching to kernel space, which has a significant performance impact.
Could we possibly reduce the number of syscalls? or have them somehow not context-switch?
There are a couple simplier syscalls like the gettimeofday(2)
or clock_gettime(2)
that only do a simple lookup and return immediately.
To avoid that, couldn’t the kernel just share a piece of segment of memory and have the consumer just look up what they need in that shared segment? How do we achieve ABI stability for that segment of memory? Is there a shared object format that is designed for ABI?
Yes there is, it’s called an ELF, there is a specification and it’s been around since SunOS 5.0. So this is what have been picked up. vDSO is just a ELF object injected by the kernel with every ELF binary that is executed and it provides an alternative syscall implementation that is going to lookup clocks in a shared memory segment.
How is this injected?
A simple example:
$ file /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
/nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, not stripped
What is interesting here is that it is an ELF binary that is dynamically linked. What that means is that it does not contain all the code it requires to run on its own. It will need a set of libraries to be linked at runtime:
$ ldd /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
linux-vdso.so.1 (0x00007ffe303ce000)
libreadline.so.7 => /nix/store/r759ls1ah0cmqc8009q7b1xn91sg57q0-readline-7.0p5/lib/libreadline.so.7 (0x00007f721bb2b000)
libhistory.so.7 => /nix/store/r759ls1ah0cmqc8009q7b1xn91sg57q0-readline-7.0p5/lib/libhistory.so.7 (0x00007f721bb1f000)
libncursesw.so.6 => /nix/store/iz7bgfvjmwdz3vgmzryhfkqwbn8zr2jn-ncurses-6.2/lib/libncursesw.so.6 (0x00007f721baad000)
libdl.so.2 => /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/libdl.so.2 (0x00007f721baa8000)
libc.so.6 => /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/libc.so.6 (0x00007f721b8e3000)
/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2 => /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib64/ld-linux-x86-64.so.2 (0x00007f721bb7a000)
What pieces everything together?
It’s the ELF interpreter (/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2
) this one is provided by the glibc, but you could imagine any other interpreter, you can find the one used for each program.
What is an ELF interpreter?
$ file /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2
/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2: symbolic link to ld-2.33.so
$ file /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), static-pie linked, BuildID[sha1]=9331a61721911fb6517d093c404f34a4d197d403, not stripped
The ELF interpreter is a static binary that is going to run first, load the shared libraries for you, and then jump on your code (main
or any other entrypoint).
We can see it do its job at the start of each program:
$ LD_DEBUG=symbols bash
581026: symbol=__vdso_clock_gettime; lookup in file=linux-vdso.so.1 [0]
581026: symbol=__vdso_gettimeofday; lookup in file=linux-vdso.so.1 [0]
581026: symbol=__vdso_time; lookup in file=linux-vdso.so.1 [0]
581026: symbol=__vdso_getcpu; lookup in file=linux-vdso.so.1 [0]
581026: symbol=__vdso_clock_getres; lookup in file=linux-vdso.so.1 [0]
581026: symbol=_res; lookup in file=ls [0]
581026: symbol=_res; lookup in file=/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/librt.so.1 [0]
581026: symbol=_res; lookup in file=/nix/store/pni52adzp1cq83syj5wy1rgallv4nary-acl-2.3.1/lib/libacl.so.1 [0]
581026: symbol=_res; lookup in file=/nix/store/i0hmd3b0m5w501cdq1sx7ip19ba8mc26-attr-2.5.1/lib/libattr.so.1 [0]
[...]
It will read the dynamic section from the bash ELF, then load recursively every library and rewrite symbols tables of each one. Preparing the memory as needed for your program to run.
How is it invoked ?
$ gdb /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
(gdb) starti
Starting program: /nix/store/pxbkil5843kl55sgbhb0pl3gxgfx1s0m-system-path/bin/bash
Program stopped.
0x00007ffff7fcd050 in _start ()
from /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2
What is this instruction position?
As we’ve seen, the interpreter has to run first. So the kernel is to load it, and jump on his start position.
The dynamic loader memory layout will look like:
$ objdump -x /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so | grep -A 10 'start address'
start address 0x0000000000001050
Program Header:
LOAD off 0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**12
filesz 0x0000000000000b98 memsz 0x0000000000000b98 flags r--
LOAD off 0x0000000000001000 vaddr 0x0000000000001000 paddr 0x0000000000001000 align 2**12
filesz 0x000000000002394e memsz 0x000000000002394e flags r-x
LOAD off 0x0000000000025000 vaddr 0x0000000000025000 paddr 0x0000000000025000 align 2**12
filesz 0x00000000000097f4 memsz 0x00000000000097f4 flags r--
LOAD off 0x000000000002ec00 vaddr 0x000000000002fc00 paddr 0x000000000002fc00 align 2**12
filesz 0x0000000000002438 memsz 0x00000000000025f0 flags rw-
That means, different sections will need to be loaded. The first section 0x0000000000000000 -> 0x0000000000000b98
is to be loaded as PROT_READ
, the second section 0x0000000000001000 -> 0x000000000002394e
is to be loaded as PROT_READ | PROT_EXEC
(and so on and so forth).
Once this is done, we should put the instruction pointer to 0x0000000000001050
from the start position (remember this could be mapped anywhere in the memory (Position Independent Execution (PIE) and ASLR).
What is the kernel doing?
$ cat /proc/537466/maps
00400000-0041b000 r--p 00000000 fe:01 6990903 /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
0041b000-004a8000 r-xp 0001b000 fe:01 6990903 /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
004a8000-004d6000 r--p 000a8000 fe:01 6990903 /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
004d7000-004de000 rw-p 000d6000 fe:01 6990903 /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
004de000-004e6000 rw-p 00000000 00:00 0 [heap]
7ffff7fc6000-7ffff7fca000 r--p 00000000 00:00 0 [vvar]
7ffff7fca000-7ffff7fcc000 r-xp 00000000 00:00 0 [vdso]
7ffff7fcc000-7ffff7fcd000 r--p 00000000 fe:01 6042474 /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffff7fcd000-7ffff7ff1000 r-xp 00001000 fe:01 6042474 /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffff7ff1000-7ffff7ffb000 r--p 00025000 fe:01 6042474 /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffff7ffb000-7ffff7fff000 rw-p 0002e000 fe:01 6042474 /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffffffdd000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
We’re seeing the different sections of /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
being mapped in memory at the offset 0x7ffff7fcc000
and it then sets to instruction register to 0x00007ffff7fcd050 ($rip) = 0x7ffff7fcc000 (mapping offset) + 0x0000000000001050 (start position)
.
Starting the program
Now we’re running the interpreter, how does the kernel tell the interpreter the user wanted to execute bash?
This is actually invoked directly by the kernel (logic is to be found in fs/binfmt_elf.c
), and there is a couple interesting things on the stack.
The stack is laid out like:
- argc + argv[argc] to pass arguments to a program
- envp the execution environment of said program
- after that, comes the lesser-known auxiliary vector which is a list of id value pairs.
The best documentation I found for the stack layout can be found on an LWN article. Here is a reproduction:
------------------------------------------------------------- 0x7fff6c845000
0x7fff6c844ff8: 0x0000000000000000
_ 4fec: './stackdump\0' <------+
env / 4fe2: 'ENVVAR2=2\0' | <----+
\_ 4fd8: 'ENVVAR1=1\0' | <---+ |
/ 4fd4: 'two\0' | | | <----+
args | 4fd0: 'one\0' | | | <---+ |
\_ 4fcb: 'zero\0' | | | <--+ | |
3020: random gap padded to 16B boundary | | | | | |
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| | | | | |
3019: 'x86_64\0' <-+ | | | | | |
auxv 3009: random data: ed99b6...2adcc7 | <-+ | | | | | |
data 3000: zero padding to align stack | | | | | | | |
. . . . . . . . . . . . . . . . . . . . . . . . . . .|. .|. .| | | | | |
2ff0: AT_NULL(0)=0 | | | | | | | |
2fe0: AT_PLATFORM(15)=0x7fff6c843019 --+ | | | | | | |
2fd0: AT_EXECFN(31)=0x7fff6c844fec ------|---+ | | | | |
2fc0: AT_RANDOM(25)=0x7fff6c843009 ------+ | | | | |
ELF 2fb0: AT_SECURE(23)=0 | | | | |
auxiliary 2fa0: AT_EGID(14)=1000 | | | | |
vector: 2f90: AT_GID(13)=1000 | | | | |
(id,val) 2f80: AT_EUID(12)=1000 | | | | |
pairs 2f70: AT_UID(11)=1000 | | | | |
2f60: AT_ENTRY(9)=0x4010c0 | | | | |
2f50: AT_FLAGS(8)=0 | | | | |
2f40: AT_BASE(7)=0x7ff6c1122000 | | | | |
2f30: AT_PHNUM(5)=9 | | | | |
2f20: AT_PHENT(4)=56 | | | | |
2f10: AT_PHDR(3)=0x400040 | | | | |
2f00: AT_CLKTCK(17)=100 | | | | |
2ef0: AT_PAGESZ(6)=4096 | | | | |
2ee0: AT_HWCAP(16)=0xbfebfbff | | | | |
2ed0: AT_SYSINFO_EHDR(33)=0x7fff6c86b000 | | | | |
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | | | | |
2ec8: environ[2]=(nil) | | | | |
2ec0: environ[1]=0x7fff6c844fe2 ------------------|-+ | | |
2eb8: environ[0]=0x7fff6c844fd8 ------------------+ | | |
2eb0: argv[3]=(nil) | | |
2ea8: argv[2]=0x7fff6c844fd4 ---------------------------|-|-+
2ea0: argv[1]=0x7fff6c844fd0 ---------------------------|-+
2e98: argv[0]=0x7fff6c844fcb ---------------------------+
0x7fff6c842e90: argc=3
In our bash example that would look like:
(gdb) x/128a ((long *) $rsp)
0x7fffffffd3d0: 0x1 0x7fffffffd75a # argc argv[0]
0x7fffffffd3e0: 0x0 0x7fffffffd79b # null (end of argv) envp[0]
0x7fffffffd3f0: 0x7fffffffd7c3 0x7fffffffd7d7 # envp[n] envp[n]
[...]
0x7fffffffd5e0: 0x7fffffffefa3 0x7fffffffefac # envp[n] envp[n]
0x7fffffffd5f0: 0x0 0x21 # null (end of envp) AT_SYSINFO_EHDR
0x7fffffffd600: 0x7ffff7fca000 0x10 # value of AT_SYSINFO_EHDR AT_HWCAP
0x7fffffffd610: 0x178bfbff 0x6 # AT_PAGESZ
0x7fffffffd620: 0x1000 0x11 # value of AT_PAGESZ (4096)
[...]
0x7fffffffd680: 0x0 0x9 # AT_ENTRY
0x7fffffffd690: 0x41db20 <_start> 0xb # value of AT_ENTRY
[...]
0x7fffffffd710: 0x7fffffffefb7 0xf
0x7fffffffd720: 0x7fffffffd749 0x0 # null (end of auxv)
0x7fffffffd730: 0x0 0x96c57f04b7f1a500
0x7fffffffd740: 0x869226983f3e7e3b 0x34365f363878c6
0x7fffffffd750: 0x0 0x732f78696e2f0000
0x7fffffffd760: 0x6278702f65726f74 0x6b333438356c696b
0x7fffffffd770: 0x626862677335356c 0x66677867336c7030
0x7fffffffd780: 0x79732d6d30733178 0x7461702d6d657473
0x7fffffffd790: 0x61622f6e69622f68 0x5243414c41006873
0x7fffffffd7a0: 0x474f4c5f59545449 0x6c412f706d742f3d
0x7fffffffd7b0: 0x2d79747469726361 0x6c2e323931373434
0x7fffffffd7c0: 0x524f4c4f4300676f 0x7572743d4d524554
We’ll find our argv:
(gdb) x /s *(((long *) $rsp) + 1)
0x7fffffffd75a: "/nix/store/pxbkil5843kl55sgbhb0pl3gxgfx1s0m-system-path/bin/bash"
We’ll find our environment:
(gdb) x /s 0x7fffffffd79b
0x7fffffffd79b: "ALACRITTY_LOG=/tmp/Alacritty-447192.log"
But most importantly, we’ll find our vDSO. It is provided as memory mapping (as we can see in the /proc/$pid/maps
from above), and its address is in the AT_SYSINFO_EHDR
value:
(gdb) x /s 0x7ffff7fca000
0x7ffff7fca000: "\177ELF\002\001\001"
Note: we’ll also find the entrypoint for our program in the AT_ENTRY
of the auxiliary vector.
Can we override a vDSO?
Here comes the not-a-million dollar question. Can we actually override a vDSO and provide a different implementation?
Yes, using ptrace, and injecting a new mapping just when execve returns. How do we do that?
I made a simple implementation in https://github.com/baloo/emmett/. Emmett will inject a vDSO to intercept call to the clock_gettime(2)
and gettimeofday(2)
syscalls and return a static value.
On run, this will look like:
$ ./emmett bash -xc "date; sleep 2; date" 2>&1
+ date
Thu Jan 1 12:00:42 AM UTC 1970
+ sleep 2
+ date
Thu Jan 1 12:00:42 AM UTC 1970
For each new process, Emmett will attach and intercept calls to execve, let the execve go through, then on return stop on the syscall-exit-stop
state, will parse the stack, inject a new mapping, copy over a new implementation, and rewrite the stack to have our own vDSO linked.
How does that work for statics?
Golang for example, even on static builds, will parse the ELF provided as AT_SYSINFO_EHDR
and call-in the vDSO implementation.
Thanks
Thanks to Adrien for his help and review.