TLDR: override any vDSO, without root, nor LD_* hacking, even in static binary

What is a vDSO?

vDSO (virtual dynamic shared object) has been introduced a couple years ago in linux following a simple observation. Some workloads will generate a lot of syscalls, and with every syscall, that means you’re context-switching to kernel space, which has a significant performance impact.

Could we possibly reduce the number of syscalls? or have them somehow not context-switch?

There are a couple simplier syscalls like the gettimeofday(2) or clock_gettime(2) that only do a simple lookup and return immediately.

To avoid that, couldn’t the kernel just share a piece of segment of memory and have the consumer just look up what they need in that shared segment? How do we achieve ABI stability for that segment of memory? Is there a shared object format that is designed for ABI?

Yes there is, it’s called an ELF, there is a specification and it’s been around since SunOS 5.0. So this is what have been picked up. vDSO is just a ELF object injected by the kernel with every ELF binary that is executed and it provides an alternative syscall implementation that is going to lookup clocks in a shared memory segment.

How is this injected?

A simple example:

$ file /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
/nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, not stripped
file bash output

What is interesting here is that it is an ELF binary that is dynamically linked. What that means is that it does not contain all the code it requires to run on its own. It will need a set of libraries to be linked at runtime:

$ ldd /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
        linux-vdso.so.1 (0x00007ffe303ce000)
        libreadline.so.7 => /nix/store/r759ls1ah0cmqc8009q7b1xn91sg57q0-readline-7.0p5/lib/libreadline.so.7 (0x00007f721bb2b000)
        libhistory.so.7 => /nix/store/r759ls1ah0cmqc8009q7b1xn91sg57q0-readline-7.0p5/lib/libhistory.so.7 (0x00007f721bb1f000)
        libncursesw.so.6 => /nix/store/iz7bgfvjmwdz3vgmzryhfkqwbn8zr2jn-ncurses-6.2/lib/libncursesw.so.6 (0x00007f721baad000)
        libdl.so.2 => /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/libdl.so.2 (0x00007f721baa8000)
        libc.so.6 => /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/libc.so.6 (0x00007f721b8e3000)
        /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2 => /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib64/ld-linux-x86-64.so.2 (0x00007f721bb7a000)
ldd bash

What pieces everything together?

It’s the ELF interpreter (/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2) this one is provided by the glibc, but you could imagine any other interpreter, you can find the one used for each program.

What is an ELF interpreter?

$ file /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2
/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2: symbolic link to ld-2.33.so
$ file /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), static-pie linked, BuildID[sha1]=9331a61721911fb6517d093c404f34a4d197d403, not stripped
file interp ld-linux.so

The ELF interpreter is a static binary that is going to run first, load the shared libraries for you, and then jump on your code (main or any other entrypoint).

We can see it do its job at the start of each program:

$ LD_DEBUG=symbols bash
    581026:     symbol=__vdso_clock_gettime;  lookup in file=linux-vdso.so.1 [0]
    581026:     symbol=__vdso_gettimeofday;  lookup in file=linux-vdso.so.1 [0]
    581026:     symbol=__vdso_time;  lookup in file=linux-vdso.so.1 [0]
    581026:     symbol=__vdso_getcpu;  lookup in file=linux-vdso.so.1 [0]
    581026:     symbol=__vdso_clock_getres;  lookup in file=linux-vdso.so.1 [0]
    581026:     symbol=_res;  lookup in file=ls [0]
    581026:     symbol=_res;  lookup in file=/nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/librt.so.1 [0]
    581026:     symbol=_res;  lookup in file=/nix/store/pni52adzp1cq83syj5wy1rgallv4nary-acl-2.3.1/lib/libacl.so.1 [0]
    581026:     symbol=_res;  lookup in file=/nix/store/i0hmd3b0m5w501cdq1sx7ip19ba8mc26-attr-2.5.1/lib/libattr.so.1 [0]
[...]
LD_DEBUG=symbols ls

It will read the dynamic section from the bash ELF, then load recursively every library and rewrite symbols tables of each one. Preparing the memory as needed for your program to run.

How is it invoked ?

$ gdb /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
(gdb) starti
Starting program: /nix/store/pxbkil5843kl55sgbhb0pl3gxgfx1s0m-system-path/bin/bash

Program stopped.
0x00007ffff7fcd050 in _start ()
   from /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-linux-x86-64.so.2
gdb bash

What is this instruction position?

As we’ve seen, the interpreter has to run first. So the kernel is to load it, and jump on his start position.

The dynamic loader memory layout will look like:

$ objdump -x /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so | grep -A 10 'start address'
start address 0x0000000000001050

Program Header:
    LOAD off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**12
         filesz 0x0000000000000b98 memsz 0x0000000000000b98 flags r--
    LOAD off    0x0000000000001000 vaddr 0x0000000000001000 paddr 0x0000000000001000 align 2**12
         filesz 0x000000000002394e memsz 0x000000000002394e flags r-x
    LOAD off    0x0000000000025000 vaddr 0x0000000000025000 paddr 0x0000000000025000 align 2**12
         filesz 0x00000000000097f4 memsz 0x00000000000097f4 flags r--
    LOAD off    0x000000000002ec00 vaddr 0x000000000002fc00 paddr 0x000000000002fc00 align 2**12
         filesz 0x0000000000002438 memsz 0x00000000000025f0 flags rw-
objdump interp

That means, different sections will need to be loaded. The first section 0x0000000000000000 -> 0x0000000000000b98 is to be loaded as PROT_READ, the second section 0x0000000000001000 -> 0x000000000002394e is to be loaded as PROT_READ | PROT_EXEC (and so on and so forth).

Once this is done, we should put the instruction pointer to 0x0000000000001050 from the start position (remember this could be mapped anywhere in the memory (Position Independent Execution (PIE) and ASLR).

What is the kernel doing?

$ cat /proc/537466/maps
00400000-0041b000 r--p 00000000 fe:01 6990903                            /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
0041b000-004a8000 r-xp 0001b000 fe:01 6990903                            /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
004a8000-004d6000 r--p 000a8000 fe:01 6990903                            /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
004d7000-004de000 rw-p 000d6000 fe:01 6990903                            /nix/store/w07a7k61dw5gnsyxj3kgcq3shr76jax8-bash-interactive-4.4-p23/bin/bash
004de000-004e6000 rw-p 00000000 00:00 0                                  [heap]
7ffff7fc6000-7ffff7fca000 r--p 00000000 00:00 0                          [vvar]
7ffff7fca000-7ffff7fcc000 r-xp 00000000 00:00 0                          [vdso]
7ffff7fcc000-7ffff7fcd000 r--p 00000000 fe:01 6042474                    /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffff7fcd000-7ffff7ff1000 r-xp 00001000 fe:01 6042474                    /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffff7ff1000-7ffff7ffb000 r--p 00025000 fe:01 6042474                    /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffff7ffb000-7ffff7fff000 rw-p 0002e000 fe:01 6042474                    /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so
7ffffffdd000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
/proc memory maps

We’re seeing the different sections of /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/ld-2.33.so being mapped in memory at the offset 0x7ffff7fcc000 and it then sets to instruction register to 0x00007ffff7fcd050 ($rip) = 0x7ffff7fcc000 (mapping offset) + 0x0000000000001050 (start position).

Starting the program

Now we’re running the interpreter, how does the kernel tell the interpreter the user wanted to execute bash?

This is actually invoked directly by the kernel (logic is to be found in fs/binfmt_elf.c), and there is a couple interesting things on the stack.

The stack is laid out like:

  • argc + argv[argc] to pass arguments to a program
  • envp the execution environment of said program
  • after that, comes the lesser-known auxiliary vector which is a list of id value pairs.

The best documentation I found for the stack layout can be found on an LWN article. Here is a reproduction:

    ------------------------------------------------------------- 0x7fff6c845000
     0x7fff6c844ff8: 0x0000000000000000
            _  4fec: './stackdump\0'                      <------+
      env  /   4fe2: 'ENVVAR2=2\0'                               |    <----+
           \_  4fd8: 'ENVVAR1=1\0'                               |   <---+ |
           /   4fd4: 'two\0'                                     |       | |     <----+
     args |    4fd0: 'one\0'                                     |       | |    <---+ |
           \_  4fcb: 'zero\0'                                    |       | |   <--+ | |
               3020: random gap padded to 16B boundary           |       | |      | | |
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|       | |      | | |
               3019: 'x86_64\0'                        <-+       |       | |      | | |
     auxv      3009: random data: ed99b6...2adcc7        | <-+   |       | |      | | |
     data      3000: zero padding to align stack         |   |   |       | |      | | |
    . . . . . . . . . . . . . . . . . . . . . . . . . . .|. .|. .|       | |      | | |
               2ff0: AT_NULL(0)=0                        |   |   |       | |      | | |
               2fe0: AT_PLATFORM(15)=0x7fff6c843019    --+   |   |       | |      | | |
               2fd0: AT_EXECFN(31)=0x7fff6c844fec      ------|---+       | |      | | |
               2fc0: AT_RANDOM(25)=0x7fff6c843009      ------+           | |      | | |
      ELF      2fb0: AT_SECURE(23)=0                                     | |      | | |
    auxiliary  2fa0: AT_EGID(14)=1000                                    | |      | | |
     vector:   2f90: AT_GID(13)=1000                                     | |      | | |
    (id,val)   2f80: AT_EUID(12)=1000                                    | |      | | |
      pairs    2f70: AT_UID(11)=1000                                     | |      | | |
               2f60: AT_ENTRY(9)=0x4010c0                                | |      | | |
               2f50: AT_FLAGS(8)=0                                       | |      | | |
               2f40: AT_BASE(7)=0x7ff6c1122000                           | |      | | |
               2f30: AT_PHNUM(5)=9                                       | |      | | |
               2f20: AT_PHENT(4)=56                                      | |      | | |
               2f10: AT_PHDR(3)=0x400040                                 | |      | | |
               2f00: AT_CLKTCK(17)=100                                   | |      | | |
               2ef0: AT_PAGESZ(6)=4096                                   | |      | | |
               2ee0: AT_HWCAP(16)=0xbfebfbff                             | |      | | |
               2ed0: AT_SYSINFO_EHDR(33)=0x7fff6c86b000                  | |      | | |
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        | |      | | |
               2ec8: environ[2]=(nil)                                    | |      | | |
               2ec0: environ[1]=0x7fff6c844fe2         ------------------|-+      | | |
               2eb8: environ[0]=0x7fff6c844fd8         ------------------+        | | |
               2eb0: argv[3]=(nil)                                                | | |
               2ea8: argv[2]=0x7fff6c844fd4            ---------------------------|-|-+
               2ea0: argv[1]=0x7fff6c844fd0            ---------------------------|-+
               2e98: argv[0]=0x7fff6c844fcb            ---------------------------+
     0x7fff6c842e90: argc=3
Stack layout source=https://lwn.net/Articles/631631/

In our bash example that would look like:

(gdb) x/128a ((long *) $rsp)
0x7fffffffd3d0:        0x1                   0x7fffffffd75a  # argc                      argv[0]
0x7fffffffd3e0:        0x0                   0x7fffffffd79b  # null (end of argv)        envp[0]
0x7fffffffd3f0:        0x7fffffffd7c3        0x7fffffffd7d7  # envp[n]                   envp[n]
[...]
0x7fffffffd5e0:        0x7fffffffefa3        0x7fffffffefac  # envp[n]                   envp[n]
0x7fffffffd5f0:        0x0                   0x21            # null (end of envp)        AT_SYSINFO_EHDR
0x7fffffffd600:        0x7ffff7fca000        0x10            # value of AT_SYSINFO_EHDR  AT_HWCAP
0x7fffffffd610:        0x178bfbff            0x6             #                           AT_PAGESZ
0x7fffffffd620:        0x1000                0x11            # value of AT_PAGESZ (4096)
[...]
0x7fffffffd680:        0x0                   0x9             #                           AT_ENTRY
0x7fffffffd690:        0x41db20 <_start>     0xb             # value of AT_ENTRY
[...]
0x7fffffffd710:        0x7fffffffefb7        0xf
0x7fffffffd720:        0x7fffffffd749        0x0             # null (end of auxv)
0x7fffffffd730:        0x0                   0x96c57f04b7f1a500
0x7fffffffd740:        0x869226983f3e7e3b    0x34365f363878c6
0x7fffffffd750:        0x0                   0x732f78696e2f0000
0x7fffffffd760:        0x6278702f65726f74    0x6b333438356c696b
0x7fffffffd770:        0x626862677335356c    0x66677867336c7030
0x7fffffffd780:        0x79732d6d30733178    0x7461702d6d657473
0x7fffffffd790:        0x61622f6e69622f68    0x5243414c41006873
0x7fffffffd7a0:        0x474f4c5f59545449    0x6c412f706d742f3d
0x7fffffffd7b0:        0x2d79747469726361    0x6c2e323931373434
0x7fffffffd7c0:        0x524f4c4f4300676f    0x7572743d4d524554
gdb stack

We’ll find our argv:

(gdb) x /s *(((long *) $rsp) + 1)
0x7fffffffd75a:	"/nix/store/pxbkil5843kl55sgbhb0pl3gxgfx1s0m-system-path/bin/bash"
gdb argv

We’ll find our environment:

(gdb) x /s 0x7fffffffd79b
0x7fffffffd79b:	"ALACRITTY_LOG=/tmp/Alacritty-447192.log"
gdb envp

But most importantly, we’ll find our vDSO. It is provided as memory mapping (as we can see in the /proc/$pid/maps from above), and its address is in the AT_SYSINFO_EHDR value:

(gdb) x /s 0x7ffff7fca000
0x7ffff7fca000:	"\177ELF\002\001\001"
hello hello, here is magics

Note: we’ll also find the entrypoint for our program in the AT_ENTRY of the auxiliary vector.

Can we override a vDSO?

Here comes the not-a-million dollar question. Can we actually override a vDSO and provide a different implementation?

Yes, using ptrace, and injecting a new mapping just when execve returns. How do we do that?

I made a simple implementation in https://github.com/baloo/emmett/. Emmett will inject a vDSO to intercept call to the clock_gettime(2) and gettimeofday(2) syscalls and return a static value.

On run, this will look like:

$ ./emmett bash -xc "date; sleep 2; date" 2>&1
+ date
Thu Jan  1 12:00:42 AM UTC 1970
+ sleep 2
+ date
Thu Jan  1 12:00:42 AM UTC 1970
gdb envp

For each new process, Emmett will attach and intercept calls to execve, let the execve go through, then on return stop on the syscall-exit-stop state, will parse the stack, inject a new mapping, copy over a new implementation, and rewrite the stack to have our own vDSO linked.

How does that work for statics?

Golang for example, even on static builds, will parse the ELF provided as AT_SYSINFO_EHDR and call-in the vDSO implementation.

Thanks

Thanks to Adrien for his help and review.