memfd_secret() and the illusion of secret memory

Introduced in Linux 5.14, memfd_secret is a system call that allows a userspace process to create memory regions inaccessible to anything outside of it – including the kernel itself. According to the man page:

memfd_secret() creates an anonymous RAM-based file and returns a file descriptor that refers to it. The memory areas backing the file created with memfd_secret(2) are visible only to the processes that have access to the file descriptor. The memory region is removed from the kernel page tables and only the page tables of the processes holding the file descriptor map the corresponding physical memory.

The original patch has sparked some contention in the mm community. On the one hand, reduction in attack surface is seen as a welcome change. On the other hand, memfd_secret is inherently easy to circumvent via malicious kernel modules or custom kexec images. One such approach – remapping pages back into the kernel page table – has been demonstrated by Jonathon Reinhart in nosecmem.

This article, however, presents a different, much less invasive approach. Instead of manipulating kernel memory, we are going to make a process spill its secrets of its own accord.

Kernel developers will rightly point out that none of the following constitutes a bypass. memfd_secret was never designed to withstand a privileged attacker - it is fundamentally a mitigation that attempts to protect sensitive memory against kernel-level exploits.

For this reason, the approach covered in this article is not an attack and should not be taken as such. There is no vulnerability, no architectural oversight, and no exploitation is taking place.

But since this memory cannot be trivially accessed from userspace, it still leaves a real forensic gap. It’s a legitimate blind spot for IR and forensics tooling, and from the adversarial side – a convenient place to stash secrets that resists the standard memory inspection playbook.

The point of this article is to bridge that gap and present a more grounded outlook on what the mechanism does and doesn’t guarantee.

Setting the scene: what card am I thinking of?#

As memfd_secret was disabled by default before Linux 6.5, its real-world adoption remains practically nonexistent. For this reason, an example program is used to demonstrate the concepts.

For the sake of this demonstration, we assume the role of a mentalist. The following program generates a random playing card and prompts us to guess it. The buffer for the playing card is backed by secretmem.

static const char *pick_card(void) {
    const char *rank = select_random_rank();
    const char *suit = select_random_suit();

    char *card = alloc_secret(PAGE_SIZE);
    snprintf(card, PAGE_SIZE, "%s of %s", rank, suit);

    return card;
}

int main(void) {
    const char *card = pick_card();	
    const char *guess = readline("What card am I thinking of? ");

    if (strcasecmp(guess, card) == 0)
        puts("Lucky guess...");
    else
        puts("Nice try");
}

The goal is simple – guess the playing card.

Inspecting the deck: Cross-memory operations#

Reading data via `/proc/PID/mem`#

A privileged process can access the address space of a different process via its /proc/PID/mem interface. This is the mechanism debuggers like gdb use to provide memory inspection capabilities (see gdb/linux-nat.c, line 180).

Only valid memory areas can be accessed this way. In this context, valid means backed by a virtual memory area (VMA), be it heap, stack or memory expliticly mapped by a prior mmap system call. The list of VMAs for a particular process can be obtained from its /proc/PID/maps or /proc/PID/map_files interface. For instance:

# Find the VMA corresponding to the heap 
$ grep heap /proc/103501/maps
5d16813c5000-5d16813e6000 rw-p 00000000 00:00 0    [heap]

# Read the first 16 bytes of the heap
$ xxd -s 0x5d16813c5000 -l 16 /proc/103501/mem
00000000: 0000 0000 0000 0000 1104 0000 0000 0000  ................

The read succeeds. Now, what happens if you do the same for secretmem?

# Find the VMA corresponding to secretmem
$ grep secretmem /proc/103501/maps
7a698e98e000-7a698e98f000 rw-s 00000000 00:0e 149128  /secretmem (deleted)

# Read the first 16 bytes of secretmem 
$ xxd -s 0x7a698e98e000 -l 16 /proc/103501/mem
xxd: Input/output error

The kernel rejects our read attempt with EIO, which is triggered by the read system call:

$ perf trace -e openat,read xxd -s 0x7a698e98e000 -l 16 /proc/103501/mem
openat(dfd: CWD, filename: "/proc/103501/mem") = 3
read(fd: 3, buf: 0x615233044000, count: 1) = -1 (unknown) (Input/output error)

Reading data with `process_vm_readv` and `ptrace(PTRACE_PEEKDATA)`#

These two system calls can also be used to access memory of a different process, not unlike the /proc/PID/mem interface covered above. They too can be used to implement debugger capabilities, but their limitations make them a less common choice in practice. For the outline of such limitations, an interested reader should refer to gdb/linux-nat.c, line 180.

Let’s run the following snippet (omitting out-of-scope API details and ptrace boilerplate):

long word1 = ptrace_peek(pid, addr);
show_data_or_error(word1, "ptrace");

long word2 = process_vm_readv_word(pid, addr);
show_data_or_error(word2, "process_vm_readv");

Let’s first try to access some non-protected region:

# Find the VMA corresponding to an ELF segment
$ grep a.out /proc/103501/maps
5586cc8c4000-5586cc8e0000 r--p 00000000 00:1c 3882881   a.out

# Read the first 8 bytes of the segment
$ ./reader 103501 0x5586cc8c4000
          ptrace: 0x7f 0x45 0x4c 0x46 0x02 0x01 0x01 0x00
process_vm_readv: 0x7f 0x45 0x4c 0x46 0x02 0x01 0x01 0x00

Now we pass the address of secretmem:

# Find the VMA corresponding to secretmem
$ grep secretmem /proc/103501/maps
7a698e98e000-7a698e98f000 rw-s 00000000 00:0e 149128    /secretmem (deleted)

# Read the first 8 bytes of secretmem 
$ ./reader 103501 0x7a698e98e000 
ptrace: Input/output error
process_vm_readv: Bad address

Yet again, the kernel simply rejects our read attempts.

A kernel view into cross-memory operations#

So far, we have tried three different ways to access secretmem, to no avail. /proc/PID/mem, process_vm_readv and PTRACE_PEEKDATA - that’s three, right? But don’t they all look suspiciously similar? So similar, in fact, that it makes you wonder - aren’t they actually the same thing?

To answer this question, let’s trace each one of them and see if they converge into a shared code path. Full ftrace outputs on Linux 6.12: mem_read, ptrace_access_vm, process_vm_readv.

Abbreviated call chains:

Alt text

Looking at the call graphs, we can conclude they do indeed converge into a shared a code path - one starting with __get_user_pages.

Get User Pages, or GUP for short, is a kernel interface that allows kernel code to obtain references to physical pages belonging to userspace, enabling direct access to memory of another process. It is also used to pin user pages in memory, preventing them from being reclaimed or swapped out. In case you’re familiar with the mlock system call - GUP is exactly the mechanism used to implement it.

Going back to the man page:

Once a region for a memfd_secret() memory mapping is allocated, the user can’t accidentally pass it into the kernel to be transmitted somewhere. The memory pages in this region cannot be accessed via the direct map and they are disallowed in get_user_pages.

The last bit directly explains why all of our read attempts have failed - /proc/PID/mem, PTRACE_PEEKDATA and process_vm_readv all internally land in get_user_pages, which explicitly disallows secretmem accesses.

But where exactly does this check take place? From the ftrace output linked above, we can see that before a page table walk is initiated, get_user_pages calls into check_vma_flags, which in turn checks whether the specified VMA corresponds to a secret memory region:

  6)               |    __get_user_pages() {
  6)               |      gup_vma_lookup() {
  6)               |        find_vma() {
  6)   0.291 us    |          __rcu_read_lock();
  6)   0.291 us    |          __rcu_read_unlock();
  6)   1.893 us    |        }
  6)   2.465 us    |      }
  6)               |      check_vma_flags() {
  6)   0.361 us    |        vma_is_secretmem();  <-------
  6)   0.942 us    |      }
  6)   0.290 us    |      __cond_resched();
  6)   0.290 us    |      vma_pgtable_walk_begin();
  6)               |      follow_page_pte();
  6)   0.290 us    |      vma_pgtable_walk_end();
  6) + 12.414 us   |    }

Let’s look at the Linux 6.12 source code:

// https://elixir.bootlin.com/linux/v6.12.80/source/mm/gup.c#L1270
static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
{
	vm_flags_t vm_flags = vma->vm_flags;
	int write = (gup_flags & FOLL_WRITE);
	int foreign = (gup_flags & FOLL_REMOTE);
	bool vma_anon = vma_is_anonymous(vma);

    /* -- snip -- */

	if (vma_is_secretmem(vma))
		return -EFAULT;
    
    /* -- snip -- */
}

// https://elixir.bootlin.com/linux/v6.12.80/source/mm/secretmem.c#L139
bool vma_is_secretmem(struct vm_area_struct *vma)
{
	return vma->vm_ops == &secretmem_vm_ops;
}

According to the snippets, if the requested VMA corresponds to secretmem, check_vma_flags returns an error, which makes __get_user_pages fail early:

// https://elixir.bootlin.com/linux/v6.12.80/source/mm/gup.c#L1427
static long __get_user_pages(struct mm_struct *mm,
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		int *locked)
{
    long ret = 0, i = 0;
	struct vm_area_struct *vma = NULL;
    /* -- snip -- */
    
    if (check_vma_flags(vma, gup_flags)) {
	    ret = -EINVAL;
		goto out;
	}

    /* -- snip -- */
    out:
	return i ? i : ret;
}

Inspecting the deck: Accessing open files#

Reading out file descriptors via `/proc/PID/fd`#

We have hit a wall with get_user_pages, but cross-memory operations are not the only way. What happens if we try to exfiltrate secret memory via the /proc/PID/fd interface? After all, there’s no reason for it to go through get_user_pages.

/proc/PID/fd is a subdirectory containing one entry for each file which the process has open, named by its file descriptor, and which is a symbolic link to the actual file.

Our target process contains the following entries:

total 0
lrwx------ 1 root root 64 Apr  6 01:55 0 -> /dev/pts/4
lrwx------ 1 root root 64 Apr  6 01:55 1 -> /dev/pts/4
lrwx------ 1 root root 64 Apr  6 01:55 2 -> /dev/pts/4
lrwx------ 1 root root 64 Apr  6 01:55 3 -> '/secretmem (deleted)'

Normal files can be accessed or executed via these symbolic links. For example, in earlier versions of glibc (before 2.26) this was used to implement fexecve, a function that executes a program specified via the file descriptor (see fexecve.c).

Fun fact: since an open file descriptor keeps its inode alive, deleting a file from the filesystem prior to fexecve is often used as a rudimentary form of fileless execution in adversarial code.

But could you read /proc/PID/fd to access secretmem?

$ perf trace -e openat,read xxd /proc/103501/fd/3
openat(dfd: CWD, filename: "/proc/103501/fd/3") = -1 (unknown) (No such device or address)

This time, the caller cannot even open the file, let alone read it. What’s going on?

The VFS and secretmem_fops#

The Linux kernel supports a large number of different filesystems: some disk-based, some in-memory and others – purely virtual. In order to abstract implementation details of such filesystems and to provide a unified interface to userspace, the kernel defines a layer called the Virtual File System, or VFS. To register with VFS, kernel code that wants to implement a file-like interface provides a table of handlers that map high-level VFS operations to implementation-specific code.

For instance, the ext2 filesystem provides the following interface:

// https://elixir.bootlin.com/linux/v6.12.80/source/fs/ext2/file.c#L311
const struct file_operations ext2_file_operations = {
	.llseek		= generic_file_llseek,
	.read_iter	= ext2_file_read_iter,
	.write_iter	= ext2_file_write_iter,
	.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl	= ext2_compat_ioctl,
#endif
	.mmap		= ext2_file_mmap,
	.open		= ext2_file_open,
	.release	= ext2_release_file,
	.fsync		= ext2_fsync,
	.get_unmapped_area = thp_get_unmapped_area,
	.splice_read	= filemap_splice_read,
	.splice_write	= iter_file_splice_write,
};

For opening files, there is ext2_file_open, and for closing them – ext2_file_release. You get the idea. The full list of supported VFS operations can be viewed here. A filesystem can choose to implement only a subset of these operations.

With that in mind, let’s take a look at the interface defined for files created with memfd_secret:

// https://elixir.bootlin.com/linux/v6.12.80/source/mm/secretmem.c#L144
static const struct file_operations secretmem_fops = {
	.release	= secretmem_release,
	.mmap		= secretmem_mmap,
};

Oh, well. That explains it. In our last experiment, we tried to open a secretmem file from its symbolic link in /proc/PID/fd – but there was never an interface to open such files to begin with. The kernel didn’t have to prevent us from opening the file – it itself doesn’t know how to do it. The only operations it knows about are mmap and release. If only there were a way to get a hold of an open file descriptor for us to map.

Stealing open file descriptors#

Let’s revisit the man page for memfd_secret:

The memory areas backing the file created with memfd_secret(2) are visible only to the processes that have access to the file descriptor.

According to this sentence, if a process somehow acquires this file descriptor, it should be able to access the memory areas backing it. To test this, we are going to use the pidfd_getfd system call. From the man page:

The pidfd_getfd() system call allocates a new file descriptor in the calling process. This new file descriptor is a duplicate of an existing file descriptor, targetfd, in the process referred to by the PID file descriptor pidfd. The duplicate file descriptor refers to the same open file description as the original file descriptor in the process referred to by pidfd.

The following program calls pidfd_getfd to acquire the file descriptor, then tries to map it and read the contents directly from its own local memory:

static int steal_fd(pid_t pid, int targetfd) {
    int pidfd  = pidfd_open(pid, 0);
    int copied_fd = pidfd_getfd(pidfd, targetfd, 0);
    
    close(pidfd);
    return copied_fd;
}

int main(int argc, char *argv[]) {
    pid_t pid = strtoul(argv[1], NULL, 10);
    int targetfd = strtoul(argv[2], NULL, 10);

    int fd = steal_fd(pid, targetfd);
    const char *addr = map_file_pages(fd); 

    if (addr == MAP_FAILED) {
        perror("mmap");
        return EXIT_FAILURE;
    }

    printf("%s\n", addr);
}

Will this work?

$ ./steal-and-map-fd 103501 3
Ace of Spades

Intermission#

We have just discovered a practical way to access the contents of a secret memory region, but there’s a catch. Unsurprisingly, a process does not need to keep a file descriptor open after it has mapped it. And without an open file descriptor there is nothing for us to steal.

Let’s tweak pick_card a little and restart the program:

static const char *pick_card(void) {
    const char *rank = select_random_rank();
    const char *suit = select_random_suit();

 ++ int secret_fd;

 ++ char *card = alloc_secret(PAGE_SIZE, &secret_fd);
    snprintf(card, PAGE_SIZE, "%s of %s", rank, suit);

 ++ close(secret_fd);
    return card;
}

The file is gone, yet the VMA is still there:

$ grep secretmem /proc/220842/maps
78ddc590e000-78ddc590f000 rw-s 00000000 00:0e 340064    /secretmem (deleted)

The Broken Illusion#

We have tried using ptrace for its cross-memory capabilities earlier, but now it’s time to give it a proper introduction:

The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

Since we cannot rely on cross-memory operations, and the target process is the only entity in the entire system holding a reference to the memory, our last resort is to actually read its mind.

We are now entering the territory of runtime process manipulation. The idea is far from new – ptrace has a long track record in offensive security. Userspace rootkits and C2 implants have used it extensively to inject into system processes and manipulate their behavior, to exfiltrate sensitive data and to establish covert communication channels.

But today our motives are benign. We simply want to guess a playing card. Assuming we can make the process do whatever we want, what is it exactly that we want from it?

Fundamentally, our process must spill its secrets somewhere. Standard output, files on disk, another mapping, or pass it to another process via cross-memory operations. Any of these would work for a simple proof of concept, and it is certainly enough to guess the playing card our process has in mind.

Now, what if there are many such secret regions? Many playing cards, if you will. It makes sense to store each one into a unique file, possibly prefixed with a PID and some kind of timestamp. Making the process itself responsible for opening and closing files is rather inconsiderate. We could inject arbitrarily complex shellcode and data into the address space – but we don’t need to.

All the process needs is a gentle nudge. In stark contrast to shellcode chains, where all instructions are related, system calls can be injected in isolation from one another, with minimal footprint. To minimize the number of system calls, the task of file management is best offloaded to the tracer. The tracee, in turn, simply funnels everything through an IPC interface, freed from laborious resource management.

There are many ways to implement this idea, and desecrate – a companion tool I developed for this demo – is one of them. How exactly it scans secretmem regions and what IPC it employs for exfiltration is not particularly relevant, but here is the outline:

Obtain the list of secretmem regions from /proc/PID/map_files.
Inject a memfd_create() system call to get a file descriptor for an anonymous memory region.
Steal the file descriptor from the tracee with pidfd_getfd().
In the tracer, set up a mapping for the stolen file descriptor.
Inject a pwrite() system call to dump data from secretmem into the file descriptor.

Since the tracer now has a mapping for this anonymous memory, the contents are immediately visible and do not require additional system calls to retrieve.

Dump the mapping into a file that uniquely identifies the secretmem region.
Repeat step 5 for all regions obtained during step 1.

A brief note on the reasoning behind the use of memfd_create. It would indeed be possible to simply inject open and use write to directly dump secretmem pages to files. However, that would mean injecting filename strings into the tracee, either by overwriting existing regions and restoring them later, or by setting up a new scratch page with mmap. This works, but it’s inelegant. It would also be possible to inject a pipe system call, obtain the read side and exchange data, but the default pipe capacity is 16 pages, which doesn’t scale well. Using pipes also requires at least 2 system calls per secretmem region (tracee write, tracer read).

memfd_create() resolves both of the concerns:

It allows to create an unnamed region, which obviates filename string injection.
Mapping this region in the tracer makes the contents directly visible, without additional system calls.

This was a mouthful, but once you understand how system calls are injected, the rest becomes boilerplate. The core architecture-agnostic function that takes care of system call injection looks as follows:

long target_syscall(target_ctx *target, int syscall, size_t argc, long *argv) {
    if (target_get_regset(target) == -1)
        return -1;

    /* backup process context to restore later */
    arch_regs saved_regs = target->regs;

    /* insert arch-dependent syscall instruction at PC */
    long saved_insn = target_peek_pc(target);
    if (target_poke_pc(target, ARCH_SYSCALL_INSN) == -1)
        return -1;

    /* setup syscall arguments according to the calling convention */
    arch_syscall_setup(&target->regs, syscall, argc, argv);

    /* commit syscall arguments, singlestep, get results */
    if (target_set_regset(target) == -1
        || ptrace_singlestep(target->pid) == -1
        || target_wait_tid(target->pid) == -1
        || target_get_regset(target) == -1)
        return -1;

    /* save syscall return code */
    long retval = arch_syscall_ret(&target->regs);
    target->regs = saved_regs;

    /* restore saved context without resuming */
    if (target_poke_pc(target, saved_insn) == -1
        || target_set_regset(target) == -1)
        return -1;

    return retval;
}

The architecture-specific bits – like opcodes, calling conventions, and the choice of register for the return value – are hidden behind a small set of arch_* helpers.

x86_64

/* Syscalls for x86_64 :
 *   - registers are 64-bit
 *   - syscall number is passed in rax
 *   - arguments are in rdi, rsi, rdx, r10, r8, r9 respectively
 *   - the system call is performed by calling the syscall instruction
 *   - syscall return comes in rax
 *   - rcx and r11 are clobbered, others are preserved.
 */

#define ARCH_SYSCALL_INSN 0x050F

static inline void arch_syscall_setup(arch_regs *regs,
    int syscall, size_t argc, long *argv) {
    regs->rax = syscall;

    switch (argc) {
        default:
        case 6: regs->r9  = argv[5];
        case 5: regs->r8  = argv[4];
        case 4: regs->r10 = argv[3];
        case 3: regs->rdx = argv[2];
        case 2: regs->rsi = argv[1];
        case 1: regs->rdi = argv[0];
        case 0: break;
    }
}

static inline long arch_syscall_ret(arch_regs *regs) {
    return regs->rax;
}

static inline long arch_pc(arch_regs *regs) {
    return regs->rip;
}

ARM64

/* Syscalls for AARCH64 :
 *   - registers are 64-bit
 *   - stack is 16-byte aligned
 *   - syscall number is passed in x8
 *   - arguments are in x0, x1, x2, x3, x4, x5
 *   - the system call is performed by calling svc 0
 *   - syscall return comes in x0.
 */

static inline void arch_syscall_setup(arch_regs *regs,
    int syscall, size_t argc, long *argv) {
    regs->x8 = syscall;
    regs->sp = align_down(regs->sp, 16);

    switch (argc) {
        default:
        case 6: regs->x5 = argv[5];
        case 5: regs->x4 = argv[4];
        case 4: regs->x3 = argv[3];
        case 3: regs->x2 = argv[2];
        case 2: regs->x1 = argv[1];
        case 1: regs->x0 = argv[0];
        case 0: break;
    }
}

static inline long arch_syscall_ret(arch_regs *regs) {
    return regs->x0;
}

target_syscall is only called indirectly via helpers. For example:

static inline int target_sys_pwrite(target_ctx *target, int fd,
    uintptr_t buffer, size_t size, off_t off) {
    return target_syscall(target, SYS_pwrite64, 4, fd, buffer, size, off);
}

static inline void target_sys_close(target_ctx *target, int fd) {
    target_syscall(target, SYS_close, 1, fd);
}

For the sake of completeness, here’s the function that is ultimately responsible for exfiltration. It injects pwrite() to dump secretmem regions into the file descriptor obtained from memfd_create() and, since the contents are directly reflected in tracer’s memory, immediately dumps them to disk.

ssize_t exfil_secret_to_fd(exfil_ctx *exfil, secret_area *area, int fd) {
    int ret = target_sys_pwrite(exfil->target,
        exfil->remote_fd, area->base, area->size, 0);

    return ret != -1 ? write(fd, exfil->buffer, area->size) : -1;
}

At long last, going back to the playing card:

$ ./desecrate 220842
Found 1 secret region(s):
[78ddc590e000-78ddc590f000] 4096 bytes

$ strings secretmem.220842/78ddc590e000-78ddc590f000
Seven of Hearts

Lucky guess?