Lately I have been thinking about methods to provide a stripped down, secured environment for running untrusted code on GNU/Linux. With this post I would like to present you with the first results of my research.
ujail - brief introduction
I have chosen ujail as the name for the technique I am proposing. ujail stands for micro jail in userspace and, in itself, describes the concept briefly. The main idea is to have a userspace process monitor system calls of one of its childs and emulate some calls, if needed. This is done using ptrace and namely both PTRACE_SYSEMU and PTRACE_SYSCALL.
The ujail process should not be able to monitor syscalls, like strace does, but also intercept and emulate them.
This sounds a lot like user mode linux (uml), but the method is different. Whilst uml comes with a complete kernel, emulates all system calls and this way provides a virtualized system, ujail is intended to only emulate some systemcalls, without emulating the kernel.
Revisiting PTRACE_SYSCALL & PTRACE_SYSEMU
To better explain how the ujail technique works I would like to have a quick look at PTRACE_SYSCALL and PTRACE_SYSEMU again.
PTRACE_SYSCALL allows a userspace process to be notified whenever a traced process enters or leaves a system call. This means that two notifications are normally sent: one before system call entry and one afterwards. Even though one is able to change the parameters of system calls this method does not allow system calls to be fully emulated (think virtual filesystem here).
PTRACE_SYSEMU on the other hand provides one notification on syscall entry and expects the receiver of the notification to emulate the syscall. This method alone sounds great, but this also means that memory allocation needs to be emulated too, which is quite complex in userspace.
A hybrid of PTRACE_SYSCALL & PTRACE_SYSEMU
Now on to the concept behind ujail. The method I am describing works by calling PTRACE_SYSEMU for a specific process and this way taking over emulation of all system calls. However, some system calls are complex to emulate in userspace, and so a hybrid of both PTRACE_SYSEMU and PTRACE_SYSCALL is needed. In short this works by checking whether the syscall needs to be emulated when the PTRACE_SYSEMU event is received.
Now one way is emulating the syscall, filling the processes' registers and resuming execution of the process. This is simple and straight-forward.
The second way is forwarding the system call to the kernel. The problem here is that calling the syscall in the monitoring process will make the new resources available to that very process, and not the process to be jailed. This is where the hybrid method kicks in.
The proof of concept code creates a backup of the next instruction to be executed along with a copy of the instruction pointer at this point and patches it with the opcodes for "int $0x80", causing the syscall to be made again. After that it resumes execution with PTRACE_SYSCALL and waits again. The first event to be received now is the program leaving the emulated system call, which can be ignored. Resuming yet again will give use two PTRACE_SYSCALL events, one for syscall entry and one for syscall exit.
The first event is not really interesting, but at the second event the opcode backup is restored and the eip set from the saved value. Now the kernel has handled the syscall and the result is ready for the child process. A final call of PTRACE_SYSEMU resumes execution of the child and waits for the next syscall.
Proof of concept
The proof of concept code can be downloaded from its bazaar branch at launchpad.net. It is intended to be used on i386 systems only and works with simple programs, but is known not to work with anything using fork, vfork and most likely will not work for binaries using threading.
Finally, I would like to thank Pradeep Padala for his "Playing with ptrace" articles , which were fun to read and worked as a great introduction of ptrace for me.
Now there is only one thing left to say: if you are interested in this method, see loopholes or problems or want to contribute, please go ahead and contact me:
debian at sp dot or dot at