In 2009 I wrote about building a ptrace-based sandboxing system named "ujail", including some basic proof of concepts.
I have been thinking about this idea for a long time now, but sadly did not have the time to implement it - until now.
Right now I am working on this idea again and whilst doing some research I came across a thread on the linux-kernel mailing list.
At first a problem with 64-bit binaries trapping into 32-bit syscall handling code via int 80 got me there. While this is awkward and keeps one from implementing a sandbox in userspace (due to not being able to access TS_COMPAT, as described in the thread) it led me to something else - a more severe problem.
Unfortunately I cannot remember who wrote this and am unable to recover the actual mail (if someone finds it I would be happy if you notified me), but someone mentioned race conditions when using ptrace as a security measure.
In short I came up with a proof of concept which works around possible limitations imposed by a ptrace-based security mechanism. For those in a hurry: you can find the code of the proof of concept at github.
In the following parts of this article I would like to elaborate on the problem and how the proof of concept code exploits it.
The problem here is the fact that PTRACE_SYSCALL traps before the kernel actually fetches information from userspace.
Let me illustrate that with sys_open. Assume we are running a tracer which makes use of ptrace to get a SIGTRAP each time a tracee invokes a syscall and we want to impose limits on sys_open calls.
After a syscall has been invoked it would roughly work like this:
The tracer is notified, evaluates the registers as read using PTRACE_GETREGS and reads the first syscall argument's value (namely the path value) from the tracee. It then evaluates the value and decides whether to allow the syscall or not.
Now this is exactly the way ujail would have worked in its initial design. However, using this method there is a not-so-small attack vector which involves all values read from the tracee's memory.
You may now ask yourself what I am writing about, but it will make sense in a few moments, I promise.
There is a timespan between the tracer reading the path value from tracee's memory and the tracer actually resuming the tracee using PTRACE_SYSCALL which allows a potentially malicious thread inside the tracee to change the value of the memory path points to and thus circumvent any restriction imposed by the tracer. Changing the value is as simple as writing to the process memory, which is shared between threads, at just the right moment and to just the right position.
As writing to memory will not generate a trap the tracer could act upon the tracer would be unaware of the modification and it is just about to resume the tracee's execution - jail broken.
What is important here is just the right timing. The write has to happen after the tracer has read from the tracee's memory and before it resumes execution of the tracee. However, the tracer is most likely to employ some kind of decision-finding process here. This process will take time. It may actually involve some syscalls (think mutexes, semaphores and condition variables here). All in all enough time to swap values.
You may now think to yourself that it might be really hard to actually pull this one off and it probably is in normal circumstances. However, the possibility to do this alone should rule-out ptrace as a security measure completely.
The only way I believe this could be handled is triggering a hook inside the system call handlers themselves, just after all information has been pulled from userspace. These values are guaranteed not to be modifiable from within userspace and thus only these should be considered for making decisions. As a consequence ujail (and every other similar security measure out there) will have to be realized at least partly in kernel-space.
Feel free to leave comments, send me an email and/or point out any issues with the proof of concept code or my idea.