The closest thing I know of might be CreateRestrictedToken, AdjustTokenGroups, and AdjustTokenPrivileges, but those doesn't quite fit the bill. There's a bit more info on Pico processes at https://fourcore.io/blogs/how-a-windows-process-is-created-part-1 that you might find interesting.
Unfortunately, in my opinion, what these sorts of resources demonstrate is just how difficult it is to do something which should be very simple. You should not even have to know what a "token" is, or a "job object", or ASLR, or a mitigation policy, etc., to make a process that cannot call Windows APIs. It should be as simple as saying "CreateProcessThatCannotDoAnythingAtAll()", so you can't screw it up :)
I have only read this article and the comments section. And I'm not a systems programmer so I know basically nothing about OS'es. So, what I'm asking might be trivial or just nonsense...
1. You said that chrome doesn't use anything like this right now, and if such a thing existed it would allow someone to execute arbitrary code written as javascript or wasm in a safe way without doing any security checks. So, do you mean that one can actually try to make a system call like (acquire lock on camera) through javascript/wasm in chrome and it will only be stopped because there are security checks?
I have tried to read the code of the v8 engine with limited success. I have not seen anything that would be along the lines of checking for syscalls. All I saw was the language they have (torque) implementing the ecma spec in phrasing that is close to the spec itself, and then c++ implementations of the the algorithms that support functions like 'indexOf`, `replace` etc.
So, I was under the impression that it would actually be impossible to do something malicious inside javascript or wasm itself because it doesn't have any privileges. The only system access is possible through the html5 apis which the person writing the arbitrary code has very little control over. Is this not true? Is it possible to try to inject a syscall through v8 that will only be stopped because v8 would do a security check on the source code?
2. How did you know where to look for determining that chromium doesn't use anything like a process you are suggesting? I also tried to read the code of chromium, but its so huge I couldn't figure out where to even start. I guess if I had to look for it I could grep the whole project for something like a platform layer and look in those files. I think I saw at least one switch statement somewhere that looked like it was checking for macos, windows and linux. Is that what you did?
A simple communication API would be to make the thread inside the sandbox terminate any time it tries to do a illegal instruction and have that be notified to the main thread. (like a thread join)
Then the main thread can investigate the memory and perhaps see that a designated area for doing syscalls communication has been filled out, execute the psuedo syscall and then restart the sandbox thread (possibly only allow restarting from specific entry points).
This then won't rely on platform specific knowledge for inspecting the registers or syscall number and only 1 bit of assembly or intrinsic when doing the syscal. Adding the register inspection machinery can still be done but that wold only be useful for debugging I believe.
For a wait, maybe, but for a notify, wouldn't that be too expensive? I guess I've never thought too hard about how expensive an int 3 is, but it seems like you wouldn't want that to be the standard...
from what I understand of a syscal the interrupt itself is only as bad as a unpredicted branch at worst after which the kernel keeps the existing virtual memory mapping if it can (this is less after spectre and meltdown mitigations), the kernel interrupt handler code is always mapped but marked as inaccessible by user code. It's the context switch (changing to a new memory mapping) and cold caches afterwards that are a bad thing.
But I had assumed that the only signals the sandboxed job would want to send are blocking. I hadn't considered that there might be non-blocking calls to do. With a signal and a terminate/restart or wait you can use a IO_uring for communicating with the main process. Though then you will end up paying for the context switch each time.
Having the sandboxed code syscall the kernel directly is cheaper for sure but much less easy to verify the sandbox integrity.
One of the issues here is that CPUs maybe don't support the best options here, because they are built around the current model. There's no real reason there can't be a fast "knock" method (this exists in various hardwares for other reasons, etc., so it is not unusual). x64 also has MONITOR/MWAIT and such. But yeah. Unless we actually had a prototype working, it's hard to say how "bad" any of this would actually be in practice.
In fact now that I am randomly mulling it, might it be possible for the OS layer to mask an interrupt so that it _doesn't_ fire when the OS thread is active and pulling from a queue, and then it just unmasks it before it goes to sleep? That would pretty much solve the problem, I should think, because then the interrupt would never fire except in circumstances where the queue ran out anyway, so it couldn't have been very "high bandwidth"...
I deliberately avoided this, because I feel like this should not be a general concept for accessing Windows APIs. That is a much more dangerous thing to enable, and would require the kernel team to handle calls meant to be exactly the same as Windows calls, which may be difficult (or impossible) in this kind of process.
So I think the naming, and the function call structure, should make it clear that this is not a real Windows process that can call Windows APIs ever. It is a special kind of blank process that can never, ever, under any circumstances call a real Windows API. It can only make a syscall to a special kind of syscall handler that allows a few basic functions, but those functions are specific to this kind of process and are not meant to be completely equivalent to any existing Windows function.
Hopefully that makes some sense... I could write a separate article on this if need be. I think it's very important that this be distinct from "you can call regular Windows APIs from a whitelist", because I think having that ability means that the kernel has to expose a much larger attack surface underneath than if they can just write a very specific, less capable syscall handler for these processes.
The closest thing I know of might be CreateRestrictedToken, AdjustTokenGroups, and AdjustTokenPrivileges, but those doesn't quite fit the bill. There's a bit more info on Pico processes at https://fourcore.io/blogs/how-a-windows-process-is-created-part-1 that you might find interesting.
In terms of what you can sort-of do to sort-of get something like this, there are some good resources available, like https://github.com/dblohm7/sandbox-win32
Unfortunately, in my opinion, what these sorts of resources demonstrate is just how difficult it is to do something which should be very simple. You should not even have to know what a "token" is, or a "job object", or ASLR, or a mitigation policy, etc., to make a process that cannot call Windows APIs. It should be as simple as saying "CreateProcessThatCannotDoAnythingAtAll()", so you can't screw it up :)
I have only read this article and the comments section. And I'm not a systems programmer so I know basically nothing about OS'es. So, what I'm asking might be trivial or just nonsense...
1. You said that chrome doesn't use anything like this right now, and if such a thing existed it would allow someone to execute arbitrary code written as javascript or wasm in a safe way without doing any security checks. So, do you mean that one can actually try to make a system call like (acquire lock on camera) through javascript/wasm in chrome and it will only be stopped because there are security checks?
I have tried to read the code of the v8 engine with limited success. I have not seen anything that would be along the lines of checking for syscalls. All I saw was the language they have (torque) implementing the ecma spec in phrasing that is close to the spec itself, and then c++ implementations of the the algorithms that support functions like 'indexOf`, `replace` etc.
So, I was under the impression that it would actually be impossible to do something malicious inside javascript or wasm itself because it doesn't have any privileges. The only system access is possible through the html5 apis which the person writing the arbitrary code has very little control over. Is this not true? Is it possible to try to inject a syscall through v8 that will only be stopped because v8 would do a security check on the source code?
2. How did you know where to look for determining that chromium doesn't use anything like a process you are suggesting? I also tried to read the code of chromium, but its so huge I couldn't figure out where to even start. I guess if I had to look for it I could grep the whole project for something like a platform layer and look in those files. I think I saw at least one switch statement somewhere that looked like it was checking for macos, windows and linux. Is that what you did?
A simple communication API would be to make the thread inside the sandbox terminate any time it tries to do a illegal instruction and have that be notified to the main thread. (like a thread join)
Then the main thread can investigate the memory and perhaps see that a designated area for doing syscalls communication has been filled out, execute the psuedo syscall and then restart the sandbox thread (possibly only allow restarting from specific entry points).
This then won't rely on platform specific knowledge for inspecting the registers or syscall number and only 1 bit of assembly or intrinsic when doing the syscal. Adding the register inspection machinery can still be done but that wold only be useful for debugging I believe.
For a wait, maybe, but for a notify, wouldn't that be too expensive? I guess I've never thought too hard about how expensive an int 3 is, but it seems like you wouldn't want that to be the standard...
from what I understand of a syscal the interrupt itself is only as bad as a unpredicted branch at worst after which the kernel keeps the existing virtual memory mapping if it can (this is less after spectre and meltdown mitigations), the kernel interrupt handler code is always mapped but marked as inaccessible by user code. It's the context switch (changing to a new memory mapping) and cold caches afterwards that are a bad thing.
But I had assumed that the only signals the sandboxed job would want to send are blocking. I hadn't considered that there might be non-blocking calls to do. With a signal and a terminate/restart or wait you can use a IO_uring for communicating with the main process. Though then you will end up paying for the context switch each time.
Having the sandboxed code syscall the kernel directly is cheaper for sure but much less easy to verify the sandbox integrity.
One of the issues here is that CPUs maybe don't support the best options here, because they are built around the current model. There's no real reason there can't be a fast "knock" method (this exists in various hardwares for other reasons, etc., so it is not unusual). x64 also has MONITOR/MWAIT and such. But yeah. Unless we actually had a prototype working, it's hard to say how "bad" any of this would actually be in practice.
In fact now that I am randomly mulling it, might it be possible for the OS layer to mask an interrupt so that it _doesn't_ fire when the OS thread is active and pulling from a queue, and then it just unmasks it before it goes to sleep? That would pretty much solve the problem, I should think, because then the interrupt would never fire except in circumstances where the queue ran out anyway, so it couldn't have been very "high bandwidth"...
The name "CreateUnprivilegedProcess" would be more descriptive. Then you could have "RequestPrivilege" and "RelinquishPrivilege" as other calls.
I deliberately avoided this, because I feel like this should not be a general concept for accessing Windows APIs. That is a much more dangerous thing to enable, and would require the kernel team to handle calls meant to be exactly the same as Windows calls, which may be difficult (or impossible) in this kind of process.
So I think the naming, and the function call structure, should make it clear that this is not a real Windows process that can call Windows APIs ever. It is a special kind of blank process that can never, ever, under any circumstances call a real Windows API. It can only make a syscall to a special kind of syscall handler that allows a few basic functions, but those functions are specific to this kind of process and are not meant to be completely equivalent to any existing Windows function.
Hopefully that makes some sense... I could write a separate article on this if need be. I think it's very important that this be distinct from "you can call regular Windows APIs from a whitelist", because I think having that ability means that the kernel has to expose a much larger attack surface underneath than if they can just write a very specific, less capable syscall handler for these processes.