Finding vulnerabilities in small, challenge-like C programs
June 3, 2014
First in a series of technical essays by chief scientist Pascal Cuoq
About your hosts
I’m Pascal Cuoq, chief scientist at TrustInSoft. This is the first of a short series of technical essays, published here on a trial basis. The essays may resemble in style ones that I contributed over the years to the Frama-C blog.
TrustInSoft is a young company, created to provide products and services based on Frama-C, working in close collaboration with CEA LIST, where Frama-C continues to be developed.
Short, challenging programs
We have received several short C programs, each embodying one difficult security vulnerability. These programs may start with:
int main(int argc, char *argv[]) {
Here we encounter the first difficulty. As long as we were analyzing safety-critical C code, the programs that were verified with Frama-C were well-understood sub-components of larger systems. They had been developed according to specifications written in advance, and our responsibility was to check that these sub-components worked according to their specifications. We could be verifying that the post-conditions were established before control was returned to the caller. We could also simply be verifying that the code did not contain implementation flaws leading to undefined behavior. Even in the latter case, we were making use of the sub-component’s specifications: we were only verifying the absence of undefined behavior when the sub-component was invoked according to the pre-conditions it had been designed to expect. The pre-conditions were simple and unambiguous, because the system designers of safety-critical systems value simplicity and unambiguity.
In security, cheating is the rule. The vulnerabilities in the small examples we received are revealed by a malicious user invoking the program with a 2000-character-long argument, or with 500 arguments. It is a different world; but is there something that, with our safety-oriented habits, we can do?
Assumptions about main()’s arguments
The 5.1.2.2.1 section of the C11 standard, Program startup, describes the conditions that can be expected. Quoting the clause 2 from that section:
If they are declared, the parameters to the main function shall obey the following constraints:
- The value of argc shall be nonnegative.
- argv[argc] shall be a null pointer.
- If the value of argc is greater than zero, the array members argv[0] through argv[argc-1] inclusive shall contain pointers to strings, which are given implementation-defined values by the host environment prior to program startup. The intent is to supply to the program information determined prior to program startup from elsewhere in the hosted environment. If the host environment is not capable of supplying strings with letters in both uppercase and lowercase, the implementation shall ensure that the strings are received in lowercase.
- If the value of argc is greater than zero, the string pointed to by argv[0] represents the program name; argv[0][0] shall be the null character if the program name is not available from the host environment. If the value of argc is greater than one, the strings pointed to by argv[1] through argv[argc-1] represent the program parameters.
- The parameters argc and argv and the strings pointed to by the argv array shall be modifiable by the program, and retain their last-stored values between program startup and program termination.
Say that we are writing a privileged application (it could be an Unix program with the setuid bit set). How much of the above specification can we trust? The above clause says that argc contains “pointers to strings”, by which it means zero-terminated sequences of characters. However, a malicious user could perhaps pass a non-zero-terminated sequence of characters as argument. Bear in mind that the malicious user of our hypothetical Unix system is not even limited to what eir shell allows: this user can also call the function execve() directly.
The function execve() has the prototype:
int execve(const char path, char const argv[], char *const envp[]);
The argv argument has the same type and name as the argv argument passed to the main() function of the privileged program. This is worrisome: if the caller of execve() invokes it on purpose with an array of pointers to ill-formed strings, can a buffer overflow result in the program being executed? Can an argument contain an embedded '\0'? Can the caller maliciously make argc negative, in hope of making the executed program misbehave somehow?
The fact that execve() expects an array of well-formed strings means that the behavior is undefined if execve() is passed an array of pointers that do not point to well-formed strings. But we are working in security now. We cannot simply tell collaborative sibling components of the system please not to do that: the caller of execve() is our adversary and will use any trick that works in order to derail execution of the program. The question is for us to determine whether the undefined behavior of calling execve() with malformed arguments is harmful to security. Especially in light of the fact that it is impossible to implement in pure C an execve() function that determines whether a pointer points to a well-formed string.
How execve() would work if I was in charge
I cannot tell you how your operating system does it—the applicable standards only seem to describe properties the language and the OS must have to make programs work, not security guarantees, or perhaps I am looking in the wrong place. But here is how execve() would work if I had to implement it.
POSIX documents a size limit on the list of arguments, ARG_MAX. What execve() could do is:
– reserve a counter argc and set it to 0.
– reserve a memory area M of size ARG_MAX, and while it is not full:
– obtain a pointer from argv. If the pointer is null, we are done.
– otherwise read characters from that pointer and copy them to M until either '\0' has been encountered or M is full. If M becomes full, fail with E2BIG. Increment argc.
If a segmentation violation occurs while reading from argv or while reading characters from a pointer obtained from argv, it means that execve()’s arguments were ill-formed. So much the better: they did not have consequences, in the sense that we avoided transferring control to the program with ill-formed main() arguments. As long as ARG_MAX is chosen smaller than INT_MAX, then argc cannot overflow during the above loop, because at least one character is copied to M for each time argc is incremented.
As a side note, I did try to create a 2147483648-argument-long argv on the IL32P64 Unix-like operating system I routinely use, and to pass it to execve(). I was hoping that the invoked program would receive a negative argc, which would have been a minor security issue justifying further investigation inside privileged programs. It did not work: the execve() call failed instead. According to this table, GNU Hurd has an unlimited ARG_MAX, which almost motivates me to install it just to see if it is possible to make argc negative there.
It should also be noted that many programs actually fail if invoked with argc == 0 and argv containing only a null pointer. Bearing in mind that this is within specifications, and that this is easy to achieve by calling execve() directly, finding a widely-used setuid program that exhibits interesting behavior when invoked with these values of argc and argv is left as an exercise to the reader.
An analysis entry point
To summarize what we have seen so far, the guarantees offered by the C standard on the arguments of main() can be relied on.
In order to analyze for security C programs with main() functions, using verification tools that reason from pre-conditions, one can always use an analysis entry point as follows:
void analysis_entry_point(void) { int argc = nondeterministic_choice_between(0, N-1); char **argv = malloc((argc+1) * sizeof char*); for (int i=0; i<argc; i++) { int size = nondeterministic_choice_between(0, N-1); argv[i] = malloc(size + 1); for (int j=0; j<size; j++) argv[i][j] = nondeterministic_choice_between(1, 255) argv[i][size] = 0; } argv[argc] = 0; main(argc, argv); }
The above analysis entry point has no arguments and thus does not need to assume anything about them. It creates all possible values of argc and argv and passes them to the actual main() function. For a large enough value of N (depending on the operating system), the above piece of code creates all possible arguments that the program can effectively receive.
This could be a good analysis entry point to use in some contexts, including when discussing the notion of analyzer soundness (absence of false negatives), that always seem to give interlocutors a hard time. Obviously, in practice, the above entry point contains a lot of difficulties for only theoretical gains, and it is better to obtain a more specific description of the arguments the program is supposed to expect, or to reverse-engineer interesting numbers of arguments to analyze the program with from the first lines of its main() function.
Instead of a bit of program that builds all possible arguments the program may be invoked with, the allowable assumptions about main()’s argument can also be expressed as an ACSL contract for the program’s main() function. The contract may look something like this:
/*@ requires 0 ≤ argc < N ; requires \valid(argv + (0 .. argc)); requires argv[argc] == \null; requires \forall integer i; 0 ≤ i < argc ==> \valid_string(argv[i]) && \string_length(argv[i]) ≤ N; */ int main( int argc, char ** argv ) { …
The two versions express the same assumptions. The practical differences between one version and the other are not important. The important idea is that both versions capture any argument list a malicious user can throw at a program. If a privileged application is invoked with argc and argv arguments that are not captured by the formalizations above, it is a flaw in the operating system, not in the application. If the application misbehaves when invoked with argc and argv corresponding to our formalizations, then it is a flaw in the application. But we have a chance to detect it, and the first step before detecting it was to write down the hypotheses under which we are going to work.
Conclusion
Security is perhaps not too different from safety after all: in both fields, one must be aware of one’s assumptions. The aura of cybersecurity is that the villain wins by cheating, by circumventing the rules. Instead, a better way to look at the difference may be that in security the rules exist and are enforced seriously, but that they are more complex and that the average software developer is not aware of them as acutely as the developer of safety-critical software is aware of the rules that apply to em.
The example of the arguments of main() is mostly useful to illustrate this. It comes up in ten-line programs submitted to us during technical discussions, but not so much in actual study of security-critical code. As a data point, when analyzing a generic SSL server based on PolarSSL, the question of main()’s arguments did not even come up. There were unknown inputs to verify that the code was robust to. These inputs originated from the network instead, through a reception function that had to be registered via PolarSSL’s API, as alluded to in this previous blog post about PolarSSL.
Acknowledgements: Julien Cretin, Lukas Biton, Anne Pacalet, StackOverflow users Duck and Oli Charlesworth, and Twitter users lahosken, xexd and pestophagous.