An implementation of strings that you can trust to the end of the world

October 30, 2014

Using simple implementations with the right functionality 

An Implementation of strings

Last week, security researcher lcamtuf blogged about CVE-2014-8485, a vulnerability in the Unix strings command. The strings command is a popular way to get hints when stuck in the venerable ADVENT game. Some may also use it to reveal sequences of printable characters other than the verbs in ADVENT, but the truth is that running strings on a modern binary file is a depressing experience. On my computer, strings `which frama-c` shows that the 15MiB binary contains 111 copies of the string “volatile”, 53 copies of the identifier name “vlogic_ctor_info_decl”, and two copies of “verify_unmodified: %s has conflicting definitions. cur: %a”, a debug message not intended to ever be seen by the user. And I am only going through the Vs!

In fact, the entire Turbo Pascal compiler could probably fit in the space occupied by useless messages that start with “V” in Frama-C:

$ strings `which frama-c` | grep ^[vV] | wc
3638 4788 46623

Coming back to the vulnerability, it came as a shock to Unix users—unfortunately that vulnerability name was already taken. The purpose and function of the strings command is to scan binary files of unknown origin and/or mysterious design, and suddenly lcamtuf tells us that doing so could lead to arbitrary code execution!

 This has prompted a series of very short implementations of the basic functionality that strings is supposed to provide, in Python and in C. Depressing references to the evolution of the sizes of binaries notwithstanding, there is no reason for strings to be written in C, except perhaps if you anticipate that you will need to run it from a boot disk. Python seems, on the face of it, like an excellent choice. Java or OCaml would also have been reasonable. Regardless, Twitter user nils offered at one point the C implementation below:

a;main(c){for(;~c;putchar(isprint(c=getchar())?a++,c:a<4|(a=0)?13:10));}

And at another point, athoshun proposed the shorter:

c;main(a){while(~c)putchar(isprint(c=getchar(++a))?c:(6>a)*3+9+(a=1));}

These two definitions are remarkably concise; they both rely on the same trick of moving some of the responsibility onto the terminal by clever use of the “carriage return” character 13 in ASCII. This means that these implementations fool the user into thinking that they work, but the illusion doesn’t hold when the output is not directly interpreted by a terminal:

$ ./a.out < a.out
__PAGEZERO
…
dyld_stub_binder
$ ./a.out < a.out | more
^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M^M ^M^M^M^M^M^M^M^M^MH^M^M^M__PAGEZERO
…

Ignoring the non-standard prototype for main and the concise declarations a; and c; that are par for the C code golf course, the versions above are remarkable in that they both invoke undefined behavior in line 1:

 warning: undefined multiple accesses in expression. assert \separated(&a, &a);

This is just a blip caused by the race to conciseness. The expression a<4|(a=0)?13:10 is intended as a shorter alternative to a<4||(a=0)?13:10, but unfortunately, | does not introduce the sequence point that is needed between the assignment a=0 and the use a<4. The situation is similar with (6>a)*3+9+(a=1) in the second implementation.

 Other versions of strings, written by the same authors, that do not suffer from the unsequenced-side-effects undefined behavior, are:

a,c;main(){for(;c+1;putchar((isprint(c=getchar()))?a++,c:a<4||(a=0)?13:10));}
a=9;main(c){while(~c)putchar(isprint(c=getchar())?a/=2,c:(a=a?13:10));}

 

Are these safe? We need a specification for getchar() to find out:

/*@ ensures -1 ≤ \result ≤ 255 ; */
int getchar(void);

 

Without this specification, it looks like c+1 in the first implementation might be a signed arithmetic overflow (which is undefined behavior). With the specification, it is clear that it isn’t, since c is initially 0 and then the result of getchar(), which is at most 255.

Once the specification is provided:

$ tis-analyzer good_string1.c
…
warning: signed overflow. assert a+1 ≤ 2147483647;
…
$ tis-analyzer good_string2.c
…

The first implementation still has a slight issue: tis-analyzer is unable to demonstrate that, with the information provided, the variable a does not eventually overflow. Looking at the code, we can see that this can only happen for sequences of printable characters longer than 231 characters. A simple way to ensure that it doesn’t is not to trust the outside world with files larger than 2GiB.

The second implementation is free of undefined behavior for as many characters as you want to feed it. Since the functionality it intentionally provides is not the remote execution of arbitrary code, you can pass any sequence of character to it confident in the knowledge that it will not accidentally provide this functionality through undefined behavior to the malicious author of a perplexing file. Consider this implementation recommended. In fact, I’ll run it on frama-c right now:

$ ./a.out < `which frama-c`
…
…
…
… oh, no!
…
…
…
_unlink
_usleep
_utime
_wait
_waitpid
_write
dyld_stub_binder

In conclusion, much of the software around us seems to be too complex for its own good: OpenSSL, Bash, strings, … but we do not have to use the complex versions. Simpler implementations of just the right functionality are also available or can be written—though aiming for less than 80 characters is a bit extreme. It is easier to trust that the simpler versions do what you want and only what you want. When necessary, the simpler implementations can be formally verified to reach greater yet levels of confidence.

Newsletter