So far we have concentrated on the type checking problem: given a program fully annotated with type specifiers on all expressions, confirm that the types are consistent. Typechecking a program is straightforward. For example, the assignment x = y typechecks if and only if the type of y is a subtype of the type of x. The function call f(x) typechecks if and only if the type of x is a subtype of the type of the formal parameter of f. More detailed rules, and a proof of soundness, can be found in .
The type checking system described so far, however, is not useful in practice. The problem is that it requires all types to be annotated with qualifiers: for our running example, all types would need to be marked as either tainted or untainted at every level of each type. Clearly this is an undesirable property for two reasons. First, we are interested in finding bugs in legacy code that does not have any type qualifier annotations. Second, even if we are writing a program with type qualifiers in mind, adding and maintaining annotations on every type in the program would be prohibitively expensive for programmers.
The solution to this problem is type inference. In this model, the user introduces a small number of annotations at key places in the program, and cqual infers the types of the other expressions in the program. cqual generates fresh qualifier variables (variables which range over type qualifiers) at every position in a type, constrained by any annotations specified in the program. cqual analyses the program and generates subtyping constraints--i.e., inequalities of the form for qualified types and .
A solution to a set of subtyping constraints is a mapping from qualifier variables to qualifiers such that all of the constraints are valid according to our subtyping rules. Thus, in our system, we solve the constraints by assigning every qualifier variable to either tainted or untainted.
In our type inference algorithm, qualifier variables are introduced at every position in a type. We write qualifier variables in italics, and name them after the corresponding program variables. The argument of function f has associated qualifier variable f_argi, and the return value of function f has qualifier variable f_ret.
Since qualifiers are implicitly introduced on all levels of a type by the type inference algorithm, to name them we modify the name of the outermost qualifier of a type. For example, given the declaration char *x, cqual generates two qualifier variables: the variable x qualifies the reference x itself, and the variable x_p qualifies the location *x. Moreover, the programmer may also explicitly introduce named qualifier variables into the program; in this case, they begin with a dollar sign (``$'') in the source code to distinguish them lexically from other tokens.
For example, after the declaration char *x; we assign the qualified type to x. Similarly, a function declared with the prototype
We give next a more detailed example. Figure 3 shows a fragment of code that manipulates tainted data in an unsafe way, along with the typing constraints generated by the type inference algorithm. The constraint encodes the conclusion that the return value of getenv() is treated as tainted (as discussed above). The prototype for printf() (typically found in the global prelude file) specifies that printf() must not be called with a tainted format string argument, by requiring that its first argument be a subtype of untainted char *.
The call s = getenv("LD_LIBRARY_PATH") generates the constraints
Taking the transitive closure of these constraints, we have a chain of deductions
In our implementation, the subtyping constraints are solved on-line as they are generated. If the constraint system ever becomes unsatisfiable, an error is flagged at the first illegal expression in the code. This allows us to pinpoint the location of unsafe operations on tainted data. The inference then continues after any errors, though in this case the quality of the remaining error messages can vary tremendously.
We observe that efficient algorithms for this type inference problem are known. Given a fixed-size qualifier lattice and constraints of the form , , or , where is a lattice element and , , and are qualifier variables, a solution to the constraints can be computed in time using well-known algorithms . The idea is to express these constraints as a directed graph with qualifier variables as vertices and subtyping constraints as directed edges: the constraint induces an edge from to . The constant qualifiers tainted and untainted are also vertices in this graph, and a directed path from tainted to untainted corresponds to a possible format string bug. We call this path a taint flow path. See Figure 4 for an example.