2.4 Type Inference

So far we have concentrated on the *type checking* problem:
given a program fully annotated with type specifiers on all expressions,
confirm that the types are consistent.
Typechecking a program is
straightforward. For example, the assignment `x = y` typechecks
if and only if the type of `y` is a subtype of the type of
`x`. The function call `f(x)` typechecks if and only if
the type of `x` is a subtype of the type of the formal
parameter of `f`. More detailed rules, and a proof of
soundness, can be found in [19].

The type checking system described so far, however, is not useful in
practice. The problem is that it requires all types to be annotated
with qualifiers: for our running example, all types would need to be
marked as either `tainted` or `untainted` at every level of each type.
Clearly this is an
undesirable property for two reasons. First, we are interested in
finding bugs in legacy code that does not have any type qualifier
annotations. Second, even if we are writing a program with type
qualifiers in mind, adding and maintaining annotations on every type
in the program would be prohibitively expensive for programmers.

The solution to this problem is *type inference*. In this
model, the user introduces a small number of annotations at key places
in the program, and `cqual` infers the types of the other expressions
in the program. `cqual` generates fresh *qualifier variables*
(variables which range over type qualifiers) at every position in a
type, constrained by any annotations specified in the program.
`cqual` analyses the program and generates *subtyping
constraints*--i.e., inequalities of the form
for qualified
types and .

A *solution* to a set of subtyping constraints is a mapping
from qualifier variables to qualifiers such that all of the
constraints are valid according to our subtyping rules. Thus,
in our system, we solve the constraints by assigning every qualifier
variable to either `tainted` or `untainted`.

In our type inference algorithm, qualifier variables are introduced at
every position in a type. We write qualifier variables in italics,
and name them after the corresponding program variables. The
argument of function `f` has associated
qualifier variable *f_arg*i, and the return value of function
`f` has qualifier variable *f_ret*.

Since qualifiers are implicitly introduced on all levels of a type by
the type inference algorithm, to name them we modify the name of the
outermost qualifier of a type. For example, given the declaration
`char *x`, `cqual` generates two qualifier variables: the
variable *x* qualifies the reference `x` itself, and the
variable *x_p* qualifies the location
`*x`.
Moreover, the programmer may also
explicitly introduce named qualifier variables into the program; in
this case, they begin with a dollar sign (```$`'') in the
source code to distinguish them lexically from other tokens.

For example, after the declaration `char *x;`
we assign the qualified type
to `x`.
Similarly, a function declared with the prototype

We give next a more detailed example. Figure 3
shows a fragment of code that manipulates tainted data in an unsafe
way, along with the typing constraints generated by the type inference
algorithm. The constraint
encodes the conclusion that the return value of `getenv()` is
treated as tainted (as discussed above). The prototype for
`printf()` (typically found in the global prelude file)
specifies that `printf()` must not be called with a tainted
format string argument, by requiring that its first argument be a
subtype of `untainted char *`.

The call `s = getenv("LD_LIBRARY_PATH")` generates the
constraints

Taking the transitive closure of these constraints, we have a chain of deductions

implying that for this example to type check, we would need . As explained in Section 2.2, this does not hold in our lattice, so this code fragment does not type check, indicating a possible format string bug. This demonstrates how our type inference algorithm can be used to identify unsafe manipulation of format strings.

In our implementation, the subtyping constraints are solved on-line as they are generated. If the constraint system ever becomes unsatisfiable, an error is flagged at the first illegal expression in the code. This allows us to pinpoint the location of unsafe operations on tainted data. The inference then continues after any errors, though in this case the quality of the remaining error messages can vary tremendously.

We observe that efficient algorithms for this type inference problem
are known. Given a fixed-size qualifier lattice and constraints
of the form , , or
, where is a
lattice element and , , and are qualifier variables, a
solution to the constraints can be computed in time using
well-known algorithms [21]. The idea is to express these
constraints as a directed graph with qualifier variables as vertices
and subtyping constraints as directed edges: the constraint
induces an edge from to . The constant qualifiers
`tainted` and `untainted` are also vertices in this
graph, and a directed path from `tainted` to `untainted`
corresponds to a possible format string bug. We call this path a
*taint flow path*. See Figure 4 for an example.