next up previous
Next: 3 User Interface Up: 2 Background Previous: 2.3 The Qualifier Lattice

2.4 Type Inference

So far we have concentrated on the type checking problem: given a program fully annotated with type specifiers on all expressions, confirm that the types are consistent. Typechecking a program is straightforward. For example, the assignment x = y typechecks if and only if the type of y is a subtype of the type of x. The function call f(x) typechecks if and only if the type of x is a subtype of the type of the formal parameter of f. More detailed rules, and a proof of soundness, can be found in [19].

The type checking system described so far, however, is not useful in practice. The problem is that it requires all types to be annotated with qualifiers: for our running example, all types would need to be marked as either tainted or untainted at every level of each type. Clearly this is an undesirable property for two reasons. First, we are interested in finding bugs in legacy code that does not have any type qualifier annotations. Second, even if we are writing a program with type qualifiers in mind, adding and maintaining annotations on every type in the program would be prohibitively expensive for programmers.

The solution to this problem is type inference. In this model, the user introduces a small number of annotations at key places in the program, and cqual infers the types of the other expressions in the program. cqual generates fresh qualifier variables (variables which range over type qualifiers) at every position in a type, constrained by any annotations specified in the program. cqual analyses the program and generates subtyping constraints--i.e., inequalities of the form $ T_1\leq T_2$ for qualified types $ T_1$ and $ T_2$.

A solution to a set of subtyping constraints is a mapping from qualifier variables to qualifiers such that all of the constraints are valid according to our subtyping rules. Thus, in our system, we solve the constraints by assigning every qualifier variable to either tainted or untainted.

In our type inference algorithm, qualifier variables are introduced at every position in a type. We write qualifier variables in italics, and name them after the corresponding program variables. The $ i^{\hbox{th}}$ argument of function f has associated qualifier variable f_argi, and the return value of function f has qualifier variable f_ret.

Since qualifiers are implicitly introduced on all levels of a type by the type inference algorithm, to name them we modify the name of the outermost qualifier of a type. For example, given the declaration char *x, cqual generates two qualifier variables: the variable x qualifies the reference x itself, and the variable x_p qualifies the location *x. Moreover, the programmer may also explicitly introduce named qualifier variables into the program; in this case, they begin with a dollar sign (``$'') in the source code to distinguish them lexically from other tokens.

For example, after the declaration char *x; we assign the qualified type $ \texttt{{\it x\_p} char * {\it x}}$ to x. Similarly, a function declared with the prototype

$\displaystyle \texttt{tainted char *getenv(char *name);} $

is assigned the following fully qualified type:

$\displaystyle \begin{array}{l}
\texttt{{\it getenv\_ret\_p} char * {\it getenv\...
\textrm{(where $\textit{getenv\_ret\_p} = \texttt{tainted}$)}
\end{array} $

If we then encounter an assignment x = getenv(...), our type inference algorithm will conclude that the type of getenv()'s return value must be a subtype of the type of x, i.e.,

\texttt{{\it getenv\_ret\_p} } \;\texttt{ch...
...\, \texttt{{\it x\_p} } \;\texttt{char * {\it x}}}.

As a consequence, we can infer (using the subtyping rules introduced in Section 2.2 and 2.3) that we must have the following constraints on the qualifier variables:

$\displaystyle \textit{getenv\_ret\_p} = \textit{x\_p} = \texttt{tainted},
\,\quad \textit{getenv\_ret} \le \textit{x}. $

In essence, our declaration of getenv() has ensured that whatever it returns will be labeled as tainted. Note that this might be used to model, for instance, a scenario where environment variables are under the adversary's control.

We give next a more detailed example. Figure 3 shows a fragment of code that manipulates tainted data in an unsafe way, along with the typing constraints generated by the type inference algorithm. The constraint $ \textit{getenv\_ret\_p} = \textit{s\_p}$ encodes the conclusion that the return value of getenv() is treated as tainted (as discussed above). The prototype for printf() (typically found in the global prelude file) specifies that printf() must not be called with a tainted format string argument, by requiring that its first argument be a subtype of untainted char *.

The call s = getenv("LD_LIBRARY_PATH") generates the constraints

\textit{getenv\_ret} \leq \textit{s} \\
\textit{getenv\_ret\_p} = \textit{s\_p}

Notice the equality constraint, arising from our corrected rule for subtyping pointer types. The assignment t = s generates a similar constraint. Finally, the call printf(t) generates a subtyping constraint on the printf_arg0_p because printf's first argument is const (see Section 4.4).

Taking the transitive closure of these constraints, we have a chain of deductions

$\displaystyle \texttt{tainted}$ $\displaystyle = \textit{getenv\_ret\_p} = \textit{s\_p} = \textit{t\_p}$    
  $\displaystyle \leq \textit{printf\_arg0\_p} = \texttt{untainted},$    

implying that for this example to type check, we would need $ \texttt{tainted}{}\leq\texttt{untainted}{}$. As explained in Section 2.2, this does not hold in our lattice, so this code fragment does not type check, indicating a possible format string bug. This demonstrates how our type inference algorithm can be used to identify unsafe manipulation of format strings.

In our implementation, the subtyping constraints are solved on-line as they are generated. If the constraint system ever becomes unsatisfiable, an error is flagged at the first illegal expression in the code. This allows us to pinpoint the location of unsafe operations on tainted data. The inference then continues after any errors, though in this case the quality of the remaining error messages can vary tremendously.

Figure 4: An example of a taint flow path. The string unclean is tainted by the call to getenv at Point 0, and ultimately that data is passed to printf at Point 5.
\begin{figure}\small\begin{verbatim}tainted char *getenv(const char *name);
...f1(unclean); /* Point 4 */
printf(s); /* Point 5 */

We observe that efficient algorithms for this type inference problem are known. Given a fixed-size qualifier lattice and $ n$ constraints of the form $ l\leq q$, $ q\leq l$, or $ q_1\leq q_2$, where $ l$ is a lattice element and $ q$, $ q_1$, and $ q_2$ are qualifier variables, a solution to the constraints can be computed in $ O(n)$ time using well-known algorithms [21]. The idea is to express these constraints as a directed graph with qualifier variables as vertices and subtyping constraints as directed edges: the constraint $ v_1 \le
v_2$ induces an edge from $ v_1$ to $ v_2$. The constant qualifiers tainted and untainted are also vertices in this graph, and a directed path from tainted to untainted corresponds to a possible format string bug. We call this path a taint flow path. See Figure 4 for an example.

next up previous
Next: 3 User Interface Up: 2 Background Previous: 2.3 The Qualifier Lattice
Umesh Shankar 2001-05-16