2.4 Type Inference

So far we have concentrated on the type checking problem: given a program fully annotated with type specifiers on all expressions, confirm that the types are consistent. Typechecking a program is straightforward. For example, the assignment x = y typechecks if and only if the type of y is a subtype of the type of x. The function call f(x) typechecks if and only if the type of x is a subtype of the type of the formal parameter of f. More detailed rules, and a proof of soundness, can be found in [19].

The type checking system described so far, however, is not useful in practice. The problem is that it requires all types to be annotated with qualifiers: for our running example, all types would need to be marked as either tainted or untainted at every level of each type. Clearly this is an undesirable property for two reasons. First, we are interested in finding bugs in legacy code that does not have any type qualifier annotations. Second, even if we are writing a program with type qualifiers in mind, adding and maintaining annotations on every type in the program would be prohibitively expensive for programmers.

The solution to this problem is type inference. In this model, the user introduces a small number of annotations at key places in the program, and cqual infers the types of the other expressions in the program. cqual generates fresh qualifier variables (variables which range over type qualifiers) at every position in a type, constrained by any annotations specified in the program. cqual analyses the program and generates subtyping constraints--i.e., inequalities of the form $T_1\leq T_2$ for qualified types

and

A solution to a set of subtyping constraints is a mapping from qualifier variables to qualifiers such that all of the constraints are valid according to our subtyping rules. Thus, in our system, we solve the constraints by assigning every qualifier variable to either tainted or untainted.

In our type inference algorithm, qualifier variables are introduced at every position in a type. We write qualifier variables in italics, and name them after the corresponding program variables. The $i^{\hbox{th}}$ argument of function f has associated qualifier variable f_argi, and the return value of function f has qualifier variable f_ret.

Since qualifiers are implicitly introduced on all levels of a type by the type inference algorithm, to name them we modify the name of the outermost qualifier of a type. For example, given the declaration char *x, cqual generates two qualifier variables: the variable x qualifies the reference x itself, and the variable x_p qualifies the location *x. Moreover, the programmer may also explicitly introduce named qualifier variables into the program; in this case, they begin with a dollar sign (``$'') in the source code to distinguish them lexically from other tokens.

For example, after the declaration char *x; we assign the qualified type $\texttt{{\it x\_p} char * {\it x}}$ to x. Similarly, a function declared with the prototype

$\displaystyle \begin{array}{l} \texttt{{\it getenv\_ret\_p} char * {\it getenv\... ...5in} \textrm{(where $\textit{getenv\_ret\_p} = \texttt{tainted}$)} \end{array}$

$\begin{displaymath} \begin{array}{l} \texttt{{\it getenv\_ret\_p} } \;\texttt{ch... ...\, \texttt{{\it x\_p} } \;\texttt{char * {\it x}}}. \end{array}\end{displaymath}$

$\displaystyle \textit{getenv\_ret\_p} = \textit{x\_p} = \texttt{tainted}, \,\quad \textit{getenv\_ret} \le \textit{x}.$

We give next a more detailed example. Figure 3 shows a fragment of code that manipulates tainted data in an unsafe way, along with the typing constraints generated by the type inference algorithm. The constraint $\textit{getenv\_ret\_p} = \textit{s\_p}$ encodes the conclusion that the return value of getenv() is treated as tainted (as discussed above). The prototype for printf() (typically found in the global prelude file) specifies that printf() must not be called with a tainted format string argument, by requiring that its first argument be a subtype of untainted char *.

$\begin{displaymath} \begin{array}{cc} \textit{getenv\_ret} \leq \textit{s} \\ \textit{getenv\_ret\_p} = \textit{s\_p} \end{array}\end{displaymath}$

Taking the transitive closure of these constraints, we have a chain of deductions

$\displaystyle \texttt{tainted}$	$\displaystyle = \textit{getenv\_ret\_p} = \textit{s\_p} = \textit{t\_p}$
	$\displaystyle \leq \textit{printf\_arg0\_p} = \texttt{untainted},$

In our implementation, the subtyping constraints are solved on-line as they are generated. If the constraint system ever becomes unsatisfiable, an error is flagged at the first illegal expression in the code. This allows us to pinpoint the location of unsafe operations on tainted data. The inference then continues after any errors, though in this case the quality of the remaining error messages can vary tremendously.

**Figure 4:** An example of a taint flow path. The string `unclean` is tainted by the call to `getenv` at Point 0, and ultimately that data is passed to `printf` at Point 5.
$\begin{figure}\small\begin{verbatim}tainted char getenv(const char name); in... ...f1(unclean); /* Point 4 / printf(s); / Point 5 */ }\end{verbatim}\end{figure}$

We observe that efficient algorithms for this type inference problem are known. Given a fixed-size qualifier lattice and

constraints of the form $l\leq q$ , $q\leq l$ , or $q_1\leq q_2$ , where

is a lattice element and

, and

are qualifier variables, a solution to the constraints can be computed in

time using well-known algorithms [21]. The idea is to express these constraints as a directed graph with qualifier variables as vertices and subtyping constraints as directed edges: the constraint $v_1 \le v_2$ induces an edge from

. The constant qualifiers tainted and untainted are also vertices in this graph, and a directed path from tainted to untainted corresponds to a possible format string bug. We call this path a taint flow path. See Figure 4 for an example.