-
Notifications
You must be signed in to change notification settings - Fork 253
Type Analysis
In order to recover the original type information that may have been present in the subject binary file, Reko uses a type inference method based on Raimar Falke's 2004 master's thesis "Entwicklung eines Typanalysesystem für einen Decompiler" ('Development of a type analysis system for a decompiler'), which is an interprocedural (whole-program) analysis. Reko provides some additions to accommodate real-world code that reko encounters.
One aspect that I like about Falke's approach is that, unlike other researchers, he doesn't approach the type analysis problem as a constraint satisfaction problem, which has always struck me as being a very heavy handed approach. His algorithm has the advantage of being relatively easy to implement (all mistakes in the reko codebase are of course my own, and not Falke's).
Each Expression
in reko must have a DataType
-- currently the property
Expression.DataType
. At the very minimum, the DataType
is Primitive.Byte
or Primitive.WordN
where N
is the number of bits taken up by the expression.
Every processor architecture must define the operand sizes of every operand
of each machine instruction, so we're guaranteed at least this. Naturally,
if the processor architecture can yield more hints about the types of its
operands, reko will take advantage of it. However, the majority of the instructions
rewritten / lifted will remain as PrimitiveType
until more contextual analysis
is brought to bear. In a highly abstract way, we have the set E
of all expresions
each of which has a DT
property:
E
.DT
This is where Falke's type analysis comes into play. Reko scans through the
IR code and assigns to each expression a TypeVariable
(currently
Expression.TypeVariable
but it may get moved elsewhere in the future). Type
variables represent the as-yet undetermined data type of an expression, and
the implementation is used to store information gathered from the binary for
later stages of the analysis. After type variable assignment we have:
E
.DT
.TV ---> TV
Once each expression has a TypeVariable
, reko scans again, this time collecting
"evidence" (what Falke calls "traits") based on how the expressions are used in
the code. Obviously the initial data type is evidence but, as mentioned above, it
is normally quite unspecific. However, the operators involved in the expressions
will allow reko to refine the initial PrimitiveType
types into more specific
types. For example, if reko encounters an expression i_1 + i_1
(where '+' is integer
addition), it is highly unlikely that i_1
is either a floating point value,
character, or pointer. Also, any user supplied or platform supplied information
is introduced here as well. All this work is carried out by the
ExpressionTypeAscender
and ExpressionTypeDescender
classes.
Each TypeVariable
has an OriginalDataType
property, which is the datatype
that reko can deduce for that type variable, without taking into consideration
assignments (either directly in code or through procedure parameters /
return values).
E
.DT
.TV ---> TV
.DT_orig
To be fair, my German is rusty, so it's possible I may have misinterpreted Falke's section 3.8.1:
Um Konflikte später aufzulösen zu können, wird zu jeder Typvariable t_i ein Ursprungsdatentyp t_i' berechnet. Wenn es sich bei dem Typen um einen Summentyp handelt, wird des Alternativetyp mit Hilfe des Ursprungstyp bestimmt.
which I translate freely to:
In order to resolve later conflicts, an original data type t_i' is computed for each type variable t_i. When dealing with a union type, the alternative type is determined with the help of the original type.
It's possible that Falke implies that reko should just copy E.DT
into TV.DT_orig
.
I will try to confirm with him.
Statements and expressions that assign values to one another introduce an
equivalence relation over the type variables which allows reko to group
them into equivalence classes (described in Falke section 3.8.5.4). The
type variables are partitioned so that each one belongs to exactly one
equivalence class, using a standard disjoint set union algorithm. Each type variable
records the EquivalenceClass
it belongs to, and the EquivalenceClass
contains the
union of all the original data types of the type variables that it consists of. Naturally, the union
is simplified if all of its alternatives are compatible.
E
.DT
.TV ---> TV ---> EQ
.DT_Orig .DT
The resulting unified datatype is copied back to the type variable as well:
E
.DT
.TV ---> TV ---> EQ
.DT .DT
.DT_Orig
Now, the datatypes in EQ.DT
are "cooked" to EQ.DT'
attempting to
simplify them as much as possible. Falke's thesis 3.8.5.7 specifies a number
of transformations which reko performs. We finally have:
E
.DT
.T ---> TV ---> EQ
.DT .DT'
.DT_Orig
Now we're ready to take the data types obtained through type analysis and
apply them to the expressions. This is done in reko's TypedExpressionRewriter
class. The end result is that DT'
is "pushed" back into the E
s, but many
of those E
s will be rewritten as a consequence of their types being discovered.
Once the TypedExpressionRewriter
is done, type analysis is complete, and all
type variables should be discarded. Our end result is:
E'
.DT'
where E'
represents the fact that expressions like Mem0[r0_1 + 0x42:word32]
have now changed to r0_1->dw0042
and DT'
reflects that the original data type
of the Mem0 expression, which was word32
, is very likely now reduced to either
a primitive type like int32
or float32
, or a complex type like e.g. ptr(Eq_3)
where Eq_3
is a TypeReference
to a structure or union.