Skip to content

srcQL Queries

Joshua Behler edited this page Mar 21, 2024 · 15 revisions

srcQL (source Query Language) is a query language that provides the ability to search source code for for snippets that match some syntactic or semantic pattern. srcQL is integrated into the srcML toolkit, which will process any given srcQL query and produce a corresponding XPath search.

The language is composed of three main components - source code expressions, scoping operators, and set operators.

NOTICE

Most things on this page are a current work in progress. Currently, the only officially supported part of srcQL are the source code patterns. The section on source code patterns is valid with the most current version of srcML.

Source Code Expressions

A source code expression is an expression used by srcQL to specify what syntactic or semantic structure the query should search for.

Source Code Patterns

A source code pattern (previously referred to as srcPat) is a small fragment of source code that is used to search for matches within the target source code. Source code patterns use the syntax of the language of the source code that the query is being ran on. For example, the following C++ pattern will match with any int x that is found within the target source code.

int x

Specifically, this query looks for the following srcML structure:

<decl><type><name>int</name></type> <name>x</name></decl>

All of the examples on this page will be on C++, but source code patterns work for all languages that srcML supports.

Code Fragments

Any fragment of source code can be used as a pattern. For example, the following pattern will match with any if statement within C++.

if () {}

Because source code fragments work as patterns, any scoping within a fragment will be taken into consideration while searching. For example, the following pattern will match with any if statement that has a while loop directly inside of its block.

if () { while() {} }

The order of syntax will also be taken into consideration. For example, the following query, will match with any if statement that directly contains a while loop and a switch statement, with the while loop appearing before the switch statement.

if () { 
  while () {}
  switch () {}
}

This pattern could be applied to the following source code, and will match only some of the ifs

if (true) { // Will Match
  while (true) { }
  switch (val) { }
}

if (true) { // Will NOT Match
  switch (val) { }
  while (true) { }
}

if (true) { // Will Match
  while (true) { }
  int x = 0;
  switch (val) { }
}

As shown in the 3rd example above, the presence of something between the while and switch does not affect the pattern, so long as general ordering and scoping matches.

Logical Variables

In addition to source code fragments, srcQL also allows the presence of Logical Variables to replace certain names and expressions within the source code.

Looking at the first example again, this pattern will only return declarations of type int with name x

int x

This is very limited in what it will return. To expand the capabilities of patterns, x can be replaced with a logical variable:

int $N

This pattern will match any integer variable declaration, regardless of the name of the variable.


Logical variables can also replace the type in the declaration. The following pattern will match with any variable declaration of any type and name.

$T $N

A bare logical variable will match with any expression or declaration within the source code.

$E

When matching an expression, a logical variable will represent anything that can be found within an expression, not just names. To see all possible subelements of an expression, visit this page. The following elements are an exception to this rule, and will never be used to match: operator, comment, modifier, specifier.


Logical variables can also act as a wildcard for generic text - strings being the primary example

"" // Will only match with empty strings
"x" // Will only match with strings that exactly match "x"
"$T" // Will match any string
"prefix$T" // Will match any string that starts with "prefix"

Subsumption

Due to a source code pattern's matching looking for the bare minimum to match, patterns that contain series will match with series larger than themselves. For example the following pattern will match with any function declaration,

$TYPE $FUNC();

This following pattern will match with any function declaration that has at least one parameter,

$TYPE $FUNC($A);

This following pattern will match with any function declaration that has at least two parameters, and so on.

$TYPE $FUNC($A,$B);

Subsumption is an intrinsic part of the source code pattern - for example, the if pattern from earlier will match with any if statement, regardless of what is in the condition or block.

if () {}

The following pattern from earlier matches with any if statement that has at least one while loop, but it could have many! It also allows for any other statements to appear within the block. In addition, the while loop can have anything inside of its condition and block.

if () { while() {} }

Nesting this further yields the same results.

Unification

If the same logical variable is used multiple times, the query will only return code fragments where the two variables are the same thing. This process is called Unification, and allows patterns to search for more specific matches.

The following simple unification pattern will return any expression that adds two copies of a value together,

$X + $X
1 + 1 // Will Match
1 + 2 // Will NOT Match
x + x // Will Match
foo() + foo() // Will Match
bar(1) + bar(2) // Will NOT Match

Unification can occur on different variables at once. For example, the following pattern will find all functions that assign its parameter to a new variable:

$FTYPE $FUNC($TYPE $PARAM) { $TYPE $NEW = $PARAM; }

In that example, only the $TYPE and $PARAM variables undergo unification, because they appear twice. $FTYPE, $FUNC, and $NEW just act as wildcards since they only appear once.


Unification can be used to search for highly specific patterns. For example:

$TYPE $FUNC($TYPE $PARAM) { $TYPE $RTN = $CALL($PARAM); return $RTN; }

The above pattern will match with all functions that:

  • Have a parameter of the same type as its return type
  • Creates a new parameter with an initial value from a function call that passes the parameter
  • Returns the new variable

XPath

Instead of providing a source code pattern for an expression, a valid XPath can be substituted. XPaths can be used to specify a wider range of hierarchical structures within a source code expression, but cannot make use of logical variables or unification.

An example valid XPath would be,

//src:function

which will match to anything srcML marks up as a function.

The provided XPath must start with either / or //, and cannot return anything other than a node set. All other features of XPath 1.0 are usable, including predicates and axes. When querying for tags introduced by srcML, make sure the correct namespace is used - cpp: for any C-Preprocessor features, omp: for any OpenMP features, and src: for any base srcML tag.

srcML Tag

Lastly, a bare srcML tag can be used, which acts as a shortcut to simple XPaths. The tags will use the same namespaces as XPaths.

The following 2 source expressions will match to the same source code:

//src:function -- XPath
src:function -- srcML Tag

FIND Statement

FIND [source_expr]

The FIND keyword is a no-op operator that indicates other operators are going to be used in the query. The FIND keyword itself does not perform any operation, but is needed for the usage of other operators, especially the set operations.

FIND statements are the arguments for the Set operations Unification can only occur within a FIND statement, and will not work at the set operation level.

The source code expression next to the FIND keyword is known as the "context" of the FIND statement, and is important for the order of operations when using operations.

Both [source_expr] and FIND [source_expr] are equivalent srcQL queries - however, any further operations cannot be used on the former.

Source Expression Specifiers

srcQL will try to auto-determine the type of a source code expression. However, the type can be explicitly set by adding a source code specifier before hand.

TAG - used to specify the expression is a srcML tag - FIND TAG [source_expr]
XPATH - used to specify the expression is an XPath expression - FIND XPATH [source_expr]
PATTERN - used to specify the expression is a source code pattern - FIND PATTERN [source_expr]

Operations

Beyond just the source code expressions, srcQL provides operators to either narrow the scope of a query's search or to perform set operations on two query results. This section covers the scoping operations, and the next covers the set operations.

Anywhere that [source_expr] appears can be replaced with any valid source code expression.

CONTAINS

FIND [source_expr] CONTAINS [source_expr]
FIND [source_expr] CONTAINS [source_expr] CONTAINS [source_expr]

The CONTAINS keyword will scope the query to only select source code that contains source code which matches the CONTAIN's expression.

For example, the following query will match all functions that contain an if statement:

FIND $T $U() {} CONTAINS if() {}

The CONTAINS keyword is not constrained to just direct children of the parent expression, and will match anywhere within the descendants. This means that the above query will result in different things to FIND $T $U() { if() {} }

Expression A - FIND $T $U() {} CONTAINS if() {}
Expression B - FIND $T $U() { if() {} }

// Will match to A and B
void foo() { 
    if(true) { int x; } 
}

// Will match ONLY to A
void bar() { 
    while(true) { 
        if(true) { int y; } 
    }
}

// Will match NEITHER A or B
void baz(int z) { 
    while(true) { 
        ++z;
    }
}

Multiple CONTAINS can be used in a query, but they will only ever scope the FIND's local context, and will not scope each other. For example, the following query will get any function that contains a call to an open function and a call to a close function.

FIND $T $U() {} CONTAINS open(); CONTAINS close();

// Will match the query
void foo() { 
    open("file.txt");
    close("file.txt");
}

// Will NOT match - no open call
void bar() { 
    close("file.txt");
}

// Will NOT match - no close call
void bar() { 
    open("file.txt");
}

// Will match - CONTAINS does not check for order, only containment
void bar() { 
    close("file.txt");
    open("file.txt");
}

FOLLOWED BY

FIND [source_expr] CONTAINS [source_expr] FOLLOWED BY [source_expr]
FIND [source_expr] CONTAINS [source_expr] FOLLOWED BY [source_expr] FOLLOWED BY [source_expr]

The FOLLOWED BY keyword is used alongside CONTAINS to check for specific ordering of source code expressions.

As mentioned above, the query FIND $T $U() {} CONTAINS open(); CONTAINS close(); will match with any function that has a call to open() and close() regardless of the order of the function calls. However, if one wanted to ensure that the close() call appears somewhere after the open() call, the following query can be made:

FIND $T $U() {} CONTAINS open() FOLLOWED BY close()

This query will ensure that both the open() and close() calls are within a function, and that the close() call appears after the open() call.

FOLLOWED BY determines precedence using two criteria:

  1. The FOLLOWED BY expression and CONTAINS expression have a common ancestor that matches with the FIND expression
  2. The FOLLOWED BY expression appears after the CONTAINS expression in tree-order These rules mean that FOLLOWED BY is not limited to the expressions being at the same scope, and can check within different scopes and structures to match.

For example, here are some brief examples of how the query would react to some source code functions:

void foo1() { // Will Match
    open();
    close();
}

void foo2() { // Will NOT Match
    close();
    open();
}

void foo3() { // Will Match
    open();
    if(condition) {
        close();
    }
}

void foo4() { // Will Match
    if(open()) {
        close();
    }
}

void foo5() { // Will Match
    if(condition) {
        open();
    }
    close();
}

Multiple FOLLOWED BYs can be chained together, with each subsequent FOLLOWED BY constraining the preceding one in terms of order. FOLLOWED BY can only affect CONTAINS or other FOLLOWED BYs, and it cannot directly affect FIND or any other operator.

WITHIN

FIND [source_expr] WITHIN [source_expr]
FIND [source_expr] WITHIN [source_expr] WITHIN [source_expr]

The WITHIN keyword will scope the query to only select source code that appears inside of code which matches the WITHIN's expression. It is effectively the opposite direction of the CONTAINS operation.

For example, the following query will locate all functions defined within any class:

FIND $T $U() {} WITHIN class $C {};

Just like with CONTAINS, direct descension is not required - any level of descension is valid for the WITHIN operation.

Unlike CONTAINS, the WITHIN operation will scope any other source code expression within the query. The WITHIN will scope the expression to its direct left within the expression - except for other WITHINs. When multiple WITHINS are present, they function similarly to CONTAINS, where the order does not matter. The following are all valid srcQL Queries which use WITHIN:

// Finds all ifs that are in a while loop
FIND if() {} WITHIN while() {}

// Finds all variable declarations inside functions which are inside a class
FIND $T $U; WITHIN $R $F() {} WITHIN class $C {};

// Finds all ifs that are in a while and contain an integer declaration
FIND if() {} WITHIN while() {} CONTAINS int $x;

// Finds all functions that contain an if statement which is inside a while loop
FIND $T $U() {} CONTAINS if() {} WITHIN while() {}

// Finds all declarations inside a while loop and an if statement. The order of hierarchy of the loop and if is irrelevant
FIND $T $U WITHIN if() {} WITHIN while() {}

// Finds all functions that contain a call to the `open()` function, which is followed by a call to the `close()` function. The `close()` call must be inside an if statement
FIND $T $U() {} CONTAINS open(); FOLLOWED BY close(); WITHIN if() {}

WHERE

FIND [source_expr] WHERE [where_clause]

The WHERE keyword in srcQL allows for more specific checks and patterns to be evaluated that would otherwise be difficult to achieve through srcQL. WHERE provides a set of custom calls which can be invoked to provide final-level checks on queries.

MATCH

FIND [source_expr] WHERE MATCH([logical_variable],[regex])

The MATCH call in a WHERE clause will check the current value of the logical variable and determine the text matches with the provided regex.

For example, the following query will locate any function whose name begins with srcml_:

FIND $TYPE $NAME() {} WHERE MATCH($NAME,"srcml.*")

NOT

FIND [source_expr] WHERE NOT([source_code_pattern])

The NOT call in a WHERE clause will check the immediate preceding source code expression and verify it does not match with the given source code expression. This expression will be of a different scope, and thus have its own unification.

For example, the following query will locate all if statements that do not use true inside their condition:

FIND if() {} WHERE NOT(if(true) {})

Currently, the NOT call only supports source code patterns, and does not support the other source code expressions.

COUNT

FIND [source_expr] WHERE COUNT([source_expr])

The COUNT call can be used alongside other comparison operators to limit results based on the number of certain elements. The COUNT call's expression is ran on the preceding source code expression.

For example, the following query will select all integer function declarations with less than 5 parameters:

FIND int $FNAME(); WHERE COUNT(src:parameter) > 5

FROM

[find_expression] FROM [find_expression]

The FROM keyword allows for changing where a FIND query starts its search from. By default, a FIND statement starts by searching the top level of the srcML. Chaining together two FIND expressions allows this searching position to be changed. For example, the following query will get the names of all classes that contain a destructor.

FIND /src:class//src:name FROM FIND class $T() { ~$T() {} };

This operation is similar to chaining together two XPaths with //, such as //src:class//src:name, but allows for the finer-grain control of srcQL Queries.

The value of the FROM operation is to allow grabbing specific sub-elements of a code structure that you also need to check against specific criteria. Without this, one would need to write a query that gathers all the classes that contain a destructor and write an additional post-processing step to extract the name of the class by running an additional XPath, srcQL Query, or string processing step.

Multiple FROMS can be chained together, and will operate right-to-left, with the left-most FIND containing the final local context which the query will return.

Set Operations

Set operations occur above the FIND level of a srcQL query, and allow set math to be performed on the results from multiple FIND operations. Set operations can also be chained together, and always operate on a left-to-right basis. Unification does NOT occur across FINDs, and will remain local to each FIND.

UNION

[find_expression] UNION [find_expression]
[find_expression] UNION [find_expression] UNION [find_expression]

UNION allows two different FIND expressions' results to be added together into a singular pool. This works on any type of results, even results with different top-level tags. For example, the following query will get all functions and all classes within the source code:

FIND $T $U() {} UNION FIND class $C {};

INTERSECTION

[find_expression] INTERSECTION [find_expression]
[find_expression] INTERSECTION [find_expression] INTERSECTION [find_expression]

INTERSECTION allows common results within two different FIND expressions to be collected, and anything not in both discarded. This can be used to easily select specific code constructs that need to meet two very different conditions. For example, the following query will select all functions that have at least 3 parameters and are recursive:

FIND $T $U($A,$B,$C) {} INTERSECTION FIND $T $U {} CONTAINS $U();

DIFFERENCE

[find_expression] DIFFERENCE [find_expression]
[find_expression] DIFFERENCE [find_expression] DIFFERENCE [find_expression]

DIFFERENCE allows the removal of specific results from a FIND expression by differentiating with the results from a different FIND expression. This allows for a subtractive approach to querying source code. For example, the following query will select only functions that have 2 parameters:

FIND $T $U($A, $B) {} DIFFERENCE FIND $T $U($A, $B, $C) {}

This query works by taking the set of all functions with at least 2 parameters, and then removing the functions with at least 3.

Example Queries

// Find all expressions
FIND $T
FIND src:expr

// Find all expressions which use the `new` operator
FIND new $T
FIND new

// Find all functions definitions
FIND src:function
FIND $T $N() {}

// Find all function declarations
FIND $T $N();

// Find all variable declarations
FIND $T $N
FIND src:decl

// Find all variable declaration statements
FIND $T $N;
FIND src:decl_stmt

// Find all variable declarations with a pointer type
FIND $T * $N;

// Find all functions which use `new`
FIND src:function CONTAINS new

// Find all functions which use `new` and then delete the variable
FIND src:function CONTAINS $T $N = new $T FOLLOWED BY delete $N

// Find all functions which call an `open` function, and then call a `close` function
FIND src:function CONTAINS open() FOLLOWED BY close()

// Find all functions which call themselves
FIND $T $U() {} CONTAINS $U()

// Find all functions within a class
FIND //src:class//src:function
FIND src:function WITHIN src:class
FIND $T $U() {} WITHIN class $C {};

// Find all classes that use the name `foo` followed by a digit
FIND class $C {}; WHERE match($T,"^ foo[0-9]")

// Find all the types used in declarations
FIND //src:decl/src:type

// Find all the types used in declarations with the `new` operator
FIND //src:decl/src:type FROM FIND $T $U = new $T

// Find the names of all classes
FIND //src:class/src:name
FIND /src:class/src:name FROM class $C {};

// Find the names of all classes which declares a destructor
FIND /src:class/src:name FROM class $C { ~$C(); };

// Find all declarations that aren't of type int
FIND $T $U DIFFERENCE FIND int $U

// Find all declarations that don't use an init
FIND $T $U DIFFERENCE FIND $T $U = $I
FIND $T $U WHERE not($T $U = $I)
FIND //src:decl[not(src:init)]