Type-awareness in semantic grep

by Emma Jin

tl;dr: Semgrep now allows you to specify types in the code patterns you write, allowing you to find bugs and antipatterns or enforce best practices even more precisely.

  • In Java, specify types like this: (Type $VAR), like (Runtime $RT).exec(...).
  • In Golang, specify types like this: ($VAR : Type), like ($RT : Runtime).exec(...).

If you’re like me, you grep your code all the time, whether to find a function or to panic-search for that antipattern you suddenly remembered. And if you’re like me, you hate it when your convoluted regex still produces hundreds of results you have to sift through. After all, there’s (PopularTVShow $X) to watch.

We at r2c got tired of that happening, so that’s why we’re developing Semgrep, an open source lightweight static analysis tool that lets you search code the way you write it.

We created Semgrep to understand code the way you do. Recently, I added support for specifying types in Semgrep patterns, so that you can reduce the noise in your results. In this blog post, I’ll explain why this is useful, show how to use this feature yourself, and, if compilers and programming languages are also your jam, give a peek behind the curtain about how it works.

What is Semgrep and how can I use it?

To start, I'll explain what Semgrep and metavariables are, before getting into typed metavariables and what they do.

With all the frameworks, libraries, and odd quirks of language that are out there, many bugs are as simple as calling a function without the right argument. Semgrep is an open-source, lightweight static analysis tool that helps programmers find bugs by providing an intuitive syntax for writing checks. It uses the existing syntax of the language, with a few easy-to-learn operators that add generality.

The ellipsis operator (...) lets you say, “I don’t care what’s in here,” to match 0 or more arguments, statements, etc.

exec example

Metavariables are like capture groups in regular expressions; they give you a named reference to something you’re matching in the source code, whether it’s a function name, variable name, method body, or more.

metavariable example

To see a pattern in action, let’s say you’re using flask, a Python framework for building web applications, and you want to make sure that when you call response.set_cookie, you use httponly=True and secure=True. To do this, you can search for all calls to response.set_cookie except the ones with the arguments you need.

That means you want everything matching


unless it specifically matches

flask.response.set_cookie(..., httponly=True, secure=True, ...);

You can use yaml to write a pattern that combines the two like this:

  - pattern-not: |
      flask.response.set_cookie(..., httponly=True, secure=True, ...)
  - pattern: |

See this pattern work in the live editor!

If you want to learn more about Semgrep, go here to learn how we used Semgrep to find serious JWT-related security issues in open source repos and here for an overview of why we’re building Semgrep and how to use it. You can also learn to use Semgrep right from your browser using this interactive tutorial.

Why care about types?

Not all variables are created equal. Sometimes, we might want to know more about our arguments. Consider this Java code which queries a database:

public findUser(int userId) { queryUserWrapper(Integer.toString(userId)); }

public findUser(String userId) { queryUserWrapper(userId); }

private ResultSet queryUserWrapper(String userId) {
    String query = "select * from users where user_id=" + userId;

    Statement st = dbConnection.createStatement();
    return st.executeQuery(query);

where we run the queries:

public @ResponseBody int getUser(@PathVariable(value="userID") String id) {
    // Query 1: unsanitized String

    // Query 2: processed int
    int userId = Integer.parseInt(id);

    return 200;

The second example, where we query on an int, is perfectly safe. We’ll send in the integer, add it to the query, and run the call. However, the first example, where we query on a String, is vulnerable to a SQL injection attack. Running queries with arbitrary strings could allow the user to input and thus run code. When we write a pattern to search for this issue, we want to take into consideration this distinction.

One way to do this would be to track the type ourselves. There are only so many ways a variable can be created, so we could explicitly enumerate them:

- pattern-inside: |
    $T $FUNC(..., String $X, ...) {
- pattern-either:
    - pattern: findUser($X)
    - pattern: String $Y = $X;

This says that, within a function where a variable $X is passed in as a String, the code either calls findUser on $X, or it assigns it to an intermediate variable $Y and calls findUser on $Y.

The pattern is already getting long, and it’s still missing cases. $X could be a field of a class, or the code could have declared String $X; and set $X = $Y later, and so on. Ultimately, this is a clunky solution. What this really needs is type checking.

Cutting out the noise with Semgrep typed metavariables

Semgrep now allows you to indicate that metavariables should only match code with a certain type. This allows your patterns to be even more precise, as it filters out potential matches you don’t care about.

Specifically, if you have a local variable or an argument, Semgrep will remember its type for when you use it later. Currently only Java and Golang are supported, though other languages are in progress.

This section will show you how to use typed metavariables; later, I’ll go into more detail about how it works.

To specify that the metavariable $X must be a String, we simply need to replace it with (String $X).

The pattern from earlier is greatly simplified. Instead of using a pattern-inside and having multiple cases, we can find this error using a pattern with only one clause:

- pattern: query((String $X))

That’s it! Note the double parentheses---they’re necessary.

To see it in action, try it out on semgrep.dev! This is the original version, and this is the version that uses typechecking.

In the above example, only variables really mattered, since a hardcoded string would be safe to pass into a query. However, there might be functions where any call against a string is suspect, or you might want to flag any call with a string, just to check. Therefore, typed metavariables will also understand the type of a string literal and match it. In the above example, the call:


would also be matched.

To exclude that case, you could include in your config

- pattern-not: findUser(“...”)

Metavariables as types

In the previous example, we used String as our type, but it’s also possible to use a metavariable. Let’s say that we had a function add_new(), which was overloaded to take in objects of various types, and we wanted to make sure add_new() was never called on the same type twice. We could write a pattern like:

pattern: |
  add_new(($T $X));
  add_new(($T $Y));

This captures the idea that once add_new has been called on a variable of some type, it cannot be later called on a variable of the same type.

See it run on the live editor.

We also support Go!

The previous example was for Java, and the syntax for pattern matching there was based on variable declaration syntax in Java. As a general rule, in Semgrep, we want to keep pattern syntax as close as possible to the original language’s syntax, so that anyone can write patterns as easily as they can write code. There’s no need to learn a complicated syntax or domain specific language.

Because of this philosophy, the Golang pattern syntax is a little different. In Go, a variable declaration might look like:

var i float = 1.1
i == 1.1

Therefore, the syntax for this matches the original code as closely as possible, without causing any ambiguities. To match the above, we would write something like

- pattern: ($X : float) == $Y

Try this one on the live editor.

If you look at the above link, you’ll also notice that you don’t have to explicitly state the type of the variable. Since Golang allows variables to be implicitly typed when assigned to a literal, our typed metavariables also perform that inference. If previously we had instead had the code:

var i = 1.1
i == 1.1

Our pattern ($X : float) == $Y would recognize i as a float and match it.

How Do Typed Metavariables Work?

Alright, so we’ve discussed how to use typed metavarables, here’s how Semgrep does this under the hood.

At its core, code is structured, not as a string, but as a tree of constructs which relate hierarchically to each other. For example, the expression x + (y * z) == 2 is better understood as:

simple ast

This representation shows how we interpret the code structurally. First, we multiply y and z. Then, we add x. Finally, we compare the result to 2.

Notice that the parentheses aren’t included in the tree. To simplify the representation, we only store the content-related details. This is called an Abstract Syntax Tree, or AST. (Actually, this isn’t quite true---our tree includes some tokens to help us go from the tree back to code, but for the most part we abstract it away.)

At r2c, we use a “generic AST” which is, roughly speaking, the union of the ASTs for all the languages we support.

At a high level, Semgrep works like this: The pattern and code you give to Semgrep are each parsed into the “generic AST” Semgrep traverses the code AST to perform a structural comparison of the pattern to the parts of the code Outputs matched results

semgrep flow

When Semgrep goes through the variables, it assigns each of them a unique id, so that later, if that same variable shows up again, it’ll know it’s the same variable. In the code base, we call this “naming”. Since variables are declared with their types, when we see a variable declaration, we can save that type in addition to the unique id.

When those variables are later used, Semgrep remembers the type of the variable, and can therefore match on it.

Going back to the “compare ASTs to find matching structures” step, before, if we were comparing a pattern to code and we got to two nodes, one a metavariable (say, $X), the other a variable (say, a), we would bind $X to be a. Later on, when we encounter $X again, we would expect it to be a.

matching notype

However, if $X is a typed metavariable, we also need to consider the type of a, which, in this example, we have saved to be a String. If $X is supposed to be an int, the two cannot match, and the comparison fails.

matching withtype

Once a metavariable is bound, typed or otherwise, the binding is used in later matchings. This includes types, which is how you can use metavariables as your types.

Work in progress

While we’ve already found using typed metavariables quite useful in practice, this is still an early iteration, and there’s much left to do.

Specifically, our implementation currently only propagates declared types of variables through all usages, with some limited inference of literals. Cases Semgrep does not currently handle include:

  • Function application
  • Array indexing
  • Explicit type casts (e.g. (String) var)
  • Field accesses for structs in Golang or Objects in Java, even if the struct is in the same file

Some of these features are more reasonable to include than others. For example, adding support for array indexing and casts should be pretty reasonable. For understanding imported functions and structs, Semgrep would have to be able to understand how multiple files are connected, which would need extra information.

Specific to Golang, if multiple variables are declared in the same line, like this:

var i, j = 1

we are not currently able to recognize both i and j as ints, since the parser doesn’t realize both i and j are being assigned to 1. In this case, only i would be assigned a type.

What about typechecking for < my favorite language >?

It depends on the language.

Certain languages are much easier to do this for than others. Java and Golang are statically typed languages, so the type of every variable can be inferred without running it. Java is particularly easy because every variable is declared with its type. This means we always know what the type of a variable is, whereas with Golang, the compiler will sometimes infer the type based on the value assigned, which we had to do as well. This gets more difficult for other statically typed languages like TypeScript or OCaml, since it would take more type inference and likely require us to link other files.

Languages like Python and JavaScript are dynamically typed, which means they determine types at runtime. Our method for type propagation actually falls apart under these languages, since it assumes that the type of each variable remains the same throughout the program. In Python, this does not have to be true.

A lot of work is being done on type inference for Python (see Pyre, for example), and building on this work, we might one day be able to have typed metavariables for dynamically typed languages. For now, however, we’ll work on implementing it for statically typed languages.

As always, let us know what you need!

About the author

I’m a summer intern at r2c and typed metavariables were my first project! I came in intimidated by the size of the AST, confused about menhir (the OCaml parser generator we use), and unsure where to really start. With mentorship help and a lot of print statements, I figured out how to extend the AST values, pinpointed where I needed to make changes, and figured out why I was getting conflicts with certain implementations. Changing the grammar was a matter of tracing the logic to find where a clause belonged, but it was a hunt that ended in a cool new feature.

You can find me on LinkedIn: https://www.linkedin.com/in/emma-j-99366512a/


Huge thanks to Yoann Padioleau (r2c), who worked with me this summer, Brendon Go (r2c), my mentor/rubber duck, and to Clint Gibler (r2c) and Jean Yang (Akita) for their help with this blog post.