Taint mode is now in beta

by Iago Abal on October 21, 2021

As you may know, we at r2c are strong believers in guardrails. When we began developing Semgrep that was our main focus, and we knew that lightweight static analysis, based on syntax-aware matching, would excel at enforcing secure defaults. But since then, we have found the community and ourselves often abusing Semgrep to write rules that would find (rather than prevent) injection vulnerabilities.

For sure, you can use Semgrep's matching engine (that is, the default search mode) to find injection vulnerabilities, and a good proof of that is Nodejsscan by Ajin Abraham, powered by Semgrep. But there is actually a technique called taint analysis designed for this very purpose, which is a specific kind of data-flow analysis. With Semgrep's search mode you would typically write rules like this one, that somehow try to simulate a data-flow analysis, and that we (lovingly) call "fake-taint" rules. The drawback is that these rules tend to be large and cumbersome to maintain, and they miss bugs like this one:

const s = new Sandbox();
var user_input = "lol(" + req.query.userInput + ")";
var code = Math.random() > 0.5 ? user_input : "all good";
// ruleid:express-sandbox-code-injection
s.run(code, cb);

We think of Semgrep as a powerful tool, not only for enforcing secure defaults, but also for lightweight intra-procedural bug finding. And we believe that there is a need for a lightweight approach to taint analysis. That is why we added the taint mode into Semgrep back in 2020, first as an experimental feature. Recently in 2021 Q3 we have made great improvements to taint mode, upgrading it from an experiment to something that you can now use to write simpler but powerful taint-tracking rules like this one, and easily find bugs like the one above.

We are happy to announce that taint mode is officially in beta since Semgrep v0.70.0, and we want to encourage you all to try it out! And, of course, please let us know what you like, what you miss, or any thought you may have on it.

Taint-tracking rules should perform on pair with your equivalent fake-taint rules, but are more succinct and easier to maintain, and catch more and more complex bugs! At r2c we are definitely "eating our own dog food". We have started using taint mode for new rules added to the Semgrep registry, and we have been porting some of the already existing rules too.

Before going into further details, we want to thank Erwan Le Rousseau, one of the authors of WPScan, for being an early adopter and giving us plenty of valuable feedback!

Finding injection vulnerabilities with fake-taint rules

Let’s say that we want to find XSS vulnerabilities when creating an angular.element using untrusted input and then calling methods such as html or wrap on it (see slide 30 of this presentation by Lewis Arden).

If we use Semgrep to find this kind of bugs then we may start with a rule that looks like this one:

Essentially, we are trying to specify how insecure code would look like. This is not trivial! For sure, we can use this very simple rule to catch (similarly) very simple cases of XSS-vulnerable code like:

app.controller("myCtrl", function ($scope) {
  // ruleid: detect-angular-element-methods
  return angular.element($scope.input).html();

But we quickly realize that $scope.input may reach angular.element(...) through an intermediate variable, so we may refine our rule this way:

Unfortunately, there are virtually infinite ways we could end up creating an angular.element using untrusted input. As we try to cover as many cases as possible, our rule will become more complicated and harder to maintain (even despite metavariable-pattern may help here). This does not scale and it is just a symptom that we are not using the best tool for the problem. That said, you can get quite far and be rather effective just by using Semgrep’s search engine. In fact, the Semgrep registry has many fake-taint injection rules that have done a good job. But, in order to catch more complex injection bugs, while keeping rules simpler, we need an engine that can track the flow of data through the code, and that is Semgrep’s taint mode!

Finding injection vulnerabilities with taint mode

Taint analysis, or taint tracking, is a kind of data-flow analysis that tracks the flow of untrusted (aka "tainted") data through out a program. The analysis raises an alarm whenever such data goes into a vulnerable function (aka "sink"), without first having been checked or transformed accordingly (aka "sanitized"). In our example, the untrusted data (or "source of taint") is $scope.input, the sinks are methods .html() and .wrap() in the object returned by angular.element, and sanitization is possible through function $sanitize().

Taint mode is a lightweight but powerful implementation of taint analysis. You start by using Semgrep’s intuitive pattern syntax to specify what are sources, sanitizers, and sinks. For that, you can now resort to any pattern operator, and whatever that matches will be annotated accordingly. (In earlier versions of taint mode you were limited to a single pattern, and removing this limitation was one of the key changes we made in Q3.) Then we run a lightweight intra-procedural taint analysis based on those annotations, and report the findings back to you.

Getting back to our running example, and thanks to taint mode, we can now write this rule:

A taint rule is made of three sets of annotations: sources, sanitizers, and sinks. With pattern-sources you specify what are the sources of taint in your code. Each source is an arbitrary pattern, with the same power as a search rule. In our example we are interested in the controller functions function($SCOPE) {...} and within those functions we want $SCOPE.input to be a source of taint. Think of Semgrep running this pattern, finding all matches of $SCOPE.input in that context, and labeling them as sources of taint.

Sanitizers are specified within pattern-sanitizers, which again it consists of a list of patterns that will be run, and the code that is matched will be annotated as sanitized. Our rule specifies the sanitizer $sanitize(...), so any expression that matches that pattern will be considered sanitized. For example, given this code $sanitize($SCOPE.input), the occurrence of $SCOPE.input will be a source of taint but, at the same time, as it belongs to a piece of sanitized code, it will not produce any findings.

Finally, we use pattern-sinks to specify the XSS-vulnerable functions where we do not want any tainted data to go into. Note that pattern-sinks acts as a pattern-either operator (and the same applies to pattern-sources and pattern-sanitizers).

And, with such simple rule you can now catch rather convoluted bugs!

Minimizing false positives via sanitizers

One advantage of fake-taint rules is that they barely produce any false positives. This is natural as these rules try to be very precise about how vulnerable code looks like. Taint rule specifications are more succinct, but in turn they rely on the data-flow engine having an understanding of the semantics of the code. Since taint mode is intra-procedural, it does not know what other functions do, and Semgrep is careful to assume that taint could propagate through other functions, for example:

Here, some_safe_function receives tainted data as input and, to be on the safe side, Semgrep assumes that it will also return tainted data as output. Therefore we get a match.

In some codebases this assumption may produce too many false positives. In our example, some_safe_function may not be returning tainted data after all. If that is the case, you could first consider enumerating such functions as sanitizers (which in a sense they are):

If this is too cumbersome, then you can easily "turn it around" and instead assume that every function call is a sanitizer by default:

For this purpose, we have a special kind of not-conflicting sanitizer, specified with not_conflicting: true, that will not conflict with source and sink annotations. A pattern like $F(...) matches every fuction call. If it were acting as a regular sanitizer, it would also apply to any source or sink that had the same function-call shape. In our example this would sanitize all calls to sink and we would get no matches at all.

With this approach, you instead have to enumerate your taint propagators, using pattern-not, if there are any:

Another scenario where we have found some security researchers wanting to alter Semgrep's tendency to err on the safe side has been array indexing. Semgrep considers that indexing an array with a tainted index leads to tainted data. Again, it is easy to disable this via sanitizers:

  - patterns:
    - pattern-inside: $ARRAY[$INDEX]
    - pattern: $INDEX

Note that previously we would had advised you against using a pattern like $INDEX, since executing this single-metavariable patterns can be very expensive (a metavariable matches anything). However, partly due to its use in taint mode, we have optimized the engine to efficiently run this kind of pattern when it is combined with one or more pattern-inside.

Taint mini cookbook

The new taint mode is very powerful but it may not be obvious how to specify what you want. Here you have some interesting recipes that we hope you will find useful, and we will be working into adding these and more to our docs. We also recommend you to take a look at our registry for examples of taint rules, as we add new ones in a regular basis.

Again, keep in mind that sources, sanitizers and sinks are given by arbitrary pattern operators, so they can be anything that you can match with Semgrep. You can get very creative!

Function argument as a source

Taint may come from specific functions that read user input such as window.prompt() but it is also easy to specify that taint comes from anywhere else, for example, a specific argument within a function definition (as we already saw in our running example):

  - patterns:
    - pattern-inside: function ($REQ, ...) {...}
    - pattern: $REQ

Function argument as a sink

If you specify a sink such as sink(...) then any tainted data passed to sink, through any of its arguments, will result in a match. You can narrow it down to a specific parameter this way:

  - patterns:
    - pattern-inside: $S = new Sandbox(); ...
    - pattern-inside: $S.run($SINK, ...)
    - pattern: $SINK

Here we are telling Semgrep to only annotate the first parameter passed to $S.run as the sink, rather than the method $S.run itself. If taint goes into any other parameter of $S.run, then that will not be considered a problem.

Remember, anything can be a sink, even the index of an array access:

  - patterns:
    - pattern-inside: $ARRAY[$SINK]
    - pattern: $SINK

This way we tell Semgrep that we do not want arrays to be accessed with tainted indexes.

Sanitized by side-effect

Typically a sanitizer will be some function that gets tainted data and returns untainted one. But it does not need to be that way. Sometimes data gets sanitized via side-effect, and taint mode can handle this too, for example:

  - patterns:
    - pattern-inside: |
        $JWT.verify($TOKEN, ...)
    - pattern: $TOKEN

Here we just annotate as sanitized all the occurrences of a $TOKEN that happen after calling verify on it.

You can also use the presence of (for example) an if conditional as a sanitizer:

  - patterns:
    - pattern-inside: |
        if !strings.HasPrefix($PATH, <... $TARGET ...>, ...) {...}
    - pattern: $PATH

What is coming next

Taint mode plays an important role in our vision of the future of Semgrep, and we have bigger plans for it! We will be working towards making taint mode a GA feature in a per-language basis, and we will be starting with JavaScript. Stay tuned!