How we made Semgrep rules run on Semgrep rules

by Emma Jin on April 02, 2021

We’ve been running Semgrep on Semgrep source code for a while, but what about scanning Semgrep rules? Yes, now you can scan rules themselves, Kubernetes configs, CircleCI workflows, and more!

Introducing YAML support

As programmers, we make mistakes. That’s why we’re developing Semgrep, a tool that makes it easy to search your code for security bugs, anti-patterns, and really just anything you don’t want in there. We write rules for Semgrep to find and flag errors in our code, like how one of our engineers wrote a rule to ensure we never accidentally leak sensitive info in logs.

Though we like to say Semgrep has superpowers, it (sadly) can’t actually read minds (yet). That means that Semgrep rules are just code, understood by a machine. That means we can make mistakes while writing Semgrep rules. That means we can search Semgrep rules for...I’m sure you see where this is going.

Semgrep rules and targets

As a matter of fact, Semgrep configurations are written in YAML. For example, we might have

rules:
- id: bad-exec
  patterns:
    - pattern: |
        exec(...)
    - pattern-not: |
        exec('...')
  message: |
    Don't exec on arbitrary code!
  severity: WARNING

which checks that we only exec hardcoded string literals, so that we never run the risk of executing dangerous code.

Why do you wear that mask

YAML can be a little difficult to work with, but it provides a simple syntax to express Semgrep rules. The YAML fields allow us to specify how patterns should be composed (e.g. code can match either pattern), as well as rule information, like the rule ID and a helpful message explaining what the error is.

That being said, it’s not so simple we can’t make mistakes, and rules can get long. That’s where Semgrep comes in!

More rules

Before we can write rules for Semgrep rules, however, we need YAML support. Semgrep can already do a little bit of YAML pattern-matching using its “generic” pattern-matching mode. Generic pattern matching works by inferring meaning from whitespace, braces, quotes, etc., but it doesn’t really understand the code, so it’s fairly limited. To express the rules we want, we need more than generic pattern matching.

I am once again asking you

Adding a YAML Parser

You might be wondering at this point, can’t Semgrep already parse YAML? Since Semgrep rules are written in YAML, it must be able to read YAML already. Sure, we’d need to add some special features. For example, Semgrep has an ellipsis operator (“...”), which lets you easily match 0 or more statements, function call arguments, and more. If Semgrep is already parsing YAML files, though, maybe we could just add special features in a preprocessing pass (hold that thought).

Unfortunately, the YAML parser we use doesn’t include location information, which we need so Semgrep can report findings on the correct lines. It returns an OCaml object with all the semantic information, but no ranges for where the YAML fields are.

However, the YAML library for the parser we use also provides a low-level API that processes a YAML file into a stream of tokens, and lets us read from that stream. That stream does contain location tokens, which we can use to generate an abstract syntax tree (AST).

So we did exactly that. This works great for YAML target files! Now let’s try a pattern

...
- language: ...
...
Fatal error: exception Yaml_to_generic.ParseError("error calling parser: did not find expected node content character 0 position 0 returned: 0")

Grus plan

But we still have some tricks up our sleeve. Ellipses don’t have to look like .... We could choose a different syntax, which is more YAML-compatible. Instead of:

...
language: ...
...

which the parser can’t parse, we can use the eminently readable

__semgrep_ellipses__: __semgrep_ellipses__
language: ...
__semgrep_ellipses__: __semgrep_ellipsis__

This will be correctly converted to tokens, which can then be parsed as ellipses, as long as you haven’t chosen to name your fields __semgrep_ellipsis__.

In case you didn’t want to write __semgrep_ellipsis__: __semgrep_ellipsis__ to match dictionaries with a particular field, we—you guessed it—pre-process the yaml file to convert ... into the appropriate __semgrep_ellipsis__ equivalent.

Converting isn’t quite as easy as a simple find-replace, sadly. Consider

...
language: $X
...

vs

...
- language
...

In the first, the ellipses need to be replaced with __semgrep_ellipsis__: __semgrep_ellipsis, but in the second, they need to be replaced with - __semgrep_ellipsis__, or the parser will get confused. This means we need to keep track of how non-ellipsis fields before and after each ... are formatted, and use their format to infer what to replace each ellipsis with. Then, we can parse the YAML into an AST.

YAML support is still in alpha. There are a few limitations—as of yet, we don’t support aliases and anchors, though we’re happy to add them if they’re useful. Our error reporting for misaligned ellipses is also not ideal, and there may be some other bugs. Please file issues as you find errors, so that you can use Semgrep to secure your YAML code.

With YAML added to Semgrep, though, we can write the YAML patterns we want. And that means we can write Semgrep rules for Semgrep rules!

Writing Semgrep Rules

I used Semgrep to check the Semgrep

We can write a number of useful rules using Semgrep. For instance, we can check that we never have two identical patterns, since that’s always a mistake (we’ve actually separately checked for and fixed this in our rules in the past).

rules:
- id: identical-pattern
  message: |
    you used the same pattern multiple times
  severity: ERROR
  languages: [yaml]
  patterns:
  - pattern-inside: |
     ...
     - pattern: $X
     ...
     - pattern: $X
     ...
  - pattern: |
     pattern: $X

This rule works by checking the pattern fields within a target YAML rule. The $X is a metavariable, which will remember the string associated with the pattern key. If it finds another pattern with that same string, it’ll raise an error. To have the error location report the pattern that was repeated (instead of the entire patterns block), we wrote the rule to check for a pattern within a patterns block that has repeating patterns.

Similarly, we could check that we never have two patterns that contradict each other, or that we aren’t missing an expected field. See our meta rules for all the checks we’ve written.

And best of all, they can check themselves!

Semgrep on Semgrep demo

(In case you don’t want to squint at the gif, it’s a rule that matches when the language field isn’t there, matching on itself.)

Let's see who's failing our rules

Other Work

Writing rules about the structure of Semgrep rules was super easy, but there’s other useful work that this enables.

YAML is a common language for configuration files, such as Kubernetes configurations, so having YAML support will enable other people to write rules for their configuration files. Right now, YAML support is new, so we don’t have a lot of YAML rules in the registry, but we’ll be writing more.

Since our JSON rules parse to the same underlying generic ASTs as our YAML rules, we can also run existing JSON rules on YAML files, to check for the same vulnerabilities.

Finally, with a YAML parser that gives location information, we can improve other aspects of how we analyze Semgrep rules. We can report the line numbers where rules cause errors, and go deeper into the patterns to figure out if they could be written more efficiently.

We’re excited for what we can do with YAML in Semgrep, and also for what you’ll be able to do. Head over to the editor to check it out yourself, and let us know if there’s anything we could do better!