Introducing DeepSemgrep

by Isaac Evans and Iago Abal on May 24, 2022

Overview

The philosophy for Semgrep has always been to build a lightweight, fast tool optimized for enforcing good coding practices. Because of this, though r2c has continually made the engine smarter, Semgrep rules only run on a single file, and Semgrep taint rules (source to sink analysis) run within a single function. This allows Semgrep to run as fast as most linters and work even on incomplete code.

Sometimes, however, running on a single file is too limited for finding complex bugs. That is why we have created a proprietary extension to Semgrep called DeepSemgrep. It leverages global analysis to return better results using the exact same rules without needing to build code (i.e. on incomplete code unlike many SAST tools).

DeepSemgrep trades analysis time for more accurate results:

  • Fewer false negatives, for instance by finding more matches to a pattern with inter-file constant propagation.
  • Reduced false positives, for instance taint tracking will find out whether tainted user's input may be reaching an unsafe SQL statement through a long chain of function calls.

DeepSemgrep is available for Semgrep Team and Enterprise tiers.

We're thrilled to partner with early users and push Semgrep's analysis capabilities even further. If you'd like to join the private beta, request access here.

We have focused on extending three analyses to be inter-file and interprocedural:

  • Type inference with class inheritance analysis (typed metavariables already interprocedural in Semgrep)
  • Constant propagation
  • Taint analysis

This blog includes a quick start guide to DeepSemgrep and a comparison between Semgrep and DeepSemgrep.

Quick start guide to DeepSemgrep

DeepSemgrep performs a global analysis of all the files in a project, resolving names globally and extracting key data such as the type of each variable and method, or the known values of constant class fields. The global data is passed to the Semgrep engine, which then uses it to refine its findings.

To run deep analyses via the CLI, simply pass --deep to Semgrep:

$ semgrep --deep

In the Editor, you should see a "DeepSemgrep" toggle switch if you have DeepSemgrep enabled (currently in private beta for Team and Enterprise tier users). Here are some examples that illustrate the global analyses currently implemented by DeepSemgrep.

Constant propagation

When enforcing guardrails, Semgrep rules can be used to flag dangerous functions that may receive potentially non-constant data, which could be user-controlled, thus posing a security risk. These rules follow the template below, where we find all calls to some dangerous function, except those calls where dangerous only receives a constant string.

rules:
- id: dangerous-call
  patterns:
    - pattern: dangerous(...)
    - pattern-not: dangerous("...")
  message: Call of dangerous on non-constant value
  languages: [java]
  severity: WARNING

The Semgrep engine can perform constant folding within a single file. But, in the following Java example, there is a constant, EMPLOYEE_TABLE_NAME, that is defined in a Constants class in another file. The Semgrep engine cannot see the constant value of Constants.EMPLOYEE_TABLE_NAME by itself, as it only performs intra-file analyses, and the dangerous-call rule will incorrectly flag dangerous("Select * FROM " + EMPLOYEE_TABLE_NAME).

DeepSemgrep will not return a false positive in this case. DeepSemgrep looks into Constants.java and picks the constant value of EMPLOYEE_TABLE_NAME. This constant value is passed to the Semgrep engine, knowing that "Select * FROM " + EMPLOYEE_TABLE_NAME is a constant string.

alt_text Figure 1: Constant propagation in DeepSemgrep

Typed metavariables

Following the disclosure of the Apache Log4Shell vulnerability, the Semgrep community quickly came up with a rule for it, see below. This rule uses typed metavariables to find objects of the Logger class, and flags any dangerous-looking call to any of its methods.

rules:
- id: log4j2_tainted_argument
 patterns:
   - pattern-either:
     - pattern: (Logger $LOGGER).$METHOD($ARG);
     - pattern: (Logger $LOGGER).$METHOD($ARG,...);
   - pattern-inside: |
       import org.apache.log4j.$PKG;
       ...
   - pattern-not: (Logger $LOGGER).$METHOD("...");
 message: log4j $LOGGER.$METHOD tainted argument
 languages: [java]
 severity: WARNING

Unfortunately, if a project defines a wrapper logger class MyLogger that extends org.apache.log4j.Logger as exemplified below, Semgrep will not report this! Semgrep is unaware of the inheritance relationship between classes, even if the information is contained within a single file.

Using the DeepSemgrep extension, however, Semgrep will flag logger.error(user_input), because it builds a class inheritance tree and it is aware that MyLogger extends org.apache.log4j.Logger.

alt_text Figure 2: Typed metavariables in DeepSemgrep

Taint tracking

Last but not least, DeepSemgrep extends taint mode to perform inter-file and inter-procedural taint tracking.

Using a taint rule like the one below, we want to find data flowing from get_user_input() into vulnerable_function().

rules:
- id: unsafe-data-processing
 mode: taint
 pattern-sources:
   - pattern: get_user_input(...)
 pattern-sinks:
   - pattern: vulnerable_function(...)
 message: User input reaches vulnerable function
 languages: [java]
 severity: WARNING

Without DeepSemgrep, this rule only looks at each function or class method in isolation, so it is fairly limited in what it can find. With DeepSemgrep, get_user_input() and vulnerable_function() may be called in different packages and classes, but if there is a flow of data from the former to the latter, DeepSemgrep will find it!

alt_text Figure 3: Taint tracking in DeepSemgrep

Semgrep vs. DeepSemgrep

While Semgrep is a popular tool and very powerful in its own right, its intra-file (within-a-single-file) nature makes its use limited on codebases with multi-file coding paradigms. For example, in most object-oriented programming styles, classes are expected to be in different files, including ones that inherit from each other, making it difficult to write intra-file rules that cover all the cases. Though you can work around this limitation, many of our users and customers have asked us to expand the engine to handle it natively.

Although DeepSemgrep is proprietary, it uses the exact same rules and pattern syntax as the open-source Semgrep (and vice versa).

Feature summary

Semgrep DeepSemgrep
All existing Semgrep features (join mode, within-file taint mode, etc.) yes yes
Analyze across multiple files no yes
→ Interfile constant propagation no yes
→ Interfile type inference no yes
→ Interfile taint tracking no yes
License LGPL 2.1 proprietary
Rule syntax & schema no difference
Languages supported 24+ languages Java and Ruby

Conclusion

Our goal with DeepSemgrep is to create an engine that enables simple rule-writing as with Semgrep and understands your entire program instead of a single file. We're thrilled to partner with early users and push Semgrep's analysis capabilities further. Check out DeepSemgrep documentation for more examples. If you'd like to join the private beta, request access here.