The philosophy for Semgrep has always been to build a lightweight, fast tool optimized for enforcing good coding practices. Because of this, though r2c has continually made the engine smarter, Semgrep rules only run on a single file, and Semgrep taint rules (source to sink analysis) run within a single function. This allows Semgrep to run as fast as most linters and work even on incomplete code.
Sometimes, however, running on a single file is too limited for finding complex bugs. That is why we have created a proprietary extension to Semgrep called DeepSemgrep. It leverages global analysis to return better results using the exact same rules without needing to build code (i.e. on incomplete code unlike many SAST tools).
DeepSemgrep trades analysis time for more accurate results:
- Fewer false negatives, for instance by finding more matches to a pattern with inter-file constant propagation.
- Reduced false positives, for instance taint tracking will find out whether tainted user's input may be reaching an unsafe SQL statement through a long chain of function calls.
DeepSemgrep is available for Semgrep Team and Enterprise tiers.
We're thrilled to partner with early users and push Semgrep's analysis capabilities even further. If you'd like to join the private beta, request access here.
We have focused on extending three analyses to be inter-file and interprocedural:
- Type inference with class inheritance analysis (typed metavariables already interprocedural in Semgrep)
- Constant propagation
- Taint analysis
This blog includes a quick start guide to DeepSemgrep and a comparison between Semgrep and DeepSemgrep.
DeepSemgrep performs a global analysis of all the files in a project, resolving names globally and extracting key data such as the type of each variable and method, or the known values of constant class fields. The global data is passed to the Semgrep engine, which then uses it to refine its findings.
To run deep analyses via the CLI, simply pass
--deep to Semgrep:
$ semgrep --deep
In the Editor, you should see a "DeepSemgrep" toggle switch if you have DeepSemgrep enabled (currently in private beta for Team and Enterprise tier users). Here are some examples that illustrate the global analyses currently implemented by DeepSemgrep.
When enforcing guardrails, Semgrep rules can be used to flag dangerous functions that may receive potentially non-constant data, which could be user-controlled, thus posing a security risk. These rules follow the template below, where we find all calls to some
dangerous function, except those calls where
dangerous only receives a constant string.
rules: - id: dangerous-call patterns: - pattern: dangerous(...) - pattern-not: dangerous("...") message: Call of dangerous on non-constant value languages: [java] severity: WARNING
The Semgrep engine can perform constant folding within a single file. But, in the following Java example, there is a constant,
EMPLOYEE_TABLE_NAME, that is defined in a
Constants class in another file. The Semgrep engine cannot see the constant value of
Constants.EMPLOYEE_TABLE_NAME by itself, as it only performs intra-file analyses, and the dangerous-call rule will incorrectly flag
dangerous("Select * FROM " + EMPLOYEE_TABLE_NAME).
DeepSemgrep will not return a false positive in this case. DeepSemgrep looks into
Constants.java and picks the constant value of
EMPLOYEE_TABLE_NAME. This constant value is passed to the Semgrep engine, knowing that
"Select * FROM " + EMPLOYEE_TABLE_NAME is a constant string.
Figure 1: Constant propagation in DeepSemgrep
Following the disclosure of the Apache Log4Shell vulnerability, the Semgrep community quickly came up with a rule for it, see below. This rule uses typed metavariables to find objects of the
Logger class, and flags any dangerous-looking call to any of its methods.
rules: - id: log4j2_tainted_argument patterns: - pattern-either: - pattern: (Logger $LOGGER).$METHOD($ARG); - pattern: (Logger $LOGGER).$METHOD($ARG,...); - pattern-inside: | import org.apache.log4j.$PKG; ... - pattern-not: (Logger $LOGGER).$METHOD("..."); message: log4j $LOGGER.$METHOD tainted argument languages: [java] severity: WARNING
Unfortunately, if a project defines a wrapper logger class
MyLogger that extends
org.apache.log4j.Logger as exemplified below, Semgrep will not report this! Semgrep is unaware of the inheritance relationship between classes, even if the information is contained within a single file.
Using the DeepSemgrep extension, however, Semgrep will flag
logger.error(user_input), because it builds a class inheritance tree and it is aware that
Figure 2: Typed metavariables in DeepSemgrep
Last but not least, DeepSemgrep extends taint mode to perform inter-file and inter-procedural taint tracking.
Using a taint rule like the one below, we want to find data flowing from
rules: - id: unsafe-data-processing mode: taint pattern-sources: - pattern: get_user_input(...) pattern-sinks: - pattern: vulnerable_function(...) message: User input reaches vulnerable function languages: [java] severity: WARNING
Without DeepSemgrep, this rule only looks at each function or class method in isolation, so it is fairly limited in what it can find. With DeepSemgrep,
vulnerable_function() may be called in different packages and classes, but if there is a flow of data from the former to the latter, DeepSemgrep will find it!
Figure 3: Taint tracking in DeepSemgrep
While Semgrep is a popular tool and very powerful in its own right, its intra-file (within-a-single-file) nature makes its use limited on codebases with multi-file coding paradigms. For example, in most object-oriented programming styles, classes are expected to be in different files, including ones that inherit from each other, making it difficult to write intra-file rules that cover all the cases. Though you can work around this limitation, many of our users and customers have asked us to expand the engine to handle it natively.
Although DeepSemgrep is proprietary, it uses the exact same rules and pattern syntax as the open-source Semgrep (and vice versa).
|All existing Semgrep features (join mode, within-file taint mode, etc.)||yes||yes|
|Analyze across multiple files||no||yes|
|→ Interfile constant propagation||no||yes|
|→ Interfile type inference||no||yes|
|→ Interfile taint tracking||no||yes|
|Rule syntax & schema||no difference|
|Languages supported||24+ languages||Java and Ruby|
Our goal with DeepSemgrep is to create an engine that enables simple rule-writing as with Semgrep and understands your entire program instead of a single file. We're thrilled to partner with early users and push Semgrep's analysis capabilities further. Check out DeepSemgrep documentation for more examples. If you'd like to join the private beta, request access here.