Scanning Shell Scripts With Semgrep

by Martin Jambon on December 13, 2021

Everyone knows how to write shell scripts, or so they think. In case this wasn’t always true, we’ve just released experimental support for Bash in Semgrep. This allows you to write rules that will catch many problems with misuses of shell syntax as well as checking for the unsafe usage of various commands. Without further ado, here are three examples where Semgrep works better than plain grep.

Detecting a call to a forbidden command

Detecting variable splitting

Many would expect $X or ${X} to be replaced by the value of the X variable. This is incorrect because the variable undergoes splitting on whitespace or even on other characters as specified by the IFS variable.

First, let's protect ourselves against the obscure problem of the IFS variable. IFS is a special shell variable that determines the separators used by Bash when splitting strings. The default value is whitespace (space, tab, or newline). Let's ensure IFS is not set globally to avoid the risk of splitting strings where it's not intended:

$ docker run -it ubuntu
root@d43da008a9b3:/# echo $PATH
root@d43da008a9b3:/# IFS=:
root@d43da008a9b3:/# echo $PATH
/usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin

Uh oh, the colon separators are now missing. A legitimate use of this feature would be to read comma-separated values from the command line:

$ IFS="," read -a values   # read values from stdin into an array
$ echo "${values[@]}"      # print array
1 23 456

IFS hasn't changed for the commands that follow as you can see:

$ x=hello,world
$ echo $x

Great. So, we only need to prevent IFS from being set globally. Here's a simple Semgrep rule that takes care of it:

Now that we have IFS issues out of the way, let's try to catch variable expansions that are unquoted and would get split when they contain whitespace. All we have to do is express "an expansion of any shell variable not surrounded by double quotes". Here's a solution:

There are two subtleties in this approach. First, the pattern ${$VAR} can be surprising:

  • ${} is the expansion of a shell variable, as is the usual case in Bash.
  • $VAR in a pattern is a Semgrep metavariable. It's not the expansion of a shell variable. Here it stands for any shell variable.

Therefore, ${$VAR} means "the expansion of any variable", which Semgrep captures under the name $VAR. The captured value of $VAR is recalled in the pattern-not-inside: "...${$VAR}..." which filters out matches where the variable expansion is double-quoted.

Here's a key to what different Semgrep patterns mean:

  • $METAVAR: a Semgrep metavariable, matches any expression.
  • ${SHELLVAR}: the expansion of the shell variable SHELLVAR. It will match both ${SHELLVAR} and $SHELLVAR in a script. The syntax $SHELLVAR can't be used in a pattern because it conflicts with the syntax for metavariables.
  • $shellvar or ${shellvar}: the expansion of the shell variable shellvar.
  • ...: a Semgrep ellipsis, matches any sequence of items.

The second gotcha in this rule is the YAML syntax. The following wouldn't work because YAML itself understands double-quoted strings:

      - pattern-not-inside: "...${$VAR}..."

The pattern above is interpreted as ...${$VAR}..., which is not what we want. To keep the quotes (and line breaks) verbatim, we use the pipe | syntax:

      - pattern-not-inside: |

Detecting an iteration over the output of ls

This example implements ShellCheck rule SC2045. It should be self-explanatory:

A word of caution

As of this week, Bash support in Semgrep is still experimental. Many bugs exist and some constructs can't be matched against. We've been trying to implement the most essential features first. Here's where we're at:

  • Parsing: about 92% of the Bash/sh code is parsed successfully.
  • Searching for the following constructs should mostly work:

    • simple commands
    • pipelines foo | bar | baz
    • if, for, while, case
    • function definitions
    • assignments
    • simple variable expansions $X, ${X}
    • double-quoted strings
    • command substitution $(cmd)
    • subshells (cmd) and command grouping { cmd; }
  • The following Semgrep patterns are supported in most places where they make sense:

    • ellipsis ...
    • metavariables $MV
    • deep ellipsis <... foo ...>

Features that aren't supported yet include:

  • matching over file redirections e.g. cmd > file
  • matching over background jobs specifically cmd &
  • scanning scripts without a .sh or bash extension
  • understanding the syntax of popular commands e.g. set -eu vs. set -u -e aren't treated as equivalent for now.
  • matching over array accesses e.g. ${arr[$i]}, arr[$i]=foo, ${#arr[@]}, etc.
  • matching over arithmetic expressions
  • matching over C-style loops