Thanks to Clint Gibler, Grayson Hardaway, and Pablo Estrada at r2c for their contributions to this piece. And thanks to r2c for contracting me to write it! This article is cross-posted on jacobian.org.
When I ran the security team at Heroku, I had this recurring nightmare: my PagerDuty alarm goesoff, alerting me to some sort of security incident. In my dream, I’d look at my phone and realize “oh no, this is the big one” — and then I’d wake up.
I’m still not sure exactly what the attack in my dream was, but it may very well have been a Denial-of-Service (DoS) attack. DoS attacks are simple but can be devastating: an attacker crafts and sends traffic to your app in a way that overwhelms your servers. While this is arguably not as bad as a remote code execution or a data breach, it’s still pretty terrible. If your customers can’t use your app, you’ll lose their money and their trust.
Typically, we talk about two kinds of Denial-of-Service attacks:
- “Normal” Denial-of-Service (DoS) attacks, where a single machine is sufficient to cause downtime. The classic, old-school version of this attack is the zip bomb: an attacker tricks your server into expanding a specially-crafted ZIP file that is tiny compressed but expands to entirely fill your disk space.
- Distributed Denial-of-Service (DDoS) attacks. These attacks rely on an attacker sending a huge flood of traffic to your site from multiple machines (that’s the “Distributed” part). Often, these attacks come from Botnets — fleets of compromised machines controlled by an attacker. These botnets are available to purchase in certain corners of the Internet, making a DDoS attack well within the reach of anyone with a credit card.
Engineers who work on web applications frequently run into vulnerabilities that could be used in a DoS/DDoS attack. Unfortunately, there’s broad disagreement in the industry about how to treat these vulnerabilities. The risk can be difficult to analyze: I’ve seen development teams argue for weeks over how to handle a DoS vector.
This article tries to cut through those arguments. It provides a framework for engineering and application security teams to think about denial-of-service risk, breaks down DoS vulnerabilities into high-, medium-, and low-risk classes, and has recommendations for mitigations at each layer.
The primary focus of this post is on the big picture, and should apply to any kind of web app. But to make things concrete, I’ve added a few specific Django-related examples. (I helped create it, so it’s what I’m most familiar with.)
Evaluating the risk of a DoS vulnerability at the application layer can be difficult. There’s widespread disagreement among security professionals: you’ll often see two different appsec teams treat similar issues very differently.
Some argue: it’s nearly impossible to entirely mitigate against a focused DDoS — a dedicated enough attacker can throw more bandwidth at you than your app can handle. You can never fully mitigate a DDoS attack without serious support from an upstream network provider with specific tools to protect bot attacks (e.g., Cloudflare). Thus, chasing and fixing hypothetical DoS vulnerabilities can seem like a waste of developer time. These teams treat most potential DoS vectors as acceptable risk, and focus their energy at preparing mitigations at the network level.
Other teams point out that the traditional risk model has three potential problem areas: Confidentiality, Integrity, and Availability. We’ve long understood that uptime is a security issue. It’s becoming increasingly common for attackers to take a service down and then demand a ransom to stop the attack. The recent attack against Garmin is a highly notable example; attackers took down nearly all of Garmin’s services, and reportedly demanded US $1 million to stop the attack. (In this case the attack was ransomware, but it’s easy to see how a DoS attack could have a similar effect). Thus, DoS vulnerabilities are risks like any other, and it’s easy to understand the argument that they should all be mitigated.
It’s important to recognize that both of these positions are valid! It’s reasonable to see DoS as out-of-scope for application security; it’s similarly reasonable to scope it in. I’ve often seen security teams get completely stuck arguing between these two positions. Since neither is “right” or “wrong”, it can be impossible to figure out how to move forward.
The model I use to cut through this argument is the concept of attacker leverage. Levers amplify force: a small amount of force applied to the long end of the lever is multiplied at the short end. In the context of a DoS attack, if a vulnerability has high leverage it means attackers can consume a ton of your server resources with minimal resources.
For example, if a bug in your web app allows a single
GET request to consume 100% CPU, that’s a terrific amount of leverage. Just a small handful of attacks, and your web servers will grind to a halt. A low leverage vulnerability, on the other hand, requires a high amount of attacker resources to cause minor availability degradation. If an attacker has to spend thousands of dollars to bring a single server to its knees, you can probably scale up faster than they can.
The higher the leverage, the higher the risk, and the more likely I am to address the issue directly. The lower the leverage, the more likely I’ll accept the risk and/or lean on network-level mitigations.
Let’s get specific. I’ve broken down DoS risk into high, medium, and low risk classes, based on leverage. For each class, I’ll look at how to recognize that a vulnerability falls into this class, discuss a few examples, and give some suggestions for mitigation.
The classic high-risk DoS vulnerability is one where an attacker can cause resource starvation using very little resources themselves. This could mean exhaustion of any number of types of resources, including:
- Disk space — e.g., a vulnerability that magnifies uploaded data and fills the disk, as in the case of the classic zip bomb.
- Network bandwidth — e.g., a vulnerability that amplifies input traffic, where a single incoming request consumes tons of bandwidth, causing network starvation. I’ve seen this happen with a bug in a microservices system, where a single incoming request triggered millions of internal API requests (including moving some fairly large files around the network), and choked off the internal network bandwidth.
- CPU utilization — e.g., an exploit that triggers an accidentally quadratic algorithm, causing web servers to grind to a halt.
- Concurrency limits — most servers have a maximum concurrency limit (e.g., max threads or processes, or max connections for a database); an exploit that causes a process to run very slowly (or never exit) can cause the server to hit those limits and start rejecting requests.
In all these cases, the unifying factor is that a bug in the application will allow significant amplification.
When considering the risk of a resource amplification DoS vector, an important factor is the level of authentication required to trigger the vulnerability. If a completely anonymous user can easily trigger a resource starvation attack, it’ll be extremely easy for an attacker to bring you to your knees. Unauthenticated DoS vectors should be considered very high risk. On the other hand, if only users who authenticate against your corporate Single Sign-On server can trigger the vulnerability, it’s far lower risk. Most attackers aren’t insiders (though, some are!). And, if an attack does occur, it’s easy to attribute and block. In many cases, “we can attribute and block this attack” is a reasonable, if not complete, mitigation strategy. Many vulnerabilities fall between these two extremes: most services make creating new accounts fairly trivial (e.g., you just need an email address). This does give minimal ability to attribute and block, but often not enough.
Generally, I recommend that this class of DoS vulnerabilities — especially unauthenticated ones — be treated as high risk, and eliminated. If exploited, these vulnerabilities can be devastating; they allow a single attacker to completely overwhelm your app. I’d put the same level of effort into finding and eliminating these kinds of bugs that I do other high-risk security vulnerabilities like XSS and CSRF.
A common example of this last type of resource starvation, concurrency limits, is the regular expression denial-of-service, aka ReDoS. ReDoS bugs occur when certain types of strings can cause improperly crafted regular expressions to perform extremely poorly. These types of vulnerabilities are unfortunately relatively common in Python; the built-in regular expression module (
re) has no inherent protection against them (unlike libraries like re2, Go’s built-in regex module, and thus renders the language more or less immune to this class of attack).
(Django itself has had several of these vulnerabilities over the years; for example, CVE-2019-14232 and CVE-2019-14233 were both ReDoS vulnerabilities).
In Django, these vulnerabilities most often show up in two places: regex-based URL parsing and custom validators, and more broadly anywhere an application uses regular expressions. Luckily, this class of vulnerabilities are fairly easy to find; see the following r2c articles:
- Finding Python ReDoS bugs at scale using Dlint and r2c, and
- Improving ReDoS detection and finding more bugs using Dlint and r2c
If you’re using Python, you can easily scan for ReDoS in your application using Semgrep, which has ReDoS detection ported from Dlint. The detection requires some extra logic written using Semgrep’s powerful pattern-where-python clause, which enables rules to leverage the full power of Python, so you’ll have to use the
$ semgrep --config https://semgrep.dev/r/contrib.dlint.redos \ --dangerously-allow-arbitrary-code-execution-from-rules
Somewhat further down the risk spectrum, we find a different flavor of resource starvation: areas of your app that are inherently slower or more resource-intensive. For example:
- Complex reporting, where quite a lot of data needs to be read and calculated. Think about ad-hoc reporting on aggregated metrics over a long time period, or a quarterly financial report summarizing millions of transactions.
- Database or search engine writes that require expensive re-indexing. Typical web applications are tuned for fast reads, at the expense of slow writes. This can be especially true of consistent writes to distributed databases (thanks, CAP theorem!)
- APIs like GraphQL that can generate arbitrarily-deep database joins. This is a deeper topic than be covered here; for a good introduction, see Securing Your GraphQL API from Malicious Queries by the Apollo team.
An attacker that finds an area that’s significantly slower than normal can spam that endpoint, causing similar resource exhaustion as above. But usually these aren’t bugs; they’re features of the application. Some features will always be slower or more resource-heavy; there’s rarely a “fix” to something that just takes some time. Sometimes there are performance optimizations that can lower the risk, but often that can require serious investment or unacceptable trade-offs like giving up on consistent writes. However, there are a few mitigating factors that make these kinds of issues lower risk:
- Typically, these kinds of endpoints are behind some sort of authentication or sign-in. E.g., the GraphQL API requires an API key; the financial report is only available to privileged users; writes to the database can only be triggered by logged-in customers. This lowers the risk, as discussed above.
- It usually takes more attack traffic to overwhelm these kinds of features than the high-leverage class. E.g., while in a typical app writes are slower than reads, they’re not that slow; a well-tuned database can still handle thousands of writes per second. So, an attacker will have to work harder, and devote more of their resources, to causing resource starvation.
Taken together, I think this means it’s much more reasonable to see potential vulnerabilities in this category as acceptable risks. “We’ll just block an API key that tries to overwhelm us” seems like a reasonable decision.
That said, there’s a common architectural mitigation worth considering: rate limiting. Rate limits set a threshold on the number of requests to a particular endpoint over some short time window. Rate limiting can be pretty easy to set up and apply, and are often simply a generally positive engineering practice. As long as you’re setting the limits high enough to not block normal use, they can help prevent a bunch of issues, including DoS.
In Django, django-ratelimit provides a simple decorator-based API that makes it super easy to add rate limiting to views:
from ratelimit.decorators import ratelimit @ratelimit(key='user’, rate=’10/s’) def my_view(request): …
Or, if you’re using Django REST Framework, it’s got built-in rate limiting with a bunch of options.
For some applications, it makes sense to apply rate limits widely -- even as widely as on every view. In those cases, you could use Semgrep to find and warn about un-decorated views. Here’s an example of a Semgrep config that can find views without the
rules: - id: my_pattern_id patterns: - pattern-either: - pattern: | def $FUNC(..., request, ...): ... - pattern-not: | @ratelimit.decorators.ratelimit(...) def $FUNC(..., request, ...): ... message: | This view appears not to have a rate limit applied. Consider applying one with the @ratelimit decorator. fix: | severity: WARNING
You’d probably want to modify this ruleset for your specific application; this is just a starting point. A good way to iteratively develop a custom ruleset that works for you is by starting with this ruleset in the interactive Semgrep playground.
Finally, we get to the last category of DoS attacks: true Distributed Denial-of-Service attacks, where an attacker directs a large fleet of computers (often a botnet) to send massive waves of traffic to your application. This traffic isn’t always application-specific; it’s often a flood of nonsense TCP or UDP packets, designed to overwhelm the network itself. The size of a DDoS attack is usually only limited by your attacker’s budget. This is the class of attack that makes application security engineers throw up their hands — myself included! There isn’t really much you can do to mitigate these, certainly not at the application level. I tend to agree that true DDoS is out of scope for application security.
That said, there is some work that can be done at the network level, mostly in terms of preparation:
- You should consider putting your application behind a service like Cloudflare that can protect against DDoS. You’ll also get some substantial performance benefits from a CDN like CloudFlare, so this is usually well-worth the time.
- You should understand networking layers and where network rules can be applied. Many DDoS attacks can be identified (by IP, source port, traffic type, or some combination). Knowing how to quickly apply network rules to drop or throttle malicious traffic can help make sure you can quickly respond to an attack.
- Beyond the systems you control yourself, you should know who your network providers are and what mitigations they may be able to apply. Often, your network provider can block these attacks more effectively than you can. For example, if you host on AWS, you can get 24x7 access to the AWS DDoS Response Team as part of AWS Shield Advanced. It starts at $36,000 per year, but depending on your business that may look ridiculously expensive or absurdly cheap.
If you’d like to read more about preparing for and mitigating DDoS attacks, Chapter 10 of Google’s Building Secure and Reliable Systems is a great starting point.
Denial-of-service vulnerabilities can manifest in a number of different ways. Some should be prioritized and fixed immediately, but others are reasonably deemed “acceptable risk”. There’s no one-size-fits-all approach; you need to consider the relative risk of the vulnerability before finding an appropriate response.
The best framework I’ve found for evaluating this risk is amplification: considering how much attacker traffic is needed to trigger some level of service degradation. If a couple trivial requests can bring your server to its knees, that’s a very high risk and should be treated appropriately. On the other hand, if a terrific amount of traffic can cause modest slow-downs, it’s reasonable to prioritize your time elsewhere.
The next time you face uncertainty about a DoS vector, try using this framework. I hope it prevents one of those frustrating arguments!