How Saleor Handles Incidents

The Stakes of Q4

Q4 is when online stores face their highest traffic. Black Friday, Cyber Monday, holiday sales — all crammed into a couple of weeks. Last Cyber Monday alone meant over $13.3 billion in internet spending in one day. For many merchants, this period can make or break their year.

Downtime during peak traffic can be disastrous. It leads to lost sales, abandoned carts, and frustrated customers. Large retailers set up “war rooms” with engineers ready to handle issues.

Saleor isn’t a huge company with endless resources. Still, our customers count on their stores staying online during traffic spikes. That’s why we focus on strong monitoring, clear processes, and a culture that sees incidents as chances to learn.

Here’s how we make it work:

Monitoring: Foundation of Incident Detection

You can’t fix problems if you don’t know they’re there. Our monitoring follows one rule: cover everything important, but don’t overwhelm the team. Too few alerts leave you in the dark, but too many make it hard to spot real issues.

How we use Datadog

We use Datadog as our observability platform. It brings together all the key information—metrics, logs, traces, and errors—in one place. When you’re troubleshooting at 2 AM, you don’t want to switch between five different tools.

Here's what we track:

APM (Application Performance Monitoring) - how our services perform and talk to each other. When checkout slows down, we can see exactly which service is the bottleneck.

Infrastructure metrics - database performance, worker queues, CPU, memory, and disk space. These are the basic health checks. We want to catch when memory reaches 75%, not wait until it hits 100% and causes a crash.

Centralized logs - all application logs are stored together. We use structured logging, which makes it much easier to filter and find what we need.

Distributed tracing - lets us follow a single request as it moves through the whole system. Sometimes, one API call goes through five different services before it responds.

Error tracking - sends real-time alerts and groups similar errors together. Patterns matter more than individual errors.

Having everything in one platform is crucial during incidents. Instead of just saying, "the API is slow," you can explain, "the API is slow because this database query takes 10 seconds. Here are the logs and the trace to prove it."

Dashboards vs. Monitors

Dashboards and monitors do different things.

Dashboards are for people. They show read-only views of system health. We have dashboards for infrastructure, business metrics, and each service. Engineers use them to spot trends or dig into issues. Everyone in the company can see them, so all teams know how things are running.

Monitors are for automation. They watch for problems and send alerts when something crosses a set limit—like error rates spiking, response times slowing down, or disk space running low.

Good monitors give you context. When an alert goes off, it should tell you:

What's wrong
Who owns this system
What is the impact
What might be causing it
How to react to that (a runbook: more on that later)

We use multi-alert monitors to handle scale. Rather than setting up 50 separate monitors for 50 worker queues, one monitor keeps an eye on all of them and alerts us if any queue has a problem.

When a critical monitor detects an issue, it automatically creates an incident in incident.io so we can respond quickly.

How we use incident.io

We use incident.io for incident management. We chose it because it works directly in Slack, which we already use every day. There’s no extra dashboard to check or new tool to learn.

It takes care of the repetitive tasks:

Creating dedicated Slack channels for each incident
Assigning roles (who's leading, who's fixing, who's communicating)
Tracking status updates
Managing postmortem creation
AI-powered incident updates

This is important because you want to spend your time fixing the problem, not managing Slack channels.

The integration flow is simple:

Datadog (or Sentry, or anything) triggers
Incident gets created in incident.io
Slack channel appears
On-call team gets notified based on severity

Incident Response: When Things Go Wrong

So far, we’ve talked about what an incident is. Now, let’s look at how we respond when one happens:

Severity Classification

We use three severity levels so everyone can quickly understand how urgent an incident is:

SEV-1 (Critical) - merchants can't sell. Checkout is down, API is dead, and the platform is unavailable. On-call responds immediately, 24/7, nights and holidays included. No "we'll look at it in the morning." Every SEV-1 gets a full post-incident review.

SEV-2 (Major) - merchant operations are degraded but not dead. Search isn't working, and checkout has significant delays. Real business impact, not catastrophic. We respond urgently during extended hours. Won't wake someone at 3 AM, but won't wait until Monday if it happens Saturday.

SEV-3 (Minor) - usually invisible to merchants. Early warning signals: memory consumption climbing, error rates increasing on non-critical endpoints. Handled during working hours.

We actually want SEV-3 incidents. They let us fix problems during the day, when the team is alert, before they become SEV-1 incidents at 2 AM.

As one of our engineers puts it:

It's not necessarily bad if we have more incidents - more SEV-3 means we were able to act earlier. One SEV-1 counts as 100 SEV-3s.

Our rule is simple: when in doubt, choose a higher severity. It’s much easier to lower the severity later than to escalate in the middle of a crisis. The Google SRE book says it best: "It is better to declare an incident early and then find a simple fix and close out the incident than to have to spin up the incident management framework hours into a burgeoning problem."

On-Call Rotation

We maintain a rotating on-call schedule with at least three people who can respond. We always have a backup assigned; if the primary is unreachable, the backup gets paged.

Response times match severity:

SEV-1: Immediate, including nights and holidays
SEV-2: Daytime response, including weekends
SEV-3: Working hours

We design the rotation to be sustainable. Our goal is fair coverage, clear escalation paths, and making sure no one is left handling incidents alone.

Mitigation and Resolution

When an incident occurs, we focus on two main activities:

Mitigation - stopping the impact on customers right away. This is quick and sometimes not perfect. It might mean rolling back a release, scaling up infrastructure, or redirecting traffic.

Resolution - finding a permanent fix for the root cause. This takes more time. It means figuring out what went wrong, creating a solution, testing it, and deploying it carefully.

Mitigation always comes first. Software fixes often take longer than expected and can sometimes cause new problems. While you work on the best solution, customers could be losing transactions.

Here is a real-life example: We run a multi-tenant system. One of our queues got stuck because one tenant's issues were blocking everyone else (a "noisy neighbor" problem).

We didn’t waste hours figuring out why that tenant’s events were causing trouble. Instead, we quickly redirected their events to a separate queue. Their queue drained slowly, but everyone else was able to keep working. Once things were stable, we looked into the root cause without affecting other customers.

Incident Response Roles

We use three roles during incidents:

Incident Lead - responsible for the incident from start to finish. They coordinate the response, manage Slack and incident.io, and make key decisions. The Incident Lead keeps the team focused, tracks what’s been tried, and leads the post-incident review.

The Incident Lead doesn’t need deep technical skills. Their job is to coordinate people, not to fix code. Think of them like an air traffic controller rather than a pilot.

Fixer - handles the technical side. They investigate the root cause, carry out mitigation and resolution, and update the Incident Lead. The Fixer knows the affected systems inside and out.

Communication Lead - manages all external communication. They update customers, notify stakeholders, keep the status page current, and explain technical issues in simple terms. This lets the Fixers focus on solving the problem.

Measuring Incident Response

We track two metrics:

Time to Detect (TTD): How quickly we notice issues
Time to Resolve (TTR): How quickly we fix them

Both are important, but TTD matters most. You can’t fix a problem if you don’t know it exists.

Up Next: Prevention

This is our approach when things go wrong. Of course, we’d much rather prevent incidents from happening in the first place.

Next time, we’ll cover how we prevent incidents in the first place.