How Saleor Prevents Incidents

Preventing Incidents: Proactive > Reactive

The best incident is the one that never happens.

In the early days, we mostly fixed issues as they appeared. With a small number of stores, it was barely manageable. As traffic grew, especially during Q4, reacting wasn't enough. Small issues turned into outages faster, and the same patterns kept coming back.

We decided to change our approach. Rather than just getting better at responding, we made prevention a regular part of our engineering work. It’s not something we put off for later anymore.

Here's what that looks like.

Spot Anomalies Early with SEV-3 Alerts

Our lowest-severity incidents are usually not visible to merchants. They’re small signals that something is drifting away from normal: memory slowly climbing, API latency increasing, database queries taking longer than usual, or a queue that isn’t draining as fast as it normally does.

We treat those alerts as the first line of prevention. If they happen during working hours, someone looks into them right away. Usually, it’s a quick look to confirm the trend, figure out what changed, and fix it if needed while things are still stable.

Ignoring those early signs is what turns them into real outages.

The pattern is always the same: the trend continues, traffic increases, limits get hit, and suddenly we're dealing with a production incident that could have been avoided entirely.

Handling these signals during the day is far cheaper than dealing with the same issue at peak traffic. It also keeps problems on our schedule instead of the on-call’s. That’s why our monitoring is tuned to warn us before things break, not only after.

Staging Environment + Sentry Quality Gate

Every change we ship goes to staging first, and Sentry monitors it in the same way it monitors production. If a new release starts throwing errors in staging, we fix them before promoting it. That’s the whole rule.

Our process is straightforward: we deploy to staging, check Sentry, and look for any new issues linked to that release. If we find something, we fix it while the change is still fresh. We don’t ship until staging is clear.

Sentry integration provides several features that make this effective:

Environment-specific tagging (staging vs. production) so we can filter issues by where they occurred
Release tracking ties errors to specific code versions, making it obvious which change introduced a problem
Automatic grouping of similar errors prevents noise. Ten instances of the same bug show up as one issue
Source code context in error reports shows the exact line that failed with surrounding code

This process isn’t complicated, but it only works if everyone sticks to it. It’s tempting to push a small fix at the end of the day. Having a clear rule makes the decision simple: no guessing, no debate, just wait until staging is clean.

Most bugs never make it past this step, which means customers never see them.

Monitoring and Observability

To prevent issues early, we rely on a few signals that give us enough context to see when something starts drifting.

Comprehensive logging

All services use structured JSON logs with consistent fields. That makes it easy to filter by tenant, request ID, release, or endpoint without scanning raw text.

Distributed tracing

Tracing shows how a request moves through the system. When latency increases gradually, we can see which call or query is responsible instead of guessing.

Business metrics monitoring

We track things like checkout completion, payment success, and inventory sync time. These often change before technical metrics and are an early sign that something isn’t behaving normally.

Custom monitors for domain-specific concerns

Payments, webhooks, and inventory have different failure patterns. We maintain monitors that match how each of these parts actually behaves instead of relying only on generic thresholds.

Capacity planning

We look at usage trends over weeks and months and scale before hitting limits, rather than reacting after services start failing.

After every incident, we check if anything was missing and update our signals by adding a monitor, a log field, or another trace. This helps us catch and fix repeating issues more easily next time.

Other Prevention Practices

Monitoring is just one way we stay ahead of incidents.

Before Q4, we run load tests that simulate high traffic. This shows where queues build up, which endpoints slow down, and how autoscaling behaves, so we can fix bottlenecks ahead of time.

To avoid doing the same work over and over, we keep short runbooks for common issues. When a familiar problem comes up, the on-call person doesn’t have to start from scratch or guess what to do next.

From time to time, we run a simulated incident to make sure our process still works. This helps us find missing dashboards, unclear ownership, or outdated steps before a real outage reveals them.

Retrospectives always lead to follow-up tasks. We add these to our regular backlog and track them like any other work, so fixes don’t get lost after the incident is over.

Learning Through Blameless Retrospectives

Some issues still make it through. When that happens, we look at what allowed it rather than who triggered it. The goal is to understand the conditions that made the failure possible.

John Allspaw, who helped popularize this approach at Etsy, described it as removing the fear that stops people from being honest about what happened. Without that, timelines get edited and useful details disappear.

We run retrospectives soon after an incident while the context is still fresh. They’re short and focused. The people who were involved walk through what they saw at the time—what alerts fired, what they expected to happen, and what information they didn’t have.

The discussion stays on the system, not the individual. We look at things like:

alerts that didn’t fire or fired too late
signals that were missing entirely
steps that weren’t documented anywhere
ownership that wasn’t clear
assumptions that made sense in the moment but turned out to be wrong

The output is a list of changes that would have made the incident easier to detect or avoid: new monitors, adjusted thresholds, updates to runbooks, additional logging or tracing, or changes to how something is deployed.

These tasks go into the regular backlog with owners and priority, not into a separate “to fix later” list. We track them until they’re done, which prevents the same problem from resurfacing a few months later.

Over time, this process makes incidents less repetitive and easier to manage. Instead of depending on someone’s memory, the system and documentation get better in lasting ways.

Conclusion

We didn’t start with all of this. In the early days, we fixed things as they broke, and that was enough for a small number of stores. As traffic grew, reacting stopped working, so we started putting prevention into our regular engineering work instead of treating it as something to do later.

The practices in this article are simple: watch for early signals, use safe releases, test realistically, and learn from the incidents that still happen. Doing these things regularly is what made our incidents less repetitive and easier to handle.

This is ongoing work. We keep updating monitors based on what we learn, improving documentation, and closing the gaps that surface during real incidents. As more merchants rely on Saleor during peak traffic, this approach scales better than depending on fast reaction or individual heroics.

Q4 will always be a busy and stressful time for commerce. Our goal isn’t to get rid of every failure, but to make them smaller, less frequent, and easier to recover from. We also want to keep improving the system so the same problems don’t return.