Every SaaS application encounters failures. APIs that stop responding, databases that briefly go offline, external services that slow down under peak load. The difference between a professional SaaS and a hobby project lies in how you handle those failures.
In this article, we'll take a deep dive into the patterns and strategies that keep your SaaS reliable — even when everything around it is failing.
Why Error Handling Is Critical for SaaS
In a traditional web application, an error is annoying. In SaaS, it's potentially fatal. Your customers run their business processes on your platform. Downtime means lost revenue — not just for you, but for all your customers simultaneously.
Some numbers that underscore this:
- 53% of users abandon an app that takes more than 3 seconds to load
- A single minute of downtime can cost enterprise SaaS thousands of euros
- 80% of churn is caused by poor reliability, not missing features
The Retry Pattern: Retrying Intelligently
The simplest form of resilience is retrying. But naïve retries can make your problem worse.
Exponential Backoff with Jitter
async function withRetry<T>(
fn: () => Promise<T>,
options: {
maxRetries?: number;
baseDelay?: number;
maxDelay?: number;
} = {}
): Promise<T> {
const { maxRetries = 3, baseDelay = 1000, maxDelay = 30000 } = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
// Exponential backoff with jitter
const exponentialDelay = baseDelay * Math.pow(2, attempt);
const jitter = Math.random() * exponentialDelay * 0.5;
const delay = Math.min(exponentialDelay + jitter, maxDelay);
console.warn(
\`Attempt \${attempt + 1} failed, retrying in \${Math.round(delay)}ms\`
);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw new Error('Unreachable');
}
// Usage
const data = await withRetry(
() => fetch('https://api.payment-provider.com/charge'),
{ maxRetries: 3, baseDelay: 500 }
);
Which Errors Should You Retry?
Not every error deserves a retry. Make a distinction:
| Retry | Don't retry |
|---|---|
| 429 Too Many Requests | 400 Bad Request |
| 500 Internal Server Error | 401 Unauthorized |
| 503 Service Unavailable | 403 Forbidden |
| Network timeouts | 404 Not Found |
| DNS resolution failures | 422 Validation Error |
Rule of thumb: retry transient errors (that may resolve on their own), not permanent errors (that will produce the same result every time).
The Circuit Breaker Pattern
Imagine a circuit breaker in your electrical panel. When there's a short circuit, the breaker trips to prevent further damage. The same principle works for API calls.
class CircuitBreaker {
private failures = 0;
private lastFailure: number | null = null;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private threshold: number = 5,
private resetTimeout: number = 60000
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - (this.lastFailure || 0) > this.resetTimeout) {
this.state = 'half-open';
} else {
throw new Error('Circuit breaker is open — service temporarily unavailable');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
console.error(\`Circuit breaker opened after \${this.failures} failures\`);
}
}
}
// One circuit breaker per external service
const paymentCircuit = new CircuitBreaker(5, 30000);
const emailCircuit = new CircuitBreaker(3, 60000);
// Usage
const charge = await paymentCircuit.execute(
() => stripe.charges.create({ amount: 2000, currency: 'eur' })
);
The Three States of a Circuit Breaker
- Closed (normal): All requests go through. Failures are counted.
- Open (blocked): All requests fail immediately without contacting the service. This prevents you from further overloading an already stressed service.
- Half-open (testing): After a cooldown period, one request is allowed through to test if the service is available again.
Graceful Degradation: Better Half Than Nothing
Not every feature is equally important. If your payment provider goes down, that's critical. If your analytics service doesn't respond? Your app can keep running just fine.
interface DashboardData {
revenue: number;
activeUsers: number;
analytics?: AnalyticsData; // Optional — not critical
recommendations?: Product[]; // Optional — not critical
}
async function getDashboardData(tenantId: string): Promise<DashboardData> {
// Critical data — must succeed
const [revenue, activeUsers] = await Promise.all([
getRevenue(tenantId),
getActiveUsers(tenantId),
]);
// Non-critical data — allowed to fail
const [analytics, recommendations] = await Promise.allSettled([
getAnalytics(tenantId),
getRecommendations(tenantId),
]);
return {
revenue,
activeUsers,
analytics: analytics.status === 'fulfilled' ? analytics.value : undefined,
recommendations: recommendations.status === 'fulfilled' ? recommendations.value : undefined,
};
}
In your frontend, show a subtle notice: "Some data is temporarily unavailable" instead of a full error page.
Timeouts: The Forgotten Hero
A missing timeout is one of the most dangerous bugs in distributed systems. Without a timeout, your application hangs indefinitely waiting for a service that will never respond.
async function fetchWithTimeout(
url: string,
options: RequestInit & { timeoutMs?: number } = {}
): Promise<Response> {
const { timeoutMs = 5000, ...fetchOptions } = options;
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, {
...fetchOptions,
signal: controller.signal,
});
return response;
} catch (error) {
if (error instanceof DOMException && error.name === 'AbortError') {
throw new Error(\`Request to \${url} exceeded \${timeoutMs}ms\`);
}
throw error;
} finally {
clearTimeout(timeout);
}
}
Timeout Budgets
In a microservices architecture, you need to think about your total timeout budget:
- API Gateway: 30s total timeout
- Service A → Service B: 10s
- Service B → Database: 5s
- Service B → Cache: 500ms
Each layer must have a shorter timeout than the layer above it. Otherwise, you get cascading timeouts.
Dead Letter Queues: No Message Left Behind
For asynchronous processing (webhooks, background jobs), dead letter queues are essential. If a message can't be processed after multiple attempts, you store it for later inspection.
async function processWebhook(event: WebhookEvent): Promise<void> {
const MAX_ATTEMPTS = 3;
for (let attempt = 1; attempt <= MAX_ATTEMPTS; attempt++) {
try {
await handleEvent(event);
return; // Success
} catch (error) {
console.error(\`Webhook attempt \${attempt}/\${MAX_ATTEMPTS} failed:\`, error);
if (attempt === MAX_ATTEMPTS) {
// Send to dead letter queue
await db.deadLetterQueue.create({
data: {
eventType: event.type,
payload: JSON.stringify(event),
error: error.message,
failedAt: new Date(),
attempts: MAX_ATTEMPTS,
},
});
// Alert the team
await notifyOpsTeam(\`Webhook \${event.type} permanently failed after \${MAX_ATTEMPTS} attempts\`);
}
}
}
}
Idempotency: Safe to Reprocess
If you implement retries, you must guarantee that executing the same operation multiple times is safe. This is called idempotency.
async function processPayment(idempotencyKey: string, amount: number) {
// Check if we've already processed this
const existing = await db.payment.findUnique({
where: { idempotencyKey },
});
if (existing) {
console.log(\`Payment \${idempotencyKey} already processed, skipping\`);
return existing;
}
// Process the payment
const payment = await db.payment.create({
data: {
idempotencyKey,
amount,
status: 'pending',
},
});
try {
const charge = await stripe.charges.create(
{ amount, currency: 'eur' },
{ idempotencyKey }
);
return await db.payment.update({
where: { id: payment.id },
data: { status: 'completed', stripeChargeId: charge.id },
});
} catch (error) {
await db.payment.update({
where: { id: payment.id },
data: { status: 'failed', error: error.message },
});
throw error;
}
}
Health Checks and Readiness Probes
Your application should be able to communicate whether it's healthy:
// Health check endpoint
app.get('/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
stripe: await checkStripe(),
};
const healthy = Object.values(checks).every(c => c.status === 'ok');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'degraded',
checks,
timestamp: new Date().toISOString(),
});
});
async function checkDatabase(): Promise<HealthCheck> {
try {
const start = Date.now();
await db.$queryRaw\`SELECT 1\`;
return { status: 'ok', latencyMs: Date.now() - start };
} catch {
return { status: 'error', message: 'Database unreachable' };
}
}
Practical Checklist for Your SaaS
Before going live, run through this checklist:
- Retries with exponential backoff on all external API calls
- Circuit breakers on critical dependencies
- Timeouts on every network call (no exceptions!)
- Graceful degradation for non-critical features
- Dead letter queues for asynchronous processing
- Idempotency keys on all state-changing operations
- Health check endpoints for monitoring and orchestration
- Structured logging so you can trace errors
- Alerting on circuit breaker state changes
- Runbook for the team on common failure scenarios
Conclusion
Error handling isn't an afterthought — it's a core feature of every serious SaaS application. The patterns in this article (retries, circuit breakers, graceful degradation, idempotency) are proven techniques used by companies like Netflix, Stripe, and AWS.
Start simple: add timeouts and retries to your external calls. Then build circuit breakers around your most critical dependencies. And implement graceful degradation so your users always see a working application — even when not everything is running perfectly behind the scenes.
Your users don't need to know that something is going wrong behind the curtain. They just need to notice that your app keeps working.