Error Handling and Resilience in Your SaaS: Retries, Circuit Breakers, and Graceful Degradation

Every SaaS application encounters failures. APIs that stop responding, databases that briefly go offline, external services that slow down under peak load. The difference between a professional SaaS and a hobby project lies in how you handle those failures.

In this article, we'll take a deep dive into the patterns and strategies that keep your SaaS reliable — even when everything around it is failing.

Why Error Handling Is Critical for SaaS

In a traditional web application, an error is annoying. In SaaS, it's potentially fatal. Your customers run their business processes on your platform. Downtime means lost revenue — not just for you, but for all your customers simultaneously.

Some numbers that underscore this:

53% of users abandon an app that takes more than 3 seconds to load
A single minute of downtime can cost enterprise SaaS thousands of euros
80% of churn is caused by poor reliability, not missing features

The Retry Pattern: Retrying Intelligently

The simplest form of resilience is retrying. But naïve retries can make your problem worse.

Exponential Backoff with Jitter

async function withRetry<T>(
  fn: () => Promise<T>,
  options: {
    maxRetries?: number;
    baseDelay?: number;
    maxDelay?: number;
  } = {}
): Promise<T> {
  const { maxRetries = 3, baseDelay = 1000, maxDelay = 30000 } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      // Exponential backoff with jitter
      const exponentialDelay = baseDelay * Math.pow(2, attempt);
      const jitter = Math.random() * exponentialDelay * 0.5;
      const delay = Math.min(exponentialDelay + jitter, maxDelay);

      console.warn(
        \`Attempt \${attempt + 1} failed, retrying in \${Math.round(delay)}ms\`
      );
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
  throw new Error('Unreachable');
}

// Usage
const data = await withRetry(
  () => fetch('https://api.payment-provider.com/charge'),
  { maxRetries: 3, baseDelay: 500 }
);

Which Errors Should You Retry?

Not every error deserves a retry. Make a distinction:

Retry	Don't retry
429 Too Many Requests	400 Bad Request
500 Internal Server Error	401 Unauthorized
503 Service Unavailable	403 Forbidden
Network timeouts	404 Not Found
DNS resolution failures	422 Validation Error

Rule of thumb: retry transient errors (that may resolve on their own), not permanent errors (that will produce the same result every time).

The Circuit Breaker Pattern

Imagine a circuit breaker in your electrical panel. When there's a short circuit, the breaker trips to prevent further damage. The same principle works for API calls.

class CircuitBreaker {
  private failures = 0;
  private lastFailure: number | null = null;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private threshold: number = 5,
    private resetTimeout: number = 60000
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - (this.lastFailure || 0) > this.resetTimeout) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open — service temporarily unavailable');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.threshold) {
      this.state = 'open';
      console.error(\`Circuit breaker opened after \${this.failures} failures\`);
    }
  }
}

// One circuit breaker per external service
const paymentCircuit = new CircuitBreaker(5, 30000);
const emailCircuit = new CircuitBreaker(3, 60000);

// Usage
const charge = await paymentCircuit.execute(
  () => stripe.charges.create({ amount: 2000, currency: 'eur' })
);

The Three States of a Circuit Breaker

Closed (normal): All requests go through. Failures are counted.
Open (blocked): All requests fail immediately without contacting the service. This prevents you from further overloading an already stressed service.
Half-open (testing): After a cooldown period, one request is allowed through to test if the service is available again.

Graceful Degradation: Better Half Than Nothing

Not every feature is equally important. If your payment provider goes down, that's critical. If your analytics service doesn't respond? Your app can keep running just fine.

interface DashboardData {
  revenue: number;
  activeUsers: number;
  analytics?: AnalyticsData; // Optional — not critical
  recommendations?: Product[]; // Optional — not critical
}

async function getDashboardData(tenantId: string): Promise<DashboardData> {
  // Critical data — must succeed
  const [revenue, activeUsers] = await Promise.all([
    getRevenue(tenantId),
    getActiveUsers(tenantId),
  ]);

  // Non-critical data — allowed to fail
  const [analytics, recommendations] = await Promise.allSettled([
    getAnalytics(tenantId),
    getRecommendations(tenantId),
  ]);

  return {
    revenue,
    activeUsers,
    analytics: analytics.status === 'fulfilled' ? analytics.value : undefined,
    recommendations: recommendations.status === 'fulfilled' ? recommendations.value : undefined,
  };
}

In your frontend, show a subtle notice: "Some data is temporarily unavailable" instead of a full error page.

Timeouts: The Forgotten Hero

A missing timeout is one of the most dangerous bugs in distributed systems. Without a timeout, your application hangs indefinitely waiting for a service that will never respond.

async function fetchWithTimeout(
  url: string,
  options: RequestInit & { timeoutMs?: number } = {}
): Promise<Response> {
  const { timeoutMs = 5000, ...fetchOptions } = options;
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await fetch(url, {
      ...fetchOptions,
      signal: controller.signal,
    });
    return response;
  } catch (error) {
    if (error instanceof DOMException && error.name === 'AbortError') {
      throw new Error(\`Request to \${url} exceeded \${timeoutMs}ms\`);
    }
    throw error;
  } finally {
    clearTimeout(timeout);
  }
}

Timeout Budgets

In a microservices architecture, you need to think about your total timeout budget:

API Gateway: 30s total timeout
Service A → Service B: 10s
Service B → Database: 5s
Service B → Cache: 500ms

Each layer must have a shorter timeout than the layer above it. Otherwise, you get cascading timeouts.

Dead Letter Queues: No Message Left Behind

For asynchronous processing (webhooks, background jobs), dead letter queues are essential. If a message can't be processed after multiple attempts, you store it for later inspection.

async function processWebhook(event: WebhookEvent): Promise<void> {
  const MAX_ATTEMPTS = 3;

  for (let attempt = 1; attempt <= MAX_ATTEMPTS; attempt++) {
    try {
      await handleEvent(event);
      return; // Success
    } catch (error) {
      console.error(\`Webhook attempt \${attempt}/\${MAX_ATTEMPTS} failed:\`, error);

      if (attempt === MAX_ATTEMPTS) {
        // Send to dead letter queue
        await db.deadLetterQueue.create({
          data: {
            eventType: event.type,
            payload: JSON.stringify(event),
            error: error.message,
            failedAt: new Date(),
            attempts: MAX_ATTEMPTS,
          },
        });
        // Alert the team
        await notifyOpsTeam(\`Webhook \${event.type} permanently failed after \${MAX_ATTEMPTS} attempts\`);
      }
    }
  }
}

Idempotency: Safe to Reprocess

If you implement retries, you must guarantee that executing the same operation multiple times is safe. This is called idempotency.

async function processPayment(idempotencyKey: string, amount: number) {
  // Check if we've already processed this
  const existing = await db.payment.findUnique({
    where: { idempotencyKey },
  });

  if (existing) {
    console.log(\`Payment \${idempotencyKey} already processed, skipping\`);
    return existing;
  }

  // Process the payment
  const payment = await db.payment.create({
    data: {
      idempotencyKey,
      amount,
      status: 'pending',
    },
  });

  try {
    const charge = await stripe.charges.create(
      { amount, currency: 'eur' },
      { idempotencyKey }
    );

    return await db.payment.update({
      where: { id: payment.id },
      data: { status: 'completed', stripeChargeId: charge.id },
    });
  } catch (error) {
    await db.payment.update({
      where: { id: payment.id },
      data: { status: 'failed', error: error.message },
    });
    throw error;
  }
}

Health Checks and Readiness Probes

Your application should be able to communicate whether it's healthy:

// Health check endpoint
app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    stripe: await checkStripe(),
  };

  const healthy = Object.values(checks).every(c => c.status === 'ok');

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'healthy' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  });
});

async function checkDatabase(): Promise<HealthCheck> {
  try {
    const start = Date.now();
    await db.$queryRaw\`SELECT 1\`;
    return { status: 'ok', latencyMs: Date.now() - start };
  } catch {
    return { status: 'error', message: 'Database unreachable' };
  }
}

Practical Checklist for Your SaaS

Before going live, run through this checklist:

Conclusion

Error handling isn't an afterthought — it's a core feature of every serious SaaS application. The patterns in this article (retries, circuit breakers, graceful degradation, idempotency) are proven techniques used by companies like Netflix, Stripe, and AWS.

Start simple: add timeouts and retries to your external calls. Then build circuit breakers around your most critical dependencies. And implement graceful degradation so your users always see a working application — even when not everything is running perfectly behind the scenes.

Your users don't need to know that something is going wrong behind the curtain. They just need to notice that your app keeps working.