Load balancer incident on M...

Downtime

Load balancer incident on March 25, 2026

Mar 25, 2026 at 5:46am UTC

Affected services

bpo.idocus.com

my.idocus.com

Created
Mar 25, 2026 at 5:46am UTC

Incident Post-Mortem — March 25, 2026
Incident Title: Network failure causing load balancer split brain and virtual IP flapping
Severity: Critical
Date: March 25, 2026
Duration: 17 minutes (06:18 – 06:35 UTC)
Status: Resolved

Summary
On March 25, 2026 at 06:18 UTC, a network disruption caused a split-brain condition on the load balancer cluster, resulting in virtual IP (VIP) flapping between nodes. This led to service unavailability as traffic was being routed inconsistently or just stopped being routed. The issue was detected via monitoring alerts, investigated, and resolved within 17 minutes.

Timeline

06:18 — Incident detected. Monitoring alerts fired on VIP flapping and health check failures across the load balancer cluster.
06:25 — Incident acknowledged. On-call engineer picked up the alert and began initial triage.
06:32 — Investigation in progress. Root cause identified as a network partition between load balancer nodes, causing both nodes to claim VIP ownership simultaneously (split-brain). VRRP/Keepalived heartbeats were failing due to upstream network instability.
06:35 — Incident resolved. Network connectivity was restored, the VRRP election reconverged, and a single master was re-established. Traffic resumed normal routing. Health checks confirmed full service recovery.

Root Cause
A transient network failure disrupted communication between the HAProxy/Keepalived nodes in the load balancer cluster. With VRRP heartbeats no longer reaching the backup node, both nodes promoted themselves to master and began announcing the same virtual IP. This split-brain condition caused VIP flapping, leading to inconsistent traffic routing and intermittent client-facing errors.

Impact

Intermittent HTTP 502/503 errors for end users during the 17-minute window then service totally unavailable.

What Went Well

Fast detection: alerting triggered within seconds of the first VIP flap.
Quick acknowledgment and focused triage — root cause was identified in under 10 minutes.
Total time to resolution was 17 minutes.

What Could Be Improved

The cluster had no split-brain fencing mechanism in place (e.g., no unicast VRRP fallback, no quorum-based arbitration).
No automated remediation kicked in to force a single-master state during the flapping window.

Action Items

Implement a fencing/quorum mechanism to prevent dual-master scenarios (e.g., add a third lightweight witness node or configure unicast VRRP peers as fallback).
Add automated remediation: if VIP flapping is detected for more than N seconds, force-demote one node.
Add a runbook entry for split-brain recovery procedures.