Incident: Service unavailability on Talkspirit and Holaspirit
On April 2, 2026, Talkspirit and Holaspirit experienced two brief periods of service unavailability. Users were unable to access the platforms for approximately 5-8 minutes during each event.
| Time | Event |
|---|---|
| 07:00 | First incident — services unavailable for ~5 minutes, self-recovered |
| 13:34 | Second incident — connectivity issues detected |
| 13:37 | User-facing impact begins — platform access disrupted |
| 13:39 | Core services begin recovering (API, accounts) |
| 13:42 | All services fully restored |
An internal DNS resolver responsible for routing traffic between our infrastructure components became temporarily overloaded. When DNS resolution slowed down, our load balancers were unable to reach backend application servers, causing requests to fail.
The issue was intermittent — the DNS resolver was not down but was dropping a portion of queries during short bursts of high demand. This caused a cascading effect where health checks failed, backend servers were marked unavailable, and user requests were rejected.
Services recovered automatically as DNS resolution stabilized and load balancers re-enabled backend servers.
The DNS resolver serving our production infrastructure had insufficient queue capacity for the volume of queries generated by our growing number of services. During peak query bursts, the resolver's internal queue overflowed, causing DNS lookups to time out silently. This prevented our load balancers from reaching application servers.
This incident highlighted that our DNS infrastructure — a critical dependency for all services — lacked dedicated monitoring and alerting. While our application-level monitoring detected the outage within minutes, we had no visibility into the DNS resolver's internal health. We are addressing this gap as a priority.
We apologize for the disruption and are committed to preventing recurrence.