DNS resolver degradation - Talkspirit & Holaspirit services unavailable

Incident Report for Talkspirit

Postmortem

Public Postmortem — April 2, 2026

Incident: Service unavailability on Talkspirit and Holaspirit

Summary

On April 2, 2026, Talkspirit and Holaspirit experienced two brief periods of service unavailability. Users were unable to access the platforms for approximately 5-8 minutes during each event.

Timeline (UTC)

Time Event
07:00 First incident — services unavailable for ~5 minutes, self-recovered
13:34 Second incident — connectivity issues detected
13:37 User-facing impact begins — platform access disrupted
13:39 Core services begin recovering (API, accounts)
13:42 All services fully restored

What happened

An internal DNS resolver responsible for routing traffic between our infrastructure components became temporarily overloaded. When DNS resolution slowed down, our load balancers were unable to reach backend application servers, causing requests to fail.

The issue was intermittent — the DNS resolver was not down but was dropping a portion of queries during short bursts of high demand. This caused a cascading effect where health checks failed, backend servers were marked unavailable, and user requests were rejected.

Services recovered automatically as DNS resolution stabilized and load balancers re-enabled backend servers.

What was not affected

  • No data was lost or corrupted
  • No security breach occurred
  • Internal application servers remained healthy throughout
  • The issue was limited to network-level routing, not application logic

Root cause

The DNS resolver serving our production infrastructure had insufficient queue capacity for the volume of queries generated by our growing number of services. During peak query bursts, the resolver's internal queue overflowed, causing DNS lookups to time out silently. This prevented our load balancers from reaching application servers.

What we are doing about it

  • Monitoring improvements: We are deploying dedicated DNS resolver monitoring with real-time alerting on query failures and queue saturation. This was previously a blind spot.
  • Capacity tuning: We are increasing the DNS resolver's processing capacity to handle peak query volumes with adequate headroom.
  • Reducing DNS load: We are cleaning up unnecessary DNS traffic from deprecated services that were still generating queries.
  • DNS Caching: We are implementing DNS caching at local level

Lessons learned

This incident highlighted that our DNS infrastructure — a critical dependency for all services — lacked dedicated monitoring and alerting. While our application-level monitoring detected the outage within minutes, we had no visibility into the DNS resolver's internal health. We are addressing this gap as a priority.

We apologize for the disruption and are committed to preventing recurrence.

Posted Apr 02, 2026 - 19:06 CEST

Resolved

Talkspirits endpoints are unreachable
Posted Apr 02, 2026 - 01:00 CEST