RCA - Query and Write Outage on May 28, 2025
Summary
On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.
Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.
Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.
The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.
Cause of the Incident
Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.
On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.
We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.
Investigation and Recovery
Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.
Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.
Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).
Future mitigations
We are implementing several methods to reduce the likelihood of a similar incident in the future: