Cloud2 and Serverless Query and Write API Availability Issues AWS EU-Central-1

Incident Report for InfluxDB Cloud

Postmortem

RCA - Query and Write Outage on May 28, 2025

‌

Summary

‌

On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud (TSM and IOx) clusters. At 15:07 UTC, the continuous deployment (CD) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue.

‌

Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries.

‌

Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly.

‌

The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments.

‌

Cause of the Incident

‌

Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process.

‌

On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue.

‌

We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries.

‌

Investigation and Recovery

‌

Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC (Infrastructure-as-Code) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release.

‌

Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes.

‌

Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored (both internally and on the InfluxData status page).

‌

Future mitigations

‌

We are implementing several methods to reduce the likelihood of a similar incident in the future:

‌

Alert on or increase telemetry of secret-related deletion. We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens.
Isolate and stage infrastructure changes to critical systems. Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step.
Ongoing investigation and continued hardening. We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified.

https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#control-plane-details

https://status.influxdata.com

Posted May 30, 2025 - 00:20 UTC

Resolved

All regions fully back online. A full RCA will be provided as soon as it is completed.

Posted May 28, 2025 - 22:10 UTC

Monitoring

AWS EU-Central is now operational. We are continuing to monitor

Posted May 28, 2025 - 20:37 UTC

Identified

All regions except EU Central are now operational. Work continues on EU-Central

Posted May 28, 2025 - 19:48 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 28, 2025 - 18:38 UTC