Write and read outage in AWS: Frankfurt, EU-Central-1, AWS: Oregon, US-West-2-1 and AWS: Virginia, US-East-1

Incident Report for InfluxDB Cloud

Postmortem

Incident RCA

Summary

On Feb 24, 2023 at 19.30 UTC, we deployed a software change to multiple production clusters, which caused a significant percentage of writes and queries to fail in our larger clusters. The duration of the outage was different for each cluster as was the level of disruption (percentage of writes and queries that failed during the incident). The table below summarizes the time ranges during which the service was impacted in each cluster (all in UTC time).

Cluster	Write failure start	Write failure end	Query failure start	Query failure end
prod01-us-west-2	19:38	22:17	19:36	22:20
prod01-eu-central-1	19:36	23:49	19:34	23:38
prod101-us-east-1	19:34	22:44	19:34	00:44

Cause of the Incident

Our software is deployed via a CD pipeline to three staging clusters (one per cloud provider) where a suite of automated tests are run. If those tests pass, then it is deployed into an internal cluster where another round of testing occurs, and finally it is deployed to all of our production clusters in parallel. This is our standard software deployment methodology for our cloud service. On February 24, 2023, an engineer made a change to a health-check to ensure that our query and write pods can reach the vault within the cluster (where credentials are managed). In the past, it was possible for a query or write pod to get stuck, if it lost access to the vault. To address that problem, a health check was added so that if a pod could not reach the vault, the pod would stop/restart automatically. This health check was tested in all three staging clusters, and worked fine. The change was promoted to our internal cluster, which also worked fine. The change was then promoted to our production clusters. In the larger clusters, when the pods were restarted (with the new health check in place) too many pods made health-check calls to the vault in quick succession. These calls overwhelmed the vault, and it was unable to service all the requests. As the health check failed, the pods attempted to recover by restarting, which put an even heavier workload on the vault, from which it was unable to recover.

Recovery

As soon as we detected the problem, and identified the offending software change, we rolled back to an earlier version of our production software, and redeployed that in all the production clusters. In our smaller clusters, this happened quickly, without any significant customer impact. In our three largest clusters (the clusters listed above), as the vault was deadlocked, we were unable to deploy the new software without manually restarting the vault instances, and then gradually restarting the services that depend on the vault. This is what caused it to take longer to recover in these clusters.

Future mitigations

We are re-implementing the offending health check so that we can detect a stuck pod without putting such a burden on the vault.
As the vault is a critical element of our service, we are adding an extra peer review step to all software changes that interact with the vault.
We are enhancing the vault configuration to have the vault more gracefully degrade when overloaded.
We are enhancing our runbooks so that we can more quickly intervene with manual steps if the regular deployment/rollback process fails, to reduce our overall time-to-recover when a cluster fails to recover normally.

Posted 2 years ago. Mar 01, 2023 - 01:38 UTC

Resolved

The issue has been fully resolved in all regions. We will continue to monitor.

Posted 2 years ago. Feb 25, 2023 - 01:55 UTC

Update

The issue has been fully resolved in all regions. We will continue to monitor.

Posted 2 years ago. Feb 25, 2023 - 01:40 UTC

Update

Write and read are back in AWS: Virginia, US-East-1 and we are continuing to monitor for any further issues.

Posted 2 years ago. Feb 25, 2023 - 01:21 UTC

Update

We are continuing to monitor for any further issues.

Posted 2 years ago. Feb 25, 2023 - 01:19 UTC

Update

Write and read are down in AWS: Virginia, US-East-1.

Posted 2 years ago. Feb 25, 2023 - 01:04 UTC

Update

We are continuing to monitor for any further issues.

Posted 2 years ago. Feb 25, 2023 - 01:02 UTC

Monitoring

Write and read are working in all regions now.

Posted 2 years ago. Feb 25, 2023 - 00:32 UTC

Update

Write and read outage: AWS: Oregon, US-WEST-2-1 and AWS: Virginia, US-East-1 are recovering, and we are still working on AWS: Frankfurt, EU-Central-1

Posted 2 years ago. Feb 24, 2023 - 22:50 UTC

Update

We are continuing to investigate this issue.

Posted 2 years ago. Feb 24, 2023 - 22:44 UTC

Investigating

We are currently investigating this issue.

Posted 2 years ago. Feb 24, 2023 - 20:00 UTC

This incident affected: Cloud Serverless: AWS, US-West-2-1 (API Writes, API Queries, Tasks), Cloud Serverless: AWS, EU-Central (API Writes, API Queries, Tasks), and Cloud Serverless: AWS, US-East-1 (API Writes, API Queries, Tasks).