EU-Central-1 is down
Incident Report for InfluxDB Cloud
Postmortem

Background

  • InfluxDB Cloud runs on Kubernetes, a cloud application orchestration platform.
  • InfluxData uses an automated Continuous Delivery system to deploy both code and configuration changes to production. On a typical work day, the engineering team delivers from 5 to 15 different changes to production.
  • The team uses a tool called “Argo CD”, developed by Intuit, to deploy code and configuration changes to Kubernetes clusters. Argo CD does this by reading a YAML configuration file, and then uses the Kubernetes API to make the cluster look consistent with what is specified in the YAML.
  • In order to maintain multiple clusters, a language called jsonnet is used to templatize the configuration. The CD system detects changes in the jsonnet, converts the jsonnet into YAML, and then Argo applies the changes.
  • There is a code review process in place to inspect the resulting YAML and ensure that it will do what is expected before it is applied.

Trigger

In this particular case, the development team applied a change intended to add an “application” to the aws-eu-central cluster. In this context an “application” can be thought of as a service or system component that contributes functionality to the overall InfluxDB Cloud application. The purpose of the additional service was to add the ability to collect performance and correctness data on the new IOx storage engine.

When the developers added the application, they mistyped the name field, using “idpe” instead of “IOx” for the name. “Idpe” being the name for the larger application in which most of the InfluxDB Cloud 2 services run.

In the jsonnet, the application name should have been defined like this:

     ApplicationName: 'iox-cd-aws-prod01-eu-central-1',

But was defined like this:

     ApplicationName: 'idpe-cd-aws-prod01-eu-central-1',

Argo CD interpreted the re-use of the application name as redefinition of the application. Because the new configuration was added below the existing functionality in the file, Argo CD deleted the existing application and reinstalled the new one. 

This, essentially, removed InfluxDB Cloud 2 from production.

Contributing Factors

  1. The rendered YAML appeared to be adding functionality to the existing application; it was not clear that it was redefining it.
  2. Automated tests that run in order to detect such problems were misconfigured and were not running in the location where this change was made.
  3. Argo CD makes such changes with no warning or failsafe in place.

Customer Impact

  1. All API endpoints, including all writes and reads, returned 404 or equivalent. No data was collected during this time.
  2. No tasks ran during the outage and all scheduled task runs during the outage failed.
  3. Data could not be queried externally.

While new writes were not collected, no existing data was lost. All time series data, and also all configuration metadata, that was written before the incident was preserved.

Timeline

The following times are in UTC:

10:22: The PR was merged

10:25: InfluxData monitoring started reporting API failures

10:37: Support Team responded to customer escalations and started the incident response process

10:42: The PR was reverted

10:45: Engineering teams started to plan the recovery process

10:49: status.influxdata.com was updated to reflect the issue

10:56: All senior managers were engaged

11:17: Teams started to redeploy base services (starting with Kafka) carefully in order to connect them with existing volumes, preserve state, and accelerate recovery. Additional services were re-deployed in parallel as data integrity was verified. In some cases, services were restored from backup instead of redeploying, if this was deemed a safer strategy by the team. Expecting a surge in traffic once restored, ingress points were proactively scaled up.

15:26  The team enabled the write service, tested, and validated functionality. They then re-enabled tasks to allow the backlog of task runs to complete in order to avoid overloading the cluster. Once backlogged tasks had completed, they re-enabled the query service.

16:04: All reads, and therefore full functionality, is fully restored.

Future Mitigations

  1. Simplify the directory structure of the jsonnet configuration codebase and confirm that all pre-deployment tests are running on it.
  2. Enhance the code review process by reviewing a diff generated by an Argo CD dry run. Develop and contribute this functionality upstream if needed.
  3. Review disaster recovery runbooks in light of what was learned during the incident.
  4. Review “near misses” and ensure that our backup strategy is sufficient.
  5. Implement “game days” to practice disaster recovery on a cadence in the internal production cluster.
Posted Sep 20, 2021 - 22:28 UTC

Resolved
This incident has been resolved.
Posted Sep 17, 2021 - 19:41 UTC
Monitoring
Service has been restored for reads, writes, queries and tasks. We are monitoring the situation actively.
Posted Sep 17, 2021 - 18:14 UTC
Update
(Updating components)
The team is monitoring writes, and full service seems to have been restored for writes. The team continues to re-enable services necessary for reads, and expects to recover for reads within hours. The team does not have a more precise ETA for recovering reads yet, but will update when they have more data.
Posted Sep 17, 2021 - 17:23 UTC
Update
The team is monitoring writes, and full service seems to have been restored for writes. The team continues to re-enable services necessary for reads, and expects to recover for reads within hours. The team does not have a more precise ETA for recovering reads yet, but will update when they have more data.
Posted Sep 17, 2021 - 16:49 UTC
Update
Write availability has been restored. Customers should not see any further errors on the write path.

We are beginning the process of restoring read availability. This will involve provisioning further resources and will take some time. We won't restore read availability until we are confident writes are available.
Posted Sep 17, 2021 - 15:50 UTC
Update
We are finalising the provision of resources necessary to completely restore write availability. We expect this to be completed very soon. Reads remain unavailable on the cluster and the team is working to provision all resources necessary to fully restore read availability
Posted Sep 17, 2021 - 15:35 UTC
Update
The team continues to implement the recovery process. Currently, they estimate that write availability will be recovered in the next 2 hours, though please be advised this is only their best good faith estimate. We expect read availability to be restored after write availability. The team expects that will take hours as well, but does not have a more specific estimate at this time.
Posted Sep 17, 2021 - 15:02 UTC
Update
The team continues to make progress restoring service. However, they are still unable to supply an updated estimate for full system recovery. They still anticipate it will be measured in hours.
Posted Sep 17, 2021 - 14:29 UTC
Update
The team continues to make progress restoring service. However, they are still unable to supply an updated estimate for full system recovery. They still anticipate it will be measured in hours.
Posted Sep 17, 2021 - 13:51 UTC
Update
aws-eu-central region is still down. The team is making progress, The current plan is that writes will be back online before reads will be available. We have re-confirmed that there does not appear to be any data loss
Posted Sep 17, 2021 - 12:57 UTC
Update
aws-eu-central region is still down. The team is making progress, but we don't have an updated estimate for when service will be restored yet. We have re-confirmed that there does not appear to be any data loss.
Posted Sep 17, 2021 - 12:46 UTC
Update
A configuration/operator error caused all compute in our aws-eu-central to be disabled. This incident is limited only to aws-eu-central. Data that has already been written is safe. The engineering team is working on restoring functionality to the cluster. Unfortunately, current estimates for downtime during recovered is measured in hours. We will continue to update as our estimates improve
Posted Sep 17, 2021 - 12:09 UTC
Identified
The cluster is down, we have identified the issue and we are working on it to resolve the issue
Posted Sep 17, 2021 - 11:36 UTC
Update
The cluster is down and we are working on it to resolve the issue
Posted Sep 17, 2021 - 11:15 UTC
Update
The cluster is down and we are working on it to resolve the issue
Posted Sep 17, 2021 - 11:00 UTC
Update
We are continuing to investigate this issue.
Posted Sep 17, 2021 - 10:49 UTC
Investigating
We are currently investigating this issue.
Posted Sep 17, 2021 - 10:49 UTC
This incident affected: AWS: Frankfurt, EU-Central-1 (Web UI, API Writes, API Queries, Tasks, Persistent Storage, Compute).