In this particular case, the development team applied a change intended to add an “application” to the aws-eu-central cluster. In this context an “application” can be thought of as a service or system component that contributes functionality to the overall InfluxDB Cloud application. The purpose of the additional service was to add the ability to collect performance and correctness data on the new IOx storage engine.
When the developers added the application, they mistyped the name field, using “idpe” instead of “IOx” for the name. “Idpe” being the name for the larger application in which most of the InfluxDB Cloud 2 services run.
In the jsonnet, the application name should have been defined like this:
ApplicationName: 'iox-cd-aws-prod01-eu-central-1',
But was defined like this:
ApplicationName: 'idpe-cd-aws-prod01-eu-central-1',
Argo CD interpreted the re-use of the application name as redefinition of the application. Because the new configuration was added below the existing functionality in the file, Argo CD deleted the existing application and reinstalled the new one.
This, essentially, removed InfluxDB Cloud 2 from production.
While new writes were not collected, no existing data was lost. All time series data, and also all configuration metadata, that was written before the incident was preserved.
The following times are in UTC:
10:22: The PR was merged
10:25: InfluxData monitoring started reporting API failures
10:37: Support Team responded to customer escalations and started the incident response process
10:42: The PR was reverted
10:45: Engineering teams started to plan the recovery process
10:49: status.influxdata.com was updated to reflect the issue
10:56: All senior managers were engaged
11:17: Teams started to redeploy base services (starting with Kafka) carefully in order to connect them with existing volumes, preserve state, and accelerate recovery. Additional services were re-deployed in parallel as data integrity was verified. In some cases, services were restored from backup instead of redeploying, if this was deemed a safer strategy by the team. Expecting a surge in traffic once restored, ingress points were proactively scaled up.
15:26 The team enabled the write service, tested, and validated functionality. They then re-enabled tasks to allow the backlog of task runs to complete in order to avoid overloading the cluster. Once backlogged tasks had completed, they re-enabled the query service.
16:04: All reads, and therefore full functionality, is fully restored.