Summary
From 16:32 UTC to 18:04 UTC on 2022-04-28, users in the GCP US Central region experienced query and task unavailability caused by nodes for both replicas of two storage partitions going offline. Writes were not affected during this incident.
Background
InfluxDB Cloud uses multiple nodes to store data in a highly available way. This architecture relies on one or both duplicates of data being available to serve queries.
Trigger
This incident was triggered by two independent events coinciding:
- Routine scaling operations
- GCP maintenance operations
Contributing Factors
- Disk scaling operations increase the pod recovery time, because rebuilding the storage pods after replacing the disks is slower than a typical restart.
- When GCP replaces a node, after a 1h timeout expires, GCP proceeds
Timeline
- 16:32 UTC partition not found errors begin
- 16:52 UTC both partition replicas recovering
- 16:58 UTC GCP node replacement identified as the cause of the second replica going down
- 16:59 UTC storage node killed to force rescheduling
- 17:34 UTC storage replica killed (two partitions are down)
- 17:51 UTC first storage partition back online
- 18:04 UTC second storage partition back online
- 18:04 UTC service restored
Future Mitigations
- Update storage-controller to not recreate persistent volume claims while the other pod is unhealthy
- Accelerate pod recovery during disk scaling operations