Summary
From Fri, 06 May 2022 10:06:18 UTC to Fri, 06 May 2022 14:45:22 UTC, users in the GCP US Central regions experienced read and write unavailability caused by a GCP outage, resulting in Persistent Volume disks to be unavailable, as well as partially impacting Kafka pod availability causing storage pods to be unavailable as well.
Background
InfluxDB Cloud uses a Kafka message queue to store writes before they are written to InfluxDB Cloud. Storage pods require this queue to be available to receive writes.
Trigger
This incident was triggered by:
- A GCP outage caused Persistent Volume disks to be unavailable.
- A subset of Kafka partitions going offline.
Contributing Factors
Sub-optimal configuration of Kafka replication allocation:
- Due to a sub-optimal configuration of Kafka in our cluster deployment, about 25% of the Kafka partitions went offline in each cluster moving the Kafka pods into not-ready state.
- Our external write gateway failed almost 100% of the write operations.
- Storage pods failed due to unavailability of Kafka partitions
- Since Kafka was not configured to be topology aware, replicas were assigned in simple order to the same zone thus making it a single point of failure.
Deletion of a Kafka pod during troubleshooting (the intention was to manually force the mounting of disks to zone b) extended the outage on one cluster by about an hour.
Timeline
- 10:06:18 UTC Kafka queue partially unavailable
- 10:17:49 UTC storage pod unavailable
- 11:00:40 UTC outage on GCP Persistent Disk service causing Kafka services to be unavailable
- 14:45:22 UTC writes available
Learnings & Mitigation Actions
Make sure Kafka replica allocation is rack aware.