Cloud2 GCP US-Central Cluster write and query issues
Incident Report for InfluxDB Cloud
Postmortem

Summary

From Fri, 06 May 2022 10:06:18 UTC to Fri, 06 May 2022 14:45:22 UTC, users in the GCP US Central regions experienced read and write unavailability caused by a GCP outage, resulting in Persistent Volume disks to be unavailable, as well as partially impacting Kafka pod availability causing storage pods to be unavailable as well.

Background

InfluxDB Cloud uses a Kafka message queue to store writes before they are written to InfluxDB Cloud. Storage pods require this queue to be available to receive writes.

Trigger

This incident was triggered by:

  1. A GCP outage caused Persistent Volume disks to be unavailable.
  2. A subset of Kafka partitions going offline.

Contributing Factors

  • Sub-optimal configuration of Kafka replication allocation:

    • Due to a sub-optimal configuration of Kafka in our cluster deployment, about 25% of the Kafka partitions went offline in each cluster moving the Kafka pods into not-ready state. 
    • Our external write gateway failed almost 100% of the write operations.
    • Storage pods failed due to unavailability of Kafka partitions
    • Since Kafka was not configured to be topology aware, replicas were assigned in simple order to the same zone thus making it a single point of failure.
  • Deletion of a Kafka pod during troubleshooting (the intention was to manually force the mounting of disks to zone b) extended the outage on one cluster by about an hour.

Timeline

  • 10:06:18 UTC Kafka queue partially unavailable
  • 10:17:49 UTC storage pod unavailable
  • 11:00:40 UTC outage on GCP Persistent Disk service causing Kafka services to be unavailable
  • 14:45:22 UTC writes available

Learnings & Mitigation Actions

Make sure Kafka replica allocation is rack aware.

Posted May 19, 2022 - 21:54 UTC

Resolved
This incident has been resolved.
Posted May 06, 2022 - 19:20 UTC
Update
The service still remains stable, our engineering team continues to monitor. This incident will remain open until the GCP "Persistent Disk" incident is resolved:
https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1
Posted May 06, 2022 - 18:36 UTC
Update
The service remains stable, our engineering team continues to monitor. This incident will remain open until the GCP "Persistent Disk" incident is resolved:
https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1
Posted May 06, 2022 - 16:36 UTC
Update
The service has remained stable, our engineering team are continuing to monitor
Posted May 06, 2022 - 15:03 UTC
Update
We are continuing to monitor for any further issues.
Posted May 06, 2022 - 14:02 UTC
Monitoring
GCP have made progress on resolving their issues and we can see that queries and writes are recovering. Our engineers are continuing to monitor the situation
Posted May 06, 2022 - 13:50 UTC
Update
We are continuing to investigate this issue. We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issue
Posted May 06, 2022 - 13:25 UTC
Update
We are continuing to investigate this issue. We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issue
Posted May 06, 2022 - 12:33 UTC
Investigating
We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issues
Posted May 06, 2022 - 11:38 UTC
This incident affected: Google Cloud: Iowa, US-Central (API Writes, API Queries).