Cloud2 GCP US-Central Cluster write and query issues

Incident Report for InfluxDB Cloud

Postmortem

Summary

From Fri, 06 May 2022 10:06:18 UTC to Fri, 06 May 2022 14:45:22 UTC, users in the GCP US Central regions experienced read and write unavailability caused by a GCP outage, resulting in Persistent Volume disks to be unavailable, as well as partially impacting Kafka pod availability causing storage pods to be unavailable as well.

Background

InfluxDB Cloud uses a Kafka message queue to store writes before they are written to InfluxDB Cloud. Storage pods require this queue to be available to receive writes.

Trigger

This incident was triggered by:

  1. A GCP outage caused Persistent Volume disks to be unavailable.
  2. A subset of Kafka partitions going offline.

Contributing Factors

  • Sub-optimal configuration of Kafka replication allocation:

    • Due to a sub-optimal configuration of Kafka in our cluster deployment, about 25% of the Kafka partitions went offline in each cluster moving the Kafka pods into not-ready state. 
    • Our external write gateway failed almost 100% of the write operations.
    • Storage pods failed due to unavailability of Kafka partitions
    • Since Kafka was not configured to be topology aware, replicas were assigned in simple order to the same zone thus making it a single point of failure.
  • Deletion of a Kafka pod during troubleshooting (the intention was to manually force the mounting of disks to zone b) extended the outage on one cluster by about an hour.

Timeline

  • 10:06:18 UTC Kafka queue partially unavailable
  • 10:17:49 UTC storage pod unavailable
  • 11:00:40 UTC outage on GCP Persistent Disk service causing Kafka services to be unavailable
  • 14:45:22 UTC writes available

Learnings & Mitigation Actions

Make sure Kafka replica allocation is rack aware.

Posted 3 years ago. May 19, 2022 - 21:54 UTC

Resolved

This incident has been resolved.
Posted 3 years ago. May 06, 2022 - 19:20 UTC

Update

The service still remains stable, our engineering team continues to monitor. This incident will remain open until the GCP "Persistent Disk" incident is resolved:
https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1
Posted 3 years ago. May 06, 2022 - 18:36 UTC

Update

The service remains stable, our engineering team continues to monitor. This incident will remain open until the GCP "Persistent Disk" incident is resolved:
https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1
Posted 3 years ago. May 06, 2022 - 16:36 UTC

Update

The service has remained stable, our engineering team are continuing to monitor
Posted 3 years ago. May 06, 2022 - 15:03 UTC

Update

We are continuing to monitor for any further issues.
Posted 3 years ago. May 06, 2022 - 14:02 UTC

Monitoring

GCP have made progress on resolving their issues and we can see that queries and writes are recovering. Our engineers are continuing to monitor the situation
Posted 3 years ago. May 06, 2022 - 13:50 UTC

Update

We are continuing to investigate this issue. We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issue
Posted 3 years ago. May 06, 2022 - 13:25 UTC

Update

We are continuing to investigate this issue. We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issue
Posted 3 years ago. May 06, 2022 - 12:33 UTC

Investigating

We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issues
Posted 3 years ago. May 06, 2022 - 11:38 UTC
This incident affected: Cloud Serverless: GCP (API Writes, API Queries).