Cloud2 GCP US-Central Cluster write and query issues

Incident Report for InfluxDB Cloud

Postmortem

Summary

From Fri, 06 May 2022 10:06:18 UTC to Fri, 06 May 2022 14:45:22 UTC, users in the GCP US Central regions experienced read and write unavailability caused by a GCP outage, resulting in Persistent Volume disks to be unavailable, as well as partially impacting Kafka pod availability causing storage pods to be unavailable as well.

Background

InfluxDB Cloud uses a Kafka message queue to store writes before they are written to InfluxDB Cloud. Storage pods require this queue to be available to receive writes.

Trigger

This incident was triggered by:

A GCP outage caused Persistent Volume disks to be unavailable.
A subset of Kafka partitions going offline.

Contributing Factors

Sub-optimal configuration of Kafka replication allocation:
- Due to a sub-optimal configuration of Kafka in our cluster deployment, about 25% of the Kafka partitions went offline in each cluster moving the Kafka pods into not-ready state.
- Our external write gateway failed almost 100% of the write operations.
- Storage pods failed due to unavailability of Kafka partitions
- Since Kafka was not configured to be topology aware, replicas were assigned in simple order to the same zone thus making it a single point of failure.
Deletion of a Kafka pod during troubleshooting (the intention was to manually force the mounting of disks to zone b) extended the outage on one cluster by about an hour.

Timeline

10:06:18 UTC Kafka queue partially unavailable
10:17:49 UTC storage pod unavailable
11:00:40 UTC outage on GCP Persistent Disk service causing Kafka services to be unavailable
14:45:22 UTC writes available

Learnings & Mitigation Actions

Make sure Kafka replica allocation is rack aware.

Posted 3 years ago. May 19, 2022 - 21:54 UTC

Resolved

This incident has been resolved.

Posted 3 years ago. May 06, 2022 - 19:20 UTC

Update

The service still remains stable, our engineering team continues to monitor. This incident will remain open until the GCP "Persistent Disk" incident is resolved:
https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1

Posted 3 years ago. May 06, 2022 - 18:36 UTC

Update

The service remains stable, our engineering team continues to monitor. This incident will remain open until the GCP "Persistent Disk" incident is resolved:
https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1

Posted 3 years ago. May 06, 2022 - 16:36 UTC

Update

The service has remained stable, our engineering team are continuing to monitor

Posted 3 years ago. May 06, 2022 - 15:03 UTC

Update

We are continuing to monitor for any further issues.

Posted 3 years ago. May 06, 2022 - 14:02 UTC

Monitoring

GCP have made progress on resolving their issues and we can see that queries and writes are recovering. Our engineers are continuing to monitor the situation

Posted 3 years ago. May 06, 2022 - 13:50 UTC

Update

We are continuing to investigate this issue. We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issue

Posted 3 years ago. May 06, 2022 - 13:25 UTC

Update

Posted 3 years ago. May 06, 2022 - 12:33 UTC

Investigating

We are currently aware of an issue in our Google Cloud US-Central region impacting writes and queries.

GCP are having issues with their storage layer (https://status.cloud.google.com/incidents/pQohCqBLfFapHrkjY5Mh#RP1d9aZLNFZEJmTBk8e1) which is disrupting operations.

Our team are investigating to see whether these issues can be mitigated whilst Google resolve their issues

Posted 3 years ago. May 06, 2022 - 11:38 UTC

This incident affected: Cloud Serverless: GCP (API Writes, API Queries).