Elevated rate of 500 Errors while Querying in GCP US Central region

Incident Report for InfluxDB Cloud

Postmortem

Summary

From 16:32 UTC to 18:04 UTC on 2022-04-28, users in the GCP US Central region experienced query and task unavailability caused by nodes for both replicas of two storage partitions going offline. Writes were not affected during this incident.

Background

InfluxDB Cloud uses multiple nodes to store data in a highly available way. This architecture relies on one or both duplicates of data being available to serve queries.

Trigger

This incident was triggered by two independent events coinciding:

Routine scaling operations
GCP maintenance operations

Contributing Factors

Disk scaling operations increase the pod recovery time, because rebuilding the storage pods after replacing the disks is slower than a typical restart.
When GCP replaces a node, after a 1h timeout expires, GCP proceeds

Timeline

16:32 UTC partition not found errors begin
16:52 UTC both partition replicas recovering
16:58 UTC GCP node replacement identified as the cause of the second replica going down
16:59 UTC storage node killed to force rescheduling
17:34 UTC storage replica killed (two partitions are down)
17:51 UTC first storage partition back online
18:04 UTC second storage partition back online
18:04 UTC service restored

Future Mitigations

Update storage-controller to not recreate persistent volume claims while the other pod is unhealthy
Accelerate pod recovery during disk scaling operations

Posted May 10, 2022 - 16:54 UTC

Resolved

This incident has been resolved.

Posted Apr 28, 2022 - 19:23 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 28, 2022 - 18:53 UTC

Investigating

There are elevated rates of 500 Errors while querying in GCP US Central region. Tasks during this outage will fail and need to be manually retried after the outage.

Posted Apr 28, 2022 - 17:09 UTC

This incident affected: Cloud Serverless: GCP (API Queries, Tasks).