Elevated rate of 500 Errors while Querying in GCP US Central region
Incident Report for InfluxDB Cloud
Postmortem

Summary

From 16:32 UTC to 18:04 UTC on 2022-04-28, users in the GCP US Central region experienced query and task unavailability caused by nodes for both replicas of two storage partitions going offline. Writes were not affected during this incident. 

Background

InfluxDB Cloud uses multiple nodes to store data in a highly available way. This architecture relies on one or both duplicates of data being available to serve queries. 

Trigger

This incident was triggered by two independent events coinciding:

  1. Routine scaling operations 
  2. GCP maintenance operations

Contributing Factors

  1. Disk scaling operations increase the pod recovery time, because rebuilding the storage pods after replacing the disks is slower than a typical restart. 
  2. When GCP replaces a node, after a 1h timeout expires, GCP proceeds

Timeline

  • 16:32 UTC partition not found errors begin
  • 16:52 UTC both partition replicas recovering
  • 16:58 UTC GCP node replacement identified as the cause of the second replica going down
  • 16:59 UTC storage node killed to force rescheduling
  • 17:34 UTC storage replica killed (two partitions are down)
  • 17:51 UTC first storage partition back online
  • 18:04 UTC second storage partition back online
  • 18:04 UTC service restored

Future Mitigations

  • Update storage-controller to not recreate persistent volume claims while the other pod is unhealthy
  • Accelerate pod recovery during disk scaling operations
Posted May 10, 2022 - 16:54 UTC

Resolved
This incident has been resolved.
Posted Apr 28, 2022 - 19:23 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 28, 2022 - 18:53 UTC
Investigating
There are elevated rates of 500 Errors while querying in GCP US Central region. Tasks during this outage will fail and need to be manually retried after the outage.
Posted Apr 28, 2022 - 17:09 UTC
This incident affected: Google Cloud: Iowa, US-Central (API Queries, Tasks).