Reads-Writes affected in eu-central-1
Incident Report for InfluxDB Cloud
Postmortem

Incident RCA

RCA - 503 errors in eu-central on March 7, 2023

Summary

We use Hashicorp’s Vault product to secure access to secrets within our clusters.  Vault is configured in a highly available active-standby configuration, with one active instance and two standby instances (the recommended HA configuration).  ​​On March 7, 2023, 09:26 UTC we saw a high rate of errors on the write and read paths in eu-central, due to Vault becoming unreachable.  We restarted Vault, and then restarted the query, write and storage pods in a controlled manner, and the cluster recovered.

Cause of the Incident

The root cause was our Vault infrastructure becoming overloaded. We use Vault to securely handle authorization tokens and store customer secrets. The number of requests became too high for the Vault system to respond in a timely manner causing requests to fail. Because the write and read services rely on the secrets in Vault to be able to process customer data, these services started failing health checks and being restarted in an attempt to recover. Each restart increased the load on Vault further, creating a positive feedback loop that meant the system could not recover without intervention.

Recovery

As we had learned in the incident of Feb 24, 2023, Vault can get overwhelmed when too many pods try to connect to Vault at the same time.  The remedial action we took was to scale down the size of the worker pools for all customer-facing services in order to reduce the load on Vault to enable it to recover and then slowly increase them back to the previous levels in a controlled manner to avoid stressing the Vault system further. This enabled the cluster to return to normal operation.

Future mitigations

  1. We are reviewing our Vault readiness check to determine why the active Vault instance did not failover to the standby, when it was no longer able to respond to incoming requests. 
  2. We are investigating strategies to avoid having so many components rely on Vault access to work correctly.
Posted Mar 17, 2023 - 21:50 UTC

Resolved
This incident has been resolved.
Posted Mar 07, 2023 - 20:48 UTC
Update
We are continuing to monitor for any further issues.
Posted Mar 07, 2023 - 17:09 UTC
Monitoring
Read and writes are back to Normal levels. We are monitoring this further
Posted Mar 07, 2023 - 13:21 UTC
Update
Writes have been returned to full capacity. We are now slowly increasing the query capacity back to previous levels.
Posted Mar 07, 2023 - 12:31 UTC
Identified
The dev team has identified and working on a fix for the issue. We will have further updates soon.
Posted Mar 07, 2023 - 11:19 UTC
Update
The investigation continues. There are no further updates.
Posted Mar 07, 2023 - 10:28 UTC
Investigating
The Devs are currently working on this. We will keep you updated with the progress.
Posted Mar 07, 2023 - 09:53 UTC
This incident affected: AWS: Frankfurt, EU-Central-1 (API Writes, API Queries).