Queries timing Out
Incident Report for InfluxDB Cloud
Postmortem

Incident RCA

Queries timing Out

Summary

A bulk export job was running to export a customer’s data.  As the bulk export was taking a long time, we allocated more resources to the export job.  The export job uses ephemeral disks, and runs on the same nodes as other services, such as the query nodes. As the job ran on many nodes, it consumed the ephemeral disks on the shared nodes, which impacted the other services, causing queries to fail. When we were alerted to the query failures, we stopped the bulk export job and the cluster recovered.  We will be reworking the bulk export to run on its dedicated PVC, so that it cannot impact the other services in the cluster.

We are sorry for the service disruption that this caused.

Posted Feb 22, 2023 - 22:40 UTC

Resolved
This incident has been resolved. Error rates are back to normal across the cluster.
Posted Feb 22, 2023 - 21:47 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 22, 2023 - 19:45 UTC
Investigating
We are currently investigating this issue.
Posted Feb 22, 2023 - 19:23 UTC
This incident affected: AWS: Oregon, US-West-2-1 (API Queries).