Queries timing Out

Incident Report for InfluxDB Cloud

Postmortem

Incident RCA

Queries timing Out

Summary

A bulk export job was running to export a customer’s data.  As the bulk export was taking a long time, we allocated more resources to the export job.  The export job uses ephemeral disks, and runs on the same nodes as other services, such as the query nodes. As the job ran on many nodes, it consumed the ephemeral disks on the shared nodes, which impacted the other services, causing queries to fail. When we were alerted to the query failures, we stopped the bulk export job and the cluster recovered.  We will be reworking the bulk export to run on its dedicated PVC, so that it cannot impact the other services in the cluster.

We are sorry for the service disruption that this caused.

Posted Feb 22, 2023 - 22:40 UTC

Resolved

This incident has been resolved. Error rates are back to normal across the cluster.
Posted Feb 22, 2023 - 21:47 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 22, 2023 - 19:45 UTC

Investigating

We are currently investigating this issue.
Posted Feb 22, 2023 - 19:23 UTC
This incident affected: Cloud Serverless: AWS, US-West-2-1 (API Queries).