Degraded query performance on AWS us-east-1

Incident Report for InfluxDB Cloud

Postmortem

Background

InfluxDB is a Kubernetes application. Among it’s core services are a compute tier and the storage tier. Roughly, when a user submits a query, the compute tier queues requests, calculates what to request from the storage tier, sends a request to the storage tier, completes any computations not fulfilled by the storage tier, and returns the data to the requestor.

Trigger

Unknown at this time. The team’s current diagnosis is that an as yet unidentified anomaly in the storage tier may have caused processes to be processed slowly (for example by starving them of CPU), but the anomaly was resolved when an unrelated deployment caused all of the containers in the storage tier to reset.

Contributing Factors

No known contributing factors. The team was able to rule out common causes and contributing factors such as:

High query workload
High rate of deletes across the cluster
Buggy or malicious queries or related workloads
New workloads triggering previously known bugs

The team did observe some other factors which were correlated in time with the incident:

External requests reported a 0% error rate during the duration of the incident. This is indicative of metrics not being collected during the event, so this is most likely another symptom.
A massive 25x spike of query requests reported in the Kubernetes metrics, but not otherwise reflected in the billing or other metrics. Again, this appears to be a further symptom.
A significant number of long running processes in the storage containers. This may be related to the anomaly or the anomaly itself, but the processes were running for hours before the incident.

Customer Impact

Query duration slowed down dramatically for all users on the cluster. Queries that may have taken a long time to finish previously timed out during the incident. Writes were unaffected. The API was otherwise unaffected.

Future Mitigations

The team continues to investigate the root cause. Next steps are to enhance the logging of query behavior in the storage tier to assist in ruling out the role of long running processes in the incident.

Posted Oct 06, 2021 - 23:45 UTC

Resolved

Incident is now resolved and normal operations are resumed; however, the team will continue to monitor.

Posted Oct 05, 2021 - 21:13 UTC

Update

Queries are running more slowly than normal. The team is investigating. Slower queries that normally complete may be timing out. Writes appear unaffected. There is no data loss.

Posted Oct 05, 2021 - 20:48 UTC

Investigating

There's been a report of degraded query performance, and we are investigating.

Posted Oct 05, 2021 - 20:44 UTC

This incident affected: Cloud Serverless: AWS, US-East-1 (API Queries).