Query degradation in eu-central-1

Incident Report for InfluxDB Cloud

Postmortem

RCA

Query Degradation in eu-central-1 on Jan 9, 2024

Background

Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key to ensure even write and query load distribution.

When users write data into InfluxDB Cloud, their writes first enter a durable queue. Storage pods consume ingest data from the queue, allowing writes to be accepted even during storage issues.

Time To Become Readable (TTBR) measures the time between a write being accepted and its data becoming available for queries.

Summary

On January 9, 2024, a single partition experienced significant increases in TTBR, causing delays in data availability for queries. CPU usage on the pods responsible for this partition rose to high levels.

An investigation revealed a noisy neighbor issue caused by a small organization with infinite retention running resource-intensive queries.

Internal Visibility of Issue

Identifying the affected queries took longer than usual due to:

Queries timing out in the query tier but continuing to run on storage, creating a disconnect in observed logs.

The organization's small size caused it to not appear prominently in metrics.

Failing queries represent a tiny proportion of the organization's usage, making shifts in query success ratios minimal.

Metrics and logs relied on completed gRPC calls, which were not completing for the problematic queries.

Cause

The resource usage of a single user impacting other users, known as a noisy neighbor issue, was identified.

Resources were consumed by a relatively small organization attempting to run an expensive function against all data in a dense series.

Queries timing out led to continued consumption of resources, eventually leading to insufficient CPU for reliable queue consumption, thus pushing TTBR up.

Mitigation

Additional compute resources were deployed to mitigate any impact on the customers and allow for a smooth recovery without customer visible impact.

Prevention

Planned or ongoing changes include:

Improvements to profiling for reporting usage per organization.

Enhancements in visualization to facilitate easier identification of noisy neighbors.

Posted Mar 29, 2024 - 00:42 UTC

Resolved

This incident has been resolved.

Posted Jan 09, 2024 - 22:20 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 09, 2024 - 19:49 UTC

Investigating

We are aware of query degradation in eu-central-1, the team are currently investigating

Posted Jan 09, 2024 - 18:25 UTC

This incident affected: Cloud Serverless: AWS, EU-Central (API Queries, Tasks).