Degraded query performance: EU-Central-1

Incident Report for InfluxDB Cloud

Postmortem

Background

In a simplified summary, the InfluxDB storage engine has two essential elements.

The storage nodes, in which runs the code that knows how to write new data to disk and how to retrieve requested data from disk.
The storage partitions, which are files in which the data is actually stored.

The storage nodes pick up new data to write from a Kafka queue, process that data and write it. They also process queries and return data to the user.

When data is written, it is stored in memory for a period of time in addition to being persisted. In certain cases, such as when data is being quickly overwritten and duplicated, the process of merging the data from disk and memory during queries can be computationally expensive.

Trigger

A user wrote a very tight loop of code for the purpose of backfilling historical data. As part of this loop, they also queried the data as it was being written. This loop included running a query after every write. Like every user, their data was stored on a subset of the storage nodes.

Due to the nature of the data being backfilled and the queries, the impacted storage nodes were performing an extreme number of the expensive merges, causing a dramatic spike in CPU utilization on those nodes.

This high CPU utilization caused the storage nodes to lack CPU resources to keep up with processing new writes from the Kafka queue as fast as normal. Therefore, there was a long delay before the written data was available for reads, and so queries would return incomplete data from the impacted storage nodes during the incident.

Contributing Factors

The user who ran this very tight loop misunderstood the technical documentation for writing historical data, and therefore thought that including a read was necessary.

Customer Impact

Customers with data on the impacted nodes experienced:

Increased query latencies during the duration of the incident.
Incomplete data returned in their queries for recently written data.

Other customers were not impacted. No data was lost. The API continued to operate normally.

Timeline

All times are in UTC:

23:45 - Storage Team received a TTBR alert (“time to become readable” signifying that there was a delay in data being written). Upon investigation the TTBR had already recovered to normal levels.

00:14 - Storage Team received another TTBR alert, but upon investigation saw that the TTBR was continuing to climb on a single storage node. The team embarked upon an investigation as their top priority.

00:45 - The team determined that there must be one specific customer causing the issue, and contacted Support to help identify the user, and also to update the status page reflecting that performance was degraded for some users.

00:49 - The team initiated the incident response processes. More engineers joined the investigation and were able to identify the user that triggered the incident.

01:21 - The team used a feature flag to target the specific customer and stop their queries from running. Support reached out to the customer and initiated a discussion.

01:46 - The team observed that the impacted storage nodes were quickly catching up with the queued writes. The team continued to monitor before finally closing the incident.

Future Mitigations

We are already in the process of implementing a governor over total query time, and expect to roll this out over the coming weeks. This will cause a customer who is taking too much query time to be rate limited. This will arrest the observed behavior in the future, by automatically denying more query time to the customer.
We are adjusting the documentation and providing improved sample code to help customers who are backfilling data to use the API more correctly.
The team will investigate other direct mitigations within the storage nodes themselves.

Posted Oct 05, 2021 - 02:26 UTC

Resolved

This incident has been resolved.

Posted Oct 04, 2021 - 01:59 UTC

Investigating

Users in the EU-Central region are experiencing degraded query performance. Our engineering team is actively investigating the issue and will provide further updates as they become available.

Posted Oct 04, 2021 - 00:59 UTC

This incident affected: Cloud Serverless: AWS, EU-Central (API Queries).