In a simplified summary, the InfluxDB storage engine has two essential elements.
The storage nodes pick up new data to write from a Kafka queue, process that data and write it. They also process queries and return data to the user.
When data is written, it is stored in memory for a period of time in addition to being persisted. In certain cases, such as when data is being quickly overwritten and duplicated, the process of merging the data from disk and memory during queries can be computationally expensive.
A user wrote a very tight loop of code for the purpose of backfilling historical data. As part of this loop, they also queried the data as it was being written. This loop included running a query after every write. Like every user, their data was stored on a subset of the storage nodes.
Due to the nature of the data being backfilled and the queries, the impacted storage nodes were performing an extreme number of the expensive merges, causing a dramatic spike in CPU utilization on those nodes.
This high CPU utilization caused the storage nodes to lack CPU resources to keep up with processing new writes from the Kafka queue as fast as normal. Therefore, there was a long delay before the written data was available for reads, and so queries would return incomplete data from the impacted storage nodes during the incident.
Customers with data on the impacted nodes experienced:
Other customers were not impacted. No data was lost. The API continued to operate normally.
All times are in UTC:
23:45 - Storage Team received a TTBR alert (“time to become readable” signifying that there was a delay in data being written). Upon investigation the TTBR had already recovered to normal levels.
00:14 - Storage Team received another TTBR alert, but upon investigation saw that the TTBR was continuing to climb on a single storage node. The team embarked upon an investigation as their top priority.
00:45 - The team determined that there must be one specific customer causing the issue, and contacted Support to help identify the user, and also to update the status page reflecting that performance was degraded for some users.
00:49 - The team initiated the incident response processes. More engineers joined the investigation and were able to identify the user that triggered the incident.
01:21 - The team used a feature flag to target the specific customer and stop their queries from running. Support reached out to the customer and initiated a discussion.
01:46 - The team observed that the impacted storage nodes were quickly catching up with the queued writes. The team continued to monitor before finally closing the incident.