Sustained query error-rate in AWS eu-central-1 on June 7, 2024
Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key, with the intent that the write and query load will, on average, be distributed evenly across partitions.
When an InfluxDB Cloud user writes data, their writes first go into a durable queue. So, rather than being written directly by users, storage pods consume ingest data from the other end of this queue, amongst other things, this allows writes to be accepted even if storage is encountering issues.
One of the metrics used to reflect the status of this pipeline is Time To Become Readable (TTBR), the time it takes between a write being accepted and passing through the queue so that its data is available to queries.
In order to respond to a query, the compute tier needs to request relevant data from each of the 64 partitions. In order for a query to succeed, the compute must receive a response from every partition (this is to ensure that incomplete results are not returned). Each partition has multiple pods that are responsible for it, and query activity is distributed across them.
On the 7th of June 2024, partition 44 started to report large increases in TTBR. This meant that, whilst customer’s writes were being safely accepted into the durable queue, they were delayed in becoming available to queries.
At around the same time, alerts were received indicating an elevated query failure rate, accompanied by an increase in the query queue depth.
Investigation showed that the pods responsible for partition 44 were periodically trying to consume more RAM than permitted, causing them to exit and report an out-of-memory (OOM) event.
InfluxData allocated additional RAM to the pods to try to mitigate the customer-facing impact quickly. However, they continued to OOM, so the investigation moved on to identifying the source of the excessive resource usage.
In a multi-tenant system, the resource usage of a single user impacting other users is known as a noisy neighbor issue. The best way to address the problem is to identify the tenant that is the source of the problematic query and temporarily block their queries, while we engage with them to correct the problematic query.
In this case, the customer had automated the execution of a query that attempted to run an in-memory sort() against data taken from a particularly dense series.
With the problematic query being submitted regularly, more RAM was consumed, until the storage pods ultimately OOMed. As a result of these regular OOMs, Kubernetes moved the pods into a CrashloopBackOff state, which lengthened the recovery time between each OOM.
The extended recovery periodically caused all pods responsible for partition 44 to be offline, preventing the query tier from authoritatively answering queries.
We are working on several changes to better identify the source of incidents and reduce the likelihood of them occurring in the future. These changes include: