Degradable performance on read and write in azure-us-east on April 1, 2024
Summary
Alerts were received indicating an increase in Time To Be Readable (TTBR) followed by an increase in the number of queries failing within the region
Cause
The cluster experienced a significant increase in workload which consumed the available CPU time within the storage tier. This led to the query queue growing deeper, with some queries timing out before being processed.
Whenever a multi-tenant cluster experiences performance issues which do not appear to correlate to any changes that we’ve made, we first check the larger customers in the cluster to see whether they’ve exhibited any change in behavior.
In this instance, a large customer had increased the number of queries being run throughout the day. The incident coincided with another large customer writing a large number of new series into the cluster, which will have led to indexes being locked more frequently than usual.
The combination of the two led to the storage tier answering queries far more slowly than normal, allowing more queries to queue and therefore sustaining pressure on the system.
On InfluxDB Cloud2 the performance of writes and queries is inextricably linked with changes in behavior within one path able to affect the other (note: this is no longer the case in v3 based products such as InfluxDB Cloud Dedicated).
Future mitigations
Multi-tenant clusters come with an inherent risk that an increase or change in workload can negatively impact other users of the cluster. When we observe significant and persistent changes in workload, we look to adjust the cluster to handle the new workload.