Degradable performance on read and write in azure-us-east

Incident Report for InfluxDB Cloud

Postmortem

RCA

Degradable performance on read and write in azure-us-east on April 1, 2024

‌

Summary

Alerts were received indicating an increase in Time To Be Readable (TTBR) followed by an increase in the number of queries failing within the region

Cause

The cluster experienced a significant increase in workload which consumed the available CPU time within the storage tier. This led to the query queue growing deeper, with some queries timing out before being processed.

Whenever a multi-tenant cluster experiences performance issues which do not appear to correlate to any changes that we’ve made, we first check the larger customers in the cluster to see whether they’ve exhibited any change in behavior.

In this instance, a large customer had increased the number of queries being run throughout the day. The incident coincided with another large customer writing a large number of new series into the cluster, which will have led to indexes being locked more frequently than usual.

‌

The combination of the two led to the storage tier answering queries far more slowly than normal, allowing more queries to queue and therefore sustaining pressure on the system.

‌

On InfluxDB Cloud2 the performance of writes and queries is inextricably linked with changes in behavior within one path able to affect the other (note: this is no longer the case in v3 based products such as InfluxDB Cloud Dedicated).

‌

Future mitigations

Multi-tenant clusters come with an inherent risk that an increase or change in workload can negatively impact other users of the cluster. When we observe significant and persistent changes in workload, we look to adjust the cluster to handle the new workload.

Posted May 23, 2024 - 00:24 UTC

Resolved

This incident has been resolved.

Posted Apr 02, 2024 - 16:00 UTC

Update

The cluster is now healthy, and operations have returned to normal.

Posted Apr 01, 2024 - 22:17 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 01, 2024 - 22:11 UTC

Update

We are continuing to investigate this issue.

Posted Apr 01, 2024 - 20:46 UTC

Investigating

We are currently investigating this issue.

Posted Apr 01, 2024 - 20:41 UTC

This incident affected: Cloud Serverless: Azure, East US (API Writes, API Queries).