RCA - Query errors in Azure West Europe on May 13, 2025
Background
Data in customer buckets on the InfluxDB Cloud platform is typically retained for a defined period, controlled by a retention policy set per bucket. These retention policies allow users to specify how long data should be kept before InfluxDB automatically deletes it.
While retention policies are the recommended way to manage bucket size and data freshness, it’s also possible to delete data by predicate or individual point. However, large-scale deletes using this method can put significant strain on the storage subsystem.
Incident Detail
On 13 May, multiple batches of inefficient delete requests were sent in rapid succession, leading to latency and, in some cases, errors in their interactions with the InfluxDB Cloud. The impact was limited to organizations whose data was stored near the data being deleted. In multi-tenant systems, resource consumption that impacts other users is known as a “noisy neighbor” issue. The customer who submitted these requests has since received guidance on more efficient deletion strategies.
Internal alerts flagged the issue, prompting the operations team to quickly identify and isolate the deletes that were impacting adjacent customers. Once mitigations were put in place, the storage subsystem promptly returned to full functionality.
We recognize that this condition is problematic. To help minimize its impact, we’ve implemented monitoring systems that detect large or potentially inefficient delete operations. While these systems are effective in catching issues, they are reactive by nature and not a long-term solution.
Actions
We are actively working on a more efficient deletion mechanism to give customers more granular control over data removal beyond what retention policies offer.
In the meantime, we are reviewing applied rate limits and other service protections to proactively identify and engage with customers whose deletion patterns may cause excessive storage subsystem activity that impacts adjacent customers.