Increase in query errors reported in Azure west europe

Incident Report for InfluxDB Cloud

Postmortem

RCA - Query errors in Azure West Europe on May 13, 2025

Background

Data in customer buckets on the InfluxDB Cloud platform is typically retained for a defined period, controlled by a retention policy set per bucket. These retention policies allow users to specify how long data should be kept before InfluxDB automatically deletes it.

While retention policies are the recommended way to manage bucket size and data freshness, it’s also possible to delete data by predicate or individual point. However, large-scale deletes using this method can put significant strain on the storage subsystem.

Incident Detail

On 13 May, multiple batches of inefficient delete requests were sent in rapid succession, leading to latency and, in some cases, errors in their interactions with the InfluxDB Cloud. The impact was limited to organizations whose data was stored near the data being deleted. In multi-tenant systems, resource consumption that impacts other users is known as a “noisy neighbor” issue. The customer who submitted these requests has since received guidance on more efficient deletion strategies.

Internal alerts flagged the issue, prompting the operations team to quickly identify and isolate the deletes that were impacting adjacent customers. Once mitigations were put in place, the storage subsystem promptly returned to full functionality.

We recognize that this condition is problematic. To help minimize its impact, we’ve implemented monitoring systems that detect large or potentially inefficient delete operations. While these systems are effective in catching issues, they are reactive by nature and not a long-term solution.

Actions

We are actively working on a more efficient deletion mechanism to give customers more granular control over data removal beyond what retention policies offer.

In the meantime, we are reviewing applied rate limits and other service protections to proactively identify and engage with customers whose deletion patterns may cause excessive storage subsystem activity that impacts adjacent customers.

Posted Jun 05, 2025 - 22:53 UTC

Resolved

This incident has been resolved.
Posted May 13, 2025 - 15:07 UTC

Update

We are continuing to monitor for any further issues.
Posted May 13, 2025 - 14:32 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 13, 2025 - 13:36 UTC

Investigating

We are currently investigating this issue.
Posted May 13, 2025 - 13:06 UTC
This incident affected: Cloud Serverless: Azure, East US (API Queries).