Delayed reads and writes in eu-central-1
Incident Report for InfluxDB Cloud
Postmortem

Incident RCA

2023-01-23: Delayed writes and reads in eu-central-1

Summary

Beginning in early January, a new customer workload began periodically causing high TTBR in eu-central-1. This workload was characterized by a few (< 10/day) large spikes of writes, sometimes upwards of 50-60MiB/s. Under normal circumstances, this would have posed little issue for a cluster of this size, however, these spikes consisted primarily of “upserts”--modifications to existing points. The process of merging the existing point values and those newly written during a spike of write traffic was highly CPU intensive, resulting in all replicas of the most impacted storage partitions falling behind on the stream of new writes, sometimes by over an hour. After each spike of write traffic, the most severely impacted partitions would take several hours to recover. Frequently, spikes during UTC business hours were separated by as little as one hour, further adding to the needed time to recover.

The engineering team identified the customer in question and attempted to mitigate the impact of these spikes of write traffic by significantly increasing the provisioned resources for each storage partition in this cluster. After these efforts proved to be unsuccessful, we worked with this customer to pause this workload. Going forward, we will be working with this customer to tune their workload.

Posted Feb 08, 2023 - 03:30 UTC

Resolved
This incident has been resolved.
Posted Feb 01, 2023 - 18:38 UTC
Update
We are continuing to monitor for any further issues.
Posted Jan 27, 2023 - 19:45 UTC
Update
The amount of time taken for writes to become readable has returned to the normal operating level. We will continue to monitor and update.
Posted Jan 26, 2023 - 22:33 UTC
Update
We are continuing to monitor for any further issues.
Posted Jan 26, 2023 - 19:53 UTC
Update
The region is currently experiencing elevated times for written information to become queryable along with elevated query run times. All written information is still being safely queued. We are working to minimize disruptions and we will continue to update as the situation evolves.
Posted Jan 26, 2023 - 18:38 UTC
Monitoring
The storage engine has completed processing the bursted traffic. Writes are now queryable within the normal time ranges. Our team will continue to monitor.
Posted Jan 26, 2023 - 03:06 UTC
Identified
The region received another burst of write traffic causing backup into the queueing system. Heavily bursting traffic has now been limited on a per organization basis. All organizations impacted by this limitation have been notified. The storage system is currently processing the bursted traffic. We will update here as the storage engine completes this processing.
Posted Jan 26, 2023 - 02:29 UTC
Investigating
We are seeing a return of the delayed reads and writes issue from a few hours ago. Our team is actively investigating.
Posted Jan 26, 2023 - 01:21 UTC
Monitoring
We have identified the primary contributing factor for performance degradation (delayed writes and subsequent temporary read discrepancies). The impacted region has been receiving a large burst of writes twice daily which has saturated the storage layer. All writes to this region have been successfully recorded within our queuing system but are taking much longer than expected to become queryable. In response to this event, we have increased the resources available to the storage system. In addition, we have paused all tertiary background processes in the region to expedite recovery. We will continue to provide updates as the storage system recovers and will provide a complete root cause analysis when this incident has been resolved.
Posted Jan 25, 2023 - 19:53 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 18:12 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 16:06 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 14:45 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 13:50 UTC
Update
We have deployed some minor updates to the cluster but still continue to investigate the issue
Posted Jan 25, 2023 - 09:29 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 07:32 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 05:53 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 03:44 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 25, 2023 - 01:25 UTC
Update
We are continuing to work on a fix for this issue.
Posted Jan 24, 2023 - 23:43 UTC
Update
We are continuing to work on this issue and a fix is being implemented.
Posted Jan 24, 2023 - 21:00 UTC
Update
The issue has been identified and a fix is still being implemented.
Posted Jan 24, 2023 - 19:50 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 24, 2023 - 17:41 UTC
Update
The AWS regions specified are experiencing delayed write/query operations and intermittent query failures.
We are continuing to investigate the issue.
Posted Jan 24, 2023 - 15:57 UTC
Update
We are continuing to investigate this issue.
Posted Jan 24, 2023 - 13:50 UTC
Investigating
We are currently investigating this issue.
Posted Jan 24, 2023 - 13:11 UTC
This incident affected: AWS: Frankfurt, EU-Central-1 (API Writes, API Queries).