Delayed reads and writes in eu-central-1

Incident Report for InfluxDB Cloud

Postmortem

Incident RCA

2023-01-23: Delayed writes and reads in eu-central-1

Summary

Beginning in early January, a new customer workload began periodically causing high TTBR in eu-central-1. This workload was characterized by a few (< 10/day) large spikes of writes, sometimes upwards of 50-60MiB/s. Under normal circumstances, this would have posed little issue for a cluster of this size, however, these spikes consisted primarily of “upserts”--modifications to existing points. The process of merging the existing point values and those newly written during a spike of write traffic was highly CPU intensive, resulting in all replicas of the most impacted storage partitions falling behind on the stream of new writes, sometimes by over an hour. After each spike of write traffic, the most severely impacted partitions would take several hours to recover. Frequently, spikes during UTC business hours were separated by as little as one hour, further adding to the needed time to recover.

The engineering team identified the customer in question and attempted to mitigate the impact of these spikes of write traffic by significantly increasing the provisioned resources for each storage partition in this cluster. After these efforts proved to be unsuccessful, we worked with this customer to pause this workload. Going forward, we will be working with this customer to tune their workload.

Posted Feb 08, 2023 - 03:30 UTC

Resolved

This incident has been resolved.

Posted Feb 01, 2023 - 18:38 UTC

Update

We are continuing to monitor for any further issues.

Posted Jan 27, 2023 - 19:45 UTC

Update

The amount of time taken for writes to become readable has returned to the normal operating level. We will continue to monitor and update.

Posted Jan 26, 2023 - 22:33 UTC

Update

We are continuing to monitor for any further issues.

Posted Jan 26, 2023 - 19:53 UTC

Update

The region is currently experiencing elevated times for written information to become queryable along with elevated query run times. All written information is still being safely queued. We are working to minimize disruptions and we will continue to update as the situation evolves.

Posted Jan 26, 2023 - 18:38 UTC

Monitoring

The storage engine has completed processing the bursted traffic. Writes are now queryable within the normal time ranges. Our team will continue to monitor.

Posted Jan 26, 2023 - 03:06 UTC

Identified

The region received another burst of write traffic causing backup into the queueing system. Heavily bursting traffic has now been limited on a per organization basis. All organizations impacted by this limitation have been notified. The storage system is currently processing the bursted traffic. We will update here as the storage engine completes this processing.

Posted Jan 26, 2023 - 02:29 UTC

Investigating

We are seeing a return of the delayed reads and writes issue from a few hours ago. Our team is actively investigating.

Posted Jan 26, 2023 - 01:21 UTC

Monitoring

We have identified the primary contributing factor for performance degradation (delayed writes and subsequent temporary read discrepancies). The impacted region has been receiving a large burst of writes twice daily which has saturated the storage layer. All writes to this region have been successfully recorded within our queuing system but are taking much longer than expected to become queryable. In response to this event, we have increased the resources available to the storage system. In addition, we have paused all tertiary background processes in the region to expedite recovery. We will continue to provide updates as the storage system recovers and will provide a complete root cause analysis when this incident has been resolved.

Posted Jan 25, 2023 - 19:53 UTC