2023-01-23: Delayed writes and reads in eu-central-1
Beginning in early January, a new customer workload began periodically causing high TTBR in eu-central-1. This workload was characterized by a few (< 10/day) large spikes of writes, sometimes upwards of 50-60MiB/s. Under normal circumstances, this would have posed little issue for a cluster of this size, however, these spikes consisted primarily of “upserts”--modifications to existing points. The process of merging the existing point values and those newly written during a spike of write traffic was highly CPU intensive, resulting in all replicas of the most impacted storage partitions falling behind on the stream of new writes, sometimes by over an hour. After each spike of write traffic, the most severely impacted partitions would take several hours to recover. Frequently, spikes during UTC business hours were separated by as little as one hour, further adding to the needed time to recover.
The engineering team identified the customer in question and attempted to mitigate the impact of these spikes of write traffic by significantly increasing the provisioned resources for each storage partition in this cluster. After these efforts proved to be unsuccessful, we worked with this customer to pause this workload. Going forward, we will be working with this customer to tune their workload.