November 30, 2022: Outage in prod01-eu-central-1
On November 30th 2022 at 15:24 UTC a code change was applied to the eu-central-1 cluster that resulted in all storage pods being evicted. The pods could not reschedule because the sum total request of resources was greater than that available on the underlying hardware. After investigation, a change was made to reduce resource requests for storage services. With the reduced resource allocation, the cluster was able to schedule the pods by 16:10 UTC. While the service was recovering, customers on this cluster received an abnormally higher rate of write request failures because of the load from recovering while servicing write requests. The service returned to normal operation by 17:00 UTC as monitored by TTBR of the pods in the cluster as well as write request failure rate.
Our kubernetes nodes host our application services and daemonsets that each independently request a certain amount of resources. The change made at 15:24 UTC added a new daemonset. In isolation this change was reasonable, however, in aggregate with all services it exceeded the capability of the nodes. Daemonsets are run with a higher priority than our application services and therefore storage pods were evicted. During this time, writes continued to work correctly, as they are buffered by Kafka, but all queries failed because they rely on our storage pods.
Kafka’s write buffer allows us to accept data even if the storage pods are unavailable. When the storage service is restarted we recover that data from Kafka. This is exactly what happened during the window between 15:24 UTC and 16:10 UTC - all writes were stored in Kafka. When the storage pods began to recover at 16:10 UTC, the knock-on effect of this recovery is that our kafka brokers were under increased load which caused an abnormally high rate of write requests to time out and return 503 responses. By 17:00 the recovery workload had decreased enough that the Kafka brokers were able to return to normal performance and error levels.
While the storage pods were offline between 15:24 UTC and 16:10 UTC all customers on the eu-central-1 cluster were unable to access the query apis. All queries in this time span failed. Once the pods were restarted and undergoing recovery from Kafka between 16:10 UTC and
17:00 UTC customers in this cluster may have experienced an elevated rate of write requests timed out and returning 503 responses. During the whole scope of this incident all write requests that received a 200 OK response were successfully recorded within InfluxDB. No data that was successfully acknowledged during or before the incident was lost. Failed tasks or tasks that were executed during the TTBR recovery may need to be rerun.
17:00 UTC - TTBR of all nodes returns to healthy levels & write request response times
return to healthy levels.