Read and write operations are degraded and we are currently investigating

Incident Report for InfluxDB Cloud

Postmortem

Incident RCA

November 30, 2022: Outage in prod01-eu-central-1

Summary

On November 30th 2022 at 15:24 UTC a code change was applied to the eu-central-1 cluster that resulted in all storage pods being evicted. The pods could not reschedule because the sum total request of resources was greater than that available on the underlying hardware. After investigation, a change was made to reduce resource requests for storage services. With the reduced resource allocation, the cluster was able to schedule the pods by 16:10 UTC. While the service was recovering, customers on this cluster received an abnormally higher rate of write request failures because of the load from recovering while servicing write requests. The service returned to normal operation by 17:00 UTC as monitored by TTBR of the pods in the cluster as well as write request failure rate.

Initial Outage - Query Failures

Our kubernetes nodes host our application services and daemonsets that each independently request a certain amount of resources. The change made at 15:24 UTC added a new daemonset. In isolation this change was reasonable, however, in aggregate with all services it exceeded the capability of the nodes. Daemonsets are run with a higher priority than our application services and therefore storage pods were evicted. During this time, writes continued to work correctly, as they are buffered by Kafka, but all queries failed because they rely on our storage pods.

Secondary Outage - Write Failures

Kafka’s write buffer allows us to accept data even if the storage pods are unavailable. When the storage service is restarted we recover that data from Kafka. This is exactly what happened during the window between 15:24 UTC and 16:10 UTC - all writes were stored in Kafka. When the storage pods began to recover at 16:10 UTC, the knock-on effect of this recovery is that our kafka brokers were under increased load which caused an abnormally high rate of write requests to time out and return 503 responses. By 17:00 the recovery workload had decreased enough that the Kafka brokers were able to return to normal performance and error levels.

Customer Impact

While the storage pods were offline between 15:24 UTC and 16:10 UTC all customers on the eu-central-1 cluster were unable to access the query apis. All queries in this time span failed. Once the pods were restarted and undergoing recovery from Kafka between 16:10 UTC and

17:00 UTC customers in this cluster may have experienced an elevated rate of write requests timed out and returning 503 responses. During the whole scope of this incident all write requests that received a 200 OK response were successfully recorded within InfluxDB. No data that was successfully acknowledged during or before the incident was lost. Failed tasks or tasks that were executed during the TTBR recovery may need to be rerun.

Timeline

November 30th

15:24 UTC - Patch merged to eu-central-1 cluster that added a new node daemonset.
15:30 UTC - Storage pods evicted due to resource overcommit.
15:40 UTC - Sustained query error rate alert fires.
15:45 UTC - Deadman alerts received for the unscheduled storage pods.
15:45 UTC - Incident channel created and triage began.
16:03 UTC - Mitigating patch deployed to reduce resources requested by the services.
16:10 UTC - Recovery of all pods scheduled & recovery from Kafka into storage pods.
16:10 UTC - Writes began to have a high rate of errors.
17:00 UTC - TTBR of all nodes returns to healthy levels & write request response times

return to healthy levels.

Future Mitigations

This issue was caused by a misconfiguration of the relative priority classes of two core elements - storage and the new daemonset. We are reviewing the priority classes of all our cloud components to ensure that critical components are never evicted by a less critical component. We have already corrected the priority class for the particular daemonset that we deployed today.
We are reviewing the capacity allocated to all critical components to ensure there is sufficient headroom to accommodate monitoring tools without putting pressure on the core components.

Posted Nov 30, 2022 - 21:25 UTC

Resolved

Service has returned to normal operation. There is no reason to expect additional outages at this time. An RCA will be provided later today.

Posted Nov 30, 2022 - 19:15 UTC

Update

We are continuing to monitor for any further issues.

Posted Nov 30, 2022 - 17:27 UTC

Monitoring

A fix has been implemented and we are monitoring it.

Posted Nov 30, 2022 - 17:01 UTC

Update

A fix has been implemented and we will continue to monitoring

Posted Nov 30, 2022 - 17:00 UTC

Identified

The issue has been identified, storage pods are back online, and writes are being replayed from Kafka. TTBR is recovering.

Posted Nov 30, 2022 - 16:31 UTC

Update

We are continuing to investigate this issue.

Posted Nov 30, 2022 - 16:19 UTC

Update

We are continuing to investigate this issue.

Posted Nov 30, 2022 - 16:02 UTC

Investigating

read operations are degraded and we are currently investigating

Posted Nov 30, 2022 - 15:59 UTC

This incident affected: Cloud Serverless: AWS, EU-Central (API Writes, API Queries, Tasks).