RCA - query errors in us-central-1 on May 11, 2023
One of the customers in this cluster submitted a large number of deletes. By itself, this would not have caused an outage. However, at the same time, one of the storage pods ran out of disk space. We added more disk space, but due to the large number of tombstone files (created to keep track of deleted measurements) the pod was very slow to recover, and there was a high rate of query failures until the storage pod recovered.
The immediate root cause was that a disk filled up. Under normal circumstances, this is not service-impacting. We get alerted when the disk is close to filling up, and we have a run book in place to add capacity to the storage layer. The pod must be restarted after adjusting the disk size, and when the pod restarted, it was unavailable for a long time, while processing all the tombstone files.
As soon as we identified that deletes were contributing to the slow recovery, we reached out to the customer that had generated the large volume of deletes, to ask them to stop sending requests to delete measurements, while the cluster was in recovery mode. While we waited to hear back from them, we blocked deletes for all customers on this cluster as a temporary measure. We also manually removed the tombstone files from one replica of the most heavily impacted storage partition, so that it could recover more quickly. This enabled the cluster to return to normal operation and process queries. Meanwhile the other replica of this partition continued to process the backlog of tombstone files so that once it was complete we could restart both replicas and the data will be complete and correct.
May 11, 2023 18:20 - Alerted that storage pod disk was close to capacity
May 11, 2023 18:43 - Added more disk capacity, and restarted the pods.
May 11, 2023 19:05 - Pod was very slow to start, so queries started failing. Storage pod became unavailable and queries started failing. Investigations showed that the problem partition was pegged processing the enormous number of deletes, but that it was making progress. We decided to let the process run its course so that the data was correct. Continuous monitoring of the progress gave a predicted outage time of around two hours. During this time writes were being accepted but queries were failing. This was seen as the least bad option. The partition completed its work but then, because of the number of delete requests that had continued to be added, effectively had to start over.
May 11, 2023 21:16 - Blocked all deletes to the cluster and continued to monitor the situation.
Progress was still being made. We examined the code paths and decided that this was not a software fault as such and so the best course of action was to allow the process to continue to run. We started investigating ideas to restore services faster.
May 11, 2023 22:00 - Manually deleted tombstone files on the secondary replica of the impacted storage pods to speed up the recovery process. This means that queries were able to be serviced but that data which should have been deleted was visible. This was decided to be the least bad option at this time. We intended to leave the primary replica to process the remaining deletes expecting this to take many more hours. When this process was complete we would restart the secondary and the replicas would be back in sync with the correct data.
May 11, 2023 23:08 - Primary replica of impacted storage pod recovered, query rate back to normal.