RCA - Write and Query failures on prod102-us-west-2 on September 7, 2023
We upgraded nginx from 1.3 to 1.7 in prod102-us-west-2 and the upgrade appeared to be fine, with the cluster functioning normally immediately after the upgrade. We had already done the same upgrade of nginx in other Cloud 2 clusters, with no negative impact. Approximately 8 hours later, we were alerted by a high rate of write and query errors in the cluster. We ultimately resolved the issue by reverting the nginx upgrade.
Cause of the Incident
We believe that the cause of the incident was an interaction between the nginx upgrade and stale connection handling, although we were not able to definitively prove this, as we could not cause the problem to occur in later testing. We also did not see this interaction in any of the other clusters where we upgraded nginx. Eight hours after nginx was upgraded, we did a normal deployment which caused all of our gateway pods to restart (normal behavior). When they restarted, we got a very high rate (though not quite 100%) of connection failures on the gateway pods. This was visible to our customers, who saw 100% failures for queries, and between 80% and 90% failures for writes.
On receiving the alerts, we investigated and identified the high failure rate to the gateway pods. As nginx had already been running for 8 hours, we didn’t immediately assume that nginx was at fault, and instead we investigated any other potential causes (including the changes that had been deployed in the deployment that had just occurred). When we could not identify any other potential source, we rolled back the nginx upgrade, and the cluster recovered. We have since upgraded nginx again, on this cluster, with no negative impact. Our assumption is not that there is a problem with the particular version of nginx, but that the action of upgrading nginx caused it to hold onto stale connections, so after the gateway pods restarted they could no longer connect successfully.
Sept 7, 2023 18:15 UTC - Alerted that we are seeing a high rate of write and query errors.
Sept 7, 2023 18:20 UTC - Engineering team began investigating.
Sept 7, 2023 18:20 UTC to 21:10 UTC - Our early investigation showed that all the gateway pods had restarted. We investigated any changes that had been in that deployment, but there was nothing in the updated code that could have caused an issue of this magnitude (impacting queries and writes). We manually restarted one gateway pod and it did not recover. We also investigated whether this could have been caused by something external (e.g. some network issue within the AWS infrastructure) but could not identify any cause there. As we could not find a root cause, we chose to undo all recent changes in that cluster, including the nginx upgrade.
Sept 7, 21:15 UTC - Rolled back nginx to v1.3.
Sept 7, 21:20 UTC - All gateway connections recovered. Query and Write latency was still high as there was a backlog of failed requests to catch up.
Sept 7, 22:30 UTC - Cluster fully recovered.
Sept 8, 10:00 UTC - We upgraded nginx to 1.7, and restarted all the gateway pods, without any negative consequences.
We will force a deployment directly after each nginx upgrade, to ensure that all connections are refreshed, to avoid any potential interaction with stale connections.