During this incident all Users in the aws eu-central-1 region cluster with Tasks saw elevated lateness, and potentially missing logs for Runs that were supposed to have been executed during that period. Organizations will only have missing Run logs if they experienced rate limits when their Tasks were in a “finishing” state.
AWS eu-central-1 was undergoing an upgrade to Kubernetes v1.19. This caused Nodes to roll as they were being upgraded.
Services that depend on Etcd dial it via a Service definition. If the Pod attempts to dial via this DNS entry, and the Pod they receive isn’t healthy, it won’t attempt to contact other Etcd Pods, because it’s unaware of them. This lack of resilience in the client setup exacerbated the issue.
Learnings
Enable liveness and readiness probes in aws eu-central-1 to know ahead of downstream services failing that there is a potential problem with the Etcd cluster.