Users are experiencing elevated Task lateness in AWS eu-central-1

Incident Report for InfluxDB Cloud

Postmortem

Summary

During this incident all Users in the aws eu-central-1 region cluster with Tasks saw elevated lateness, and potentially missing logs for Runs that were supposed to have been executed during that period. Organizations will only have missing Run logs if they experienced rate limits when their Tasks were in a “finishing” state.

Contributors

AWS eu-central-1 was undergoing an upgrade to Kubernetes v1.19. This caused Nodes to roll as they were being upgraded.

Services that depend on Etcd dial it via a Service definition. If the Pod attempts to dial via this DNS entry, and the Pod they receive isn’t healthy, it won’t attempt to contact other Etcd Pods, because it’s unaware of them. This lack of resilience in the client setup exacerbated the issue.

Learnings

Enable liveness and readiness probes in aws eu-central-1 to know ahead of downstream services failing that there is a potential problem with the Etcd cluster.

Posted Feb 09, 2022 - 00:12 UTC

Resolved

This incident has been resolved. An RCA will be posted shortly.

Posted Feb 07, 2022 - 19:22 UTC

Investigating

We are currently investigating this issue.

Posted Feb 07, 2022 - 18:26 UTC

This incident affected: Cloud Serverless: AWS, EU-Central (Tasks).