Summary
On January 15, 2026, user login for Cloud 2 was failing intermittently in the eu-central AWS
cluster. The impact was that for some users, they could not login via the web UI. For others,
they were able to login but then could not see their resources (buckets, dashboards etc.) in the
web UI. This was an intermittent issue, not affecting all users, and in some cases, recoverable
by retrying. It was also isolated to the web UI and did not impact API-based writes and queries.
Cause of the Incident
The incident was caused by the clock on some of the nodes being slightly behind the master
clock. When a user logs in, a JWT session token is signed with a nbf (not-before) timestamp
based on the signing server's clock. When the request hits a gateway pod, the token is checked
and one of the checks is to confirm that the token time is valid. If the gateway pod is one whose
node clock is behind the signing node's clock, the JWT is rejected because from that node's
perspective, the token's nbf time is in the future.
The reason why this problem was intermittent was that a load balancer distributed the
authentication requests across pods, and not all the pods had clock-skew. The same token
worked fine on some of the nodes (where the time was correct) and not on a few nodes (where
the time was incorrect).
Recovery
We identified the nodes whose clock time had drifted, and drained the affected nodes.
When the service was restarted on new nodes with the correct time, the problem was
resolved.
Timeline
tasks in the UI in the eu-central AWS cluster, and engineering began to investigate the
issue.
cluster.
drained.
2026-01-15 21:45 UTC - Confirmed that authentications were successful.
2026-01-16 16:11 UTC - An additional node was showing signs of clock skew, so the
node was drained and the incident was closed.
Future mitigation
We are still researching why the clocks drifted on these nodes, as these nodes use
the AWS-provided time synchronization service, which should have kept the
clocks in sync.