InfluxDB is a Kubernetes application. Among it’s core services are a compute tier and the storage tier. Roughly, when a user submits a query, the compute tier queues requests, calculates what to request from the storage tier, sends a request to the storage tier, completes any computations not fulfilled by the storage tier, and returns the data to the requestor.
Unknown at this time. The team’s current diagnosis is that an as yet unidentified anomaly in the storage tier may have caused processes to be processed slowly (for example by starving them of CPU), but the anomaly was resolved when an unrelated deployment caused all of the containers in the storage tier to reset.
No known contributing factors. The team was able to rule out common causes and contributing factors such as:
The team did observe some other factors which were correlated in time with the incident:
Query duration slowed down dramatically for all users on the cluster. Queries that may have taken a long time to finish previously timed out during the incident. Writes were unaffected. The API was otherwise unaffected.
The team continues to investigate the root cause. Next steps are to enhance the logging of query behavior in the storage tier to assist in ruling out the role of long running processes in the incident.