Queries timing Out
A bulk export job was running to export a customer’s data. As the bulk export was taking a long time, we allocated more resources to the export job. The export job uses ephemeral disks, and runs on the same nodes as other services, such as the query nodes. As the job ran on many nodes, it consumed the ephemeral disks on the shared nodes, which impacted the other services, causing queries to fail. When we were alerted to the query failures, we stopped the bulk export job and the cluster recovered. We will be reworking the bulk export to run on its dedicated PVC, so that it cannot impact the other services in the cluster.
We are sorry for the service disruption that this caused.