A software change was introduced that was intended to improve data compaction for very large partitions. The change had a bug that, under certain circumstances, caused some data to be dropped during the compaction process. When compaction completed, the new compacted file (missing a portion of the source data) replaced the original files. This made some data unavailable until it was restored from backup.
Because the underlying data remained intact in backups, no permanent data loss occurred. After identifying the issue, we rolled back the software change and initiated a full data restoration from backups. While the restoration process is reliable, restoring data at cluster scale while maintaining normal system operations required an extended recovery period. During the restoration process, some customers experienced elevated TTBR and degraded query performance while restoration was underway.
Data in InfluxDB Cloud 2 is organized into 64 partitions. Each partition is regularly compacted to improve query performance since uncompacted data is slower to query.
In January, we identified that one of the partitions in prod101-us-east-1 (partition 47) was no longer compacting successfully. As a result, partially compacted files accumulated in this partition and began impacting performance. To address this, our engineering team designed and implemented an optimization to run multiple compaction jobs to run in parallel. Each job targeted a subset of the data in the partition (based on the org id prefix), enabling heavily loaded partitions, such as partition 47, to be compacted more effectively.
This code change was reviewed by two additional engineers as part of our standard release process. However, the optimization introduced a bug that, in certain circumstances, caused the compactor to fail when reading some of the source TSM data files. This compactor did not correctly handle this failure and instead silently skipped the TSM files, compacted the rest of the data, and then replaced the original files with the compacted data. This issue was not detected during code review or internal testing, and only surfaced after the change was applied to prod101-us-east-1.
Once the issue was identified, we immediately rolled back the compaction change to prevent any further impact. We then began restoring the affected data.
Cloud 2 maintains two separate backup sources. Each cluster is protected by weekly TSM snapshots, and all incoming writes are recorded as snapshots in Kafka logs. Together, these provide a complete record of the data written to the cluster and allow us to fully restore the affected data.
Because the issue only impacted partitions that were compacted while the buggy code was running, not all data in the cluster was affected. However, as a precaution, the recovery process prioritized restoring the most recent data first as it is typically the most critical data for customer operations. Recent data is reconstructed from Kafka logs, which contain a complete record of writes into the cluster. Historical data is then restored from TSM snapshots.
The cluster remained online throughout the recovery process because customers rely on it for real-time workloads. As a result, restoration had to be performed incrementally so recovery activity would not disrupt ongoing ingest and query operations.
Although this approach takes longer than restoring the cluster offline from a full backup, it allows the system to remain available while the recovery progresses. During portions of the recovery, restoration activity resulted in elevated TTBR and slower query performance.
Feb 9 - Deployed compaction change to prod101-us-east-1.
Feb 9 - First ticket raised indicating potential data availability issues. Investigation began.
Feb 10 - Additional tickets raised. Root cause identified.
Feb 10 - Compaction change rolled back.
Feb 10 - Data restoration process initiated, starting with the most recent data.
Feb 12 - Recent data (Feb 2-10) fully restored for a test set of impacted customers. Recent data restoration began for all customers.
Feb 17 - Modified restoration process to improve efficiency for historical data recovery.
Feb 23 - Recent data (Feb 2-9) fully restored for all customers.
Feb 23 - Historical data restoration resumed for all customer data from Feb 2 and earlier
Feb 25 - Historical data is fully restored for initial test set of customers.
Mar 8 - Historical data fully restored for all paying customers.
We recognize the disruption this incident caused and sincerely apologize to customers who were impacted. During portions of the recovery process, some customers experienced degraded query performance while restoration was underway. Reliable access to your data is critical, and we take incidents like this very seriously.
Following this incident, we are implementing several improvements to reduce the risk of similar issues in the future.
We are expanding our testing coverage for changes that affect critical storage components such as the compactor, including additional test scenarios that better reflect production-scale workloads.
We are also enhancing safeguards around the compaction process to ensure failures are detected and handled safely.
In addition, we are reviewing our restoration workflows and operational procedures to reduce the potential for elevated TTBR or query performance degradation during large-scale restoration.
We remain committed to strengthening the resilience and operational safety of the platform.