Restoration of historical data in AWS, US-East-1

Incident Report for InfluxDB Cloud

Postmortem

Summary

A software change was introduced that was intended to improve data compaction for very large partitions. The change had a bug that, under certain circumstances, caused some data to be dropped during the compaction process. When compaction completed, the new compacted file (missing a portion of the source data) replaced the original files. This made some data unavailable until it was restored from backup.

Because the underlying data remained intact in backups, no permanent data loss occurred. After identifying the issue, we rolled back the software change and initiated a full data restoration from backups. While the restoration process is reliable, restoring data at cluster scale while maintaining normal system operations required an extended recovery period. During the restoration process, some customers experienced elevated TTBR and degraded query performance while restoration was underway. 

Cause

Data in InfluxDB Cloud 2 is organized into 64 partitions. Each partition is regularly compacted to improve query performance since uncompacted data is slower to query.  

In January, we identified that one of the partitions in prod101-us-east-1 (partition 47) was no longer compacting successfully. As a result, partially compacted files accumulated in this partition and began impacting performance. To address this, our engineering team designed and implemented an optimization to run multiple compaction jobs to run in parallel. Each job targeted a subset of the data in the partition (based on the org id prefix), enabling heavily loaded partitions, such as partition 47, to be compacted more effectively.

This code change was reviewed by two additional engineers as part of our standard release process. However, the optimization introduced a bug that, in certain circumstances, caused the compactor to fail when reading some of the source TSM data files. This compactor did not correctly handle this failure and instead silently skipped the TSM files, compacted the rest of the data, and then replaced the original files with the compacted data. This issue was not detected during code review or internal testing, and only surfaced after the change was applied to prod101-us-east-1. 

Recovery

Once the issue was identified, we immediately rolled back the compaction change to prevent any further impact. We then began restoring the affected data. 

Cloud 2 maintains two separate backup sources. Each cluster is protected by weekly TSM snapshots, and all incoming writes are recorded as snapshots in Kafka logs. Together, these provide a complete record of the data written to the cluster and allow us to fully restore the affected data.

Because the issue only impacted partitions that were compacted while the buggy code was running, not all data in the cluster was affected. However, as a precaution, the recovery process prioritized restoring the most recent data first as it is typically the most critical data for customer operations. Recent data is reconstructed from Kafka logs, which contain a complete record of writes into the cluster. Historical data is then restored from TSM snapshots.  

The cluster remained online throughout the recovery process because customers rely on it for real-time workloads. As a result, restoration had to be performed incrementally so recovery activity would not disrupt ongoing ingest and query operations. 

Although this approach takes longer than restoring the cluster offline from a full backup, it allows the system to remain available while the recovery progresses. During portions of the recovery, restoration activity resulted in elevated TTBR and slower query performance.

Timeline

Feb 9 - Deployed compaction change to prod101-us-east-1.

Feb 9 - First ticket raised indicating potential data availability issues.  Investigation began.

Feb 10 - Additional tickets raised. Root cause identified.

Feb 10 - Compaction change rolled back.

Feb 10 - Data restoration process initiated, starting with the most recent data.

Feb 12 - Recent data (Feb 2-10) fully restored for a test set of impacted customers. Recent data restoration began for all customers. 

Feb 17 - Modified restoration process to improve efficiency for historical data recovery.

Feb 23 - Recent data (Feb 2-9) fully restored for all customers.

Feb 23 - Historical data restoration resumed for all customer data from Feb 2 and earlier

Feb 25 - Historical data is fully restored for initial test set of customers.

Mar 8 - Historical data fully restored for all paying customers.

Future mitigation

We recognize the disruption this incident caused and sincerely apologize to customers who were impacted. During portions of the recovery process, some customers experienced degraded query performance while restoration was underway. Reliable access to your data is critical, and we take incidents like this very seriously.

Following this incident, we are implementing several improvements to reduce the risk of similar issues in the future. 

We are expanding our testing coverage for changes that affect critical storage components such as the compactor, including additional test scenarios that better reflect production-scale workloads. 

We are also enhancing safeguards around the compaction process to ensure failures are detected and handled safely.

In addition, we are reviewing our restoration workflows and operational procedures to reduce the potential for elevated TTBR or query performance degradation during large-scale restoration.

We remain committed to strengthening the resilience and operational safety of the platform.

Posted Mar 10, 2026 - 01:22 UTC

Resolved

This incident has been resolved
Posted Mar 10, 2026 - 01:19 UTC

Update

We sincerely apologize for the length of this incident. Restoration of data from February 2 onwards is now substantially complete, and the second phase of restoration, covering data prior to February 2, is underway, and has already been completed for many of our customers.  While we continue to restore the remaining historical data, some customers may notice elevated time-to-become-readable (TTBR) on queries.  We apologize in advance for the inconvenience that this may cause.  We are trying to maintain the responsiveness of the service while also restoring the last of the historical data as quickly as possible. Please contact support with any questions or concerns.
Posted Feb 26, 2026 - 19:10 UTC

Update

We apologize for the length of time that it has taken to restore customer data that was inadvertently deleted.  The data is all available in our backups, but it is taking us an unacceptably long time to restore it. as we have had to balance the restoration of historical data with service availability of the cluster. After we have resolved this issue, we will be revisiting our backup/restore strategy to improve the performance of cluster-wide restorations.  For this particular restoration, we have been restoring the data in two phases (data from February 2-9, and data prior to February 2nd).  Most customers should now be able to see all of their data from February 2nd onwards, with the last remaining data from this time range expected to be available by Sunday February 20th.  Once we have restored all the data from February 2nd onwards, we will be able to give a more accurate time estimate to restore the remaining older data.  Once again, we sincerely apologize for how long it is taking us to restore all of your data.  After we have completed the restoration, we will provide a full RCA for this outage.
Posted Feb 20, 2026 - 23:41 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 19, 2026 - 17:07 UTC

Update

We apologize for the ongoing disruption to prod101-us-east-1. We are continuing to restore data from our backups. During the restore, there may occasionally be heightened TTBR.
Posted Feb 18, 2026 - 16:18 UTC

Update

Our team has been working around the clock to restore data availability for all affected users. Work will continue through the weekend with no interruption. We do not yet have a resolution timeline to share, but we are committed to providing updates as soon as they are available until data restoration is complete. We sincerely apologize for the continued impact.
Posted Feb 14, 2026 - 00:08 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 13, 2026 - 01:43 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 12, 2026 - 10:49 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 12, 2026 - 07:46 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 12, 2026 - 01:29 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 11, 2026 - 18:57 UTC

Update

Our team continues to actively work to restore data for all affected users.
Posted Feb 11, 2026 - 16:36 UTC

Update

Our team is actively working to restore data for affected users.
Posted Feb 11, 2026 - 10:11 UTC

Update

Our team has identified the issue and is actively working to restore data for affected users.
Posted Feb 11, 2026 - 07:32 UTC

Update

Our team has identified the issue and is actively working to restore data for affected users.
Posted Feb 11, 2026 - 04:46 UTC

Update

Our team has identified the issue and is actively working to restore data for affected users.
Posted Feb 11, 2026 - 01:37 UTC

Update

We are continuing to work on the fix for this issue.
Posted Feb 10, 2026 - 22:54 UTC

Update

We are continuing to work on the fix for this issue.
Posted Feb 10, 2026 - 21:40 UTC

Update

We are continuing to work on the fix for this issue.
Posted Feb 10, 2026 - 19:21 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 10, 2026 - 18:12 UTC

Investigating

We are investigating an issue related to queries. Some recent data is not showing up on query responses
Posted Feb 10, 2026 - 17:37 UTC
This incident affected: Cloud Serverless: AWS, US-East-1 (Web UI, API Queries).