A refactoring intended to improve readability and support future improvements combined with a limited amount of invalid data in AWS us-west-2 caused the series file and index code path to encounter read and write problems.
At the code level a change was made to guarantee that certain files were closed on error. Instead, that code was relying on the file not being closed on error.
Finding and fixing the issue was compounded by some extra factors:
Why weren't tests included with that refactoring? The team did include a new test with the refactoring, but the specific code that caused the problem was purposefully excluded from the refactor, and therefore, was not covered by the new test.
Why wasn’t the code immediately reverted? The team did revert the code immediately, but we follow the principle of not releasing changes to production unless we have positive confirmation that they will help. However, the above mentioned secondary bug caused the tests to fail. This impaired verification testing until the secondary bug was discovered.
Why did this only impact one region? As mentioned above, the problem was triggered by the combination of the code change to guarantee closing files and the presence of certain “illegal” data in storage. Only us-west-2 contained such data.
Did customers lose data? No. All writes succeeded, save for a few hours of data that was not written, but was of course saved in our backups ( Our data durability system is described here: https://docs.influxdata.com/influxdb/cloud/reference/internals/durability/). The missing data was replayed to guarantee that all data was written. Because certain series were not returned in queries, users’ tasks likely did not behave as expected. For example, checks may have stopped writing statuses for specific series, or downsampling did not occur as expected. End users can rerun tasks for the impacted time if needed.
Where did the “bad” data come from? The team traced the source of the bad data (zero length tag values) to a flux function that bypasses our usual storage client. This appears to bypass the safeguards present in the storage client, specifically because the Flux function does not ensure all data is valid line-protocol.
Friday June 11 - Refactoring PR is released, makes it way through CI and CD, and then into production.
Saturday June 12 - Alerts start showing that there are errors occurring in the internal environment. Code is reverted there, and the problem appears mitigated. Series begin to be missed in queries in production, but goes unnoticed.
Sunday June 13 - The team believed that production was stable so the team by and large monitored the internal environment.
Monday June 14 - The team prepares and tests several new versions with different combinations of code revisions, as well as embarks on debugging of files produced in production. Images with reverted code are built, but tests continue to fail. A small number of customers report that they are missing series in their queries. All of these customers are in the same region (AWS us-west). The team confirms that data continues to be written successfully, so there is no underlying data loss.
Tuesday June 15 - More customers report missing data in their queries.Later that evening, the secondary bug causing test failures is uncovered, and it is then confirmed that a code reversion will work.
Wednesday June 16 - The reverted code is released to the internal environment and heavily tested there to ensure that it fixes the problem and does not create new problems. The code is then released only to the impacted cluster. We experience a query outage of approximately 7 or 8 minutes during the deployment, but all series become queryable. It takes approximately 3 hours for all data to be recovered from S3. The team does an investigation and identifies a small amount of data that was collected and not written. The team isolates that data and writes a custom tool to replay this data subset in an efficient manner.
Thursday June 17 - The team replays the small amount of missing data. The team updates all production clusters with the reverted code. The team embarks on a final investigation to confirm that all the problems were fixed, and that no new problems have arisen. The incident is closed.