Microsoft cloud Azure accidentally deletes customer DBs
Five-minute gap in which transactions for some punters are toast
Users of Microsoft’s Azure system lost database records as part of a mass outage on Tuesday, Jan 29th 2019. A combination of DNS problems and automated scripts were to blame.
Microsoft deleted several Transparent Data Encryption (TDE) databases in Azure, holding live customer information. TDE databases dynamically encrypt the information they store, decrypting it when customers access it. Keeping the data encrypted at rest stops an intruder with access to the database from reading the information.
While there are different approaches to encrypting these tables, many Azure users store their own encryption keys in Microsoft’s Key Vault encryption key management system, in a process called Bring Your Own Key (BYOK).
The deletions were automated, triggered by a script that drops TDE database tables when their corresponding keys can no longer be accessed in the Key Vault, explained Microsoft in a letter reportedly sent to customers.
The company quickly restored the tables from a five-minute snapshot backup, but that meant any transactions that customers had processed within five minutes of the table drop would have to be dealt with manually. In this case, customers would have to raise a support ticket and ask for the database copy to be renamed to the original
TRACKING ID: Q7N7-VWG
STATUS: RCA 2/1/2019 2:20:02 AM (UTC)
SUMMARY OF IMPACT: Between 21:00 UTC on 29 Jan 2019 and 00:10 UTC on 30 Jan 2019, customers may have experienced issues accessing Microsoft Cloud resources, as well as intermittent authentication issues across multiple Azure services affecting the Azure Cloud, US Government Cloud, and German Cloud. This issue was mitigated for the Azure Cloud at 23:05 UTC on 29 Jan 2019. Residual impact to the Azure Government Cloud was mitigated at 00:00 UTC on 30 Jan 2019 and to the German Cloud at 00:10 UTC on 30 Jan 2019.
Azure AD authentication in the Azure Cloud was impacted between 21:07 - 21:48 UTC, and MFA was impacted between 21:07 - 22:25 UTC.
Customers using a subset of other Azure services including: Microsoft Azure portal, Azure Data Lake Store, Azure Data Lake Analytics, Application Insights, Azure Log Analytics, Azure DevOps, Azure Resource Graph, Azure Container Registries, and Azure Machine Learning may have experienced intermittent inability to access service resources during the incident. A limited subset of customers using SQL transparent data encryption with bring your own key support may have had their SQL database dropped after Azure Key Vault was not reachable. The SQL service restored all databases.
For customers using Azure Monitor, there was a period of time where alerts, including Service Health alerts, did not fire. Azure internal communications tooling was also affected by the external DNS incident. As a result, we were delayed in publishing communications to the Service health status on the Azure Status Dashboard. Customers may have also been unable to log into their Azure Management Portal to view portal communications.
ROOT CAUSE AND MITIGATION:
Root Cause: An external DNS service provider experienced a global outage after rolling out a software update which exposed a data corruption issue in their primary servers and affected their secondary servers, impacting network traffic. A subset of Azure services, including Azure Active Directory were leveraging the external DNS provider at the time of the incident and subsequently experienced a downstream service impact as a result of the external DNS provider incident. Azure services that leverage Azure DNS were not impacted, however, customers may have been unable to access these services because of an inability to authenticate through Azure Active Directory. An extremely limited subset of SQL databases using "bring your own key support" were dropped after losing connectivity to Azure Key Vault. As a result of losing connectivity to Azure Key Vault, the key is revoked, and the SQL database dropped.
Mitigation: DNS services were failed over to an alternative DNS provider which mitigated the issue. This issue was mitigated for the Azure Cloud at 23:05 UTC on 29 Jan 2019. Residual impact to the Azure Government Cloud was mitigated at 00:00 UTC on 30 Jan 2019 and to the German Cloud at 00:10 UTC on 30 Jan 2019. Authentication requests which occurred prior to the routing changes may have failed if the request was routed using the impacted DNS provider. While Azure Active Directory (AAD) leverages multiple DNS providers, manual intervention was required to route a portion of AAD traffic to a secondary provider.
The external DNS service provider has resolved the root cause issue by addressing the data corruption issue on their primary servers. They have also added filters to catch this class of issue in the future, lowering the risk of reoccurrence.
Azure SQL engineers restored all SQL databases that were dropped as a result of this incident.
NEXT STEPS: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
• Azure Active Directory and other Azure service that currently leverage an external DNS provider are working to ensure that impacted services are on-boarded to our fully redundant, native to Azure, DNS hosting solution and exercising best practices around DNS usage. This will help to protect against a similar recurrence [in progress].
• Azure SQL DB has deployed code that will prevent SQL DB from being dropped when Azure Key Vault is not accessible [complete]
• Azure SQL DB code is under development to correctly distinguish between Azure Key Vault being unreachable from a key revoked [in progress]
• Azure SQL DB code is under development to introduce a database inaccessible state that will move a SQL DB into an inaccessible state instead of dropping in the event of a key revocation [in progress]
• Azure Communications tooling infrastructure resiliency improvements have been deployed to production which will help ensure an expedited fail over in the event internal communications tooling infrastructure experiences a service interruption [complete]
Network Infrastructure Australia East
East US 2
North Central US
South Central US
West Central US
West US 2