The outage at Amazon Web Services in Northern Virginia was caused by a software bug in an automated DNS management system that caused one automated component to delete the work of another.
The cloud provider published a comprehensive post-incident report late on Friday Australian time, shedding light on a disruption being touted as the biggest to internet infrastructure in more than a year.
The post-incident report notes that “there were three distinct periods of impact to customer applications,” although the initial issues with DynamoDB will likely be the most interesting.
The official cause is attributed to “a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that…the automation was unable to recover.”
The race condition, a type of software bug, involved “an unlikely interaction” between two of the same type of automated components in the DynamoDB DNS management architecture.
AWS said there are two different components in the architecture: a “DNS Planner, [which] …periodically creates a new DNS plan for each of the service’s endpoints,” and DNS Enactors who “pick up the latest plan” and systematically apply it to the endpoints.
“This process is typically completed quickly and effectively keeps DNS status up to date,” AWS said.
AWS said that DNS Enactors sometimes contact each other without any problems.
But in this case, one DNS Enactor experienced “unusually large delays, forcing it to retry the update on several DNS endpoints,” while another Enactor picked up a newer plan and applied it “quickly” to endpoints.
“The timing of these events caused the latent race condition,” AWS said.
“When the second Enactor (applying the latest plan) completed the endpoint updates, it was called upon [a] Purge process, which identifies plans that are significantly older than the one just applied and removes them,” AWS said.
“At the same time this cleanup process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DynamoDB endpoint and overwrote the newer plan.
“The cleanup process of the second Enactor then removed this older plan because it predated the plan it had just applied by many generations.
“When this plan was removed, all IP addresses for the regional endpoint were immediately removed.”
AWS said that ultimately “manual operator intervention” was required to mitigate the incident.
As an immediate step, AWS said it has disabled both “DNS Planner and DNS Enactor automation globally.”
“Before we re-enable this automation, we will resolve the race condition scenario and add additional protections to prevent the application of incorrect DNS plans,” the cloud provider said.
The issues with DynamoDB in US-EAST-1 led to disruptions to other AWS cloud services that depend on it.
Issues with EC2 instances were caused by a subsystem that relied on DynamoDB to function unable to reach the service; this disruption caused flow effects.
Other AWS services that rely on DynamoDB also experienced issues during the incident.
#AWS #outage #caused #interaction #automated #systems


