Regarding the incident that occurred on October 20 in the US-EAST-1 (Northern Virginia) regionLarge-scale service disruptionsAWS officiallyAnnouncement of accident investigation resultsAs previously determined, the root cause of the problem was indeed its core database service, DynamoDB. However, the specific cause was a design flaw in the DynamoDB DNS software automation module, which led to a catastrophic chain reaction.
The incident had a wide impact, affecting a total of 142 AWS services and thousands of customers, and took 15 hours to fully recover.
Automated conflict: DNS Enactor cleaning program mistakenly deletes critical IP addresses
According to AWS, DynamoDB's DNS management is handled by two automated modules: the DNS Planner, which generates new DNS plans, and the DNS Enactor, which deploys those plans to Amazon Route53. To improve availability, AWS runs three independent DNS Enactors in three different Availability Zones (AZs).
Under normal circumstances, Enactor will confirm the plan version before deployment and update the endpoints one by one. If there is a conflict, it will retry and clean up the expired plans after completion.
However, the triggering point of this accident is:
• Enactor A began deploying a plan but encountered significant delays in updating multiple DNS endpoints, resulting in slow progress and constant retries. Meanwhile, DNS Planner continued to release new versions of the plan.
• Enactor B, operating independently, obtained the latest plan and quickly completed all endpoint updates. After completing its task, Enactor B immediately initiated the cleanup process.
• Key conflict point: Enactor A, which is behind schedule, is just about to deploy the outdated plan to the US-EAST-1 regional service node that Enactor B has just updated.
• Then, Enactor B’s cleanup process mistakenly considered the “obsolete plan” being deployed by Enactor A to be invalid and deleted it.
As a result, all IP addresses of the US-EAST-1 regional service nodes were removed, the DNS records became blank and could not be resolved, and no new projects could be deployed.
Chain reaction: EC2 instance startup is blocked, NLB and Lambda are paralyzed
AWS pointed out that although the core DNS problem of DynamoDB was resolved in about 3 hours, the chain reaction it caused lasted for more than ten hours.
The primary reason was that many core services relied heavily on DynamoDB, such as DropletWorkflow Manager (DWFM), a tool responsible for managing EC2 instance state. During the DynamoDB outage, a large number of leases expired. Once DNS was restored, DWFM attempted to simultaneously reestablish hundreds of thousands of leases. The overwhelming volume of requests caused its system to become congested and crash.
The failure of DWFM directly prevented new EC2 instances from launching properly and caused network configuration delays. This further impacted downstream services such as the EC2-dependent Network Load Balancer (NLB) and the serverless computing service AWS Lambda, significantly extending the overall downtime.
AWS emergency response: Global suspension of DynamoDB DNS automation module
This incident once again highlights how, while automation systems in large cloud architectures can improve efficiency, they can also lead to disasters due to complex dependencies and potential race conditions. Multiple automation and partitioned redundancy mechanisms, designed to improve reliability, instead led to unexpected conflicts under extreme conditions.
As a response, AWS announced the temporary suspension of the DynamoDB DNS Planner and DynamoDB DNS Enactor automation modules worldwide until relevant security checks, race condition corrections, and more complete control mechanisms are completed.
US-EAST-1's core service area amplifies the disaster
As AWS's oldest, largest, and most core region, US-EAST-1 hosts many global control platforms and management backends, making its stability crucial. DynamoDB, one of the most relied-upon NoSQL database services within AWS (e.g., Amazon.com, Alexa) and for external customers (e.g., Netflix), a brief DNS resolution failure was enough to trigger such a large-scale chain reaction, serving as a reminder to the industry of the potential risks of relying on critical infrastructure.



