
Amazon Web Services (AWS) has announced that most services are returning to normal following a significant outage on Monday that caused a global internet disruption. While the core issue has been mitigated, the company stated that some services, such as AWS Config and Redshift, continue to process a backlog of messages, and full restoration may take a few more hours.
The event knocked thousands of websites and applications offline, affecting everything from social media and gaming platforms to financial services and enterprise tools, underscoring the critical role of AWS in the internet's backbone.
In an update, AWS announced that its cloud service returned to normal operations on Monday afternoon, cited by Reuters. “By 3:01 PM, all AWS services returned to normal operations,” said the announcement.
AWS identified the root cause as a Domain Name System (DNS) resolution problem within its heavily utilized US-EAST-1 data center in Northern Virginia.
Specifically, the issue prevented applications from resolving the endpoint for the DynamoDB API, a fundamental database service used by countless clients. The initial problem originated from a subsystem responsible for monitoring the health of network load balancers within the EC2 internal network.
“After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB,” the update said.
“Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch,” it added.
This Amazon DNS failure had a cascading effect, leading to the extensive cloud service downtime experienced globally. This is at least the third major outage in five years linked to the US-EAST-1 region.
The incident serves as a potent reminder of the internet's reliance on a few key cloud infrastructure providers. The failure in a single region impacted major platforms like Reddit, Snapchat, Zoom, Ring cameras, and Coinbase, causing significant operational and financial disruption.
Experts have pointed out that while cloud providers like AWS offer tools for building resilient, multi-region architectures, the cost and complexity can lead some organizations to cut corners.
This latest AWS outage update reinforces the need for businesses to implement robust fault tolerance and disaster recovery plans to protect against single points of failure in critical infrastructure.
Last year, a CrowdStrike outage affected users globally, with test software bug and EU agreements blamed.