A recent revelation has shed light on the cause of the Cloudflare 1.1.1.1 outage, and it's a doozy! You won't believe how a simple change in CNAME ordering led to a major service disruption. But here's where it gets controversial...
The issue stems from an unclear RFC specification, which Cloudflare has now proposed to clarify. On January 8, a routine DNS update altered the order of CNAME records, causing a chain reaction of failures. Most modern software doesn't care about this order, but some implementations do, and that's where the problem began.
Cloudflare's Sebastiaan Neuteboom explains the subtle change made to CNAME record ordering, which was introduced on December 2, 2025, and deployed on January 7, 2026. When a DNS resolver encounters a CNAME record, it caches each step with an expiry time. If part of the chain expires, the resolver only re-fetches the expired portion, combining it with the valid parts. However, when the order of CNAME records changed, it disrupted this process, leading to the outage.
Neuteboom clarifies that the previous code created a new list, inserted the CNAME chain, and then appended new records. To optimize memory usage, the code was changed to append CNAMEs to the existing answer list. As a result, the 1.1.1.1 responses now had CNAME records appearing at the bottom, after the final resolved answer.
Some DNS client implementations, like systemd-resolved, are unaffected by this order. However, others, including the getaddrinfo function in glibc, expect CNAME records to appear first. A Reddit user comments on the situation, questioning Cloudflare's testing practices and their global impact.
The debate extends to whether the RFC is truly unclear or if Cloudflare misinterpreted it. Patrick May invokes Hyrum's Law and Postel's Law, emphasizing the importance of considering all observable behaviors and being liberal in acceptance. Cloudflare has proposed an RFC to explicitly define CNAME record handling in DNS responses, which will be discussed at the IETF.
The timeline shows that Cloudflare began the global rollout on January 7, reaching 90% of servers by January 8 at 17:40 UTC. The incident was declared soon after, and the change was reverted by 19:55 UTC on the same day.
This incident highlights the delicate balance between optimization and potential unintended consequences. It's a reminder of the importance of thorough testing and understanding the impact of changes, especially in widely used services like Cloudflare's 1.1.1.1. What do you think? Is this a case of an unclear RFC, or could better testing practices have prevented this outage? Let's discuss in the comments!