Facebook's Offline — It's Not Just You
When Facebook goes down, you know immediately — not because you were using Facebook, but because everyone starts talking about it. The outage that took Facebook, Instagram, WhatsApp, and Messenger offline simultaneously was one of the most visible infrastructure failures in internet history. Here is what happened, why it happened, and what it teaches about building systems at scale.
What Happened
The outage began when a configuration change to Facebook’s backbone routers caused a withdrawal of BGP (Border Gateway Protocol) routes. BGP is the routing protocol that tells the internet how to reach different networks. When Facebook’s BGP routes disappeared, the rest of the internet lost the ability to find Facebook’s servers.
This alone would have caused an outage, but the cascading effects made it worse. Facebook’s DNS servers became unreachable because they sit within Facebook’s network. Without DNS, even Facebook’s own engineers could not resolve internal hostnames. The tools they would normally use to diagnose and fix the problem — the internal dashboards, the configuration systems, the communication platforms — were all down.
Reports indicated that engineers had to physically travel to data centers to manually reconfigure hardware, because the remote management systems were also affected by the routing failure.
Why It Matters
The technical details are interesting to infrastructure engineers, but the broader lesson is more important: consolidation creates fragility.
Facebook’s family of apps — Facebook, Instagram, WhatsApp, Messenger — shares infrastructure. This is efficient. Shared data centers, shared networking, shared DNS, shared authentication. It reduces costs and simplifies operations. It also means a single infrastructure failure can take down services used by over three billion people simultaneously.
WhatsApp is the primary communication tool in many countries. When it went down, people could not contact family members, businesses could not reach customers, and emergency communications in some regions were disrupted. An infrastructure failure at one company became a communication crisis across continents.
The Cascading Failure Pattern
The most dangerous failures are not the ones where a server crashes — those are caught by redundancy. The most dangerous failures are the ones where the recovery mechanism itself fails.
Facebook’s outage followed this pattern exactly. The BGP withdrawal (initial failure) made DNS unreachable (second failure) which made internal tools unreachable (third failure) which prevented remote remediation (recovery failure). Each layer of failure blocked the fix for the previous layer.
This is a well-known pattern in distributed systems, but it is extraordinarily difficult to prevent at Facebook’s scale. Testing for this kind of cascading failure requires simulating conditions that would themselves cause an outage — a catch-22 that makes thorough testing impractical.
The Lesson
The internet’s infrastructure is more fragile than it appears. A handful of companies — Facebook, Google, Amazon, Cloudflare, a few others — provide the backbone services that most of the internet depends on. When one of them has a bad day, billions of people are affected.
There is no easy fix. Decentralization would reduce the blast radius but increase complexity and cost. Redundancy helps but cannot prevent every cascading failure. The pragmatic takeaway: design your systems assuming that any external dependency — no matter how reliable — will eventually go down, and have a plan for when it does.