Simplifying what broke Facebook: BGP, DNS and the unified bucket idea

Those 6 hours were probably the longest for Facebook’s engineers and they must felt a lot longer for users including the small and medium businesses that rely heavily on Facebook, WhatsApp, Messenger and Instagram. Technology outages are not uncommon, but then again, not many platforms have more than 3.5 billion users around the world. The most popular messaging app in the world, WhatsApp, along with Instagram and Messenger, as well as Facebook Workplace, stopped working late Monday night and the outage could only be fixed early on October 5. Facebook eventually said that a configuration change to its infrastructure was the reason for its services to have gone offline.

All of Facebook’s services went down together since they share the same infrastructure in line with the company’s bid to integrate all the tech behind different blocks, as one. A single overview of all users across all apps makes them even more powerful in the world of targeted online advertising. Yet, it makes the structure vulnerable to a massive failure as the massive outage showed. We are now finally able to piece together the sequence of events that led to the service outage and the eventual corrective measures taken, while simplifying the complex tech jargon.

Breaking BGP: This map didn’t exist, for a while

It was at about 11:39 am Eastern Time (that’s about 9:10pm IST on October 4 for us in India), when someone at Facebook made an update to the service’s Border Gateway Protocol (BGP) records. BGP is essentially a map of the services that are available on the worldwide web and allows the networks to pick the best routing or path to them. Facebook has its own BGP, as does pretty much every other internet service provider. It is the map your phone or computer or web browser use to find Facebook.com when you punch in its address in the web browser.

BGPs are configured on the routers that allow Facebook servers to connect to the internet and speak with each other. The update broke this BGP. It simply meant that the Facebook app on your phone or your web browser when you entered facebook.com, had no idea where to find what you were looking for. Therefore, the error messages such as “this site can’t be reached”.

The reasons for the BGP, or simply put, the map to Facebook’s apps and services, to break down remains anyone’s guess. The company in the clarification insists that it was a “configuration change” which caused the problem to begin with. That indicates either a human error while configuring or applying the update, or a pure software glitch that didn’t receive a new update as intended. And since every business at Facebook is unified, which puts Facebook, Instagram, WhatsApp, Messenger and even Facebook Workplace in one large bucket, breaking the single handle pretty much rendered everything inside it unusable.

Facebook domain as it went up for sale

First things first, Facebook.com was not officially up for sale. Yet, once the BGP broke taking Facebook services off the web, its Domain Name System (DNS) records became unreachable. Think of DNS as the phonebook of the internet, while BGP is the map that gets you to the correct contact. While the BGP error was unfolding, configuration issues also caused the DNS of Facebook and the family of apps to disappear. The path, and the destination, were both broken. That meant websites which sell domain names, and automatically search for inactive domain names, started listing facebook.com as up for sale. Manual corrections were done to take down those listings, though a lot of Twitter users did get in on the fun with bids for the domain.

Also Read: Facebook costliest outage caused $160 million loss: NetBlocks

So centralized that even physical access was restricted

To correct the mess that had happened, Facebook engineers rushed to the California data center. Since servers were unavailable online, including to employees within the company’s own networks, they couldn’t remotely issue another patch to fix the configuration error. Then, they had trouble getting into the building. That is because the authentication systems, wherein badges are scanned to access different parts of the physical premises, was also down along with Facebook Workplace for users. Eventually they were able to get access to the servers. Facebook’s chief technology officer Mike Schroepfer said in an email to employees after services had been restored, that the issue was “affecting our networking backbone that connects all our data centers together.”

It took almost 6 hours for services to make their presence felt online once again. For systems as massive as this, rebooting after an update takes time. And even then, there are chances of access being inconsistent for a period of time as server databases are rebuilt. “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt,” explains Santosh Janardahan, vice president of Infrastructure at Facebook. He goes on to say, “We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.”

Impact on businesses expected to be huge

Facebook doesn’t have any official numbers or data on how many businesses were impacted across Facebook, WhatsApp and Instagram, or the magnitude of losses for small and medium businesses in that outage window. Or it isn’t saying them out loud, just yet. UK based cybersecurity watchdog NetBlocks suggests that the global economy lost as much as $160 million within the first hour of the outage. Their Cost Of Shutdown Tool (COST) estimates internet disruption using specific indicators including from World Bank and the International Telecommunication Union (ITU). With the outage being global, and across its family of apps, the impact would have been felt by the more than 200 million businesses which use Facebook’s tools around the world. Those are numbers according to the company’s latest insights data. A majority of these businesses, if not all, would have been negatively impacted in the hours of the outage. They say that there are more than 10 million active advertisers as well across platforms.

This isn’t Facebook’s first outage

The massive outage that Facebook suffered earlier today remains unmatched in magnitude. Yet, it isn’t the first for the tech giant this year. In June, Facebook and services had suffered an outage, but it didn’t last as long. The company has struggled with sporadic outages across Facebook, WhatsApp and Instagram in May, April, March and February this year, though their disruptive scope was limited in comparison.

For all the latest business News Click Here 

Read original article here

Denial of responsibility! TechAI is an automatic aggregator around the global media. All the content are available free on Internet. We have just arranged it in one platform for educational purpose only. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials on our website, please contact us by email – [email protected]. The content will be deleted within 24 hours.