Paper Title: Robotron: Top-down Network Management at Facebook Scale
Authors: Yu-Wei Eric Sung, Xiaozheng Tie, Starsky H.Y. Wong, and Hongyi Zeng
Where published: ACM SIGCOMM 2016 Conference
Note: This post is part of a series of posts that present interesting academic papers and latest research results in the field of computer science in brief format.
This paper presents Robotron, the system that manages Facebook’s production network since 2008. The authors describe the challenges of managing a large-scale network (tens of thousands of network devices) and how Facebook addresses those challenges with Robotron. Robotron automates network management and deployment of device configurations.
Why is network management so important? Facebook serves over 1.5 billion active users, and therefore, cannot tolerate network outages. A high-profile network incident hit the Google Compute Engine on April 11, 2016 and lasted 18 minutes during which users could not connect to Google Compute Engine instances. Follow this link (https://status.cloud.google.com/incident/compute/16007) for a detailed description of the outage. With Facebook’s Robotron, network engineers adopt agile network management, which allows Facebook to dynamically and reliably support network changes.
Facebook’s production network, as illustrated in Figure 1, consists of a backbone and multiple POPs (Point-of-Presence) and DCs (Data centers) distributed globally. The network transits internal traffic between POPs and DCs via optical transfer links and the use of MPLS and BGP routing protocols. The network also delivers a large volume of content traffic to users in a fast and reliable way. The architecture in POPs and DCs is a standard fat-tree that rarely changes. In contrast, the backbone network has an asymmetrical architecture that evolves continuously according to capacity needs.
Figure 1: The overview of Facebook's Network.
Network engineers at Facebook perform a plethora of management tasks to maintain a sustainable network. They constantly monitor the state of devices (e.g., switches and circuits), which make up the Facebook network. Common management tasks include adjusting link capacity, setting-up a new POP, upgrading the capacity of an existing cluster, OS upgrade, and adding/deleting backbone routers and circuits to provide service redundancy and improve the overall capacity of the network.
Network management is challenging for network engineers. A key challenge is how to resolve dependencies between network devices during device configuration updates. For example, adding a new router into the AS (Autonomous system) requires changing the configuration of all other routers in the AS. An additional challenge is how to manage a complex network with multiple domains (the backbone network, POPs, and DCs), devices from multiple vendors, and an architecture that is constantly changing to adapt to the needs of Facebook users.
With Robotron, Facebook moved from heavily relying on manual configurations for network management to an automated system with little human intervention, continuous monitoring of the network state, and the possibility to support new device configurations and network topologies. The outcome is a deterministic and reproducible approach to network management.
Figure 2: Overview of Robotron.
Robotron manages Facebook’s massive production network following a top-down approach as shown in Figure 2. During network design, engineers model the network. For instance, to build a POP cluster, they define the components of the network topology: network devices (e.g., number of racks per cluster) and links to connect them (eg., number of uplinks per rack switch). Robotron translates the model into tens of thousands of FBnet objects with device/network-level/topology attributes following an object-oriented paradigm within minutes. FBnet is vendor-agnostic and provides scalable APIs to allow engineers to read and modify device configurations from geographically distributed data centers. FBnet is implemented in MySQL. Robotron logs all design changes for peer reviewing among engineers, debugging, and error tracking. After the design stage, Robotron automatically generates vendor-specific device configurations and then deploys them across Facebook’s network worldwide. Robotron monitors in real-time the state of each network component to detect anomalies such as hardware failures and device misconfigurations. Network performance metrics such as link, CPU, and memory utilization are stored in multiple back-ends that use FBnet, HBase, and Hive.
Areas of improvement for Robotron include how to allow concurrent design changes to the network at Facebook’s scale, and how to avoid the manual configuration of devices that network engineers still do when they have to make urgent changes not supported by Robotron.
Founder & Director, LeLaboDigital