Facebook's data centres contain millions of PCIe-based hardware components with ASIC-based accelerators for video and inference, GPUs, NICs, and SSDs connected either directly into a PCI slot on a server’s motherboard or through a PCIe switch like a carrier card.
The sheer variety of PCIe hardware components makes studying PCIe issues a daunting task. These components can have different vendors, firmware versions, and different applications running on them. On top of this, the applications themselves might have different compute and storage needs, usage profiles, and tolerances.
A number of open source and in-house tools are used to address these challenges and determine the root cause of PCIe hardware failures and performance degradation with automated repairs.
An open source, Python-based command line interface tool called PCIcrawler is used to display, filter, and export information about PCI or PCIe buses and devices, including PCI topology and PCIe Advanced Error Reporting (AER) errors. This tool produces visually appealing, treelike outputs for easy debugging as well as machine parsable json output that can be consumed by tools for deployment at scale.
An in-house tool called MachineChecker quickly evaluates the production worthiness of servers from a hardware standpoint. MachineChecker helps detect and diagnose hardware problems. It can be run as a command line input tool. It also lives as a library and a service.
Another in-house tool takes a snapshot of the target host’s hardware configuration along with hardware modelling, while an in-house utility service detects PCIe errors on millions of servers. This tool parses the logs on the server at regular intervals and records the rate of correctable errors on a file on the corresponding server. The rate is recorded per 10 minutes, per 30 minutes, per hour, per six hours, and per day. This rate is used to decide which servers have exceeded the configured tolerable PCIe-corrected error rate threshold depending on the platform and the service.
An open source utility for managing and configuring devices that support the Intelligent Platform Management Interface (IPMI). IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. It’s mainly used to manually extract System Event Logs (SELs) for inspection, debugging, and study.
The Facebook auto remediation (FBAR) tool is a set of software daemons that execute code automatically in response to detected software and hardware signals on individual servers. Every day, without human intervention, FBAR takes faulty servers out of production and sends requests to the data centre teams to perform physical hardware repairs.
All the data is stored in Scuba, a fast, scalable, distributed, in-memory database built at Facebook. It is the data management system we use for most of the real-time analysis.