Facebook automates PCIe fault tracking across the data centre
Facebook’s data centres contain millions of PCIe-based hardware components with ASIC-based accelerators for video and inference, GPUs, NICs, and SSDs connected either directly into a PCI slot on a server’s motherboard or through a PCIe switch like a carrier card.
The sheer variety of PCIe hardware components makes studying PCIe issues a daunting task. These components can have different vendors, firmware versions, and different applications running on them. On top of this, the applications themselves might have different compute and storage needs, usage profiles, and tolerances.
A number of open source and in-house tools are used to address these challenges and determine the root cause of PCIe hardware failures and performance degradation with automated repairs.
An open source, Python-based command line interface tool called PCIcrawler is used to display, filter, and export information about PCI or PCIe buses and devices, including PCI topology and PCIe Advanced Error Reporting (AER) errors. This tool produces visually appealing, treelike outputs for easy debugging as well as machine parsable json output that can be consumed by tools for deployment at scale.
An in-house tool called MachineChecker quickly evaluates the production worthiness of servers from a hardware standpoint. MachineChecker helps detect and diagnose hardware problems. It can be run as a command line input tool. It also lives as a library and a service.
Another in-house tool takes a snapshot of the target host’s hardware configuration along with hardware modelling, while an in-house utility service detects PCIe errors on millions of servers. This tool parses the logs on the server at regular intervals and records the rate of correctable errors on a file on the corresponding server. The rate is recorded per 10 minutes, per 30 minutes, per hour, per six hours, and per day. This rate is used to decide which servers have exceeded the configured tolerable PCIe-corrected error rate threshold depending on the platform and the service.
An open source utility for managing and configuring devices that support the Intelligent Platform Management Interface (IPMI). IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. It’s mainly used to manually extract System Event Logs (SELs) for inspection, debugging, and study.
The Facebook auto remediation (FBAR) tool is a set of software daemons that execute code automatically in response to detected software and hardware signals on individual servers. Every day, without human intervention, FBAR takes faulty servers out of production and sends requests to the data centre teams to perform physical hardware repairs.
All the data is stored in Scuba, a fast, scalable, distributed, in-memory database built at Facebook. It is the data management system we use for most of the real-time analysis.
Next: Automated PCIe fault tracking
Some of the issues were obvious. PCIe fatal uncorrected errors, for example, are definitely bad, even if there is only one instance on a particular server. MachineChecker can detect this and mark the faulty hardware (ultimately leading to it being replaced).
Depending on the error conditions, uncorrectable errors are further classified into nonfatal errors and fatal errors. Nonfatal errors are ones that cause a particular transaction to be unreliable, but the PCIe link itself is fully functional. Fatal errors, on the other hand, cause the link to be unreliable.
The engineers found that for any uncorrected PCIe error, swapping the hardware component (and sometimes the motherboard) is the most effective action.
Other issues can seem innocuous at first. PCIe-corrected errors, for example, are correctable by definition and are mostly corrected well in practice. Correctable errors are supposed to pose no impact on the functionality of the interface. However, the rate at which correctable errors occur matters. If the rate is beyond a particular threshold, it leads to a degradation in performance that is not acceptable for certain applications.
An in-depth study looked at the performance degradation and system stalls to PCIe-corrected error rates. Determining the threshold is another challenge, since different platforms and different applications have different profiles and needs.
The PCIe Error Logging Service observed the failures in the Scuba tables and correlated events, system stalls, and PCIe faults to determine the thresholds for each platform. Swapping hardware is the most effective solution when PCIe-corrected error rates cross a particular threshold.
PCIe defines two error-reporting paradigms: The baseline capability and the AER error reporting capability. The baseline capability is required of all PCIe components and provides a minimum defined set of error reporting requirements. The AER capability is implemented with a PCIe AER extended capability structure and provides more robust error reporting. The PCIe AER driver provides the infrastructure to support PCIe AER capability and the team used the PCIcrawler tool to take advantage of this.
As a result the team recommends that every vendor adopt the PCIe AER functionality and PCIcrawler rather than relying on custom vendor tools, which lack generality. Custom tools are hard to parse and even harder to maintain. Moreover, integrating new vendors, new kernel versions, or new types of hardware requires a lot of time and effort.
Bad (down-negotiated) link speed (usually running at half or a quarter of the expected speed) and bad (down-negotiated) link width (running at half, quarter or even an eighth of the expected width) were other concerning PCIe faults. These faults can be difficult to detect without some sort of automated tool because the hardware is working, just not as well as it could. Most of these faults could be corrected by reseating hardware components.
The team also has special rules to identify repeat offenders. For example, if the same hardware component on the same server fails a predefined number of times in a predetermined time interval, after a predefined number of reseats, it is automatically marked as faulty and swapped out. In cases where the component swap does not fix the problem, a motherboard swap is necessary.
The process also monitors the repair trends to identify nontypical failure rates. For example, using data from custom Scuba tables identified a problem in a specific firmware release from a specific vendor. The team then worked with the vendor to roll out new firmware that fixed the issue.
Using this overall methodology, the team added hardware health coverage and fix several thousand servers and server components. Every week, they detect, diagnose, remediate, and repair various PCIe faults on hundreds of servers.
Other articles on eeNews Europe