To get code to run as fast as possible, developers and compilers typically use performance models that run the code through a simulation of given chip architectures. Compilers use that information to automatically optimize code while developers use it to tackle performance bottlenecks on the microprocessors that will run it. But performance models for machine code are handwritten by a relatively small group of experts and are not properly validated, the researchers argue, negatively impacting the simulated performance results which often deviate from real-life results.
Last summer, the researchers presented a novel machine-learning pipeline that automates the creation of a performance model. Ithemal is a neural-network model that trains on labelled data in the form of “basic blocks” or fundamental snippets of computing instructions to automatically predict how long it takes a given chip to execute previously unseen basic blocks.
Then at the November IEEE International Symposium on Workload Characterization, the researchers presented a benchmark suite of basic blocks from a variety of domains, including machine learning, compilers, cryptography, and graphics that can be used to validate performance models. They pooled more than 300,000 of the profiled blocks into an open-source dataset called BHive. During their evaluations, Ithemal predicted how fast Intel chips would run code even better than a performance model built by Intel itself.
Ultimately, given enough data, developers and compilers can use the tool to generate code that runs faster and more efficiently on an ever-growing number of diverse and “black box” chip designs.
“Modern computer processors are opaque, horrendously complicated, and difficult to understand. It is also incredibly challenging to write computer code that executes as fast as possible for these processors,” explains co-author Michael Carbin, an assistant professor in the Department of Electrical Engineering and Computer Science (EECS) and a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “This tool is a big step forward toward fully modelling the performance of these chips for improved efficiency.”
In another paper, the MIT researchers proposed a new technique to automatically generate compiler optimizations. Specifically, they automatically generate an algorithm, called Vemal, that converts certain code into vectors, which can be used for parallel computing. Vemal was demonstrated to outperform hand-crafted vectorization algorithms used in the LLVM compiler — a popular compiler used in the industry.
Designing performance models by hand can be “a black art,” Carbin says. Intel provides extensive documentation of more than 3,000 pages describing its chips’ architectures. But there currently exists only a small group of experts who will build performance models that simulate the execution of code on those architectures.
“Intel’s documents are neither error-free nor complete, and Intel will omit certain things, because it’s proprietary,” adds Charith Mendis, first author of a paper on basic block throughput prediction. “However, when you use data, you don’t need to know the documentation. If there’s something hidden you can learn it directly from the data.”
To do so, the researchers clocked the average number of cycles a given microprocessor takes to compute basic block instructions, from boot-up to execute and shut down. They automated this measurement process to rapidly profile hundreds of thousands or millions of blocks.
In training, the Ithemal model analyzes millions of automatically profiled basic blocks to learn exactly how different chip architectures will execute computation. Importantly, Ithemal takes raw text as input and does not require manually adding features to the input data. In testing, Ithemal can be fed previously unseen basic blocks and a given chip, and will generate a single number indicating how fast the chip will execute that code.
The Ithemal model was found to cut error rates in accuracy by 50 percent over traditional hand-crafted models. The tool now makes it easier to quickly learn performance speeds for any new chip architectures, Mendis says. For instance, domain-specific architectures, such as Google’s new Tensor Processing Unit used specifically for neural networks, are now being built but aren’t widely understood. “If you want to train a model on some new architecture, you just collect more data from that architecture, run it through our profiler, use that information to train Ithemal, and now you have a model that predicts performance,” Mendis says.
MIT – www.mit.edu