Transform slow software into fast hardware with Vivado HLS

Feature articles | September 2, 2014

By eeNews Europe

Have you ever written some software that, despite your best coding efforts, didn’t run as fast as desired? I have.

Have you thought, “If only there were an easy way to put some of the code into multiple custom processors or custom hardware that wasn’t so expensive?” After all, your application is one of many, and custom hardware takes time and money to create. Or does it?

I began rethinking this proposition recently when I heard about the Xilinx high-level synthesis tool, Vivado HLS. In combination with the Zynq-7000 All Programmable SoC, which combines a dual-core ARM Cortex-A9 processor with an FPGA fabric, high-level synthesis opens up new possibilities in design. This class of tools creates highly tuned RTL from C, C++ or SystemC source code. Many purveyors of this technology exist, and the rate of adoption has been increasing in recent years.

So, how hard would it be to migrate some of that slow code into hardware, if indeed I could simply use Vivado HLS to do the more demanding computations? After all, I usually write my code in C++, and Vivado HLS uses C/C++ as an input. The presence of the ARM processor cores means I could run the bulk of my software in a conventional environment. In fact, Xilinx has even made available a software development kit (SDK) and PetaLinux for this purpose.

Architectural concerns

As I started to think about this transformation from a software perspective, I grew concerned about the software interface. After all, HLS creates hardware dedicated to processing hardware interfaces. I needed something easy to access, such as a coprocessor or hardware accelerator, to make the software run faster. Also, I didn’t want to write a new compiler. To make it easy to exchange data with the rest of the software, the interface needed to look like simple memory locations where we could place the inputs and later read back the results.

Then I made a discovery. Vivado HLS supports the idea of creating an AXI slave, with relatively little effort. This capability started me thinking that an accelerator might not be so difficult to create after all. Thus, I found myself coding up a simple example to explore the possibilities. I was pleasantly surprised with how it turned out.

Let’s take a walk through the approach I took and consider the results.

next; choose a sample task…

For my example, I chose to model a set of simple matrix operations such as add and multiply. I didn’t want it to be constrained to a fixed size, so I would have to provide both the input arrays and their respective sizes. An ideal interface would put all the values as simple arguments to a function, such as the code in Figure 1.

Figure 1 – Example call to accelerator

The interface to the hardware would need to have a simple way to map the function arguments to memory locations. Figure 2 shows a memory layout to support this mapping. The registers would hold information about how matrices were laid out and what the desired operations would be. The command register would indicate which operation to do. This would allow me to combine several simple operations into one piece of hardware. The status register could simply be to know if the operation was in progress or had finished successfully. Ideally, the device would also support an interrupt.

Figure 2 Register summary table

Going back to the hardware design, I learned that Vivado HLS allows for array arguments to specify small memories. Thus, the functionality would be described with a function such as Figure 3 shows.

Figure 3 – Accelerator function API

next; avoid writing a driver…

Assuming the ability to synthesise the AXI slave, how would this fit with the software? My normal coding environment assumes Linux. Fortunately, Xilinx provides PetaLinux and, conveniently, PetaLinux provides a mechanism known as the User I/O device (UIO). UIO allows a simple approach to mapping the new hardware into user memory space, and provides the ability to wait for an interrupt. This means the awkward time and process of writing a device driver can be avoided. Figure 4 illustrates the system.

Figure 4 System diagram

There are, of course, a few drawbacks to this approach. For instance, the UIO device cannot be used with DMA, so you must construct matrices in the device memory and manually copy them out when done. A custom device driver in the future could address that issue if needed.

next; synthesising the hardware…

Synthesising the hardware with Vivado HLS

Back to the topic of synthesising the AXI slave. How difficult would this be? I found the coding restrictions to be quite reasonable. Most of the C++ language could be used, with the exception of the dynamic allocation of memory. After all, hardware doesn’t manufacture itself during operation. This restricts use of the standard template library (STL) functions, because they make heavy use of dynamic allocation. As long as the data storage remains static, most features are available. At first, this task appeared onerous, but I realised it wasn’t a huge deal. Also, Vivado HLS allows for C++ classes, templates, function and operator overloading. My matrix operations could easily be wrapped in a custom matrix class.

Adding the I/O to create an AXI slave was easy. Simply add some pragmas to indicate which ports participate and what protocol they would use.

Running the synthesis tool was also fairly easy as long as I didn’t push all the knobs.

Step-by-step

Figure 5 shows the overall steps involved. I colour-coded them to identify the major stages.

The first stage, identification, codes and automates verification of the software-only version. In conjunction with profiling, this provides crucial information for identifying what to put into hardware. Verification code at this stage will be reused to verify the hardware implementation later.

The second stage, refactoring, moves and isolates the software to be transformed. During the refactoring, verification must be repeated to ensure the refactored code still provides the required functionality. Care must be taken to ensure changes do not affect functionality. A special requirement of Vivado HLS is that main() must return a value of zero only if verification is successful. Verification errors should result in a non-zero return.

Having isolated the code for synthesis, the high-level synthesis (HLS) stage may begin. This stage begins with the Eclipse-based Vivado HLS tool. Besides indicating which files constitute hardware versus testbench (verification software), Vivado HLS needs a certain amount of direction as to the target technology and clock speed. If using a development board, such as the ZedBoard, you can specify it rather than the specific FPGA.

Additional code transformations occur during the HLS stage to satisfy high-level synthesis tool requirements. Most transformations come easily, and make good sense. Some of what needs to be changed can be identified quickly by running the Vivado HLS tool and examining the error report.

During the HLS stage, various options may be enabled to provide better QoR. Knowing what to try is best learned by taking a short two-day Xilinx technology class on Vivado HLS. A significant part of running the tools involves keeping an eye on the reports for violations of policy, and careful study of the analysis report to ensure Vivado HLS has done what was expected. Tool users need to have some appreciation for the hardware aspects, which is covered by the technology class. There is also the issue of running verification simulations, both before and after synthesis, to verify the expected behaviour. The process of verification is automated, but relies heavily on the user’s verification code from the first stage.

Figure 5 Steps in design flow

The next stage, hardware synthesis, takes the output of HLS and connects it into the hardware platform. Vivado’s IP Integrator made connecting the AXI slave into the Zynq SoC hardware a breeze, and removed concerns that signals would be hooked up incorrectly. Xilinx even has a profile for my development system, the ZedBoard, and IP Integrator exports data for the software development kit.

Almost the last step, is the software integration stage. This stage involves modifying the original software to incorporate the newly designed hardware into the software. This is where I took advantage of the UIO driver.

Finally, it is necessary to re-evaluate performance to ensure there were actual performance gains as a result of using the new hardware. Software profiling of the verification suite satisfies this requirement.

Summary

I am truly pleased with the results, and hope to do more with this chip/tool set combination. I have not explored all the possibilities. For instance, Vivado HLS also supports an AXI master interface. This would allow the accelerator to copy the matrices from external memory (although security issues might exist for this case). Nevertheless, I highly recommend that anyone looking at code bottlenecks in their software should look at this tool set.

Further training classes, resources and materials exist to enable a fast ramp, including those from Doulos. Expert-led online and face-to-face courses, including Vivado HLS and Zynq, are available to book now, as well as customised onsite team training. More from;

Doulos; www.doulos.com/xilinx

Xilinx; www.xilinx.com & https://www.xilinx.com/products/silicon-devices/soc/zynq-7000/index.htm

David C. Black is Senior Member of Technical Staff, at Doulos.