Companies designing new system-on-chip (SoC) products are subject to ongoing market pressure to do more with less and achieve higher returns. The result is shrinking engineering teams, reduced design tool budgets and shortened time lines to get new products to market. This has led companies designing complex SoCs to move increasingly toward licensing IP cores for a majority of the building blocks of their designs instead of building their own in-house custom versions. Selecting the right IP cores is the fundamental challenge of this developing paradigm and the means of evaluating and presenting it is as important to the purchaser as it is to the developer.
The reality is that IP cores are offered with a huge variety of features and options. And, even once you’ve sorted through the catalogue of potential vendors and products, there is still a vast range in IP quality. The trick is to separate the truly robust and capable from IP that is buggy, insufficiently tested, and lacking in real world performance and a wide and active set of successful users.
CogniVue is innovating with embedded vision enabling small smart cameras that see and react to the world around them, cars that see and avoid accidents, cameras on our TVs that recognise our faces and gestures, and smart phones that see and give us an augmented view of the world around us. CogniVue with its Image Cognition Processing is enabling dramatically new levels of embedded vision, making previously impossible small smart cameras possible. When it comes to vision processing, CogniVue aims not only to have the highest quality IP to offer, but also ensure that it meets the needs of the widest range of application, both for today and tomorrow. This is a field where use cases are still developing and where many won’t know their real needs until the design project is well underway.
Figure1. Example of vision-enabled SoC architecture with a CogniVue APEX2-642 Core
CogniVue’s APEX image cognition processing core, shown in Figure 1, is designed for efficient pipelining of embedded image and vision processing algorithms. The Image Cognition Processor (ICP) is in production and used in many applications including automotive cameras such as Zorg Industries for AudioVox as well as some new wearable type consumer products such as the Neo.1 smart pen from NeoLAB Convergence Inc., as shown in Figure 2. Its deployment in these kinds of consumer applications comes thanks to the ability to deliver 100x better performance per area per power for vision processing compared with conventional processor architectures. For the NEO.1 product the APEX core was able to provide processing at rates of 120 frames per second while still maintaining very low power dissipation, allowing this battery-powered device to last for many days on a single charge.
Figure 2. The CogniVue APEX core powers the NeoLAB Convergence Inc. NEO.1 Smart Pen
This kind of success is achieved both through a fundamental knowledge of image processing requirements and through an exhaustive testing and demonstration approach that targets customer needs within their industrial landscape. Before any core is delivered, extensive validation is needed, especially in markets such as automotive where compliance to industry standards for safety (e.g. ISO 26262 "Road vehicles – Functional safety") is required.
Although testing is necessitated by such requirements, there is also ancillary motivation for IP companies to provide validation and evaluation platforms that not only show functionality and compliance, but that also perform at levels capable of highlighting their true value to prospective customers.
As an example of this motivation, consider the fact that it is less difficult to create vision IP that performs well for narrow, targeted applications that are currently known. Building vision usefulness and flexibility into the technology from the ground up is, however, what will ensure that the IP can perform at the highest levels across multiple applications. And we know that talk is cheap; the IP quality and fit for the application may not be apparent without a real-world “eyes-on” demonstration to prove that the IP’s quality and capabilities exist.
The challenge for the fabless IP provider looking to enable their partners and customers is to demonstrate a real IP application running in the real world. Thankfully FPGA platforms continue to leap forward alongside the rest of the technology world, providing a vehicle for this demonstration. In other words, FPGAs can provide the necessary capacity and performance to demonstrate what is possible if the IP is selected for use in the next generation custom ASICs. In spite of this, it seems that we always operate on the edge, pushing the limits of FPGA capacity and performance, and always wanting a little bit more.
FPGA vendors are getting very good at software tool development. But such tools tie use of IP to an individual FPGA company. A demo running on one FPGA vendor today should stand ready to work and move to an entirely different FPGA vendor tomorrow. This can be driven by internal teams or the end customer, and can be due to a combination of factors such as preference/familiarity, legacy infrastructure (hardware and software components), and sometimes the availability of new faster, less costly, better-sized platforms. Moreover, a common RTL code base must work in both the eventual ASIC design flow and in the FPGA “IP demonstration” design flow, as shown in Figure 3.
Figure 3. IP needs to be implemented on multiple ASIC and FPGA prototype demonstration platforms from multiple vendors
Within this operational model, we believe that Synopsys Synplify stands out, and coupled with Synopsys DesignWare IP, it creates the perfect partner for CogniVue’s complex IP development, along with crucial IP demonstration. In the first place, Synplify provides the best capability for fit and performance of logic into the silicon devices to which we deliver IP. For a novice FPGA developer, this is counter-intuitive; surely the vendor would know best how to map logic to its own constructs! The vendor tools are getting very good at providing everything that a basic developer may need. In many cases, it wouldn’t surprise us if they provided optimal results, but the reality is that the first stage of implementing an RTL design in an FPGA is logic synthesis that includes timing and area optimisation. Synopsys has been a leader in that domain for a very long time and their developers have been focused on the fundamental synthesis problem of hardware realisation that is independent of final technology mapping (whether an FPGA or ASIC chip).
For us, proof of this is found in the fact that we routinely work with code bases that are constantly pushing the boundaries of the available FPGA devices and that do not fit when exclusively using FPGA vendor tools. In these instances, place and route won’t even be attempted after synthesis. Using Synplify often enables these borderline designs to complete by virtue of reducing the post-synthesis footprint and corresponding space required within the target FPGA device. Table 1 demonstrates the resource utilisation results obtained for a recent design when using the vendor provided tools for both synthesis and place and route compared to the same design when using Synplify to perform synthesis followed by place and route with the vendor’s tool. One of the key metrics from Table 1 is that the design based on the vendor tool synthesis was at 116.91% utilisation and would not fit the FPGA device available on the platform. This is a real world example of our IP and our need to consistently map the design onto the FPGA. It is critical from a system and software development aspect that we are able to reuse these FPGA platforms. Alternatives would be to create an FPGA variant of the design to reduce functionality and achieve fit, but this would be a far from ideal situation as there would be a divergence between the RTL design validated in FPGA and the RTL design delivered for integration into ASIC SoC projects.
Most seasoned FPGA users might comment that, even after Synplify synthesis, the 94.92% utilisation is a very precarious position to be in, as even minor changes in FPGA designs, such as adding a few logic gates, can have large impacts on overall area and achievable clock speed. Nevertheless, our experience has shown this result to be reliably implementable and also achieves clock rates at the upper end of our expectation. This is clearly a testament to the increasing quality of vendor implementation tools.
Table 1 Vendor Tool Only vs. Synplify and Vendor Tool results comparison
Put together then, the Synplify to Vendor place and route flow works well for us; and not only does it yield better results, it also achieves them in less overall time. This isn’t always immediately apparent because the synthesis step in an FPGA vendor’s tool is sometimes quicker than the same step through Synplify. We have consistently seen, however, that the implementation steps following vendor-only synthesis take significantly longer than implementation from a Synplify-optimised netlist. The above example can’t be representative since implementation is not possible in the vendor-only case. Instead, consider another common (much larger) build that CogniVue uses to showcase the power and expandability of our IP; the CogniVue IP alone is the equivalent of ~2.6M NAND2 ASIC gates. To build that configuration and its attendant system components (processor, memories, interconnect, etc.) with the Synplify flow takes about four hours and twenty minutes; the same build with vendor only tools has been observed to require approximately five hours and forty-five minutes. That’s 33% longer and yields a less optimal result.
Part of the reason that Synplify synthesis can take more time is because it offers powerful quality of results (QoR) capabilities “under the hood” to improve performance. There are two features in this category that we routinely use to achieve the best performance from our implemented FPGA platforms: retiming and pipelining.
Retiming is the process of redistributing sequential elements (e.g. flops) to better balance the logic levels and/or routing distance between them. In this way, it improves overall timing by reducing long paths that would reduce achievable performance while lengthening shorter paths that would otherwise have unused extra margin. All of this is achieved without any RTL changes and without affecting the design behaviour observed at the primary inputs and outputs of the design; the total number of sequential stages remains the same and the functional operation is unchanged.
Pipelining is a related process whereby a complex function (such as multiplication) is broken into several stages such that the input stage can accept new inputs on each cycle while the output and intermediate stages continue processing the previous inputs. Through this type of staging, clock rates and throughput can be increased without any significant impact in latency. In terms of the synthesis function that Synplify applies to a calculation like multiplication, this means that the flops placed before and/or after the multiplication operation can be recognised and earmarked as pipeline candidates and therefore moved into the multiplier automatically by the tool. This achieves a similar kind of timing balancing to the retiming capability described above and enables higher clock rates and optimal efficiency of complex RTL functions.
When you think about the fact that these QoR functions can be selected to analyse and improve a given design automatically, you can see that they are a real enabler of quicker, better design practices for engineers. In the same way that synthesis tools have become fundamentally dependable for the realisation of logic from high-level languages such as Verilog and VHDL (obsoleting schematic capture for integrated circuit design), this type of capability shows that Synopsys, at least, is also dependable for finding the optimal timing configuration from a complex sea of timing elements and combinational logic. This means that engineers using a tool like Synplify can capture their designs in a natural, clear way and then rely on the software tools to perform the optimisations that would otherwise confuse and obfuscate their code.
These optimizations definitely help to improve the achievable clock rates of our testing platforms, and they are further helped by the fact that Synplify significantly reduces our logic footprint (as shown in Table 1). Using less logic (fewer FPGA resources) means correspondingly shorter paths and this generally results in higher achievable clock speeds and lower time/effort spent in timing closure. On these points alone, Synplify is a winner because it enables us to achieve the best fit and performance for our demonstration platforms. The fact that we can use the same synthesis step for multiple vendors simply seals the deal.
Vendor independence for synthesis is only part of the equation. CogniVue IP seeks to provide the very best vision processing performance for the broadest range of applications within the context of a system (or an SoC). And that means we need additional IP (e.g. host processor interface, DDR RAM controller, interconnect and so on) to build a useful demonstration platform. FPGA vendors have a lot to offer in this area, too, and some of their components are necessary to achieve optimal implementations. A high-speed DDR RAM controller, for example, is best chosen from those a vendor has matched to its devices as there are physical interface considerations such as I/O speeds and internal routing. Our experience has shown, however, that Synopsys’ DesignWare IP offers a host of fundamental building blocks that provide optimal performance (not just area and clock speed; interface efficiency, as well) and flexibility.
As an example, one of the most common SoC interconnects today is AMBA AXI from ARM. FPGA vendors are aware of this and generally offer all of the AXI components that might be necessary to stitch together an array of IP. The Synopsys DesignWare IP solutions for AMBA, however, are also very extensive and offer class-leading flexibility, efficiency, area, and speed in a format that is vendor independent and applicable not only to FPGAs, but also to the ASICs that may ultimately be realised. Their use enables an IP vendor to demonstrate interoperability beyond its own sphere and augments the guidance it can provide to customers.
After all, no matter how good your IP is, the quality will be missed if the logic surrounding it, driving it, and supporting it, is less than ideally matched. Building what you’re offering into an optimal, high performance demonstration platform is what will highlight its value and convince customers to keep looking to you for more.
We have an extensive history in vision processing and what we have learned along the way has formed the underpinning of products that not only offer best-in-class performance per area per power, but are inherently flexible and applicable for the needs of today and tomorrow. Having the Synopsys synthesis tools and IP that enable use of any vendor’s FPGA and automatic availability of IP cores targeted for ASICs, along with achieving the best QoR and runtimes, has been an important element of our design success. We have delivered our IP cores and have seen them in real automotive and consumer devices. Synopsys’ Synplify and DesignWare IP have been key players in this success and offer significant benefits to any company with broad application aspirations and a need for the best performance along the way.
About the Authors
Ali Osman Örs, Senior Director of Engineering; has over 16 years of experience in successfully bringing ASIC and IP products to market. In his role as Senior Director of Engineering, Ali is responsible for the architecture and implementation of all hardware and software aspects of the APEX processor IP cores and the Image Cognition Processor SOCs. Ali held various roles as architect and technical lead in the development of the first generation APEX processor core as well as the proprietary video encoder/decoder engines at MtekVision and prior to that at Atsana semiconductor. Previous to Atsana, Ali worked at Cadence and Westport Technologies where he was in charge of system architecture and various design and verification roles for ASIC and FPGA development. In 2014, Ali was selected as a “Forty Under 40” business leader by the Ottawa Chamber of Commerce and the Ottawa Business Journal. Ali has a Bachelor of Engineering degree in Electrical Engineering from Carleton University in Ottawa.
Daniel Reader, Senior ASIC Architect; has over 15 years of experience in circuit board design and ASIC development. His ASIC expertise is helping to steer CogniVue’s next-generation technology while he architects and maintains its FPGA demonstration and evaluation platforms. He has previously held roles as ASIC chip prime with Diablo Technologies, senior ASIC architect and developer with Neterion Corporation, and lead circuit board designer with Nortel. He holds a Bachelor of Science degree in Computer Engineering from the University of Manitoba in Winnipeg and a Master of Engineering degree from the Auckland University of Technology in Auckland, New Zealand.