ARM launches flagship cores in ‘DynamIQ’ style

Technology News | May 29, 2017

By Peter Clarke

MPUs/MCUs PLDs/FPGAs/ASICs Automotive

The Cortex-A75 is ARM’s next-generation leading-performance 32/64bit processor core aimed at 16nmm, 10nm and 7nm FinFET implementation and beyond, effectively eclipsing last year’s Cortex-A73. The Cortex-A55 is a new mid-range performance 32/64bit processor offering a balance between performance and power efficiency. Cortex-A55 is the latest “little” core intended to complement the Cortex-A75, the latest “big” core.

The Cortex-A75 offers more than 20 percent more peak and mobile performance than the Cortex-A73 at about the same energy efficiency. Benchmarks for advanced use cases show between 16 percent and 48 percent uplift compared with the Cortex-A73 at the same manufacturing process and clock frequency. And ARM is expecting its licensees to get a slight frequency boost at 10nm getting to 3.0GHz clock frequency where the previous core maxed out at 2.8GHz in that process.

The Cortex-A55 offers up to twice the performance of the Cortex-A53 at up to 15 percent improved power efficiency this time benchmarked for a 16nm manufacturing process. In the case of Cortex-A55 the benchmarks show an uplift of between 14 percent and 97 percent.

However, a key factor is that the Cortex-A75 and the Cortex-A55 are the first two cores to support version ARMv8.2 of the instruction set architecture and DynamIQ clustering (see ARM to boost processor performance by 50x with new AI instructions).

The ISA upgrade includes additional instructions that are expected to provide up to a 50x improvement in machine learning applications over the next three to five years, and the clustering support will support a far richer said of heterogeneous cores to be used in a cluster of eight cores and up to 32 clusters on a chip.

One of the obvious benefits of this is that mobile devices will be able to handle more of the machine learning and certainly do inference more quickly and less work need will need to be passed up to the cloud with a concommitant improvement in latency and overall system power consumption.

Next: Not just for mobile

In the mobile space DynamIQ can support configurations such as 1B+3L or 1B+7L but DynamiQ is also aimed at a wide range of applications that could use multiple clusters of cores including servers, networking and self-driving cars and automotive driver assistance systems (ADAS).

The A75/A55 cores were released to early-adoption partners in 4Q16 and first SoCs based on the cores, expected in 16nm and 10nm FinFET processes, should be available in 4Q17 or 1Q18. The company claims to have more than 10 licensees for Cortex-A75, Cortex-A55 and DynamIQ.

DynamIQ shared unit (DSU)

The key thing that marks out an ARM processor core as DynamIQ is the presence of a private L2 cache next to the core and access to an L3 cache shared within the cluster of cores plus a DynamIQ share unit (DSU) to implement functions around these features. The DSU contains asynchronous bridges and support for multiple clock domains, snoop control unit and cluster interfaces

The L3 cache is 16-way set associative and configurable up to 4Mbytes. ARM has also introduced an ability to partition and reserve certain parts of the L3 cache for certain functions. The cache can be partitioned into up to four areas to reduce the effect of cache thrashing, which can be important for markets such as network infrastructure and automotive, according to Peter Greenhalgh, ARM Fellow and senior director of technology at ARM’s CPU group.

The partitions can allow a class of processes running on CPUs or external agents via the ACP [Accelerator Coherency Port] or interconnect. The remaining partitions are shared between all other processes. Importantly, the partitions can be re-assigned by an operating system or hypervisor at runtime. The L3 cache data allocation policy can change depending on pattern of data use and reduces memory access latency. As a result, it is more than 50 percent faster to access L2 cache memory, ARM said.

Other innovations introduced with DynamIQ and implemented in the A75/A55 include: cache stashing, a faster power-down and asynchronous bridges to decouple cluster and core clock frequencies.

Next: Cache stashing

“Cache stashing enables read/writes into shared L3 or per-core L2 cache. The AMBA 5 CHI and ACP can be used. It is like an inverse pre-fetch, where it is not the CPU pre-fetching but allowing agents or software to nominate data that will be useful in the cache,” said Peter Greenhalgh, ARM Fellow and senior director of technology at ARM’s CPU group, in a meeting with analysts and journalists held in Cambridge, last month.

The Cortex-A55

The Cortex-A55 includes: the integrated L2 cache as previously mentioned, improved branch predictor and improved data pre-fetch, atomic memory instructions, RAS features and lower latency floating point. It also retains the in-order execution 8-stage pipeline of its Cortex-A53 predecessor. “As we go to 7nm we are not seeing much frequency entitlement. We are seeing area improvement and reduction in leakage current. The lack of frequency change is a reason not to change the pipeline depth,” said Greenhalgh.

Interestingly the branch predictor for this conventional von Neumann processor is a neural network based algorithm with a zero-cycle prediction to eliminate pipeline ‘bubbles’.

Neon is a SIMD (Single Instruction Multiple Data) accelerator processor that has been part of the ARM core since ARMv6 instruction architecture and ARM11 core. It covers integer data types up to 64bit width and various floating point formats. It is not part of the ALU pipeline but is a coprocessor that shares sixteen 128bit registers with the vector floating point unit.

The debut of ARMv8.2 and the Cortex-A55 includes the introduction of several NEON instructions including: rounding double MAC instructions for color-space conversion, the ability to perfrom eight 16-bit operations per cycle using FP16 instructions, or four 32bit operations per cycle with dot product instructions. The latency of fused multiply-add operations has been halved to four cycles and a radix 16 integer divider has been added.

Next: More support for CNNs

Greenhalgh made the point that the popular form of convolutional neural network AlexNet used for computer vision and visual classification and recognition spends 80 percent of its time doing matrix multiplication as supported by Neon. He added that industry seems to be migrating towards higher neural network complexity [layers] at reduced resolution such as 16bit FP, now supported, and 8bit fixed point.

Improvements to the 16kbyte to 64kbyte L1 cache include higher bandwidth and a larger 16-entry translation look-aside buffer (TLB). The L2 cache, is now exclusive to the DynamIQ core, is configurable in size up to 256kbytes and offers 50 percent lower latency – at about 6 cycles – compared with the old shared L2 cache used by Cortex-A53. There is also an improved L2 cache TLB with an increased size of 1,024 entries.

Cortex-A75

The Cortex-A75 is aimed at a broad set of applications, said Fred Piry, ARM Fellow and lead architect in the CPU group. It provides an upgrade to the Cortex-A73 for mobile and consumer applications and for the older Cortex-A72 for infrastructure and automotive applications such as ADAS and in-vehicle infotainment.

The key changes on the Cortex-A75 is that it is a three-way superscalar out-of-order machine – versus the two-way superscalar Cortex-A73 – and has a private L2 cache compared to the shared L2 on the Cortex-A73. There is single-cycle decoding of instructions with instruction fusing and micro operations.

To cope with the increased superscalarity the instruction fetch has been widened to four instructions (from three wide) and decoupling of the has been enhanced with a deeper instruction queue.

Interestingly the branch predictor is a version of the Cortex-A73’s and not a neural network based implementation. “No neural network branch predictor. The A73’s was good enough,” said Piry.

As with the Cortex-A55 the Cortex-A75 supports dot product and half-precision floating point to better support artificial intelligence applications and machine learning processing. Its use of virtualized host extensions (VHE) offers performance improvements for type-2 kernel-based virtual machine (KVM) hypervisor. Cache stashing and atomic operations improve multicore networking performance. And the Cortex-A75 sees the introduction of CPU activity monitoring for fine-grain thread control under power or thermal management.

A POP [processor optimization package] for the Cortex-A75 on TSMC’s 16FFC manufacturing process is already available and being used by early licensees. The Cortex-A75 and Cortex-A55 POP IP for TSMC 7FF also will be available by Q4 2017, according to ARM’s website.

Related links and articles:

www.arm.com

News articles:

ARM to boost processor performance by 50x with new AI instructions

Why ARM wants to do more

ARM’s soft launch for machine learning library

Kalray’s Coolidge processor adds deep learning acceleration

Intel cancels its developer forum