Intel with an old take on big.little for Alder Lake

Intel with an old take on big.little for Alder Lake

Technology News |
Intel has revealed a dramatic change in its general purpose chip architecture with a version of the big.little approach adopted by ARM a decade ago.
By Nick Flaherty


Intel’s next-generation desktop chip, code-named Alder Lake, is the company’s first hybrid architecture to integrate two core types – the Performance-core and Efficient-core. This is similar to ARM’s big.little approach which used a small core optimised for low power consumption with lower performance alongside a larger, higher performance core. Both cores could run the same code depending on the context, avoiding the problems of having a scheduler to allocate tasks to multiple cores. This has traditionally been a limiting factor for the system-level performance of multicore chip designs.

Intel’s hybrid approach is based on the threads, with a thread director. This is an improved scheduling technology that adds in more monitoring of the core to determine the context. Intel hopes this increased monitoring combined with the thread approach and three independent fabrics will avoid the potential for a performance bottleneck.  

The compute fabric can support up to 1Tbyte/s, which is 100 GBps per core or per cluster and connects the cores and graphics through the last level cache to the memory. This has a high dynamic frequency range and is capable of dynamically selecting the data path for latency versus bandwidth optimization based on actual fabric loads. It also dynamically adjusts the last-level cache policy to be inclusive or non-inclusive depending on the utilization.

Related ARM big.little articles

The I/O fabric supports up to 64 GBps, connecting the different types of I/Os as well as internal devices and can change speed seamlessly without interfering with a device’s normal operation, selecting the fabric speed to match the required amount of data transfer

The memory fabric can deliver up to 204 GBps of data and dynamically scale its bus width and speed to support multiple operating points for high bandwidth, low latency or low power/.

These connect up the different types of processor cores, controlled by the Thread Director. This is built directly into the hardware and provides low-level telemetry on the state of the core and the instruction mix of the thread. Thread Director is dynamic and adaptive, adjusting scheduling decisions to real-time compute needs rather than using simple, static rules determined at compilation time and this allows the operating system to place the right thread on the right core at the right time. 

Traditionally, the operating system would make decisions based on limited available stats, such as foreground and background tasks. Thread Director uses the hardware telemetry to direct threads that require higher performance to the right performance core at that moment. By monitoring the instruction mix, state of the core and other relevant microarchitecture telemetry at a granular level, the operating system can make more intelligent scheduling decisions

Intel has also extended the ‘PowerThrottling’ API, with an EcoQoS classification that informs the scheduler if the thread prefers power efficiency to schedule the threads on Efficient cores rather than the performance cores.

Next: Efficient Core vs Performance Core 

Efficient core

The Efficient-core microarchitecture, previously code-named “Gracemont,” is designed for throughput efficiency, enabling scalable multithreaded performance for modern multitasking. This is Intel’s most efficient x86 microarchitecture with an aggressive silicon area target so that multicore workloads can scale out with the number of cores with a wide frequency range.

This can run at a lower voltage to reduce overall power consumption, while creating the power headroom to operate at higher frequencies. This allows the Efficient-core to ramp performance when needed.

The architecture includes a 5,000 entry branch target cache that results in more accurate branch prediction and a larger 64 kilobyte instruction cache to keep useful instructions close without expending memory subsystem power. Intel’s first on-demand instruction length decoder that generates pre-decode information

A clustered out-of-order decoder enables decoding up to six instructions per cycle while maintaining energy efficiency and a wide back end with five-wide allocation and eight-wide retire, 256 entry out-of-order window and 17 execution ports

This gives a 40 percent boost in single thread performance boost over the previous Skylake CPU core, while consuming less than 40 percent of the power. Four Efficient-cores offer 80 percent more performance while still consuming less power than two Skylake cores running four threads or the same throughput performance while consuming 80 percent less power

Performance core

The previously code-named “Golden Cove” Performance core is designed for lower latency in instruction execution. The six instruction decoders (up from four) have an eight-wide microoperand (µop) cache (up from six) and 12 execution ports (up from 10). This is supported a bigger physical register files with deeper re-order buffer with 512 entry.

An improved branch prediction algorithm reduces the effective L1 latency; full write predictive bandwidth optimizations in L2 cache.

All of this gives 19 percent improvement in performance across a wide range of workloads over current 11th Gen Intel Core processor architecture (Cypress Cove).

Advanced Matrix Extensions have been added to boost AI performance further for deep learning inference and training performance. This includes dedicated hardware and new instruction set architecture to perform matrix multiplication operations significantly faster with lower latency and increased support for large data and large code footprint application.

SoC architecture

All of this comes together in a system-on-chip (SoC) architecture with three key design points

The first is a maximum performance, two-chip, socketed desktop tuned for performance, power efficiency, memory and I/O

The second is a high-performance mobile BGA package that adds imaging, larger Xe graphics and Thunderbolt 4 connectivity

The third is thin, lower-power, high-density package with optimized I/O and power delivery for ultr-mobile notebooks.

Related articles

Other articles on eeNews Europe

Linked Articles
eeNews Europe