How to overcome memory-imposed access rates and bandwidth constraints
More complex applications require external memory and at the processing rates available today need the highest possible random access rate to that memory. Traditional memory interfaces are a burden to performance because they are plagued by slow speeds, lengthy latency, and high pin counts. As a result, conventional design approaches to implementing external memory have already reached the point of diminishing returns.
Serial protocols & standards break the I/O bottleneck
Consider any modern System-on-Chip (SoC) available today and you will see nearly all the interfaces are serial, except for that to traditional memory ICs. Going forward, the transition to serial memory has already begun and decisions need to be made regarding which serial interface protocols to support. Any interface can be delineated into its physical layer or PHY, transport protocol or PCS, and transaction layer or the command set. Standardisation can take place on each level, independently.
Regarding the serial PHY; the industry standards group, the Optical Internetworking Forum (OIF), published the Common Electrical Interface I/O (CEI) standards including CEI-11 in September 2011.[Ref. 1] Standards development groups such as OIF require three to five years to develop channel models, set clocking and jitter budgets, determine electrical signal coding, and encourage the development of the ecosystem. As a result, these standards are [now] being adopted for a broad range of applications.
In fact, three serial memory interface protocols have adopted the CEI-11 physical definition: the GigaChip Interface (GCI) [Ref. 2], the Interlaken Look-Aside (ILA) [Ref. 3], and the Hybrid Memory Cube Interface (HMC) [Ref. 4] as indicated in Figure 1. Design teams can expect these protocols to also conform to the CEI-25 5 standard in the future. Each of these protocols targets different applications and markets as outlined in Table 1.
Designers therefore do not need to develop three different interface solutions to meet multiple use cases. Instead, host processors can incorporate two or more protocols running on the same physical layer. The interface need not be limited to memory but could also be used for general serial IO, giving the ultimate flexibility to the system designer in addressing a broad range of market applications.
Although it is possible to multiplex the protocols, a close look reveals distinct performance differences. Protocols leveraged from other applications result in unnecessary overhead and latency when used in point to point applications, such as a high performance memory interface. It may be necessary in the near term to include all three serial protocols on the SoC processor in order to support performance capabilities and devices from different manufacturers. If a customer wants to consolidate to two or one interface only GCI offers high efficiency for all the memory access patterns used on high performance networking line cards.
Networking applications tend to have three types of memory access patterns, depending on the function being performed.
The first is a buffer application, where there is a fixed 1:1 ratio of reads to writes with low data persistence. A packet arrives and needs to be stored for a short amount of time until it can be dispatched to the next leg of its journey. Depending on the end market, packet buffers might be implemented with or without error correction in the array. If for some reason the packet is corrupted there is almost always an option inherent in networking to drop the packet which will trigger a retransmit from the origin. The process of packet buffering involves either high packet arrival rates for sizes in the sub-64B range, or long lived ‘elephant flows’ of large or jumbo (9 kB) transmissions. Efficiency is paramount, but the ability to accommodate a wide range of packet sizes is also necessary. Figure 3 compares the efficiency of data transfer in a packet buffer application, including all the necessary overheads of commands and transport.
The second significant use of memory is in lookup applications where tables are written and updated infrequently but read at a very high random rate with of a small access size, on the order of 4 to 8 Bytes. For packet header processing this can be multiple lookups for each packet. Data persistence is long lived and any disruption caused by corruption would interrupt the flow of traffic. In most cases lookup tables are implemented with error correction code (ECC) to protect the memory contents. The random access rate for DRAM-based memories often is slower than the packet arrival rate of a single 40G port. In order to emulate a faster lookup table, multiple copies of the tables can be made and accessed in a ‘round-robin’ fashion. Table replication is reasonably efficient for two table copies, but beyond that, efficiency declines quickly, limiting the effectiveness of this technique. This underscores the need for high access rate memory devices for look-up applications.
Considering only the serial interface for a table lookup application, the return or read datapath is the bottleneck; therefore maximising the transport on return is paramount. Figure 4 illustrates the data return efficiency for the three different protocols.
The third application in packet header processing requires true random read and write access. Traditionally, only SRAM can achieve this performance. The MoSys MSR720, with its innovative Bank-Conflict Resolution logic, enables this device to perform concurrent read and write access to any addresses while maintaining complete data coherency. These two functions, lookup and random access, require high efficiency and small data word transfers which are addressed with the capabilities of the GigaChip Interface. As shown in Table 1, the CRC bits for GCI are implemented on a per-frame basis, thus minimizing the overhead for small transfers. GCI supports the smallest payload size with the lowest overhead, ideal for small access transfers.
The Biggest Savings Occur at Board Level
The premise of conversion to serial interface for memory so far highlights performance advantages, however significant cost savings are realised as well. Quite simply, high efficiency serial data transfer increases the bandwidth density per pin which, in turn, reduces the pin count, board complexity and energy per bit transferred. Because serial interfaces result in lower power consumption, they also produce lower overall thermal output.
The board layout for a serial interconnect reduces the number of signals to be routed which can then reduce the number of layers in a board stack-up. This also lowers copper consumption on the board and can reduce the board area, both resulting in significant cost reductions. Serial communications also allow longer interconnects, which establishes thermal and mechanical separation from the host. In combination, the efficiency and performance gained from implementing serial communications results in more cost-effective designs.
Mythbusting
Unfortunately, serial solutions are plagued by unsubstantiated myths. The most common involve serialisation/deserialisation latency, power consumption and bit error rates. The emergence and adoption of serial interfaces for memory applications proves their viability. To further debunk these misconceptions, MoSys’ second generation Bandwidth Engine IC (BE-2) exhibits as little as 12 nsec of read latency, which is comparable to the highest performance Reduced Latency DRAM. A secondary effect is that latency increases the need for buffering which increases latency in a vicious circle. So latency of both the memory and the interface profoundly impact the overall system design.
While traditional SRAM may have a lower latency, the BE-2 can deliver continuous data return at an effective read and write rate many times higher than an SRAM. Internally, the Bandwidth Engine architecture is capable of 16 concurrent memory accesses. This performance can only be realised by the host with a serial interface. By comparison, traditional SRAM, with low latency, is still limited on access and bandwidth by the parallel bus interface and usually lacks the necessary capacity for multiple 100G links. In short, for efficient high-throughput networking applications, serial solutions provide the only path forward pathway for multi-100G systems as shown in Fig. 4 above. Of the three serial protocols, the GCI demonstrates the best data return efficiency.
For a given manufacturing technology, higher performance increases power consumption. In this case, power may be higher, but the power-to-performance ratio is lower. The high transport efficiency of the GCI protocol results in a corresponding high-energy efficiency and provides the means to generate high throughput. Even SoCs built using multi-chip module technology that incorporates on-package memory can still benefit from high-efficiency serial interfaces. And as noted above, the serial-based solution lowers pin counts and results in reduced board complexity, area and cost.
As data rates speed up and electrical signal levels shrink, the probability of an incorrectly sampled data bit increases, regardless of whether the interface is serial or parallel. For networking interfaces operating at high frequencies, even parallel interfaces need some sort of error checking and handling solution. With their higher data rates and farther reach, serial interfaces have always included mechanisms to ensure data integrity. GCI includes automatic error recovery through a replay mechanism as well as host error recovery for use in applications that are intolerant of network jitter. The combination of the low bit error rate of a CEI interface combined with the error checking and recovery of GCI results in a robust solution for carrier and enterprise class applications.
Highest Performing Network Memory Solutions
MoSys’ Bandwidth Engine today fulfills a wide range of system design requirements. The high performance, serial attached, discrete IC component combined with the corresponding processors allows network solutions to perform at their optimum. The internal parallel architecture enables up to six billion accesses per second, six times that of traditional networking memories. This results in an ideal solution for packet header processing in networking infrastructure hardware. The device achieves this level of performance by using the serial IO efficiency of the GCI protocol over the widest range of payload sizes.
Increasing System Reliability
In addition to the error protected interface and the ECC protected memory array the Bandwidth Engine architecture also includes intelligent error management. This capability enhances the quality and reliability of data transmissions for carrier and enterprise class networking equipment. Specifically, it pre-empts errors. The self-test and self-repair technology detects, removes and replaces storage locations that are weaker than the baseline population. This reduces the risk of an uncorrectable multi-bit error. It performs background Built-in-Test (BIST), memory scrubbing, memory sparing, and is capable of persistent self repair.
Supporting all three protocols simultaneously allows designers to implement with external components that scale to meet system price/performance objectives. As a result, design teams can realise a high degree of reuse which reduces the overall design effort, reduces silicon area, accelerates the design cycle, and enables greater flexibility in the end product to address broader and larger end-user markets.
For high-throughput networking infrastructure hardware, serial chip-to-chip protocols provide the only scalable approach. Each of the three protocols (GCI, ILA, and HCM) is optimized for different applications and use cases, but can be combined on the processor for ultimate flexibility. The GigaChip Interface provides the finest granularity and is best suited for use in processing packet headers, but does not preclude the use in applications with larger payloads. GCI delivers an optimum point to point solution and provide scalable performance for higher rate links in the future. By comparison, traditional memory solutions have reached the point of diminishing returns of performance scalability.
References
1. The Optical Internetworking Form, “Common Electrical I/O (CEI) – Electrical and Jitter Interoperability agreements for 6G+ bps, 11G+ bps and 25G+ bps I/O,” September, 2011, pp. 165-201. Available online at: https://www.oiforum.com/public/documents/OIF_CEI_03.0.pdf
2. The GigaChip Alliance, online at: https://www.gigachipalliance.com/
3. The Interlaken Alliance, online at: https://www.interlakenalliance.com/
4. The Hybrid Memory Cube Consortium, online at : https://www.hybridmemorycube.org/
5. Ibid., The CEI-25 standard, pp. 219-234.
Also see, by this author;
“Serialised memory interface gains momentum”