Since ancient days, nature has been the inspiration for mankind. Humanity has always looked to build systems that will mimic the elegant and efficient mechanisms nature has created to resolve challenges presented by the world we live in. Applications of this are observed in a wide variety of domains spanning from pharmacology to transportation.
In recent years, we are witnessing a renaissance in the field of ‘deep learning’, a domain that is attempting, ultimately, to enable the level of reasoning and intelligence that resembles human behavior. Identified as the organ that is considered the source of wisdom and intelligence, the ‘brain’ is the natural subject being explored in the journey to get closer to this target.
Distilled mathematical formulation in the form of artificial neural networks (ANN) are developed vis-a-vis the development of physical devices that will be able to run those networks effectively. Despite the fact that comparing computers and the human brain is quite mundane, their underlying structures are quite distinct. One easily spotted property of neural networks is their cellular nature. Thus, the structure of the basic ‘cell’ is among the aspects that are thoroughly explored, for the obvious reason that it gets repeated many times. Hence the importance of its efficiency. That will be the focus of this short article.
Next: A taste of theory
A taste of theory
The underpinnings of an ANN is a huge collection of elements called neurons, typically arranged in bundles that are heavily interconnected. Briefly described, a neuron is a cell that is characterized by having multiple inputs and a single output. The output of the cell is a direct function of the inputs to it, each of which is getting different ‘attention’ in the overall contribution to the output. This level of ‘attention’ is usually referred to as weight. Additionally, the output may carry some thresholding effect that results in generating a response only if the neuron has crossed the threshold also known as ‘fired’. The relevant inputs of neurons down the line that are connected to a firing neuron, will get ‘excited’ and the process will carry throughout the network to reach an eventual output.
Figure 1: The neuron biological inspiration (left) and its artificial, conceptual equivalent (right). The dendrites serve as the inputs; the axon is the output and the aggregation takes place within the cell.
While defining the equivalent model, the most common approach is weighted sum with a non-linearity applied to the output. This approach is very useful in capturing the essence of a concept in a simplistic and meaningful manner. However, in attempts to capture finer aspects of the biological behavior more complex models are sought. These reflect other properties that may result in a more complete description of the neuron and for practical reasons may offer implementation alternatives that overcome some performance barriers inherent to the basic representation.
Options to model the neuron behavior involve time-domain, frequency-domain and amplitude-domain representations. These options can be easily expressed in closed mathematical form as described below.
The straightforward discrete model representing the neuron as a weighted sum of the inputs (figure 2a); A pulsed version where pulse-trains represent activity and their temporal rate determines level of excitation – this is the one that represents closest representation of nerve cell activity in the human body (figure 2b) and a continuous representation.
Figure 2: Mathematical representation of a) discrete (b) pulsed and (c) continuous models.
Implementation: analog and digital
Various methods for the neuron implementation need to address two basic aspects, namely (i) processing – the part that takes care of computing the output out of inputs and weights and (ii) data transfer – the part that takes care of data delivery and storage.
While digital implementation is more common in modern, large-scale IC design, recent approaches involve analog implementations. A digital realization of a neuron is based on a multiply-and-accumulate circuit. Each operation involves reading an input and a weight, and results in an intermediate result. This procedure is repeated multiple times. After the summation ends nonlinearity needs to be applied to the resulting value and the result is rendered the neuron output. A result is available once for every N cycles This result should be stored thereafter.
Figure 3: Digital building block
Analog implementations leverage the continuous nature of signals to express the sum of some physical level (e.g. sum of voltage potentials or sum of currents) and get a continuous signal that is exempt of finite world length representation issues.
Figure 4: Analog building block (continuous operation)
Another variant of an analog circuit is a spiking-based circuit that leverages the concept of pulse train of constant amplitude. Excitation level is rate dependant in this case. This concept is the one that mostly resembles brain neuronal activity.
Figure 5: analog building block (spiking operation)
In the analog case, data storage is a non-trivial challenge. It can be addressed by translation to the digital domain, implying a need for some kind of an analog-to-digital conversion while storing data and digital-to-analog conversion while fetching it. Alternatively, the output can directly feed the next stage, thus avoiding any storage. The latter approach is highly efficient provided that the design is capable of supporting the needed bandwidth. Some capacitance may be applied to allow bandwidth control if needed. (Note: Implementation diagrams 3, 4 & 5 show one option for implementing each of the formerly mentioned approaches and don’t carry all implementation details.)
When looking into the performance of the various methods, it is clear that while digital solutions are well established they are limited by the barriers of CMOS technology. These are typically transistor level threshold voltage at ~0.4V, a process-dependant maximum clock frequency for standard cells at less than 3GHz and duty-cycle limitation. This results in a lower bound for the processing node that is roughly around ~100fJ for a single 8bit multiply-and-add operation 
Analog circuitry, in contrast, is theoretically bounded by thermal noise that is roughly at 0.01fJ. This is four orders of magnitude lower than the digital option. Hence the interest in building circuits based on an analog compute fabric. Yet, practical deployment is challenged by various issues such as delivering data into a large array of compute elements as described, parasitic effects related to their connectivity, storing the output efficiently and finally the ability to translate into large-scale design flows and mass production techniques. In practice, the reported information indicates achievable energy for the compute element in the ballpark of of 1 to 10fJ . In these implementations, indeed the compute element energy becomes negligible, however, the overall energy is largely dominated by the surrounding circuitry and storage elements. All in all a practical efficiency of x10 to x100 on top of digital-based building blocks is achievable at small scale but rapidly drops away when scaling up the number of elements.
 Computing’s Energy Problem (and what we can do about it), M. Horowitz, 2014
 Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing & Beyond, Bavandpour et. al, 2017
Figure 6: Relevant domain of operation illustration
Figure 6 is a qualitative description of the different approaches. Efficiency loss of analog circuits is primarily due to implementation loss (i.e. the detector circuit has some internal noise which degrades signal-to-noise ratio and requires better margin). In this case, a spiking approach has a lower detection threshold. When scaling up analog solutions, noise coupling is observed. This effect grows with solution scale (it is more dominant in continuous approaches). Digital approaches suffer less from this coupling effect. Indeed the energy gap from analog to digital is attributed to higher voltage levels and operational frequency, which is much higher in the analog case.
Practically, large-scale circuit design has matured throughout the last few decades and the overall acquired industry experience cannot be easily dismissed. Therefore, the combination of the scalability issue and productization aspects largely limit the ability to make analog-based solutions a dominant approach for the general problem. Furthermore, at the system level, the secondary contributors cannot be overlooked. Once the compute element contribution is lowered down to a reasonable level, further improvement becomes less important.
Thus far this discussion has been devoted to the building block level. However, overlooking the rest of the system is incomplete. A system-level analysis should account for all contributors and consider the fact that at a certain point the improvement factor of the basic processing becomes negligible. Such is the case with the energy distribution. To date, state-of-the-art solutions are struggling to achieve 0.1 to 1TOPS/W when running machine learning tasks. This is equivalent to 1 to 10pJ per operation. As mentioned earlier, since the digital implementation of a neuron plateaus at 0.1pJ then 90 to 99 percent of the energy still lies in other domains which include memory elements, control fabric and bus architecture. Therefore, to harness the potential an architecture overhaul is of the essence. The energy recovered by a transition to an analog solution alone is upper bounded by 10 percent of the total energy consumed.
The following table captures some of the key properties of the various approaches and summarizes most of the items that were mentioned briefly above.
Table 1: Comparison of analog and digital bases neural networks
To summarize, it is obvious that the vibrant nature of the machine learning field will bring about new and interesting technologies that will mature throughout generations to address various market needs. It appears that analog solutions open up a huge potential in a subset of the overall field of neural-network compute engines. Once they are more established they may very well become a complementary element in a variety of neural compute solutions and may address some challenging use cases. Nonetheless, it is hard to foresee analog-based solutions become dominant in this field in light of their limited scalability, technology node sensitivity, and the fact that the solution they offer is relevant to a relatively limited subset of applications while digital solutions offer a valid alternative that is flexible, relatively easy to implement and good enough to meet many product needs.
Avi Baum, is co-founder and chief technology officer at Hailo Technologies Ltd. and a former CTO at Texas Instruments’ wireless technology group. Hailo, founded in 2017, is developing a processor architecture to accelerate neural network processing on edge devices that could be installed in autonomous vehicles, drones and smart home appliances such as personal assistants, smart cameras and smart TVs.
Related links and articles: