
New Multiplier-Accumulator accelerates polynomials
Classical MAC architecture
The classical MAC unit computes the product of two numbers and adds the product to an accumulator (Z), thus implementing the following operation:
Z ← Z + (A × B)
implemented as:
Figure 1. Classical MAC architecture
This is a general-purpose architecture widely used to solve algorithms that can be converted into a sum of multiplications (e.g., filters, polynomial computation).
Polynomial computation and Horner’s method
Being that every real function can be approximated by a polynomial of sufficient degree (think of a Taylor series), such a computation should be considered whenever function evaluation has to be performed. To this aim, let’s consider Horner’s method, which represents the best known algorithm to compute polynomials, requiring a minimum number of arithmetic operations.
Given the real polynomial:
p(x) = an·xn + an-1·xn-1 + … + a1·x + a0
Evaluating at x using Horner’s method, we proceed through the following steps:
zn = an
zn-1 = zn·x + an-1
…
z0 = z1·x + a0 = an·xn + an-1·xn-1 +…+ a1·x + a0 = p(x)
For polynomial degree n, calculation of p(x) requires only n additions and n multiplications (or n multiply-adds in Horner’s steps).
Using the standard MAC, such an algorithm could be implemented by the following assembly code:
; – – – – – – – – – – – – – – – – – – – – – – – – – –
; – – – polynomial computation with classic mac – – –
; – – – – – – – – – – – – – – – – – – – – – – – – – –
;
; – – – – – – – – – – initialize – – – – – – – – – –
;
; calculation: z = 1 * an + 0
clr z ; clear accumulator
mov b, #1 ; load operand B with 1
mov a, #an ; load operand A with an coefficient
mov ctr, #mac ; program multiply-accumulate
nop ; wait for calculation
… ; wait for calculation
nop ; z = an
;
; – – – – – – replicate with i = 1,..,n- – – – – – –
; calculation: z = zn-i+1*x + an-i in two steps
; 1) z = zn-i+1*x + 0 = zn-i+1*x
; 2) z = an-i * 1 + z’ = zn-i+1*x + an-i
mov a, z ; load operand A with zn-i+1 coefficient
mov b, #x ; load operand B with x
clr z ; clear accumulator
mov ctr, #mac ; program multiply-accumulate
nop ; wait for calculation
… ; wait for calculation
nop ; z = zn-i+1*x
mov a, #an-i ; load operand A with an-i
mov b, #1 ; load operand B with 1
mov ctr, #mac ; program multiply-accumulate
nop ; wait for calculation
… ; wait for calculation
nop ; z = zn-i+1*x + an-i
;
; – – – – – – – – – read result – – – – – – – – – – –
;
mov r0, z ; read result from accumulator
Here, we’ve assumed the MAC unit is equipped with a control register (ctr), which enables the accumulator to be updated according to the result of the multiply-accumulate operation. Nonetheless, since arithmetic operations usually include multiple pipeline stages (i.e., to reduce power consumption), we have also assumed that the core must wait for a corresponding number of clock cycles before having access to the result.
Let us define Ncalc as the number of cycles required for the calculation of the multiply-accumulate output starting from (and also including) access to the relevant control register. Moreover, we may consider that loading the operand registers could also be performed in multiple steps. As an example, the 8-bit 8051 would take 8 cycles to load/read a 64-bit double-precision floating-point operand/result, thus requiring each mov instruction to be actually replaced with a suitable set of 8 mov steps; thus, we define Nload as the number of instructions required by a core to wholly access a register holding a floating-point number of a given precision.
Defining Nmac-poly as the overall number of cycles required by the processor to evaluate an nth-degree polynomial with the standard MAC architecture, we have:
Nmac-poly(n) = 1 + 3 • Nload + Ncalc + n • (1 + 4 • Nload + 2 • Ncalc)
New MAC architecture
Here is our new MAC architecture:
Figure 2. New MAC architecture (patent pending)
The following differences will be seen with respect to a classical MAC implementation:
– An additional register for operand ‘C’ is introduced
– The possibility to store intermediate results for both multiplications and additions (‘M’ and ‘Z’ registers respectively)
– Multiplexing logic is defined to feed the operator’s inputs
All of these features deal with enhanced flexibility for the MAC unit, providing the definition of a set of 13 arithmetic operations that can be executed by SW atomically:
Addition |
Z ← B + C |
Subtraction |
Z ← B – C |
Recursive addition |
Z ← Z + C |
Recursive subtraction |
Z ← Z – C |
M-addition |
Z ← M + C |
M-subtraction |
Z ← M – C |
Multiplication |
M ← A × B |
Recursive multiplication |
M ← M × B |
Z-multiplication |
M ← Z × A |
Multiplication and addition |
Z ← A × B + C |
Multiplication and subtraction |
Z ← A × B – C |
Product accumulation |
Z ← Z + A × B |
Horner’s operation |
Z ← Z × A + C |
Table 1. New MAC unit’s operations
This MAC can be effectively used to compute polynomials. This comes from the support of Horner’s operation, for which operators are fed with inputs coming from the green branches of Figure 2. In more detail, the aforementioned steps defined to evaluate p(x) according to Horner’s method can be translated to software through the following procedure:
; – – – – – – – – – – – – – – – – – – – – – – – – –
; – – polynomial computation with proposed mac – – –
; – – – – – – – – – – – – – – – – – – – – – – – – –
;
; – – – – – – – – – – initialize – – – – – – – – – –
;
; calculation: z = x * an + an-1
mov a, #x ; load operand A with x
mov b, #an ; load operand B with an coefficient
mov c, #an-1 ; load operand C with an-1 coefficient
mov ctr, #mad ; program multiply-and-add operation
nop ; wait for calculation
… ; wait for calculation
nop ; z = an*x+ an-1
;
;- – – – – – – replicate with i = 1,..,n -1 – – – – –
;
; calculation: z = x * z’ + an-1-i
mov c, #an-1-i ; load operand C with an-1-i coefficient
mov ctr, #hor ; program Horner’s step computation
nop ; wait for calculation
… ; wait for calculation
nop ; z = zn-i*x + an-1-i
;
;- – – – – – – – – – read result – – – – – – – – – – –
;
mov r0, z ; read result from accumulator
Defining Nmac-poly(n) as the overall number of cycles required by the processor to evaluate an nth-degree polynomial with the new MAC architecture:
Nnew-mac-poly(n) = 3 • Nload + n • (Nload + Ncalc)
We may now compare Nnew-mac-poly and Nmac-poly metrics as a ratio:
Considering that in most applications the polynomial’s degree is greater than 10, we can assume the Rpoly ratio as an invariant with respect to the degree, thus obtaining:
Rpoly permits us to measure the performance enhancement provided by our new MAC architecture in regard to polynomial computation. As shown in Table 2, Rpoly is always greater than two: this means that our solution at least halves the number of computational steps to evaluate polynomials compared to a classical MAC architecture.
Rpoly |
Nload |
||||
1 |
2 |
4 |
8 |
||
Ncalc |
1 |
3 |
3.33 |
3.6 |
3.78 |
2 |
2.67 |
3 |
3.33 |
3.6 |
|
3 |
2.5 |
2.8 |
3.14 |
3.45 |
|
4 |
2.4 |
2.67 |
3 |
3.33 |
|
5 |
2.33 |
2.57 |
2.89 |
3.23 |
|
6 |
2.29 |
2.5 |
2.8 |
3.14 |
|
7 |
2.25 |
2.44 |
2.73 |
3.07 |
|
8 |
2.22 |
2.4 |
2.67 |
3 |
Table 2. Values of Rpoly as a function of a given set of Nload and Ncalc metrics.
We can also note that Rpoly increases with Nload, meaning that performance improvement is even magnified in case of small processors handling big floating-point numbers. In our specific case, we instantiated an IEEE 754 single precision (32 bits) MAC architecture under the ESFR space of an M8051EW microprocessor, thus obtaining Nload = 4; moreover, our MAC datapath included six pipeline stages, thus having Ncalc = 7. As a consequence, our implementation provided us an Rpoly performance enhancement of 2.73.
Conclusions
In this article, we have described an innovative MAC architecture. With a slight increase in logic, our solution offers distinct advantages. It is efficient for the evaluation of polynomials, permitting us to at least halve the number of required computational steps with respect to a common MAC architecture. Also, our solution exhibits much flexibility, with the support of a large set of arithmetic operations able to be executed by software atomically.
About the Authors
Samuele Raffaelli is a digital IC designer at Azcom Technology; David Vincenzoni is R&D Design Manager at STMicroelectronics, responsible for the design and verification of new chips for Broadband Power Line Modems and for new families of devices for industrial applications.
