MENU

New Multiplier-Accumulator accelerates polynomials

New Multiplier-Accumulator accelerates polynomials

Feature articles |
By Graham Prophet



Classical MAC architecture

The classical MAC unit computes the product of two numbers and adds the product to an accumulator (Z), thus implementing the following operation:

 

Z ← Z + (A × B)

 

implemented as:

 

 

Figure 1. Classical MAC architecture

 

This is a general-purpose architecture widely used to solve algorithms that can be converted into a sum of multiplications (e.g., filters, polynomial computation).

 

Polynomial computation and Horner’s method

Being that every real function can be approximated by a polynomial of sufficient degree (think of a Taylor series), such a computation should be considered whenever function evaluation has to be performed. To this aim, let’s consider Horner’s method, which represents the best known algorithm to compute polynomials, requiring a minimum number of arithmetic operations.

 

Given the real polynomial:

 

p(x) = an·xn + an-1·xn-1 + … + a1·x + a0

 

Evaluating at x using Horner’s method, we proceed through the following steps:

 

zn = an

zn-1 = zn·x + an-1

z0 = z1·x + a0 = an·xn + an-1·xn-1 +…+ a1·x + a0 = p(x)

 

For polynomial degree n, calculation of p(x) requires only n additions and n multiplications (or n multiply-adds in Horner’s steps).


Using the standard MAC, such an algorithm could be implemented by the following assembly code:

 

; – – – – – – – – – – – – – – – – – – – – – – – – – –

; – – – polynomial computation with classic mac – – –

; – – – – – – – – – – – – – – – – – – – – – – – – – –

;

; – – – – – – – – – – initialize – – – – – – – – – –

;

; calculation: z = 1 * an + 0

clr z ; clear accumulator

mov b, #1 ; load operand B with 1

mov a, #an ; load operand A with an coefficient

mov ctr, #mac ; program multiply-accumulate

nop ; wait for calculation

; wait for calculation

nop ; z = an

;

; – – – – – – replicate with i = 1,..,n- – – – – – –

; calculation: z = zn-i+1*x + an-i in two steps

; 1) z = zn-i+1*x + 0 = zn-i+1*x

; 2) z = an-i * 1 + z’ = zn-i+1*x + an-i

mov a, z ; load operand A with zn-i+1 coefficient

mov b, #x ; load operand B with x

clr z ; clear accumulator

mov ctr, #mac ; program multiply-accumulate

nop ; wait for calculation

; wait for calculation

nop ; z = zn-i+1*x

mov a, #an-i ; load operand A with an-i

mov b, #1 ; load operand B with 1

mov ctr, #mac ; program multiply-accumulate

nop ; wait for calculation

; wait for calculation

nop ; z = zn-i+1*x + an-i

;

; – – – – – – – – – read result – – – – – – – – – – –

;

mov r0, z ; read result from accumulator

 

Here, we’ve assumed the MAC unit is equipped with a control register (ctr), which enables the accumulator to be updated according to the result of the multiply-accumulate operation. Nonetheless, since arithmetic operations usually include multiple pipeline stages (i.e., to reduce power consumption), we have also assumed that the core must wait for a corresponding number of clock cycles before having access to the result.

 

Let us define Ncalc as the number of cycles required for the calculation of the multiply-accumulate output starting from (and also including) access to the relevant control register. Moreover, we may consider that loading the operand registers could also be performed in multiple steps. As an example, the 8-bit 8051 would take 8 cycles to load/read a 64-bit double-precision floating-point operand/result, thus requiring each mov instruction to be actually replaced with a suitable set of 8 mov steps; thus, we define Nload as the number of instructions required by a core to wholly access a register holding a floating-point number of a given precision.

Defining Nmac-poly as the overall number of cycles required by the processor to evaluate an nth-degree polynomial with the standard MAC architecture, we have:

 

Nmac-poly(n) = 1 + 3 • Nload + Ncalc + n • (1 + 4 • Nload + 2 • Ncalc)


New MAC architecture

Here is our new MAC architecture:

 

 

Figure 2. New MAC architecture (patent pending)

 

The following differences will be seen with respect to a classical MAC implementation:

 

– An additional register for operand ‘C’ is introduced

– The possibility to store intermediate results for both multiplications and additions (‘M’ and ‘Z’ registers respectively)

– Multiplexing logic is defined to feed the operator’s inputs

 

All of these features deal with enhanced flexibility for the MAC unit, providing the definition of a set of 13 arithmetic operations that can be executed by SW atomically:

 

 

Addition

Z  ← B + C

Subtraction

Z  ← B – C

Recursive addition

Z  ← Z + C

Recursive subtraction

Z  ← Z – C

M-addition

Z  ← M + C

M-subtraction

Z  ← M – C

Multiplication

M ← A × B

Recursive multiplication

M ← M × B

Z-multiplication

M ← Z × A

Multiplication and addition

Z  ← A × B + C

Multiplication and subtraction

Z  ← A × B – C

Product accumulation

Z  ← Z + A × B

Horner’s operation

Z  ← Z × A + C

 

 

Table 1. New MAC unit’s operations


This MAC can be effectively used to compute polynomials. This comes from the support of Horner’s operation, for which operators are fed with inputs coming from the green branches of Figure 2. In more detail, the aforementioned steps defined to evaluate p(x) according to Horner’s method can be translated to software through the following procedure:

 

; – – – – – – – – – – – – – – – – – – – – – – – – –

; – – polynomial computation with proposed mac – – –

; – – – – – – – – – – – – – – – – – – – – – – – – –

;

; – – – – – – – – – – initialize – – – – – – – – – –

;

; calculation: z = x * an + an-1

mov a, #x ; load operand A with x

mov b, #an ; load operand B with an coefficient

mov c, #an-1 ; load operand C with an-1 coefficient

mov ctr, #mad ; program multiply-and-add operation

nop ; wait for calculation

; wait for calculation

nop ; z = an*x+ an-1

;

;- – – – – – – replicate with i = 1,..,n -1 – – – – –

;

; calculation: z = x * z’ + an-1-i

mov c, #an-1-i ; load operand C with an-1-i coefficient

mov ctr, #hor ; program Horner’s step computation

nop ; wait for calculation

; wait for calculation

nop ; z = zn-i*x + an-1-i

;

;- – – – – – – – – – read result – – – – – – – – – – –

;

mov r0, z ; read result from accumulator

 

Defining Nmac-poly(n) as the overall number of cycles required by the processor to evaluate an nth-degree polynomial with the new MAC architecture:

 

Nnew-mac-poly(n) = 3 • Nload + n • (Nload + Ncalc)

 

We may now compare Nnew-mac-poly and Nmac-poly metrics as a ratio:

 

 

Considering that in most applications the polynomial’s degree is greater than 10, we can assume the Rpoly ratio as an invariant with respect to the degree, thus obtaining:

 

 

Rpoly permits us to measure the performance enhancement provided by our new MAC architecture in regard to polynomial computation. As shown in Table 2, Rpoly is always greater than two: this means that our solution at least halves the number of computational steps to evaluate polynomials compared to a classical MAC architecture.

 

 

Rpoly 

Nload

1

2

4

8

Ncalc

1

3

3.33

3.6

3.78

2

2.67

3

3.33

3.6

3

2.5

2.8

3.14

3.45

4

2.4

2.67

3

3.33

5

2.33

2.57

2.89

3.23

6

2.29

2.5

2.8

3.14

7

2.25

2.44

2.73

3.07

8

2.22

2.4

2.67

3

 

Table 2. Values of Rpoly as a function of a given set of Nload and Ncalc metrics.

 

We can also note that Rpoly increases with Nload, meaning that performance improvement is even magnified in case of small processors handling big floating-point numbers. In our specific case, we instantiated an IEEE 754 single precision (32 bits) MAC architecture under the ESFR space of an M8051EW microprocessor, thus obtaining Nload = 4; moreover, our MAC datapath included six pipeline stages, thus having Ncalc = 7. As a consequence, our implementation provided us an Rpoly performance enhancement of 2.73.

 

Conclusions

In this article, we have described an innovative MAC architecture. With a slight increase in logic, our solution offers distinct advantages. It is efficient for the evaluation of polynomials, permitting us to at least halve the number of required computational steps with respect to a common MAC architecture. Also, our solution exhibits much flexibility, with the support of a large set of arithmetic operations able to be executed by software atomically.

 

About the Authors

Samuele Raffaelli is a digital IC designer at Azcom Technology; David Vincenzoni is R&D Design Manager at STMicroelectronics, responsible for the design and verification of new chips for Broadband Power Line Modems and for new families of devices for industrial applications.

 

 

 

 

 

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s