4.6. FMS Strassen and Winograd Algorithms

The FMS parameter IALGOR directs FMS to use alternative algorithms when performing matrix multiplies during factoring, solving and matrix-vector multiply. These algorithms are designed to reduce the number of floating point operations performed, thereby providing a faster solution. The following two algorithms are available:

Strassen's Algorithm
This algorithm modifies the traditional matrix multiply
```
   [C] = [A][B]
```
by first bisecting each of the matrices [A] and [B] into 4 equal parts. By appropriately adding, subtracting and multiplying the resulting quadrants of [A] and [B], it is possible to compute the matrix [C] with 7/8 of the multiplies required by the traditional technique. Because this algorithm is recursive, the matrices may be bisected several times, with each bisection reducing the remaining work by another factor of 7/8.
Ultimately, however, the matrices become too small and the overhead of adding and subtracting the quadrants exceeds the 7/8 benefit. The FMS parameter MINDIM is used to terminate this process by specifying the minimum dimension of a matrix quadrant at or below which traditional techniques are used.
Winograd's Algorithm (Complex data only)
(also called 3M Complex multiply)
This algorithm performs complex matrix multiply using 3 instead of the traditional 4 multiply and add operations. In the traditional case, when two complex numbers A and B are multiplied and added to C, 4 multiplies and 4 adds are required as follows:
```
   (Re_C + Im_C i) = (Re_A + Im_A i) * (Re_B + Im_B i)

   P1 = Re_A*Re_B
   P2 = Re_A*Im_B
   P3 = Im_A*Re_B
   P4 = Im_A*Im_B

   Re_C = Re_C + (P1 - P4)
   Im_C = Im_C + (P2 + P3)
```
In this reduced operation case, the same calculation is performed with 3 multiplies as follows:
```
   P1 = Re_A*(Re_B - Im_B)
   P2 = (Re_A + Im_A)*Re_B
   P3 = (Re_A - Im_A)*Im_B

   Re_C = Re_C + (P1 + P3)
   Im_C = Im_C + (P2 - P1)
```
At first it would appear that we have not gained anything because, while we have reduced the number of multiplies from 4 to 3, we have increased the number of adds from 4 to 7. Since most computers perform multiplies and adds at the same speed (and many concurrently), we have actually increased the number of cycles to perform the operation.
However when A, B and C become matrices [A], [B] and [C] the balance shifts. For matrices of size N by N, N square operations are required for addition while 2(N cube) operations are required for multiplication. For sufficiently large N, the extra additions become insignificant.

For complex data both these algorithms may be used. Winograd's algorithm is applied first, producing the three real matrices. Strassen's algorithm may then be applied to perform the real matrix multiplies.

There are, however, some downsides to these algorithms. First, because additions and subtractions are performed ahead of multiplications, some precision will be lost. The amount is problem dependent. You should always check your answer with the traditional technique, IALGOR=0.

Second, the adding and subtracting of the quadrants results in a more random memory access pattern than the traditional technique. For this reason, some of the benefit may be lost in cache thrashing.

Third, current floating point processors fuse the multiplier and adder so the output of the multiplier flows directly into the adder. This development has occurred after these algorithms were developed. When additions are performed, multiply cycles go unused.