There are many more matrix multiplications, and some that are slightly more efficient than those presented here. We now consider (last but not least) the implementation of Cannon's algorithm (1969).
Cannon's algorithm uses a mesh of processes that are connected as a torus. Process
at location
initially begins with submatrices
and
. As the algorithm progresses, the submatrices are passed left and upwards.
The total information received by a process is bytes to receive the starting submatrix, at most
bytes during the alignment phase, and then another
bytes during the computation. The root process must gather
bytes from each process to complete the result. This is a total of
Cannon's algorithm is suitable when the inputs/outputs of a matrix operation are generated/used locally. In this case, the scatter and gather operations are not required.
New matrix multiplications algorithms have continued to be published into the mid 1990's.