Matrix multiplication
Description of the problem
Let us consider a regular application multiplying 2 dense square n*n matrices X and Y.
Our mpC program will use a number of virtual processors, each of which computes a number of rows of the resulting matrix Z. Both dimension n of matrices and the number of virtual processors involved in computations are defined in run time. So, our application implements the following scheme.
The algorithm and program in mpC
Initializing X and Y on the virtual host–processor
Creating a network
Scattering rows of X over virtual processors of the network
Broadcasting Y over virtual processors of the network
Parallel computing submatrices of Z
Gathering the resulting matrix Z on virtual host-processor
The corresponding mpC program looks as follows:
nettype
SimpleNet(n) {coord I = n;};nettype
Star(m,n[m])void
[*]MxM(float *x, float *y, float *z, repl n)void
[net SimpleNet(p)v] ParMult(float *dx, float *dy, float *dz, repl *r, repl n)myn = r[I
coordof r];void
SeqMult(float *a, float *b, float *c, int m, int n)int i, j, k, ixn;
for(i=0; i<m; i++)
void Partition(
int p, double *v, int *r, int n)for(i=0, sv=0.0; i<p; i++) sv+v[i];
Formal parameters
x, y, and z of basic function MxM are distributed over entire computing space, and parameter n is replicated over the entire computing space. It is meant that n holds the dimension of matrices. It is also meant that x points to n*n-member array, and the component of this distributed array belonging to the virtual host-processor holds matrix X. Similarly, [host]y points to an array holding matrix Y, and [host]z point to array holding resulting matrix Z.int
[net SimpleNet(n)] MPC_Bcast(This call broadcasts matrix Y from the parent of v to all virtual processors of v. As a result, each component of the distributed array pointed by dy will contain this matrix.
An embedded network function looks like a library network function, but a compiler knows its semantics. In particular, it will generate different code for different types of arguments corresponding to source and destination buffers.
Next statements are asynchronous. They form two p-member arrays d and l distributed over v. after their execution, l[i] will hold the number of elements in the portion of resulting matrix which is computed by i-th virtual processor of v, and d[i] will hold the displacement which corresponds to this portion in the resulting matrix. Equivalently, l[i] will hold the number of elements in the portion of matrix X which is used by i-th virtual processor of v, and d[i] will hold the displacement which corresponds to this portion in matrix Y. Each component of c will hold the number of elements in the portion of the resulting matrix which is computed by the corresponding virtual processor (equivalently, the number of elements in the portion of matrix X which is used by this virtual processor.)
After that a call to embedded network function MPC_Scatter is placed. This function is declared as follows:
int
[net SimpleNet(n) w] MPC_Scatter(This call scatters matrix X from the parent of v to all virtual processors of v. As a result, each component of dx will point to an array containing the corresponding portion of matrix X.
After that a call to nodal function SeqMult is placed. The function computes the portions of the resulting matrix on each of virtual processors of v in parallel (SeqMult implements traditional sequential algorithm of matrix multiplication).
Finally there is a call to embedded network function MPC_Gather in ParMult. MPC_Gather is declared as follows:
int
[net SimpleNet(n) w] MPC_Gather (Experimental results
We measured the running time of our mpC program multiplying two dense square matrices. We used three Sun SPARCstations 5 (hostnames gamma, beta, and delta), SPARCclassic (omega), and HP 9000-712 (zeta) connected via 10Mbits Ethernet. There were more than 20 others computers in the same segment of the local network.
We used LAM MPI Version 6.0 as a particular communication platform as well as a new improved benchmark for detecting relative performances of workstations. In addition all executables, which took part in the experiment, were generated by GNU C compiler with optimization option –O2.
Eight virtual parallel machines were created:

The computing space of each of these virtual parallel machines was constituted by 5 processes running on each of workstations (that is for example, the computing space of gbdz constitute by 20 processes). As a base of comparison we used the running time of a sequential C program implementing the same algorithm which was used in function SeqMult.
Figure 1 illustrates how the mpC program allows to speed up the multiplication of two dense square matrices depending on the their dimensions. It compares it to the speed of the sequential C program.
Note, that the running time of the mpC program substantially depends on the network load. We monitored the network activity during our experiments. We have observed up to 32 collisions per second. The collisions occurred more often during broadcasting large data portions. The collisions resulted in visible degradation of the network bandwidth.
Table 1 compares contribution of communications and computations in the total running time of mpC program (results for gbdz are presented). The first column shows matrix dimension, and the second one shows percentage of communications in the total running time.
|
N |
Communications (%) |
|
100 |
40 |
|
200 |
55 |
|
300 |
48 |
|
400 |
38 |
|
500 |
35 |
|
600 |
26 |
|
700 |
24 |
|
800 |
21 |
Communications in our mpC program consist of three parts: scattering matrix X, broadcasting matrix Y, and gathering the resulting matrix. Table 2 compares contribution of each of these parts in the total communication time (for gbdz virtual parallel machine).
|
n |
broadcast |
Scatter |
gather |
|
100 |
70% |
18% |
12% |
|
200 |
78% |
11% |
11% |
|
300 |
78% |
10% |
12% |
|
400 |
79% |
10% |
11% |
|
500 |
79% |
10% |
11% |
|
600 |
79% |
10% |
11% |
|
700 |
79% |
10% |
11% |
|
800 |
76% |
13% |
11% |
While the present results, it is necessary to take into account some peculiarities of both the implementation of MPI, which we used, and our local network.
Our local network does not support fast communications. It is based on 10Mbis Ethernet and uses old-fashioned network equipment. In addition, there are 26 computers in our segment of the network connected via cascade of 4 hubs. To characterize our network, it is enough to say that ftp transfers data from gamma to alpha at the rate of 300-400 Kbytes/s. It means that real bandwidth of our network is aboun 25-30% of its maximum bandwidth.
On other hand, LAM MPI Version 6.0 ensures sending large floating arrays at the rate 50––60 Kbytes/s. In addition, it doesn’t use multicasting features of our network when broadcasting.
Nevertheless, even under these conditions, our mpC program has demonstrated good speedup comparing with the sequential C program.
If the implementation of MPI ensured the communication rate comparable with real bandwidth of local network and used its multicasting facilities, contribution of the communication of our mpC program would not exceed 5-7%. If, in addition we used 100Mbits Ethernet and up-to date network technologies (for example, replacing hubs with switching devices), contribution of communication in the total running time of the mpC program would not exceed 1-2%. That is, the mpC programming environment can ensure practically ideal speedup of the presented mpC program for up-to-date heterogeneous networks of workstations.
Table 3 gives the time of running the mpC program on four heterogeneous virtual parallel machines (zg, zgb, zgbd, and zo) dependent on dimension of multiplied matrices, and compares it to the time of running the sequential C program on workstation zeta.
|
n |
z |
zg |
zgb |
Zgbd |
zo |
zo |
|
|
C |
mpC |
mpC |
MpC |
mpC |
MPI |
|
100 |
0.18 |
0.43 |
0.52 |
0.57 |
0.67 |
0.91 |
|
200 |
1.52 |
1.67 |
1.70 |
1.79 |
2.36 |
4.29 |
|
300 |
6.80 |
5.66 |
5.08 |
4.90 |
7.09 |
14.2 |
|
400 |
17.3 |
14.2 |
11.7 |
11.1 |
16.4 |
33.0 |
|
500 |
36.2 |
26.0 |
21.0 |
19.0 |
32.8 |
68.0 |
|
600 |
66.8 |
53.0 |
41.0 |
37.0 |
58.5 |
120. |
|
700 |
113. |
83.0 |
64.0 |
56.0 |
97.0 |
200. |
|
800 |
180. |
134. |
102. |
88.0 |
152. |
306. |
In addition, the table compares the mpC program and its manually written MPI counterpart on machine zo.

Figure 2 illustrates how the mpC program allows to speed up the multiplication of two dense square matrices, if the user starts from single powerful workstation zeta and enhances his computing facilities step by step by means of adding less powerful workstations gamma, beta, and delta. One can see that the mpC programming environment ensures good speedup in these case also.
Another interesting result can be extracted from Figure 1 and table 3. One can see that the slow network consisting of workstations gamma and delta (virtual parallel machine gd), the performance each of which is about 60% of the performance of workstation zeta, demonstrates a little bit higher performance (when multiplying two dense square matrices) than single workstation zeta.

Finally, figure 3 shows clearly, that even for very heterogeneous distributed memory machine consisting of high-performance HP workstation and old low performance Sun workstation omega, the mpC program allows to utilize its parallel potential, speeding up the multiplication of two dense square matrices (comparing to sequential C program running on zeta). At the same time, the use its MPI counterpart, which distributes the workload equally, does not allow to do it slowing down the matrix multiplication essentially.
Back to The mpC Program Examples