Matrix multiplication

Description of the problem

Let us consider a regular application multiplying 2 dense square n*n matrices X and Y.
   Our mpC program will use a number of virtual processors, each of which computes a number of rows of the resulting matrix Z. Both dimension n of matrices and the number of virtual processors involved in computations are defined in run time. So, our application implements the following scheme.

The algorithm and program in mpC

Initializing X and Y on the virtual host–processor
Creating a network
Scattering rows of X over virtual processors of the network
Broadcasting Y over virtual processors of the network
Parallel computing submatrices of Z
Gathering the resulting matrix Z on virtual host-processor

The corresponding mpC program looks as follows:

nettype SimpleNet(n) {coord I = n;};

nettype Star(m,n[m])
{
   
coord I=m;
   
node {I>=0: fast*n[I] scalar;};
   
link {I>0: [I]->[0], [0]->[I];};
   
parent [0];
};

void [*]MxM(float *x, float *y, float *z, repl n)
{
   
repl double powers;
   
repl nprocs, nrows[MAXNPROCS], n;
   MPC_Processors_static_info(&nprocs, &powers)
   Partition(nprocs, powers, nrows, n);
   {
      
net Star(nprocs, nrows) w;
      ([([w]nprocs)w])ParMult([w]x, [w]y, [w]z, [w]nrows, [w]n);
   }
}

void [net SimpleNet(p)v] ParMult(float *dx, float *dy, float *dz, repl *r, repl n)
{
   
repl s = 0;
   
int myn, i;
   
int *d, *l c;

   myn = r[I coordof r];
   ([(p)v])MPC_Bcast(&s, dy, 1, n*n, dy, 1);
   d = calloc(p, sizeof(
int));
   l = calloc(p, sizeof(
int));
   
for(i=0, d[0] = 0; i<p; i++)
   {
      l[i] =r[i]*n;
      
if(i+1<p) d[i+1] =l[i]+d[i];
   }
   c=l[I
coordof c];
   ([(p)v])MPC_Scatter(&s, dx,d,l,c,dx);
   SeqMult(dx, dy, dz, myn, n);
   ([(p)v])MPC_Gather(&s, dz, d, l, c, dz);
}

void SeqMult(float *a, float *b, float *c, int m, int n)
{

   int i, j, k, ixn;
   
double s;

   for(i=0; i<m; i++)
      
for(j=0, ixn=i*n; j<n; j++) {
         s+=a[ixn+k]*(
double)(b[k*n+j]);
         c[ixn+j]=s;
      }
}

void Partition(int p, double *v, int *r, int n)
{
   
int sr, i;
   
double sv;

   for(i=0, sv=0.0; i<p; i++) sv+v[i];
   
for(i=0, sr=0; i<p; i++) {
      r[i]=(int)(v[i]/sv*n);
      sr+=r[i];
   }
   
if (sr!=n) r[0]+=n-sr;
}

Formal parameters x, y, and z of basic function MxM are distributed over entire computing space, and parameter n is replicated over the entire computing space. It is meant that n holds the dimension of matrices. It is also meant that x points to n*n-member array, and the component of this distributed array belonging to the virtual host-processor holds matrix X. Similarly, [host]y points to an array holding matrix Y, and [host]z point to array holding resulting matrix Z.
   Call to library nodal function MPC_Processors_static_info made on the entire computing space returns the number of actual processors and their relative performances. So, after this call replicated variable nprocs will hold the number of actual processors, and replicated array powers will hold their relative performances.
   Call to nodal function Partition is also performed on the entire computing space. Based on relative performances of actual processors, this function computes how many rows of the resulting matrix should be computed by every actual processor. So, after this call nrows[i] will hold the number of rows to compute by i-th actual processor.
   After that automatic network w is defined. Its type is defined completely only in run time. Network w, which executes the rest of computations and communications, is defined in such a way, that the more powerful the virtual processor, the greater number of rows it computes. The mpC environment will ensure the optimal mapping of the virtual processors constituting the entire computing space. So, just one process from processes running on each actual processors will be involved in multiplication of matrices, and the more powerful the actual processor is, the greater number of rows it computes.
There is a call to network function ParMult on network w two lines below. In this call, topological argument [w]nprocs specifies a network type as an instance of parametrized network type SimpleNet, and network argument w specifies a region of the computing space treated by function ParMult as a network of this type.
   Definition of function ParMult declares identifier v of a network being a special network formal parameter of the function. Since network v has a parametrized type, topological parameter p is also declared in this header. In the function body, special formal parameter p is treated as an unmodifiable variable of type int replicated over network formal parameter v the rest of formal parameters (regular formal parameters) of the function are also distributed over v.
   Actually, p holds the number of virtual processors in network v, n holds the dimension of matrices, r points to p-member array, i-th element of which holds the number of rows of resulting matrix that i-th virtual processor of network v computes. Each component of dy points to an array to contain n*n matrix Y. Each component of dz points to an array to contain rows of Z computed on the corresponding virtual processor. Each component of dx points to an array to contain the row of X used in computations on the corresponding virtual processor. In addition, throughout the function execution the components of dx, dy, dz belonging to the parent of network v are reputed to arrays holding matrices X, Y and Z correspondingly.
   Variable s is replicated over v. Variables myn, i, d, l and c all distributed over v.
After execution of the first executable (asynchronous) statement, each component of myn will contain the number of rows of the resulting matrix that computes the corresponding virtual processor. After it is a call to so-called embedded network function MPC_Bcast which declared in a standard mpC header as follows:

int [net SimpleNet(n)] MPC_Bcast(
   
repl const *coordinates_of_source,
   
void *source_buffer,
   
const source_step,
   
repl const count,
   
void *destination_buffer,
   
const *destination_step);

This call broadcasts matrix Y from the parent of v to all virtual processors of v. As a result, each component of the distributed array pointed by dy will contain this matrix.
   An embedded network function looks like a library network function, but a compiler knows its semantics. In particular, it will generate different code for different types of arguments corresponding to source and destination buffers.
   Next statements are asynchronous. They form two p-member arrays d and l distributed over v. after their execution, l[i] will hold the number of elements in the portion of resulting matrix which is computed by i-th virtual processor of v, and d[i] will hold the displacement which corresponds to this portion in the resulting matrix. Equivalently, l[i] will hold the number of elements in the portion of matrix X which is used by i-th virtual processor of v, and d[i] will hold the displacement which corresponds to this portion in matrix Y. Each component of c will hold the number of elements in the portion of the resulting matrix which is computed by the corresponding virtual processor (equivalently, the number of elements in the portion of matrix X which is used by this virtual processor.)
After that a call to embedded network function MPC_Scatter is placed. This function is declared as follows:

int [net SimpleNet(n) w] MPC_Scatter(
   
repl const *coordinates_of_source,
   
void *source_buffer,
   
const *displacements,
   
const *sendcounts,
   
int receivecount,
   
void *destination_buffer);

This call scatters matrix X from the parent of v to all virtual processors of v. As a result, each component of dx will point to an array containing the corresponding portion of matrix X.
After that a call to nodal function SeqMult is placed. The function computes the portions of the resulting matrix on each of virtual processors of v in parallel (SeqMult implements traditional sequential algorithm of matrix multiplication).
   Finally there is a call to embedded network function MPC_Gather in ParMult. MPC_Gather is declared as follows:

int [net SimpleNet(n) w] MPC_Gather (
   
repl const *coordinates_of_destination,
   
void *destination_buffer,
   
const *displacements,
   
const *receivecounts,
   
int sendcount,
   
void *source_buffer);
This call gathers resulting matrix Z each virtual processor of v sending its portion of the result to the parent of v.

Experimental results

We measured the running time of our mpC program multiplying two dense square matrices. We used three Sun SPARCstations 5 (hostnames gamma, beta, and delta), SPARCclassic (omega), and HP 9000-712 (zeta) connected via 10Mbits Ethernet. There were more than 20 others computers in the same segment of the local network.
   We used LAM MPI Version 6.0 as a particular communication platform as well as a new improved benchmark for detecting relative performances of workstations. In addition all executables, which took part in the experiment, were generated by GNU C compiler with optimization option –O2.
   Eight virtual parallel machines were created:

The computing space of each of these virtual parallel machines was constituted by 5 processes running on each of workstations (that is for example, the computing space of gbdz constitute by 20 processes). As a base of comparison we used the running time of a sequential C program implementing the same algorithm which was used in function SeqMult.
   Figure 1 illustrates how the mpC program allows to speed up the multiplication of two dense square matrices depending on the their dimensions. It compares it to the speed of the sequential C program.
Note, that the running time of the mpC program substantially depends on the network load. We monitored the network activity during our experiments. We have observed up to 32 collisions per second. The collisions occurred more often during broadcasting large data portions. The collisions resulted in visible degradation of the network bandwidth.
   Table 1 compares contribution of communications and computations in the total running time of mpC program (results for gbdz are presented). The first column shows matrix dimension, and the second one shows percentage of communications in the total running time.

N

Communications (%)

100

40

200

55

300

48

400

38

500

35

600

26

700

24

800

21

Communications in our mpC program consist of three parts: scattering matrix X, broadcasting matrix Y, and gathering the resulting matrix. Table 2 compares contribution of each of these parts in the total communication time (for gbdz virtual parallel machine).

n

broadcast

Scatter

gather

100

70%

18%

12%

200

78%

11%

11%

300

78%

10%

12%

400

79%

10%

11%

500

79%

10%

11%

600

79%

10%

11%

700

79%

10%

11%

800

76%

13%

11%

While the present results, it is necessary to take into account some peculiarities of both the implementation of MPI, which we used, and our local network.
   Our local network does not support fast communications. It is based on 10Mbis Ethernet and uses old-fashioned network equipment. In addition, there are 26 computers in our segment of the network connected via cascade of 4 hubs. To characterize our network, it is enough to say that ftp transfers data from gamma to alpha at the rate of 300-400 Kbytes/s. It means that real bandwidth of our network is aboun 25-30% of its maximum bandwidth.
   On other hand, LAM MPI Version 6.0 ensures sending large floating arrays at the rate 50––60 Kbytes/s. In addition, it doesn’t use multicasting features of our network when broadcasting.
   Nevertheless, even under these conditions, our mpC program has demonstrated good speedup comparing with the sequential C program.
   If the implementation of MPI ensured the communication rate comparable with real bandwidth of local network and used its multicasting facilities, contribution of the communication of our mpC program would not exceed 5-7%. If, in addition we used 100Mbits Ethernet and up-to date network technologies (for example, replacing hubs with switching devices), contribution of communication in the total running time of the mpC program would not exceed 1-2%. That is, the mpC programming environment can ensure practically ideal speedup of the presented mpC program for up-to-date heterogeneous networks of workstations.
   Table 3 gives the time of running the mpC program on four heterogeneous virtual parallel machines (zg, zgb, zgbd, and zo) dependent on dimension of multiplied matrices, and compares it to the time of running the sequential C program on workstation zeta.

n

z

zg

zgb

Zgbd

zo

zo

 

C

mpC

mpC

MpC

mpC

MPI

100

0.18

0.43

0.52

0.57

0.67

0.91

200

1.52

1.67

1.70

1.79

2.36

4.29

300

6.80

5.66

5.08

4.90

7.09

14.2

400

17.3

14.2

11.7

11.1

16.4

33.0

500

36.2

26.0

21.0

19.0

32.8

68.0

600

66.8

53.0

41.0

37.0

58.5

120.

700

113.

83.0

64.0

56.0

97.0

200.

800

180.

134.

102.

88.0

152.

306.

In addition, the table compares the mpC program and its manually written MPI counterpart on machine zo.


   Figure 2 illustrates how the mpC program allows to speed up the multiplication of two dense square matrices, if the user starts from single powerful workstation zeta and enhances his computing facilities step by step by means of adding less powerful workstations gamma, beta, and delta. One can see that the mpC programming environment ensures good speedup in these case also.
   Another interesting result can be extracted from Figure 1 and table 3. One can see that the slow network consisting of workstations gamma and delta (virtual parallel machine gd), the performance each of which is about 60% of the performance of workstation zeta, demonstrates a little bit higher performance (when multiplying two dense square matrices) than single workstation zeta.


   Finally, figure 3 shows clearly, that even for very heterogeneous distributed memory machine consisting of high-performance HP workstation and old low performance Sun workstation omega, the mpC program allows to utilize its parallel potential, speeding up the multiplication of two dense square matrices (comparing to sequential C program running on zeta). At the same time, the use its MPI counterpart, which distributes the workload equally, does not allow to do it slowing down the matrix multiplication essentially.

 

Back to The mpC Program Examples