Dynamic estimation of processor performances

Motivation

The mpC language allows the programmer to describe an abstract heterogeneous netwrok the most appropriate for execution of the corresponding parallel algorithm, and its programming environment maps the abstract network to a real executing network. The mapping is performed in run time and based on information about performances of processors and links of the real net work, dynamically adapting the program to the executing heterogeneous hardware. Efficiency of utilization of the performance potential of the real network by the parallel program directly depends on quality of this mapping, which in its turn depends on accuracy of estimation of pro cessor and link performances.

The processor and link performances can be estimated by means of execution of a special paral lel benchmark on the network. Indeed, such an estimation procedure is a part of one of commands of the command-line user interface external to the mpC language, namely, the mpccreate com mand.

The described mechanism of dynamic adaptation of parallel programs to executing heteroge neous hardware is necessary but rather restricted. Its main weakness lies in use of static integral estimation of hardware performance. The estimation does not depend on the executing program, is obtained by execution of a special parallel benchmark and is saved in a file, from which after wards it is just read by the mpC run-time support system (RTSS) while executing the correspond ing mpC program.

The estimation of processor performances, obtained in that way, does not vary during program execution and is used while creating networks defined in the program. As a rule, parallel code, executed on each of the network objects, differs essentially from the instruction mixture provided by the benchmark. It is not very essential while defining relative performances of microprocessors of the same architecture, but for microprocessors of different architectures, estimations of relative performances obtained with different instruction mixtures can differ essentially. As a result, in each particular case of network object creation, the used estimation turns out rather rough, result ing in insufficiently balanced processor workload and, hence, in lowering efficiency of the pro gram.

There were considered several approaches to heightening accuracy of estimation of relative per formances of processors, the first of which lied in classification of mpC applications and use of its own benchmark for each application class. But experiments showed that relative performance of processors could differ essentially even for applications of the same class. For example, the fol lowing piece of C code

          for(k=0; k<500; k++) {
            for(i=k, lkk=sqrt(a[k][k]); i<500; i++)
              a[i][k]/=lkk;
            for(j=k+1; j<500; j++)
              for(i=j; i<500; i++)
                a[i][j]-=a[i][k]*a[j][k];
          }
implemented Cholesky factorization of a 500x500 matrix. Used as a benchmark, it estimated the relative performance of SPARCstation-5 and SPARCstation-20 as 10:9. On the other hand, the following piece of code
          for(k=0; k<500; k++) {
            for(j=k, lkk=sqrt(a[k][k]); j<500; j++)
              a[k][j]/=lkk;
            for(i=k+1; j<500; i++)
              for(j=i; j<500; j++)
                a[i][j]-=a[k][j]*a[k][i]
          }
also implementing Cholesky factorization of a 500x500 matrix, estimated the relative perfor mance of SPARCstation-5 and SPARCstation-20 as 10:14. Finally, the LAPACK routine dpotf2, solving the same problem, estimated their relative performance as 10:10.

The second approach lied in use of its own benchmark for each mpC application. In particular, the problem of automatic generation of the benchmark proceeding from the source code of the application was investigated. This approach needed very serious complication of the mpC pro gramming environment. On the other hand, it did not work if the problem solved by the mpC pro gram was divided into several subproblems, each solved on its own network object. In that case, the obtained estimation of relative performances of processors averaged real relative perfor mances achieved at separate parallel parts of the program and could differ from them as much as in the previous approach.

The third approach lied in automatic generation of such a benchmark for each particular mpC program, that produces the vector estimation of relative performances of processors, with each parallel part of the program, executed on a separate network object, being characterized by its own estimation of relative performances just used by the RTSS while creating the corresponding network object. This approach, being exceptionally complex in implementation, did not work if the network executing the mpC program was actively used for other computations too. In this case, from the mpC program's point of view, real relative performances of processors were a function of time. Therefore, the use of their static estimation, not varying while the program run ning, often led to qualitative distortion of real correlation of processor powers and, hence, to essential slowing down the program execution.

So, for implementation in the mpC programming environment, the fourth approach was selected. It lied in introduction of a new language construct allowing the programmer to refresh in run time the estimation of relative performances of processors with the most appropriate (from user's point of view) benchmark.

The recon statement

Thus, to ensure as accurate estimation of relative performances of processors as possible, that is necessary while creating network objects, a new statement of the form

recon benchmark
is introduced in the mpC language, where benchmark is either a null statement, consisting of just a semicolon, or a statement of general form, distributed over the entire computing space, that specifies execution of the same code on each virtual processor and does not specifies communica tions between them. The recon statement performs refreshment of the relative performances of processors of the executing network of heterogeneous computers used by the RTSS. If bench mark is a null statement, some standard benchmark is used, otherwise, the code specified by the user is used as a benchmark.

In any case, the recalculated map of processor performances is saved inside the RTSS and used when creating network objects.

New library functions

Three new library functions, covering and extending the functionality of function MPC_Processors_static_info, are introduced.

Function MPC_Get_number_of_processors.

Synopsis
     #include <mpc.h>
     repl int [*]MPC_Get_number_of_processors(void);
Description

The function detects the total number of physical processors of the underlying distributed memory machine.

Returned value

The function returns the total number N of physical processors.

Function MPC_Get_processors_info

Synopsis
     #include <mpc.h>
     void [*]MPC_Get_processors_info
                    (repl int *imap, repl double *dmap);
Description

Parameter imap should be either a null pointer or points to the initial element of an N-element integer array where N is the total number of physical processors of the underlying distributed memory machine. If imap!=NULL, then after a call to MPC_Get_procesors_info, the array will contain containing relative performances of the processors. Parameter dmap should be either a null pointer or points to the initial element of an N-element double array. If dmap!=NULL, then after a call to MPC_Get_procesors_info, the array will contain containing relative performances of the processors.

Function MPC_Set_processors_info

Synopsis #include <mpc.h> void [*]MPC_Set_processors_info(int *[host]imap); Description

Parameter imap should be a pointer to the initial element of an N-element integer array where N is the total number of physical processors of the underlying distributed memory machine. The function sets new values of processor performances to be used by the RTSS when creating net work objects. The new values is defined by array imap.