C代写:KIT318 Parallel Matrix Multiply

使用MPI库,代写程序,对矩阵进行并行计算。

Parallel Matrix Multiply

The objective of this exercise is to understand how we can design a parallel programs using Message Passing Interface (MPI).

In the lecture I merely discussed about the parallel matrix multiplication program: C = A * B. Here I have added details to make it easier for you.

The submitted program should be able to accept any size square matrix and utilize any number of processors, although I will test with a fixed number of processors.

Some comments on implementation

  1. Only one process (e.g., a process with rank = 0 acting as the master) should read the data/matrix (or assign values to matrix A and B) and then it should automatically share partitioned data among the worker processes.
  2. You can divide the work among processes in a number of ways based on how you partitioned the data. For this assignment, you need to implement Two (2) of the following (except the first one) strategies and compare them:
    • Data/Work Partitioning 1: Cyclicv1
      In this case, treat computation of one element of C matrix as one work unit. Assign these units with data to each processor equally (as much as possible) in cyclic manner. An example the implementation for this will be shared with you on Mylo.
    • Data/Work Partitioning 2: Cyclicv2
      In this case, treat computation of one row of C matrix as one work unit. Assign these work units with data to each processor in cyclic manner.
    • Data/Work Partitioning 3: blockv1
      In this case, treat a block of computation of matrix C as one work unit. Assign block of data of matrix A and B to different processes.
    • Data/Work Partitioning 4: blockv2
      Assume B is already transposed, work units for computing matrix can be divided into blocks as shown above.
  3. Try to assign the work units to worker processes such that the load/work distribution among them is balanced with least amount of data transfer between different processes. The master process should collect the results from worker processes and stores in matrix C.

Testing and Evaluation

(a) Take the size of square matrix as an input to the program by reading from the keyboard at runtime.

(b) Test your program to make sure that it works in parallel!. Also it computes the multiplication correctly. To run the program, you may use following commands (for C programs):

mpicc ParMatrixMul.c
mpirun -np numprocess ./a.out

where ParMatrixMul.c is name of your program file. numprocess is the number of processors you want to use.
(c) When the size of matrix (N) is larger than 5, it is NOT be feasible for you to input elements of matrix A and B from the keyboard. In such cases, simply assign some random values to elements matrix A and B within your program.

(d) Run your program for three different problem sizes: matrix sizes N (e.g. 100, 200, 300, 400, 500) on different combinations of processes: 1, 2, 4, 8, and 16 processes.

(e) Note the processing time by making use of appropriate MPI calls and draw a graph for varying values of matrix size with “Number of CPUs/nodes on X column” and “the time taken “ on Y column.

That is, use randomly generated matrices in a range of sizes and produces a report on the time taken for a range of different sized matrices in a parallel programming environment using MPI. You need to compare two data partitioning strategies chosen for parallelisation (discussed above). You need to do the experiment in two environments: a) in your own Personal Laptops and b) in MPI Cluster provided to you (details will be given in lecture).

Programming Environment

For initial development and testing, you can use your own personal computer. You can choose either C or Java for your implementation.

If you want to develop in C and use windows, you need to install Cygwin and OPENMPI libraries. If you want to develop in C and use a Unix system, you have to install MPICH2 library. If you want to develop in Java, you can use MPJ Express (http://mpj-express.org/). MPJ express can be plugged into eclipse also.

For performance testing, you will be using a cluster on Nectar cloud. Instructions for accessing this will be available in another document.

Submission

  • Write a report containing (a) details on your two chosen data partitioning and parallelisation algorithms including the way you did the load balancing with proper reasoning, and (b) performance graphs.

  • Program source code.

  • Please upload the assignment on Mylo (by deadline). Please do not zip the files. Please mention your name and student number on the first page of your report.

You are welcome to implement as indicated below if you like.
A better way of expressing the problem is: Write a function “square_dgemm” routine to compute the product of two matrices using MPI to distribute the task onto a number of processors. Note, only one processor should read the data/matrix and then the program should automatically share this among the processors.

There is a harness routine matmul.c on Mylo which provides checks on correctness and scalability.