C++代写:COMP4300 Shared Memory Parallel Algorithm

用CUDA和OpenMP代写几个并行算法。

Assignment Setup and Submission

This assignment must be submitted electronically. To obtain a copy of the template source files, you must fork the gitlab project ps-ass2 into your own gitlab area, add the user as a Reporter to your project, and push the final version of your files into your fork of the ps-ass2 gitlab project before the deadline.

The files that will be marked are a report ps-ass2Rep.pdf and the C files parAdvect.c, parAdvect.cu.

Learning Objectives

  • Implement shared memory parallel algorithm using the OpenMP and CUDA programming models.
  • Gain an understanding of some of the performance and programability implications of SMP and GPU systems.

Setup

Your assignment project directory will contain a ps-ass2Rep.pdf, which you must overwrite with your own report. It also contains two sub-directories, openmp and cuda. The former contains a test program advectTest.c, a file serAdvect.c containing serial advection functions, some header files, and a template OpenMP advection solver parAdvect.c. The test programs can be built using the command make.

The usage for the test program is:

OMP_NUM_THREADS=p ./testAdvect [-P P] [-x] M N [r]

The test program operates much like that for Assignment 1 except as follows. The -P option invokes an optimization where the parallel region is over all timesteps of the simulation, and P by Q block distribution is used to parallelize the threads, where p=PQ. The -x option is used to invoke an optional extra optimization.

The directory cuda is similar, except the test program is called advectTest.cu, and the template CUDA parallel solver file is called parAdvect.cu. The usage for the test program is:

./testAdvect [-h] [-s] [-g Gx[,Gy]] [-b Bx[,By]] [-o] [-w w] M N [r]

with default values of Gx=Gy=Bx=By=r=1. (Gx,Gy) specifies the grid dimensions of the GPU kernels; (Bx,By) specifies the block dimensions.

The option -h runs the solver on the host; this may be useful for comparing the ‘error’ of the GPU runs (and for comparing GPU and CPU speeds). The option -s forces a serial implementation (run on a single GPU thread); all other options are ignored. If neither of -h,-s,-o are given, Gx,Gy thread blocks of size Bx,By are used in a 2D GPU parallelization of the solver. If -o is specified, an optimized GPU implementation is used, which may use the parameter w as well.

You should read the files and familiarize yourself with what they contain.

Part 1 (OpenMP): Tasks

Experimentation for this section should be done on the Raijin supercomputer.

Results should use a batch job run on a single supercomputer node.

Parallelization via 1D Decomposition and Simple Directives

Parallelize the prototype function omp1dAdvect() in parAdvect.c. Each (advection update) loop nesting must have its own parallel region, and the parallelization must be over one of the two loop indices (hint: this can be done via simple OMP parallel for directives).

There are various ways this can be done: (i) parallelize the outer or inner loop, (ii) interchange the loop order and (iii) schedule the iterations in a block or cyclic fashion. Choose combinations which:

  • maximize performance.
  • maximize the number of OpenMP parallel region entry/exits.
  • maximize shared cache misses involving coherent reads.
  • maximize shared cache misses involving coherent writes.

The boundary update loops can potentially be parallelized as well. Experiment with this for the purpose of improving the performance of case (1).

Cut-and-paste the directive / loop-nesting combinations for each case in your report, and discuss why you chose them. Similarly discuss your strategy in parallelizing the boundary update loops, and whether and by how much it improved the performance of case (i). Choosing suitable M, N, r, record the performance at p = 8,16 and put the results in your report, discussing the differences.

Now, leave the best performing parallelization combinations in your file.

Performance Modelling of Shared Memory Programs

Let ts denote the cost of a parallel region entry/exits, and tw,R and tw,W denote the cost per (double) word of a cache miss for coherent read and writes, respectively. From your measurements, derive values for these parameters for both p = 8,16.
Construct a performance model for the best-performing variant (hint: it should be very similar to the 1D model you used in Assignment 1, and you can use the same value of tf). Compare your model’s prediction with the previous results for the best-performing variant (these should match - why?) and for when MN is twice as large, and for the both of these for p = 4,12.

Parallelization via 2D Decomposition and an Extended Parallel Region

In the function omp2dAdvect(), write an advection solver such there is only a single parallel region (over all timesteps), and the parallelization is over a P by Q block distribution. Hint: use the code of omp1dAdvect() minus the directives as a starting point.
For suitable measurements with p=16 and varying P, compare performance. Discuss whether there is any gain from the 2D distribution. Comparing this with the 1D vs 2D case for the MPI version of the solver, and explain any differences with the relative advantages. Comparing with your results in Q1 for the P=1 case, was there any advantage in having a single parallel region?

Optional.

Find and describe some new optimization that could potentially improve the performance of the OpenMP advection solver. If possible, implement and evaluate this on Raijin. Put your code in ompAdvectExtra(), which is activated by the -x option.

Part 2 (CUDA): Tasks

Unless otherwise specified, experimental results for this section should be made on the Jetson system, as described in Prac 5.

Baseline GPU Implementation - 1D

Using the code of cudaSerAdvect() and its kernels in serAdvect.cu as a starting point, implement a solver whose field update kernels operate on Gx x Gy thread blocks of size Bx x By to (without restrictions, except you may assume Bx*By ≤ the maximum thread block size). You may choose what, if any, parallelization strategy you apply for the boundary updates (justify this in your report).

In parAdvect.cu, you will need to create new kernel functions, and your outer-level solver calling these must be implemented in cuda2DAdvect(). Hint: to help in debugging, replace the kernels from serAdvect.cu one by one with your own, and do 1D parallelizations first.

Perform experiments to determine the effects of varying Gx,Gy,Bx,By (given some reasonably sized problem to solve). Report and discuss the optimal combination, including its speedups against the single GPU and host (ARM) cores, and any results of interest (including a comparison of 1 x B vs B x 1 blocks). Also perform suitable experiments to determine the overhead of a kernel invocation, and report your findings.

Optimized GPU Implementation

In cudaOptAdvect(), create an optimized solver and its associated kernels. It should be significantly faster than the previous version. For ideas of what optimizations you might consider, read the paper Optimized Three-Dimensional Stencil Computation on Fermi and Kepler GPUs by Vizitiu et al. In your report, describe your optimization(s) and their rationale. Perform suitable experiments which should demonstrate the efficacy of your approach, and discuss them in your report.

Comparison of the Programming Models

Based on your experiences in this course, comment on the programability (relative ease of designing, implementing and debugging programs) in the three paradigms you have used so far: OpenMP (shared memory), CUDA (streaming device programming) and MPI (distributed memory). Compare this with the speedups obtained.

Optional

Port your codes to the Raijin GPUs, and perform some experiments to compare your results with those on Jetson (warning: the batch queues for the GPU nodes are known for their long wait times!). Any code changes should not go in your submitted version of parAdvect.cu; instead, describe these in your report.

Requirements

Your code must be written using C/OpenMP or Cuda, as appropriate. It should be written with good programming style. It should be well commented, so as to enable the reader (this includes the tutor/marker!) to understand how it works.

In your report, include:

  • A disclaimer with your name and student number stating what (if any) contributions from others (not including course staff) are in your submission. Note that significant contributions may require a revision of your final mark, as this is intended to be an individual assignment.
  • Your answers to all the questions given above.
  • Details of any notable deficiencies, bugs, or issues with your work.
  • Any feedback on what you found helpful in doing this work, or what from the course could be improved in helping you to carry out this work.