OpenMP代写:CS484 Matrix Transpose

完成基本的Matrix操作优化。包括使用Tiling等技术。

Tiling

Introduction

Learning Goals

You will learn the following:

  • How memory access patterns can affect the performance of serial algorithms.
  • How to reason about the relative performance of two algorithms based on machine characteristics.
  • An introductory OpenMP directive.
  • How to calculate Speedup and Parallel Efficiency.

Assignment Tasks

The basic workflow of this assignment is as follows:

  • Clone your turnin/mp1 repo
  • Implement the algorithms / code for the various programming tasks.
  • Build and test on Campus Cluster (see README.md) (It should pass all tests except “Cache.TileSizeChosen” , which you will handle later.)
  • Push your complete, correct code.
  • Do at least two rounds of benchmarking on Campus Cluster ( scripts/submit batch.slurm ):
    • To determine the best tile size for the Campus Cluster nodes.
    • A final round to get the benchmark results you will use for your writeup.
  • Push any benchmarking results that you wish to and final versions of code files. (At this point, you should pass all tests.)
  • Write and push your writeup.

Matrix Transpose

This exercise introduces tiling as a way of improving cache performance. There is a function named transpose tiled in transpose/transpose.cpp. Implement this function (matrix transposition) using the tiling concept from class.

For simplicity, we will only deal with square matrices and tiles, and you may assume that the matrices are optimally aligned in memory.

After implementing and verifying, benchmark this function on Campus Cluster (see Section 2). Then answer the following questions in your writeup:

  1. For a square matrix of size N = 2048, plot the average execution times as the tile size varies over 2, 4, 8, …512. Explain the trend in terms of the machine’s cache specifications. (If you use the provided scripts/batch script.slurm, you will have these results in writeup/tile results.txt . ) You can find details about the cache using the following command: ‘getconf -a | grep CACHE’ . (On campus cluster, the run script will do this automatically and place the output in writeup/cache properties.txt )
  2. On the basis of the above results, alter the definition of BEST TILE SIZE in transpose/best tile size.h (and add/commit), re-run the batch script, reexamine the tiling results, and plot the execution time varying the matrix size N over 128, 256, 512, 1024, 2048, 4096, 8192. Explain these new results in terms of the cache specifications of the machine on which you are benchmarking (Campus Cluster).

Matrix Multiplication

In this exercise, you will analyze the performance difference of two variants of matrix multiplication.

Implement the functions multiply tiled and multiply transposed in mulitply/multiply.cpp , verify their correctness, and answer the following questions.

  1. Explain performance differences that you observe between the two variants, Is one clearly better than the other? Why or why not? (Note that the benchmarks will automatically use the same BEST TILE SIZE that you determined in exercise 1.1.

Basic OpenMP

In this exercise you will apply simple OpenMP directives to two for loops (wrapped by functions squares serial and primes serial ) and analyze their effects on performance. You will use OpenMP pragmas to parallelized versions of these two loops inside the functions squares parallel and primes parallel located inside openmp/openmp.cpp .

  1. Explain performance differences that you observe between the two variants. Is one clearly better than the other? Why or why not?

Testing and Benchmarking

The process for running unit tests and benchmarks is described in see Section 2.1.

Though we reserve the right to subject your code to any tests of correctness, benchmarks, or human code examination, these are essentially the tests that we will use for grading. In some cases, tests that cannot be provided without revealing an answer (“Cache.TileSizeChosen”) have been redacted to only check if your submission is of the correct form, not to check the accuracy of your answer itself.

Below is the grading rubric based on expected speedups for each test. Keep in mind that you may need to run your code several times to find the best speedups. We will also be running your code several times to find the best speedups.

Compiling and Testing

To compile, just run the script compile script.sh in the scripts directory. To test and benchmark your code, just run the script submit batch.sh in the scripts directory.

You will have output files in your writeup directory that will give you performance measurements, and your slurm-[JOBID].out file will tell you which tests you passed or failed. We use Google Benchmarks for benchmarking your code. Feel free to create your own benchmarks in tests/student benchmarks.cpp if you wish, but you do not have to. If you do use your own benchmarks, then uncomment the student benchmark run command at the bottom of scripts/batch script.pbs.

Submission Guidelines

Your writeup must be a single PDF file that contains the tasks outlined above.

Please save the file as “./writeup/mp1 [NetID].pdf” , commit it to git, and push it to the main branch of your turnin repo before the submission deadline.

You must also commit at least the following code files. These files, and only these files, will be copied into a fresh repo, compiled, and tested at grading time.

  • transpose/transpose.cpp
  • transpose/best tile size.h
  • mulitply/multiply.cpp
  • openmp/openmp.cpp

Nothing prevents you from altering or adding any other file you like to help your debugging or to do additional experiments.
This includes the unit test and benchmark code. (Which will just be reverted anyway.)

It goes without saying, however, that any attempt to subvert our grading system through self-modifying code, linkage shenanigans, etc. in the above files will be caught and dealt with harshly. Fortunately, it is absolutely impossible to do any of these things unaware or by accident, so relax and enjoy the assignment.