CUDA代写:CME213 Neural Networks on CUDA Part2

Introduction

用GPU编写Neural Network算法的第二部分作业,这部分要求做性能的优化。

In this second part of the final project, we provide further details about the grading policy and introduce you to the starter code. You can also find instructions for running and profiling the code on the cluster and submitting your work.

Grading details

Please refer to Part I for an overall grading information. Here we explain in detail how we determine the correctness of the code and test the performance. We have setup four testcases (with corresponding grading modes in the code) for testing correctness and performance. These testcases or grading modes can be run by passing command line arguments to the program. More details about them are given in later sections.

Outline

You can find the grading outline below. More details about them are in the subsections that follow.

  • Preliminary Report:
    • Correctness (15%)
    • GEMM correctness
    • Overall correctness
    • Profiling (5 %)
  • Final Report:
    • Correctness (32%)
    • GEMM correctness
    • Overall correctness
    • Correctness analysis (3%)
    • Performance (20%)
    • GEMM performance
    • Overall performance
    • Profiling and analysis (20%)
    • Overall quality of report (5%)

GEMM correctness

Since the GEMM function is a building block of any neural network implementation and will be an important tool in your arsenal, we test the GEMM implementation separately from the overall code testing. We have provided a function prototype called myGEMM for you in gpu_func.cu, which takes inputs as two scalars a, b, three matrices A, B, C, and returns the result of D = a * A * B + b * C in C (in place).
Your job is to fill in this function, and we will test your implementation on two sets of inputs that are relevant to this project. You are welcome to, but you don’t have to use this myGEMM function in your parallel training; this is only for the purpose of grading.
We test this correctness by running grading mode 4, which runs the myGEMM function alone. This myGEMM function is called only by rank 0 in the grading mode, i.e., for this part you just need to write kernels to do GEMM on a single GPU.

Overall correctness

In large neural network problems, a common issue encountered is the aggregation of rounding errors or inconsistencies. Unfortunately, the implementations of several operations are not exactly same on CPU and GPU. Some of the sources for differences include exp() operations used in Softmax and Sigmoid functions, FMA (fused multiply add), the order of operations etc. There are some differences at the hardware level of implementation too. These discrepancies are usually of the order of 1e-16 for double precision calculations. However, such discrepancies can build up over time. In general, as the learning rate gets larger, the instability of the algorithm due to roundoff errors is high. These discrepancies might not lead to any parameter blow-up, but might create significant differences between the CPU and GPU solutions. This makes determining correctness challenging.
In order to tackle this, we have setup three testcases for determining correctness in the form of grading modes. In all those modes, a max. norm of the difference between final CPU and GPU results (parameters W(1) , W(2) , b(1) , b(2) ) is considered. If this max. norm is greater than a set threshold (1e-7) for any case, your code will fail correctness for that case. The actual max. norm values we get are much lower than this, but we want to provide some leeway in this regard and have relaxed the threshold. Apart from passing the three correctness tests, the precision on the validation set of the CPU and GPU implementations must be very close.
The hyper-parameters for the three testcases are as follows,

  1. Low learning rate: 0.001, large # iterations: 40 epochs;
  2. Medium learning rate: 0.01, medium # iterations: 10 epochs;
  3. High learning rate: 0.025, small # iterations: 1 epoch.

The grading modes 1,2 and 3 run the above three testcases respectively.
Note: In order to get full credit on overall code correctness, these above thresholds must be met by a fully parallel code running on 4 GPUs through four different processes (or CPU threads) using MPI and CUDA. If the code is running on a single GPU or is not using GPUs (just MPI), you will lose a significant portion of the grade. Similarly, if you are running four processes but only one of them is using GPUs, you will again lose points. Here, when we say running on GPUs, we expect that all the GEMM, Softmax and Sigmoid calculations be done on GPUs.

GEMM Performance

This refers to the performance of your myGEMM function. To test this we run the code in grading mode 4. The grade for this will be based on the performance of your GEMM function (in terms of the time taken) relative to other students in the class. The exact method for calculating this relative grade will be determined by us later depending on the range of performances we get.
In the code, we run this myGEMM function repeatedly for a number of iterations. This has been currently set to 10, but we might change this based on the performance we see in the submissions. We believe that this should not affect your implementation.
Caveat: If your GEMM implementation does not pass the GEMM correctness test, you will not receive any points for performance.

Overall Performance

This refers to the performance of your full NN code. Here we use the default settings of the program for benchmarking the performance (time taken). Here again, the grade is based on your performance relative to other students in the class. The exact method for calculating this relative grade will be determined by us later depending on the range of performances we get.
Caveat: If you do not pass the overall correctness tests, you will be penalized and we will determine the penalty on a case by case basis.

Starter-code

The starter code integrates the GPU CUDA code and other C++ code. The GPU code is first compiled by nvcc into object files, and then linked with other parts of the project and libraries by g++ linker. The project is using the Armadillo library for matrices and vectors. The details about the files are below. Those marked with a star(*) will not be submitted by the submission script. You are free to modify those files for debugging purposes, but make sure you test with the original version of those files before you submit. In the other files, you may write any number of functions you wish to.
Note: Please make sure you adequately comment your code and also structure it well. Although we will avoid going through the code in detail, it will help us read your code in case we have to.

  • sample_bashrc.txt: This file contains a list of modules to be loaded to run the program. You can choose to copy this into your ̃/.bashrc in the cluster.
  • init.sh: This file contains the script to download the MNIST dataset and install Armadillo. This only needs to be run once. Please see the running instructions before you run this script.
  • run.sh: This file contains the script to run the program using sbatch. See 3.2 for further details.
  • main.cpp: This is the main file for the project. You do not need to change this file except for your own debugging purposes.
  • gpu_func.cu, gpu_func.h: You should implement your GPU CUDA wrapper functions and kernels in gpu_func.cu and declare them in gpu_func.h. This separates the source code so that nvcc only compiles the CUDA code into object files, which can be linked into other parts of the project by the g++ linker.
  • two_layer_net.h: This file contains a basic C++ class to implement the two layer neural network. Note that all members in two_layer_net are declared to be public, and you can access them directly, which allows an easier MPI implementation than with a more encapsulated class.
  • two_layer_net.cpp: This file already contains a serial implementation of the neural network. Your objective is to fill the parallel_train function with the parallel implementation.
  • utils/tests.cpp utils/tests.h: These files contain the tests used for determining correctness and testing performance.
  • utils/common.cpp, utils/common.h: These files contain common operations on arma::mat that may be useful. You can make your own GPU CUDA implementation accordingly in gpu_func.cu.
  • utils/test_utils.h: This file contains helper functions useful for debugging and testing, e.g., a function to compare a memory space representing a matrix with an Armadillo Matrix to check if the GPU implementation is correct.
  • utils/mnist.cpp, utils/mnist.h: These files contain code that reads in the MNIST dataset.
  • Outputs folder: All the output files go into this folder. There is another folder named CPUmats inside this folder. All the CPU matrices that are written out during debug mode go into this folder.

Running instructions

We have provided a sample .bashrc file in sample_bashrc. You can replace your current ̃/.bashrc (or bash profile) file on the cluster with this. You can also copy the relevant portions to your current bashrc file. The modules that have been loaded are as follows:

module add shared
module add slurm
module add gcc/4.8.5
module add cuda75
module add mvapich2/gcc/64/2.1
module add intel-cluster-runtime/intel64/3.7

(Please load gcc/4.8.5 instead of gcc, because the nvcc does not support gcc version 4.9 and up.)
Make sure all the above modules are loaded. If you changed your .bashrc file, you may have to source it for the changes to take effect. Alternatively you can exit your ssh session and log back in. You can see the modules that have been loaded by using

module list

With the correct modules loaded, run

./init.sh

This downloads the MNIST dataset and installs the Armadillo library. You only need to do this the first time after you download the code.
Edit the job script run.sh to add command line arguments or change number of processes you want to run with. By default, we request for 4 processes on a single node in the cluster and request for 4 GPUs. The single node is to reduce MPI overhead. Communication across nodes is slower than within a single node. Note that the program prints out the number of MPI processes and CUDA devices used in the very beginning to help you make sure you are running it correctly.
Submit the job script run.sh using sbatch as follows

sbatch run.sh

You can check whether your job is still running via the command squeue.