## Introduction

结合了神经网络的高级的CUDA，用GPU进行Neural Network的算法编写进行数字识别。核心算法部分框架都已提供，需要实现的主要是相关数据结构在GPU中的申请和释放。

In this project, you will implement a neural network using CUDA to identify digits from handwritten images (a specific case of image classification problem). Neural networks are widely used in machine learning problems, specifically in the domains of image processing, computer vision and natural language processing. Recently, there has been a lot of excitement regarding deep learning, which basically uses more advanced variants of the simpler neural network (NN) we cover here. Therefore, being able to train large neural networks efficiently is important and is the goal of this project.

The main purpose of this Part I is to introduce you to the project and give you a picture of what needs to be done. Details of the starter code, submission instructions and full details of the grading steps will be handed out in Part II.

## Data and Notation

We will be using the MNIST dataset, which consists of greyscale 28 × 28 pixel images of handwritten digits from 0 to 9.

The dataset is divided into two parts — 60,000 training data and 10,000 test data. We will use the training data to learn the parameters of our neural network (described later), and the test data to measure the performance of the learned network.

In the training process, one issue will be overfitting on the training data. To avoid this, a standard practice is to perform a cross-validation — a technique to measure how the model will generalize on an independent dataset. Cross-validation is carried out by further dividing the training data into two sets - a training set and a validation set. The validation set is a small portion (usually 10%) of the training dataset. We then perform the training on the training set (excluding the validation set) and evaluate our model on the validation set. There are different types of cross-validation, and we will only do a single holdout for validation because of computational issues. We use insights from this validation to improve our model.

The goal of the test set is to perform a final evaluation of our model on the unseen data. It is not meant to be used to in the training process.

## Neural Networks

### Neurons

To describe neural networks, we will begin by describing the simplest possible neural network, one which comprises a single “neuron.”

This is a good time for us to discuss the calculation of the derivative of the sigmoid function with respect to its input, since we are going to use it significantly in the following sections.

The neuron performs a linear transform on the input and then applies a non-linear transformation (sigmoid, in this case).

When a single neuron is used, we are limited to a binary classification. Take the example of a cancer tumor. We may build a neural network that takes as input the size of the tumor, its location, and the length of time it has been there. Based on these three pieces of information, the network needs to determine whether the tumor is benign or malignant. This is a true/false type of determination. In our previous model using the sigmoid function, we can interpret the output of the network.

Our task is therefore to learn the parameters of the network W and b so that it achieves the best accuracy or precision on the test set.

In the project we need to extend this concept to multiple classes. Instead of a simple true/false output, we need to decide which digits from 0 to 9 is shown on the input image. This requires a full neural network.

### Fully-connected neural network

Figure 4 shows a fully-connected neural network with 1 input layer, 1 hidden layer, and 1 output layer. We call such a network to be a two-layered neural network (ignoring the input layer as it is trivially present)

In our problem, we are trying to determine the digit associated with each image. We will call this digit the “label” associated with the image (using the neural network terminology). The total number of labels is denoted C. In our case C = 10, since we are trying to determine digits 0 to 9.

The last layer is special. This is the output of our network. In the project, we have C = 10 output nodes. Each node represent a possible digit. We will see later on how the output vector y can be interpreted to determine the digit that is guessed by the network for a given image.

### Feed-forward

The nice thing about neural networks is that they are highly modular. Layer L[i] does not need to know whether its input is the input layer itself or the output of L[i−1]. L[i] computes its activations

### Training

Recall that our objective is to learn the parameters of the neural network such that it gets the best accuracy on a set of data points, which we call the dev set. Let y be the one-hot vector denoting the class of the input, i.e., y[c] = 1 if c is the correct label, 0, otherwise. We want P (label = c|x) to be the highest.

Without going into the mathematical details, we will use the following general expression to determine the error of our neural network.

The above cost measures the error or our “dissatisfaction” with the output of the network. The more certain the network is about the correct label (high P (y = c|x)), the lower our cost will be.

Clearly, we should choose the parameters that minimize this cost. This is an optimization problem, and is usually solved using the method of Gradient Descent (described below).

Our neural network applies a non-linear function to the input because of the sigmoid and softmax functions. However, if we make W very small, the network becomes nearly linear because T is itself very small. To optimize the neural network, we often add a penalization term for the magnitude of W in order to control the non-linearity of the network. There is no rigorous justification for this penalization. It is found to work well in practice and is easy to use. With the penalization term.

### Gradient Descent

Gradient Descent is an iterative algorithm for finding local minima of a function.

In practice, we often do not compute J using all the input images. Instead, we subdivide the input into mini-batches containing M images. We process one mini-batch at a time. For each mini-batch, we calculate J, and update the network coefficients. Then, process the next batch, until all images are processed. See Listing 5 for the pseudo-code. In the code, an epoch (in the machine learning lingo) is an iteration over the entire data set of N images.

This approach usually leads to faster convergence because we update the network coefficients more often and are able to learn faster.

### Backpropagation

Backpropagation is the process of updating the neural network coefficients. This involves computing the gradient of multi-variable functions using the chain rule.

## Your Tasks

You task is to implement a parallel training process for our two layer neural network utilizing both CUDA and MPI. You should focus on parallelizing each iteration that processes a batch of images. The work should be parallelized among a certain number of MPI nodes (e.g., 4 MPI nodes), and the computation on each node should be accelerated by GPU kernels. You’ll need to carefully analyze the neural network training algorithm, the overhead of the MPI and CPU-GPU communication and decide the granularity of parallelism. For example, you may choose to fork (distribute data among MPI nodes) and join (collect, reduce on one MPI node) on every matrix operation, or fork at the beginning, keep intermediate results local at each node and join at the end, or a combination of both.

Possible optimizations may include:

- GPU accelerated matrix operation: The training process contains GEMM (General Matrix-matrix Multiplication) and element-wise matrix operations, which should be accelerated by GPU. The key to optimizing the GPU kernels involves a wise choice of blockSize and gridSize, memory coalescing, usage of shared memory, etc.
- Communication overhead: Both MPI and GPU computation introduce communication overheads. While some of the communication are required, some can be avoided.

### Outline of parallelization strategies for CUDA and MPI

- GEMM CUDA implementation: A GEMM operation can be expressed as D = a * A * B + b * C. Some BLAS libraries perform in-place computation that saves the result D in the memory space of C, as cuBLAS does. This is possible if C is no longer used after the computation.

We first outline the basic implementation (“Algorithm 1”). The simplest implementation is to have one thread for each element in the result D. Each thread reads the required row of A, column of B, and an element of C to compute the output element in D.

For Algorithm 2, try to think of a better implementation. For example, use a blocking algorithm and take advantage of the shared memory. One approach is to have each thread block (e.g., a block of 32*32 threads) compute a sub-matrix (of size 32*32) in the output matrix. Blocks from matrices A and B can be loaded into shared memory, with each thread reading one element of each sub-matrix.

Each thread then updates its entry in the sub-matrix of D. A loop is used to multiply all the required entries in A and B. For an even more optimized and a possible “A+ grade” implementation, please refer to section 5.

We intentionally do not explain the details of these algorithms. It’s for you to fill the blanks and perhaps come up with better ideas! - MPI implementation: For each batch of images, you should subdivide the input images into smaller image batches and use MPI communication methods to distribute the input data and Neural Network parameters among MPI nodes, and perform GEMM and other computations in parallel. The resulting network coefficient updates should be reduced and sent to all nodes.