CUDA代写:CME213 PageRank




In this programming assignment you will use NVIDIA’s Compute Unified Device Architecture (CUDA) language to implement a basic string shift algorithm and the pagerank algorithm. In the process you will learn how to write general purpose GPU programming applications and consider some optimization techniques.
You must turn in your own copy of the assignment as described below. You may discuss the assignment with your peers, but you may not share answers. Please direct your questions about the assignment to the forum.


“C for CUDA” is a programming language subset and extension of the C programming language, and is commonly referenced as simply CUDA. Many languages support wrappers for CUDA, but in this class we will develop in C for CUDA and compile with nvcc.
The programmer creates a general purpose kernel to be run on a GPU, analogous to a function or method on a CPU. The compiler allows you to run C++ code on the CPU and the CUDA code on the device (GPU).
The first step you should take in any CUDA program is to move the data from the host memory to device memory. The function calls cudaMalloc and cudaMemcpy allocate and copy data, respectively. cudaMalloc will allocate a specified number of bytes in the device main memory and return a pointer to the memory block, similar to malloc in C. You should not try to dereference a pointer allocated with cudaMalloc from a host function.
The second step is to use cudaMemcpy from the CUDA API to transfer a block of memory from the host to the device. You can also use this function to copy memory from the device to the host. It takes four parameters, a pointer to the device memory, a pointer to the host memory, a size, and the direction to move data (cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost). We have already provided the code to copy the string from the host memory to the device memory space, and to copy it back after calling your shift kernel.
Kernels are launched in CUDA using the syntax kernelName. The arguments inside of the chevrons specify the number of thread blocks and thread per block to be launched for the kernel. The arguments to the kernel are passed by value like in normal C/C++ functions.
There are some read-only variables that all threads running on the device possess. The three most valuable to you for this assignment are blockIdx, blockDim, and threadIdx. Each of these variables contains fields x, y, and z. blockIdx contains the x, y, and z coordinates of the thread block where this thread is located. blockDim contains the dimensions of thread block where the thread resides. threadIdx contains the indices of this thread within the thread block.

Problem 1 String Shift

The purpose of this problem is to give you experience writing your first simple CUDA program. This program will help us examine how various factors can affect the achieved memory bandwidth. The program will take an input string and shift each character by constant amount. You will do this in three different ways:

  1. by shifting one byte at a time.
  2. by shifting 4 bytes at a time.
  3. and finally by shifting 8 bytes at a time.

You should be able to take the files we give you and type make main_q1 to build the executable. The executable takes 1 argument | the number of times to double the input file in size. For debugging it is recommended to use a value of 0. The executable will run, but since the CUDA code hasn’t been written yet (that’s your job), it will report errors and quit.

Problem 2 PageRank

PageRank was the link analysis algorithm responsible (in part) for the success of Google. It generates a score for every node in a graph by considering the number of in links and out links of a node. We are going to compute a simplified model of pagerank, which, in every iteration computes the pagerank score as a vector.
In the actual algorithm this operation is performed until the change between successive vectors is sufficiently small. In our case we will choose a fixed number of iterations to more easily compare performance across various numbers of nodes and edges. If you wish to learn more about the algorithm itself, check