C++代写:CS516 Parallelizing Single Source Shortest Path on GPU

使用CUDA库,代写Single Source Shortest Path算法,代码需要能在GPU上运行。

Requirement

In this project, you will parallelize single source shortest path (SSSP) algorithm on GPUs. A SSSP algorithm finds the shortest path between two vertices in a graph. It is an important and also an interesting problem that admit many potential algorithmic choices. SSSP can find its uses in many applications including route planning, AI, 3D modeling, and social networks [2].

In this project, both undergraduate and graduate students need to implement two versions of parallel SSSP in CUDA. Every graduate student needs to implement a third version, however, it is optional to undergraduate students.

You need to submit both code and report. The code comprises 60% and the report comprises 40% of project 2 grade. We will provide real world graphs. You need to evaluate your implementations on each of the graph and report the results for different stages of computation.

You will receive 0 credit if we cannot compile/run your code on ilab machines or if your code fails correctness tests. All grading will be done on ilab machines.

The interface of your implemented code needs to meet the requirements defined in Section 3.2. We have provided sample code package for you to start with. The following three sections describe the three different versions of SSSP.

Bellman-ford Algorithm

Before describing Bellman-ford algorithm, we will describe the input file format and some necessary data structures for the shortest path algorithm.

The input file is a plain text file that contains only edges. Each line in the input file corresponds to an edge in the graph and has at least two vertex indices separated by space or tab. The first number specifies the start point of the edge of interest and the second one specifies the end point (the node that is pointed to in the edge). The third number (optional) contains the weight of the edge, if not specified, you need to set the edge weight to 1 by default.

An example graph and edge list is provided in Figure 1 (a) and (b) respectively. The source node by default is always the node with the index 0 in the graph. Note that the edge list in the input file is not necessarily always sorted by starting node index or destination node index. You might need to sort the list of edges yourself after reading in the edge list. A potential data structure you can use (after sorting the edge list by starting node index) is presented in Figure 1 (b), in which the outgoing edges from a node v is represented as a range of edges from index A[v] to A[v+1]-1 in the edge list E. You can also sort the edges by the destination node index, and represent contiguous segments of incoming edges each of which corresponding to edges pointing to the same node. This is similar to the compressed sparse row (CSR) format for sparse matrices.

The sequential bellman-ford algorithm checks all edges at every iteration and updates a node if the current estimate of its distance from the source node can be reduced. The number of iterations is at most the same as the number of vertices if no negative cycle exists.

The sequential bellman-ford algorithm is easy to implement, but not as efficient as the Dijkstra’s algorithm. However, it is amenable to parallelization. The first two required versions of parallel implementations are both based on the bellman-ford algorithm. An example for the bellman-ford algorithm is presented in Figure 1 (c). In this example, it takes four iterations before the distance value of every node in the graph stops changing. The node values at the beginning of every iteration is provided in Figure 1.

In the parallel bellman-ford algorithm, we exploit the parallelism of edge processing at every iteration. For example, in Figure 1 (c), each of the seven edges is checked at every iteration, we can distribute these 7 edges evenly to different processors such that each processor is responsible for the same number of edges.

The prev node list array distance prev and current node list array distance cur in Listing 2 represent respectively the result from the last iteration and the updated result at the end of the current iteration. Every iteration here corresponds to one kernel invocation (Line 19 at Listing 2). This is an out-of-core implementation, which means once these nodes are updated, their distance is not immediately visible to other co-running threads in the same kernel invocation. We switch the pointers to these two arrays in Line 21 of Listing 2 before calling the kernel in the next iteration.

What to Report?

  • Execution configuration: you need to report results for the five following block size and block number configurations: (256, 8), (384, 5), (512, 4), (768, 2), (1024, 2). The edge list will be evenly divided to different thread warps.
  • Edge list organization: You need to report the running time for the three cases: when the edge order is when the edge list is sorted by source node, and when the edge list is sorted by destination node.
  • In-core configuration: you need to implement the in-core method and report the running time for all graphs. Note that in Listing 2, we used two arrays for the node list, one represents the state before the iteration and the other represents the state after the iteration. You will need to use one single array “distance”, and remove distance prev & distance cur such that the original two lines:

You need to make other changes correspondingly, for example, the parameters to the kernel function need to be updated, the code to swap the two pointers need to be removed. The code to check if any node is updated needs to updated.

You also need to report the running time of every graph when the edge order is when the edge list is sorted by starting node, and when the edge list is sorted by destination node.

  • Using shared memory: you will need to implement a segment-scan type method and report the running time. You will only need to run the segscan method on the edge list sorted by destination index, in which the incoming edges for the same node can be processed by consecutive threads (or iterations). You will only need to use the out-of-core processing mechanism. Since finding the minimum of a segment of values is similar to computing the sum of a segment of values, you are required to use the similar segment scan code in project 1 to find the minimum distance for every segment before you atomically update using atomicMin.

The CUDA atomic intrinsic functions are provided here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Work Efficient Algorithm

We can improve Implementation 1 by reducing the edges that need to be processed at each iteration.
For any directed edge, if the starting point of the edge (the did not change, it will not affect the pointed-to end of the edge (the destination node on the other end). Thus, we can skip this edge when performing the comparison and updating “if (distance[u] + w [ distance[v]) …” computation.

In this implementation, you will need to filter out the edges whose starting point did not change in the last iteration and only keep the edges that might lead to a node distance update. We refer to the edges that need to be checked as to-process edges. To achieve this, you can use a parallel prefix sum operation.

Recall that the (exclusive) prefix sum is an operation that scans a sequence numbers of x0, x1, x2, …, xn, and obtain a second sequence numbers of y0, y1, … yn such that: y0 = 0, y1 = x0, y2 = x0 + x1, and so on. Similar to the summation peration, the prefix sum operation can be performed within logarithmic time steps [1]. Assuming xi stores the number of to-process edges identified by warp i and T stores all the to-process edges, yi gives the total number of to-process edges gathered by warps from warp0, warp1, .., warpi, and thus gives the index of the first to-process edge found by warp i in the list T . That is, the first to-process edge detected by warp i will be stored at T[yi]. An example is given in Figure 2.

You may implement the prefix sum operation for gathering to-process stages in three steps. In this project, we assume that the total number of threads will not be greater than 2048. We also assume that one thread warp will gather to-process edges from a contiguous range of edges from the edge list L, such that the warp i will process the range from index (L.length/warp num) i to index L.length/warp num (i+1) - 1 (inclusive).

  1. Every warp counts the number of edges that need to be processed in its assigned edge range. x[warpid] specifies the number of to-process edges that are identified by warp warpid. Specifically you can use warp voting functions for fast identification of the to-process edges for threads in the same warp. The “unsigned int ballot(int predicate)” function in CUDA returns a value with the N-th bit set, where N is the lane index in the warp if the predicate is nonzero. If you combine the ballot function with popc function, you can count the number of threads in a warp which have the predicate set to non-zero. In our example, you can use “mask = ballot(current edge.src == changed); count = popc(mask);” to get count, which is the number of lanes in the warp that has seen the predicate (current edge.src == changed) set to true. More information about the ballot function can be found here: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
  2. Since we assume the total number of threads is 2048, the number of warps is 64. In this step, we launch a kernel that has the same number of threads as the number of warps in the last step (and the block number is 1). At this step, we perform a block level parallel prefix sum to get the offset of every warp’s first to-process edge in the final to-process edge list T .
  3. We let every warp copy its identified to-process edge index to the array T . If a warp’s offset is y[i], and it has detected k to-process edges (not necessarily contiguous). Specifically, you can also use the ballot and the integer intrinsic functions to help you determine the index of the to-process edge in the final list T.

More details about the integer intrinsic functions can be found here: http://docs.nvidia.com/cuda/cuda-math-api/group CUDA MATH INTRINSIC INT.html

Note that we will focus on improving performance of the computation stage not the performance of the filtering stage. You can implement the filtering stage in different ways as long as the relative ordering of the to-process edges is the same as they are in the complete list L.

After filtering out the edges that do not need to be considered, you will evenly distribute the toprocess edges among different thread warps and perform the same iterative process as described in Implementation 1.

What to Report?

  • Use the same execution configuration as specified in Implementation 1.
  • You need to report the running time for the computation stage and the filtering stage separately.
  • You need to report the results for the edge list sorted by starting point, sorted by destination, for the out-of-core and in-core processing kernel.
  • You need to perform data analysis, summarize what you have observed and discuss (reason about) the differences in various configurations.

A Further Optimization (Optional to Undergraduate Students)

In the third implementation, you will design and implement your optimization strategy. The optimization should be locality optimization, priority queue based optimization, or a combination of both. You will need to specify clearly which optimization you have applied. Your implementation does not necessarily need to always run faster than the first two implementations. However, you will need to describe why you think your proposed optimization might help, and provide some intuition, i.e., use a small example.

Similar to the previous two implementations, you only have to optimize the time for the computation stage, not the filtering stage or other overhead for the optimization approach. You will still have to report the filtering time or the overhead.

Here we provide some heuristics on locality optimization and priority queue based optimization that you can use.

Locality Optimization

You can use graph partition method to divide one kernel invocation into multiple kernel invocations for last level cache performance optimization. The GPU we have on ilab cluster (in Hill 120) each has 256KB last level cache. Not the entire data set can fit into cache at one time. Therefore, we can divide one kernel (if there is sufficient workload) into multiple kernels such that each kernel’s (frequently-used) data set size fit into cache.

To partition the workload, we partition the edges of the graph. We will provide a balanced graph edge partition library for you, and details will be released later in piazza. Note that the provided library only performs balanced partition, if you want to keep partition the tasks into variable size components, you might need to perform hierarchical partitioning, for instance, use bisection at one time, and only split the individual sub-partition that you want to split.

Note this optimization may only apply to implementation 1, since only in implementation 1, the edges to be processed in every original kernel invocation are the same.

Priority Queue

In our first two implementations in this project, we treat every edge as having the same priority. The Dijkstra’s algorithm and the near-far tile method [2] let edges that potentially point to smaller distance nodes have higher priority. The basic idea is to maintain a priority queue. At one time, we take the first n tasks from the priority queue, process them, which further generates new tasks that can be inserted into the priority queue (which might require some reordering operations in the queue or deletion of tasks in the queue).

If choosing the priority queue based approach, you will need to try at least one heuristic for the priority function. The near-far tile method uses the distance as the heuristic, that is, the edges in the work list that come from the nodes which are bound by a distance have higher priority.

Another potential heuristic you can try is to set the nodes that have the least number of un-updated predecessors nodes as having high priority. Specifically, if a node’s all predecessors nodes have been updated in previous iteration(s), this node will have higher priority and the incoming edges to these nodes will be placed at the beginning of the priority queue. Next we place the edges that point to these nodes that just have only one un-updated (immediate) predecessor nodes in the queue - here an immediate predecessor node points to the node of interest. And next the edges that point to the nodes with two un-updated (immediate) predecessor nodes. Note that when ordering the edges in the list based on the priority order, the list should be the actual work-efficient work list, that is, we don’t consider any edge whose starting point has not been changed.

In this exploration, you can maintain the priority queue (insertion and deletion) in GPU or CPU. As mentioned before, our focus at this project is not how to reduce the cost of parallel priority queue implementation. Rather we would like to explore heuristics that will lead to a good parallel SSSP implementation. Once you discover a good heuristics, you can try to improve the parallelization of priority queue on GPUs, which is not required in this project (but can be a good topic for your project 3).