C代写:COMP2129 PageRank

Introduction

用C实现PageRank算法,同时需要对大数据量节点的情况进行优化。

Task Description

In this assignment your task is to implement the PageRank algorithm in C using the power method described below and then optimise and parallelise your code to ensure peak performance is achieved.
The PageRank algorithm was developed in 1996 by Larry Page and Sergey Brin when they were graduate students at Stanford University. Google and other search engines compare words in search phrases to words in web pages and use ranking algorithms to determine the most relevant results.
PageRank assigns a score to a set of web pages that indicates their importance. The underlying idea behind PageRank is to model a user who is clicking on web pages and following links from one page to another. In this framework, important pages are those which have incoming links from many other pages, or have incoming links from other pages with a high PageRank score, or both.
You are encouraged to ask questions on Ed using the assignments category. As with any assignment, make sure that your work is your own, and you do not share your code or solutions with other students.

Working on your assignment

You can work on this assignment on your own computer or the lab machines. Students can also take advantage of the online code editor on Ed. Simply navigate to the assignment page and you are able to run, edit and submit code from within your browser. We recommend that you use Safari or Chrome.
It is important that you continually back up your assignment files onto your own machine, flash drives, external hard drives and cloud storage providers. You are encouraged to submit your assignment while you are in the process of completing it. By submitting you will obtain some feedback of your progress.

The PageRank algorithm

PageRank is an iterative algorithm that is repeated until a stopping criteria is met. The last iteration gives us the result of the search, which is a score per web page. A high score indicates a very relevant web page whereas a low score indicates a not so relevant web page for a search. Sorting the web pages by their scores in descending order gives us the order for the result list of a search query.

Example

Our example has four web pages: S = {A, B, C, D}. In this example, d = 0.85 and c = 0.005.
The referencing structure of the web pages A, B, C, and D is given in the graph below.
Each node represents a web page.
Edges in the graph indicate that the source of the edge is linking to the destination of the edge.
At this point the algorithm terminates and the values of P(4) are the final PageRank scores for each of the pages. If this were a query, the resulting ranks would be A, C, B, D where we arbitrarily rank page A before page C since they have the same score.

Implementation details

Your program must implement the PageRank algorithm using the power method described above.
The header file pagerank.h contains an init function that processes the list of pages and edges from standard input and stores a linked list of pages and corresponding inlinks in the following structs

1
2
3
4
5
6
7
8
9
10
11
struct page {
char* name; // name of this page
size_t index; // index of this page
size_t noutlinks; // number of outlinks from this page
node* inlinks; // linked list of pages with inlinks to this page
};

struct node {
page* page; // pointer to the page data structure
node* next; // pointer to the next page in the list
};

Your program must produce no errors when built on the lab machines and ed with the command:

$ clang -O1 -std=gnu11 -march=native -lm -pthread pagerank.c -o pagerank

Your program output must match the exact output format shown by the prototype and on Ed. You are encouraged to submit your assignment while you are working on it, so you can obtain some feedback.

Program input

<dampening factor>
<number of pages>
<name of web page>
...
<number of edges>
<source page> <destination page>
...

The first line specifies the dampening factor to be used by the program. The second line defines the number of pages to be declared, followed by a list of page names with one name per line. Then, the number of edges in the graph is specified, followed by a list of edges with one edge per line.
The init function reads the input and terminates outputting an error message if the input is invalid.
The input is considered invalid if a page name exceeds the maximum length, the dampening factor is not in the range 0 < d < 1, a page is declared twice, or an edge is defined to a nonexistent page.

Example

0.85
4
A
B
C
D
5
D A
D B
D C
B A
B C

This example has four web pages with names: A, B, C, and D
There are five edges: D - A, D - B, D - C, B - A, and B - C
The output is the list of scores for each web page, for example:

A 0.30778524
B 0.21606622
C 0.30778524
D 0.16836329

The scores are ordered in the same order they were defined in the input.
For printing the score, use the format string “%s %.8lf\n”
For all test cases set e = 0.005, use the constant defined as EPSILON