Your task is to write a MapReduce program in Java to calculate the maximum of the weights of all outgoing edges for each node in the graph.
You should have already loaded two graph files into HDFS. Each file stores a list of edges as tab-separated-values.
Each line represents a single edge consisting of three columns: (source node ID, target node ID, edge weight), each of which is separated by a tab (\t). Node IDs are positive integers, and weights are also positive integers. Edges are ordered randomly.
src tgt weight 15 127 2 15 134 3 15 599 3 511 330 51 511 694 79 230 15 11
Your task is to cascade the edge weights in graph1.tsv and graph2.tsv to node weights, and finally determine the accumulated node weights using Spark, in Scala. Assume that 80% of the edge weight comes from the source node and 20% from the target node. When loading the edges, parse the edge weights using the t oInt method and before cascading, filter out (ignore) all edges whose edge weights equal 1. That is, only consider edges whose edge weights do not equal 1.
Consider the following example:
src tgt weight 1 2 40 2 3 100 1 3 60 3 4 1
1 80.0 = 0.8*40 + 0.8*60 2 88.0 = 0.2*40 + 0.8*100 3 32.0 = 0.2*100 + 0.2*60
For each unique bigram, compute its average number of appearances per book. For the above example, the results will be:
I am (342 + 211) / (90 + 10) = 5.53 very cool (500 + 3210 + 9994) / (10 + 1000 + 3020) = 3.40049628
Output the 10 bigrams having the highest average number of appearances per book along with their corresponding averages, in t abseparated
format, sorted in descending order. If multiple bigrams have the same average, o rder them alphabetically. For the example above, the output will be:
I am 5.53 very cool 3.40049628
You will solve this problem by writing a PIG script on Amazon EC2 and save the output.
Download t ask4data.zip. The data that you download has a readme.txt file which has details about how the data is stored. Using pandas, load the data as dataframes and find the number of unique movies and number of unique users in the dataset.