Georgia Tech代写:CSE6232 Hadoop, Spark, Pig and Pandas


Georgia Tech的Data and visual analytics的作业,分别四个小任务,分别需要用Hadoop/Java,Spark/Scala,Pig/AWS以及Pandas/Python实现。

Task 1: Analyzing a Large Graph with Hadoop/Java

Your task is to write a MapReduce program in Java to calculate the maximum of the weights of all outgoing edges for each node in the graph.
You should have already loaded two graph files into HDFS. Each file stores a list of edges as tab-separated-values.
Each line represents a single edge consisting of three columns: (source node ID, target node ID, edge weight), each of which is separated by a tab (\t). Node IDs are positive integers, and weights are also positive integers. Edges are ordered randomly.

src tgt weight
15  127 2
15  134 3
15  599 3
511 330 51
511 694 79
230 15  11


Task 2: Analyzing a Large Graph with Spark/Scala

Your task is to cascade the edge weights in graph1.tsv and graph2.tsv to node weights, and finally determine the accumulated node weights using Spark, in Scala. Assume that 80% of the edge weight comes from the source node and 20% from the target node. When loading the edges, parse the edge weights using the t oInt method and before cascading, filter out (ignore) all edges whose edge weights equal 1. That is, only consider edges whose edge weights do not equal 1.
Consider the following example:

src tgt weight
1   2   40
2   3   100
1   3   60
3   4   1


1 80.0 = 0.8*40 + 0.8*60
2 88.0 = 0.2*40 + 0.8*100
3 32.0 = 0.2*100 + 0.2*60


Task 3: Analyzing Large Amount of Data with Pig on AWS

For each unique bigram, compute its average number of appearances per book. For the above example, the results will be:

I am (342 + 211) / (90 + 10) = 5.53
very cool (500 + 3210 + 9994) / (10 + 1000 + 3020) = 3.40049628

Output the 10 bigrams having the highest average number of appearances per book along with their corresponding averages, in t abseparated
format, sorted in descending order. If multiple bigrams have the same average, o rder them alphabetically. For the example above, the output will be:

I am 5.53
very cool 3.40049628

You will solve this problem by writing a PIG script on Amazon EC2 and save the output.

Task 4: Explore and Analyze data with Pandas

Download t The data that you download has a readme.txt file which has details about how the data is stored. Using pandas, load the data as dataframes and find the number of unique movies and number of unique users in the dataset.
Output format:




四个毫无关联的小作业,需要的知识点也不少,不愧是Georgia Tech的Master作业。