Georgia Tech代写:CSE6232 Hadoop, Spark, Pig and Pandas

Introduction

Georgia Tech的Data and visual analytics的作业,分别四个小任务,分别需要用Hadoop/Java,Spark/Scala,Pig/AWS以及Pandas/Python实现。
虽然有Tutorial可以参考,但是像Scala这种冷门语言还是需要花时间去学习的。
另外,搭建Hadoop,Spark,以及在AWS上面跑Pig也需要大量的时间。

Task 1: Analyzing a Large Graph with Hadoop/Java

Your task is to write a MapReduce program in Java to calculate the maximum of the weights of all outgoing edges for each node in the graph.
You should have already loaded two graph files into HDFS. Each file stores a list of edges as tab-separated-values.
Each line represents a single edge consisting of three columns: (source node ID, target node ID, edge weight), each of which is separated by a tab (\t). Node IDs are positive integers, and weights are also positive integers. Edges are ordered randomly.

src tgt weight
15  127 2
15  134 3
15  599 3
511 330 51
511 694 79
230 15  11

第一个任务是用Hadoop写一个计算图中每个顶点的出度,Map过程按行拆点,key/value设好,reduce过程求和,就可以了。

Task 2: Analyzing a Large Graph with Spark/Scala

Your task is to cascade the edge weights in graph1.tsv and graph2.tsv to node weights, and finally determine the accumulated node weights using Spark, in Scala. Assume that 80% of the edge weight comes from the source node and 20% from the target node. When loading the edges, parse the edge weights using the t oInt method and before cascading, filter out (ignore) all edges whose edge weights equal 1. That is, only consider edges whose edge weights do not equal 1.
Consider the following example:
Input:

src tgt weight
1   2   40
2   3   100
1   3   60
3   4   1

Output:

1 80.0 = 0.8*40 + 0.8*60
2 88.0 = 0.2*40 + 0.8*100
3 32.0 = 0.2*100 + 0.2*60

第二个任务是用Spark写一个计算图中每个顶点的权重,用不同的算子处理Spark中的RDD即可,比较坑的是这个竟然要写Scala。

Task 3: Analyzing Large Amount of Data with Pig on AWS

For each unique bigram, compute its average number of appearances per book. For the above example, the results will be:

I am (342 + 211) / (90 + 10) = 5.53
very cool (500 + 3210 + 9994) / (10 + 1000 + 3020) = 3.40049628

Output the 10 bigrams having the highest average number of appearances per book along with their corresponding averages, in t abseparated
format, sorted in descending order. If multiple bigrams have the same average, o rder them alphabetically. For the example above, the output will be:

I am 5.53
very cool 3.40049628

You will solve this problem by writing a PIG script on Amazon EC2 and save the output.
第三个任务是在AWS上用Pig跑一个简单的数据处理,数据集已经准备好了。这个完全就是个体力活,各种AWS的配置,Account/EC2/S3/EMR。

Task 4: Explore and Analyze data with Pandas

Download t ask4data.zip. The data that you download has a readme.txt file which has details about how the data is stored. Using pandas, load the data as dataframes and find the number of unique movies and number of unique users in the dataset.
Output format:

Number_of_unique_movies
Number_of_unique_users

第四个任务是用Pandas做数据处理,Pandas自己封装了一套类R的数据结构,类SQL的编程思想。

Summary

四个毫无关联的小作业,需要的知识点也不少,不愧是Georgia Tech的Master作业。