Hadoop代写:CSC555 Mining Big Data 2

在之前搭建的AWS平台上,使用Hadoop, Hive, Pig, Mahout等工具,进行大数据处理。

Requirement

You will various queries using Hive, Pig and Hadoop streaming. The schema is available below, but do remember that schema should specify the delimiter.

The data is available at (note that the data is |-separated, not comma separated):

Please note what instance and what cluster you are using (you can reuse your existing cluster for most of the questions).

Please be sure to submit all code (pig and python and Hive). You should also submit the command lines you use and a screenshot of a completed run (just the last page, do not worry about capturing the whole output). You can use time command to record time of execution of anything you run.

I highly recommend creating a small sample input (e.g., by running head lineorder.tbl > lineorder.tbl.sample and testing your code with it, you can use head -n 100 to get first 100 lines only).

Part 1: Data Transformation

Using Scale4 data perform the following data processing.

  • A. Transform lineorder.tbl table into a csv (comma-separated file): Use Hive, MapReduce with HadoopStreaming and Pig (i.e. 3 different solutions)
  • B. Extract five of the numeric columns that for rows where lo_discount is between 4 and 6 into a space-separated text file (for K-Means clustering later). Use Hive and Pig (2 different solutions) (NOTE: you do not need to use your code to identify what is a numeric column, just go by what the data types say. You should manually pick any 5 columns that contain only numbers)
  • C. Create a pre-join (i.e. a new data file) that corresponds to the following query below. You can think of it as a materialized view. What is the size of the new file? Use Hive and Pig (2 different solutions).
1
2
3
SELECT lo_partkey, lo_suppkey, lo_discount, d_year, lo_revenue
FROM lineorder, dwdate
WHERE lo_orderdate = d_datekey;

Part 2: Querying

All queries from SSBM benchmark are available here.

Using Scale4 data perform the following data processing and don’t forget to time your results.

  • A. Run SSBM queries 2.2, 3.2 and 4.2 using Hive only.
  • B. For this part use Hive and Pig (two different solutions) to run Q2.1 using what you have created in 1-C (i.e. use PreJoin1 instead of lineorder and dwdate tables in the from clause). You would need to rewrite the query accordingly. (e.g. something like,
1
2
3
4
5
6
7
8
select sum(lo_revenue), d_year, p_brand1
from MyNewStructureFrom1C, part, supplier
where lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by d_year, p_brand1
order by d_year, p_brand1;)

Part 3: Clustering

Using the file you have created in 1-B, run KMeans clustering using 7 clusters.

  • A. Using Mahout synthetic clustering as you have in a previous assignment on sample data.
  • B. Using Hadoop streaming perform one iteration manually with randomly chosen input centers. (This would require passing a text file with cluster centers using -file option, opening the centers.txt in the mapper with open(‘centers.txt’, ‘r’) and assigning a key to each point based on which center is the closest to each particular point). Your reducer would need to compute the new centers, and at that point the iteration is done.

NOTE: if you get a java.lang.OutOfMemoryError error, you will need to reconfigure Hadoop to supply the java virtual machine with more memory. You can do this by editing the mapred-site.xml (Mapper should not need much RAM):

1
2
3
4
<property>
<name> mapreduce.reduce.java.opts</name>
<value>-Xmx1024m</value>
</property>

The amount of memory can be tweaked (you can go higher, but keep in mind how much physical memory your machine has). Do not forget to restart Hadoop after any configuration file change.

If you still run out of memory in 3-A submit the screenshot of that and you will get full credit for the question.

Part 4: Performance

Compare the performance given following combinations. If you already ran that combination before it is sufficient to copy the runtime for comparison.

  • A. All three of your solutions to Part-1A with
    • a. Scale4: single node and a cluster of at least 4 nodes
    • b. Scale14: a cluster of at least 4 nodes
  • B. Both of your solutions for 2-B.
    • a. Scale4: single node and a cluster of at least 4 nodes

Summarize the results and cluster performance/scaling in at least a paragraph.

Extra Credit

Research and describe the most affordable way to build a 1-Petabyte drive. Note that the setup has to be self-sufficient (i.e. easily usable) and include references. Buying 250 of 4TB drives is not enough because you still need a way to use it. The drive should be built to own, not to rent (Dropbox or similar services doesn’t count, even if it does say “unlimited” storage).

Submit a single document containing your written answers.