In this task you are asked to create an inverted index of words to documents that contain the words. Using this inverted index you can search for all the documents that contain a particular word easily. The data has been stored in a very compact form. There are two files. The first file is called docword.txt, which contains the contents of all the documents stored in the following format:
|Attribute number||Attribute name||Description|
|1||Doc id||The id of the document that contains the word.|
|2||Vocab Index||Instead of storing the word itself. We store the index into the vocabulary file. The index starts from 1.|
|3||count||An integer representing the number of times this word occurred in this document.|
The second file called vocab.txt contains each word in the vocabulary, which is indexed by attribute 2 of the docword.txt file.
The data set used for this task can be found inside the Bag of words directory of the assignment_datafiles.zip on LMS.
If you want to test your solution with more data sets you can download other data sets of the same format from the following source.
Here is a small example content of the docword.txt file.
|Doc id||Vocab Index||Count|
Here is an example of the vocab.txt file
Plane Car Motorbike Truck Boat
Using the input files docword.txt and vocab.txt downloaded from LMS, complete the following subtasks using spark:
a) [spark] Output into a text file called “task3a.txt” a list of the total count of each word across all documents. List the words in ascending alphabetical order. So for the above small example input the output would be the following (the format of the text file can be different from below but the data content needs to be the same):
Boat 2200 Car 620 Motorbike 2502 Plane 1100 Truck 122
b) [spark] Create an inverted index of the words in the docword.txt file and store the entire inverted index in binary format under the name InvertedIndex. Also store the output in text file format under the name task3b. The inverted index contains one line per word. Each line stores the word followed by a list of (Doc id, counts) pairs (one pair per document). So the output format is the following:
word, (Doc id, count), (Doc id, count), …
- Note you need to have the list of (Doc id, count) in decreasing order by count.
- Note you need to have the words in ascending alphabetical order
So for the above example input the output text file(s) would contain (the actual format can a bit different but it should contain the same content):
Boat (3, 2000), (2, 200) Car (2, 500), (1, 120) Motorbike (1, 1200), (2, 702), (3, 600) Plane (1, 1000), (3, 100) Truck (3, 122)
For example following format for the text file would also be acceptable:
(Boat,ArrayBuffer((3,2000), (2,200))) (Car,ArrayBuffer((2,500), (1,120))) (Motorbike,ArrayBuffer((1,1200), (2,702), (3,600))) (Plane,ArrayBuffer((1,1000), (3,100))) (Truck,ArrayBuffer((3,122)))
c) [spark] Load the previously created inverted index stored in binary format from subtask b) and cache it in RAM. Search for a particular word in the inverted index and return the list of (Doc id, count) pairs for that word. The word can be hard coded inside the spark script. In the execution test we will modify that word to search for some other word. For example if we are searching for “Car” the output for the example dataset would be:
Car (2, 500), (1, 120)
- Using spark perform the following task using the data set of task 2.
[spark] Find the hash tag name that has increased the number of tweets the most from among any two consecutive months of any hash tag name. Consecutive month means for example, 200801 to 200802, or 200902 to 200903, etc. Report the hash tag name, and the 1st month count and the 2nd month counts.
- Propose any other data processing task using real data you obtained from somewhere. You need to do the data processing using either MapReduce or Pig or Spark. Need to give a proposal of this task to me by the end of week 9. I will tell you how many marks it is worth. You can get a maximum of 10 bonus marks for this task. Include the proposal as PDF, TXT or Word document in the assignment submission.