- using MapReduce to process data
- expressing an algorithm in the MapReduce style
- choosing appropriate classes and methods from the MapReduce API
- testing and debugging
- writing clear, tidy, consistent and understandable code
The practical involves manipulating fairly large data files using the Hadoop implementation of MapReduce.
When working in the lab, it is highly recommended that you copy a small subset of these files to your local machine under /cs/scratch/username , and use them to develop and test your program. Do not use the input files directly from studres or your home folder as this will exhaust the network.
Your program should perform the following operations:
- Obtain the absolute paths of the input and output directories from the user. The input must be read in from files in the input directory, and must be written to files in the output directory.
- Find all character level or word level n-grams for text fragments contained within files in the given input directory written in a given language, depending on the user’s input
- Print the list of n-grams and their frequency to a file in the output directory, in alphabetical order
One possible interaction with the program could be as follows (assuming your username is mucs1):
Enter the input directory: /cs/scratch/mucs1/p5/data Enter the output directory: /cs/scratch/mucs1/p5/output Type of n-gram (C)haracter or (W)ord: W Value of N for n-grams: 2 Language: EN-GB
The first few lines of output will then look similar to that below:
a basis 7 a border 1 a central 1 a coating 1
You must use Hadoop MapReduce wherever applicable to compute the above, over conventional methods. You can reuse any code from your previous practicals, so long as you clearly identify it.
For your convenience, when finding n-grams (character or word-level), skip all words that contain anything other than uppercase or lowercase characters (numbers, symbols, parentheses, etc.). For character-level n-grams, you do not need to indicate word boundaries with an underscore as you did in Practical 2.
In this practical you need only run your code on the local machine. However, if you did have access to a large Hadoop cluster, your code would work without needing to be adapted.
Your program should deal gracefully with possible errors such as the web resource file being unavailable, or the response containing data in an unexpected format. The source code for your program should follow common style guidelines, including:
- formatting code neatly
- consistency in name formats for methods, fields, variables
- avoiding embedded “magic numbers” and string literals
- minimising code duplication
- avoiding long methods
- using comments and informative method/variable names to make the code clear to the reader
Hand in via MMS, a zip file containing the following:
- Your Java source files
- A brief report (maximum 3 pages) explaining the decisions you made, how you tested your program, and how you solved any difficulties that you encountered. Include instructions on how to run your program and any dependencies that need resolving. You can use any software you like to write your report, but your submitted version must be in PDF format.
- Also within your report:
- Highlight one piece of feedback from your previous submissions, and explain how you used it to improve this submission
- If you had to carry out a similar large-scale data processing task in future, would you choose Hadoop or basic file manipulation as you did in earlier practicals? Write a brief comparison of the two methods, explaining the advantages and disadvantages of each, and justify your choice.
If you wish to experiment further, you could try any or all of the following:
- Give the user an option to order the n-grams by occurrence frequency. Hint: You could use the ‘Most Popular Word’ example on studres as a starting point.
- Perform additional queries on the data, such as:
a. Count all text fragments containing a given string by the user in a given language
b. Find which word occurs in the highest number of different languages
c. Any other additional statistics/queries about the data you could generate using Hadoop
You are free to implement your own extensions, but clearly state these in your report. If you use any additional libraries to implement your extensions, ensure that you include these in your submission, with clear instructions to your tutor on how to resolve any dependencies and run your program.
- For a pass mark (7) it is necessary to show evidence of a serious attempt at both the programming task and the report.
- A mark of 13 can be achieved with a partial solution to the main problem.
- A mark of 17 can be achieved with a good and complete solution to the main problem and a well written report to match.
- For higher marks it is necessary to also attempt one or more of the mentioned extension activities, or suitable extension activities of your own.
Here is one possible sequence for developing your solution. It is recommended that you make a new method or class at each stage, so that you can easily go back to a previous stage if you run into problems. Please use demonstrator support in the labs whenever possible.
- You will need to examine the structure of the data files, to see how the text fragments and language specifications are represented.
- Tackle one problem at a time, beginning with selecting all text in a particular language, which requires a mapper class but no reducer. Initially, you can use a fixed search language String in your mapper class for testing purposes.
- To select only text in the required language, the difficulty is that the language is recorded in a different line from the text fragment, so will be processed in a different call to map. This can be solved using a field in the mapper object to record the most recently encountered language. The map method can then either update this field, if the current line contains a language, or check the language field value, if the current line contains a text fragment.
- To test, first make a new directory and copy 10 or 20 of the data files into it—the full data set will take inconveniently long to run.
- Once this works, refine your solution so that the search language is passed as a parameter. Recall that you can pass a parameter value to a mapper or reducer by calling the method.
- To return text in the specified language as n-grams, you will need to also pass the user’s specified n-gram type and size as parameters, using the same methods in Step 5. Following this, it is recommended that you reuse your n-gram creation code from Practical 2 to split each String into its corresponding n-grams. Remember that, unlike Practical 2, you do not need to represent word boundaries with an underscore.
- In order to output the n-gram frequencies alongside the n-grams themselves, you will need to implement a reducer class that groups duplicate n-grams and sums their total frequency. For a reminder of how to do this, review the ‘Word Count’ example on studres.
- For ordering your output n-grams, recall that sorting order is specified with the setOutputKeyComparatorClass method of the JobConf class.