In this assignment, we are going to create many multithreaded implementations of the MapReduce-like program we started Homework 1. The major difference with this assignment is that we are going to use multithreading to map and reduce data more efficiently and effectively than in Homework 1. There is nothing in this assignment that relies on your Homework 1 code. This assignment simply uses the same concepts.
Multithreading allows us to run the map functions concurrently and get results faster than mapping sequentially. There is hardly a device available today that doesn’t have a multicore processor. Up until now your C programs have only utilized a single core to execute your instructions.
Some difficulties arise with concurrent programming as there are now indeterminacy factors. This can lead to some disastrous consequences when manipulating data structures or waiting for resources. You will use techniques taught in lecture to avoid such situations.
Threads are made available by the POSIX Threading library or pthread.
Except there is one thing that is missing, naming threads. Naming your threads will make debugging and grading your program easier.
You are required to name your threads using the following format. “map %d” where %d is just an index of which map thread it is, and “reduce” for your reduce thread.
When running your program, you can launch the program htop and view threads as they are spawned, if you enable tree view in htop ‘s settings.
Terminator is a terminal window that allows you to split your terminal into many panes. This may be useful when testing. sudo apt-get install terminator
You will write map-reduce programs all with the same functionality but with different paradigms of thread safety. You will encounter a need for reetrant safe functions such as readdir_r() for use when iterating through directories.
For example, strtok(3) isn’t reentrant. This is becuase it contains a static variable that is reused on subsequent calls. This causes an issue when two threads call strtok(3) because the static variable will be changed. Thread unsafe functions like strtok(3) have reentrant versions such as strtok_r(3) . Make sure to use the reentrant versions of functions.
We challenge you to a journey in multithreading in a series we’re calling: LORD OF THE THREADS
In this part, you will implement simple map(…) and reduce(…) functions that will form the basis of each of the following parts.
- Navigate into your CSE320 git repository on your local machine and add the HW5 repository as a branch to your existing repository.
- Fetch information about this branch and merge it into the master branch. Refer back to HW1 for more details.
- There should be a folder called hw5 in your repository with all the assignment files. Inside the hw5 directory you can find the following files:
- src/part*.c - contains function prototypes for each of the assignment parts.
- src/lott.c - contains the main program to run your assignment
- include/lott.h - contains function prototypes and other dependencies.
- Makefile - this will build any .c and .h files you may have in the include and src directories.
- .gitignore - contains the directories to ignore with git
We will be using website data collected over the past two decades to find the highs, lows, and averages of the data. The queries are described in detail below.
Download the data directly from the link, and place it in a folder called data in your hw5 directory. DO NOT FETCH AND MERGE!
Note: the data folder is in the .gitignore meaning it won’t be pushed to your repo. DO NOT push the website data to your repo. Our disk space is precious. THE DATA IS 1.5GB IF EVERYONE IN THE COURSE PUSHES THE DATA IT WILL USE 300GB.
Your map() function should open each of the files it is tasked with handling and calculate the necessary values for a given query.
Your map() function should also take a void* as an argument and return a void* . The void* argument should be a pointer to a struct of your own design holding the necessary arguments.
Keep in mind how the sharing of data between threads is performed when passing arguments.
Code organization and commenting IS KEY. If we cannot understand your code, your grade will be affected.
We will use the provided main (unmodified) to test your assignment. It simply handles argument parsing and calls the functions that you implement.
- A/B: Max/Min Average duration;
- map(): Average duration for each website
- reduce(): Max/Min average duration of all of the websites
- C/D: Max/Min Average # of users per year
- map(): Find the average user count per year
- reduce(): Max/Min average between all sites
- E: Country with most users
- map(): Count country codes to find maximum occurence for a website
- reduce(): Find country with the most users
The help menu is already implemented and is:
Lord of the Threads USAGE: lott QUERY PART [THREADS] QUERY - The calculation the program is to execute: A, B, C, D, or E. PART - Part specification, parts 1,2,3,4, or 5 are valid choices. THREADS - Number of threads for parts that take a specified amount
TRIVIA: A decade ago for Return of the King, WETA’s renderwall had about 2,300 CPU cores and 5 terabytes of ram. To render everything for Desolation of Smaug, though WETA relied on 50,000 CPU cores and 170 terabytes of ram!
In this section, as the title suggests, you are going to test the limits of your CPU’s temperature throttling ability and your operating system’s context switching ability.
Each file found in the recursive search for files will spawn a thread ( pthread_create() ). This thread will be responsible for running the map() half of the program. After all threads have been spawned, the main function of your program should call pthread_join() for each thread, storing the return value of each thread in some structure of your design.
At this point, you should call the reduce() function to operate on the returned values of the map() threads. This way your program can find a result to the specified query.
Now you can output the results using the given format at the end of this document.
You can expirement with less files to avoid severe slowdowns and/or crashes while working on your VM.
TRIVIA: In the great war between Elves, Dwarves, Humans, Orcs, and Eagles there was something to be said about an efficiency factor of not involving all 24+ species imagined in the great Tolkien universe.
Since most computers have only 2 to 4 cores, it is ideal that we limit our program to utilize a specified number of threads.
From the function argument given you will limit the number of threads used. This time you divide up the work evenly per thread, taking some portion of the files and dividing it amongst your threads. You will then take the same action as the previous part and call pthread_join() for each of the spawned threads.
When all threads have completed you can call reduce() on the total collected data from each map() thread.
TRIVIA: All of Frodo’s friends working together to carry the ring across middle earth was hardly efficient. If only there was a way they could’ve worked in tandem maybe we wouldn’t have so many movies to watch. (19 hours and 31 minutes to be exact)
In this part, you will implement the reader/writer lock on a shared file between “writer” map threads and a “reader” reduce thread. You will open a single temporary file, mapred.tmp , which all map threads will write to.
Note: Your program should delete the temporary file when the map threads are complete. Look at unlink(…)
You will need to write a wrapper function for the write() syscall such that each map() thread will be guaranteed to write uninterrupted to the temporary file. In addition, you will need a wrapper for the read() syscall so that the reduce() thread can safely read while there is no writing being done. Your read/write lock MUST give Reader-preference. The reduce thread should read what has been written to the file so far and begin the reducing process on this subset of the data immediately. When the map() threads are able to obtain the lock they will write the data they have and end their execution.
NOTE: While testing you may encounter an unavoidable deadlock as the writer threads can be starved from writing when the reduce thread is constantly holding a lock for reading. You may want to introduce an artificial delay (man usleep ) in your reduce’s calculation code so that the lock is left unlocked long enough for a writer thread to obtain it.
When all map threads have joined with the main process you can pthread_cancel() your reduce() thread. Since your thread can be canceled at any time you’ll need to use pthread_cleanup_push() to handle printing the current results.
Since your map thread can be cancelled at any time, you’ll want to make sure that it is uninterruptable when calculating the query result. You should look into the following function.
TRIVIA: Tolkien actually typed the Lord of the Rings Fellowship of the Ring by chicken pecking two fingers on his typewriter. Be sure not to code like that. :laughing:.
In this part you will create a global buffer that each map thread will store their results into as they are collected. You will need to use mutexes or semaphores to implement a producer/consumer lock on your structure to ensure the stability of the structure.
The producer/consumer lock should be implemented such that the consumer will be able to read new data whenever it is available, but producers are given priority to insert data into the buffer as quickly as possible. Basically, you MUST give writer priority, look here for some ideas.
In this part, you will spawn N threads much like in the previous example for each of the map threads dividing the work evenly amongst each of them. You will spawn one reduce() thread which will be the consumer. Your reduce() thread should read data as it becomes available and compute the new result given the new data immediately and then wait for further data.
When each map thread has been joined with your main process again you can pthread_cancel() the reduce() thread. Since your thread can be canceled at any time you’ll need to use pthread_cleanup_push() to handle printing the current results. Refer to Part 3 for more details on these functions.
TRIVIA: Hadoop is actually the name of a toy Elephant owned by the creator’s son.
In the final part of the assignment you will spawn N threads giving each thread one end of a socketpair(). The other ends will be given to your reduce() thread. Your map() threads should do their job the same as the previous parts of the assignment, but at the conclusion of their work they will write the data to the unix socket they were passed.
Your reduce() thread will use I/O Multiplexing functions such as epoll(7) , poll(2) or select(2) to trigger on input from any of the given file descriptors processing the data available on that socket.
You are expected to hand in at minimum the following files in your git submission.
- All files relevant to hw5.
- The README should contain the following information:
- Name and SBU ID#
- If you did any Extra Credit tasks document them by writing EXTRA CREDIT # to the file where # is the number next to the task in the extra credit section. If you did multiple Extra Credit tasks, you must write them individually on their own lines.
- Anything relevant to grading your project when errors may occur.
- The README should contain the following information:
REMEMBER! DO NOT submit at the last minute. We will be using the time stamp of the commit associated with the hw5 tag to determine if your homework assignment is late or not. On top of that we will NOT grade late assignments.
- Make sure all files for this assignment are in the hw5 directory.
- Before you tag your submission, make sure that you have pulled from the remote and all committed changes have been pushed. Be sure you are done making any and all changes before proceeding as tags cannot be deleted.
- Log on to your gitlab account and navigate to the tags page from your repository’s main page.
- Enter hw5 for the tag name, the name of your current branch ( master only), and an optional completion message. Do not put important information in the message as we will not necessarily see it. Any relevant information to your submission should be in your README. Please tag hw5 in lowercase.
- To check if you submitted properly, you should now see a hw5 tag in the list of tags for your repository.