In order to program for shared memory systems using multi-threading, threads need to be synchronized. Various thread synchronization mechanisms exist which ensure that threads do not simultaneously execute a critical section of the program. Many languages provide high level abstractions for synchronization to ease parallel programming. Common synchronization mechanisms include mutexes (mutual exclusion locks), semaphores, reader/writer locks and condition variables. Mutex is a mutual exclusion lock which ensures exclusive access to the shared resource. Spinlock is a type of lock which waits in a busy loop if lock cannot be acquired. Atomics operations are instructions supported by hardware and they lock the memory bus to access the shared resource. These operations are inherently atomic and have limited support for data types on various architectures. Semaphores is a type of mutual exclusion where a thread can wait to get access to the critical section or do a post so other threads can get access.
The primary focus here is to analyze the cost of low-level thread synchronization mechanisms and for this purpose, you will benchmark pthread_mutex_lock, pthread_mutex_unlock, sem_wait/sem_post, fetchand-add, and spin_lock/spin_unlock to measure latency. spin_lock and spin_unlock should be implemented using compare_and_swap atomic primitive. Fetch-and-add is supported by x86 architectures using ‘lock xadd’ instruction. These benchmarks should be obtained by running a tight loop of 1 billion operations and collecting the aggregate of the results. Each iteration acquires the lock, increments a shared integer and releases the lock, excluding fetch-and-add which performs an increment operation atomically. You will perform a weak scaling study, unless otherwise noted; this means you will set the amount of work per thread (e.g. the number of operations to evaluate in your benchmark), and keep the amount of work per thread fixed as you increase the number of threads.
This project aims to teach you how to benchmark four different synchronization mechanism. You must use the C programming language and can use abstractions such as PThreads. Libraries such as STL or Boost cannot be used. You can use any Linux system for your development, but you must make sure it compiles and runs correctly on fourier. The performance evaluation should be done on your own computer in a Linux environment (e.g. think back at Lab #0). Given the large class size, we encourage all students to avoid using fourier for anything but testing functionality on small datasets.
Your syncbench program should support the following benchmarks:
- a. vanilla: increment an integer counter with no synchronization mechanisms
- b. mutex: use a mutex to protect an integer while it is being incremented from 1 to N concurrent threads
- c. semaphore: use a semaphore to protect an integer while it is being incremented from 1 to N concurrent threads
- d. spinlock: use a spinlock to protect an integer while it is being incremented from 1 to N concurrent threads
- e. atomic: use an atomic to allow the thread safe incrementing of an integer from 1 to N concurrent threads
- a. Small: this should be set to 100 million operations (10^8)
- b. Large: this should be set to 1 billion operations (10^9)
- a. 1: Use 1 thread to execute your benchmark on the specified number of operations
- b. 2: Use 2 threads to execute your benchmark in parallel on the specified number of operations per thread
- c. 4: Use 4 threads to execute your benchmark in parallel on the specified number of operations per thread
- a. Throughput: measure the rate at which the counter can be incremented per second using the large size (1 billion operations); throughput should be calculated for each thread individually and all the results should be aggregated to get the final throughput value for the experiment.
- b. Latency: measure each individual increment operation and store it in memory using the small size (100 million operations); see discussion for fine-grained measurements for more information on how to measure latency
You will not be able to use the timing mechanisms for the latency part of this evaluation due to the short duration of the synchronization mechanisms. On x86 architectures, latency is measured in CPU cycles using RDTSCP instruction for start time and RDTSC + CPUID instruction for the end time. RDTSCP is a serializing instruction and it prevents instruction reordering around the call. CPUID is also a serializing call and when it follows RDTSC instruction, it prevents any future instructions to be executed before timing information is read. The combination of these two timing functions gives the most accurate results for latency. Timing on ARM could be quite different than x86, so be careful if you are trying to run on anything other than x86. Time base register counts cycles at a fixed lower frequency and needs to be calibrated to convert the value to actual cycles at CPU clock frequency.
- You must write all benchmarks from scratch. Do not use code you find online, as you will get 0 credit for this assignment.
- All of the benchmarks will have to evaluate concurrency performance; concurrency can be achieved using threads and processes. For this assignment, you must use multi-threading. Use weak scaling in all experiments, unless it is not possible, in which case you need to explain why a strong scaling experiment was not done. Be aware of the thread synchronizing issues to avoid inconsistency or deadlock in your system.
- All your benchmarks can be run on a single machine. You should run these benchmarks on your own system under Linux. Make sure your work compiles and can run the small workloads on fourier.
- Not all timing functions have the same accuracy; you must find one that has cycle accuracy.
- Since there are many experiments to run, find ways (e.g. scripts) to automate the performance evaluation.
- There are likely some compiler flags that will help you execute your benchmarks faster.
- No GUIs are required. Simple command line interfaces are expected.
When you have finished implementing the complete assignment as described above, you should submit your solution to your private git repository. Each program must work correctly and be detailed in-line documented. You should hand in:
- Source code and compilation (70%): All of the source code in C and Bash; in order to get full credit for the source code, your code must have in-line documents, must compile (with a Makefile), and must be able to run a variety of benchmarks through command line arguments. Must have working code that compiles and runs on fourier.
- Report / Performance (30%): A separate (typed) design document (named labEC2-report.pdf) describing the results in a table format. You must evaluate the performance of the various parameters outlined and fill in the 2 tables specified to showcase the results. You must summarize your findings and explain why you achieve the performance you achieve, and how the results compare between the various approaches.