Compiler代写:ECE454 Profiling and Compiler Optimizations

Compiler的实验性质的作业,根据不同的编译选项进行实验,回答21个问题即可。
相比Compiler的编程作业,实验性质的作业除了工作量大,没什么难度。

Introduction

Upon graduation from Skule, and having become a high-performance program-optimizing guru, you decided to open your own high-performance code optimization consulting firm, called “OptsRus”. OptsRus’ first client is a reconfigurable computing company that is using the open source CAD software program VPR (developed at UofT) to map customer designs to their reconfigurable FPGA-like hardware. However, since the demise of single-threaded CPU performance, their customers have been complaining that VPR is not fast enough for their needs, especially since their hardware chips keep getting bigger and hence so do the designs VPR has to handle.
The company cannot afford to parallelize the software at this point, so they instead want you to analyze the VPR code as well as their compilation process to see if they can do better. As detailed below, you are to experiment with both, collect measurements and profiles, and write a report for the company (by answering the specific questions below).

Procedure

In your report, please answer all of the Q* questions below, with a few sentences each and any requested measurements.

HINT: you will likely find the use of a shell or perl/python script to automate your collection very helpful; this should be very easy to learn, I highly recommend that you take the opportunity to do so if you do not already.

Set up

Get a copy of and extract the homework tarball through the following commands:

mkdir ~/hw1
cd  ̃/hw1
cp /cad2/ece454f/hw1/vpr454.tar.gz .
tar -xvvf vpr454.tar.gz

Browse the VPR source code: starting at main.c (and knowing that we are focusing on placement), follow the chain of calls through the source code for a while to become familiar with it.

  • Q1 (1 marks) list 5 functions (by name, other than main) that you think might be important to optimize. Please do this honestly, before you do profiling. Explain in a few sentences total how you chose them.

Build and Test

In the Makefile modify the OPT FLAGS field (look for “454”) to compile the -g version (debug) of VPR (eg., change it to “OPT FLAGS = -g”). Type make (which should compile successfully).
This will result in a lot of output, the end of which should say “Completed placement consistency check successfully”. Run VPR using these arguments for all experiments below.

Measuring Compilation Time

In this assignment you will use /usr/bin/time to measure compilation and performance. In the output, note that the number that ends in “user” is runtime in seconds for “user-mode”, the time to use for this report, except Section 3.4. When measuring parallel compilation (or execution) time, you should take note of the “elapsed” time instead of the “user” time. The “elapsed” time represents the wall clock time. Note that since you are measuring real systems, measurements are a little different each time due to system variability. Try to measure on an unloaded machine. For every timing measurement always do 5 runs and average them (please only report the final average).
To build the gprof version of vpr, use the flags:

-g -pg

To build the gcov version of vpr, use the flags:

-g -fprofile-arcs -ftest-coverage

Measure the compilation time of the gprof, gcov, -g, -O2, -O3, and -Os compilation flags. Be sure to run “make clean” in between each build to ensure that all files are rebuilt.

  • Q2 (1 marks) Report the six measurements in a table; in the same table, using the slowest method of compilation as a baseline, report the speedup for each of the five measurements. Eg., if gcov was the slowest, and -g was twice as fast as gcov, then the speedup for -g relative to gcov would be 2.0.
  • Q3 (1 marks) Which is the slowest and why?
  • Q4 (1 marks) Which is the fastest and why?
  • Q5 (1 marks) Which of gprof and gcov is faster and why?

Measuring Parallel Compilation Time

Make can be configured to do a “parallel make”, which means it will build independent objects using multiple processes when possible. Using the -j option you can specify the max number of processes to use. Measure the compilation time (“elapsed time”) of ‘-j X’ where X has the value of 1,2,4, etc. In each case, the OPT FLAGS variable should be set to ‘-O3’ in the Makefile.

  • Q6 (1 marks) Report the measurements in a table; in the same table, using ‘-j 1’ as a baseline, report the speedup for each of the possibilities. Briefly explain the values of the maximum number of processes you decided to use and the speedup curve you obtained.

Measuring Program Size

Use “ls -l” to measure the size of each version of vpr from the previous section.

  • Q7 (1 marks) Report the six measurements in a table; in the same table, using the smallest method of compilation as a baseline, report the relative size increase for each of the six measurements. Eg., if -g was the smallest, and gprof was twice the size of -g, then the relative size increase for gprof relative to -g would be 2.0.
  • Q8 (1 marks) Which is the smallest and why?
  • Q9 (1 marks) Which is the largest and why?
  • Q10 (1 marks) Which of gprof and gcov is smaller and why?

Measuring Performance

Measure the run-time of VPR for all six versions compiled in the previous section.

  • Q11 (1 marks) Report the six measurements in a table. Again, using the slowest measurement as a baseline, also report the speedup for each version in the same table.
  • Q12 (1 marks) Which is the slowest and why?
  • Q13 (1 marks) Which is the fastest and why?
  • Q14 (1 marks) Which of gprof and gcov is faster and why?

Profiling with gprof

Compile gprof support for the -g, -O2, and -O3 versions, by using flags “-g -pg”, “-O2 -pg” and “-O3 -pg” respectively; run each of these versions to collect the gprof result; you don’t have to time any of this.

  • Q15 (1 marks) For each version, list the top 5 functions (give function name and percentage execution time).
  • Q16 (2 marks) For the “number-one” function for -O3 (the one with the greatest percentage execution time), how does its percentage execution time compare with the percentage execution time for the same function in the -g version? How is this possible? What transformation did the compiler do and to which functions?
  • Q17 (2 marks) For the transformation that the compiler did in the previous question, why didn’t the compiler do the same transformation to the number-two-ranked function from the -O3 version?

HINT: look for support for your argument in the VPR code, and explain what that is.

Inspecting Assembly

Use objdump to list the assembly for the -g and -O3 versions of vpr (eg., run “objdump -d OBJ/main.o” to examine the assembly instructions for the file main.c).

  • Q18 (1 marks) Count the instructions for the function update bb() in both versions (-g and -O3) and report those counts, as well as the reduction (reported as a ratio) in number of instructions for the -O3 version (ie., if the -O3 version has half as many instructions as the -g version, the reduction is 2.0x).
  • Q19 (1 marks) Using the gprof output from the previous section, compute and report the speedup in execution time of the -O3 version of update bb() over the -g version (i.e., the “self seconds” from the gprof output). How does the speedup compare to the size reduction, and how can this be the case?

Profiling with gcov

Use gcov to get the per-line execution counts of the number-one function from the -O3 version (but use the -g version to gather the gcov profile). After running the gcov version of VPR, execute the gcov program to generate a profile of the appropriate file (eg., run “gcov -o OBJ -b main.c” to profile the file main.c).
Running gcov will create main.c.gcov (for main.c).

NOTE: if you run the gcov program multiple times it will add to the counts in main.c.gcov; you have to remove the .gcda and .gcno files in OBJ/ to start counting from zero.

  • Q20 (2 marks) Based only on the gcov results (ie., don’t think too much about what the code says) list the loops in this function in the order that you would focus on optimizing them and why. Identify each loop by its line number in the original source file. If appropriate for any loop describe why you would not optimize it at all.

Optional BONUS

Can you modify the source code of the number-one function (from the -O3 version), without changing the algorithm/output, and improve the average performance (when still compiling with -O3)?

NOTE: do not eliminate code that checks for errors or prints—i.e., your code should still work the same for any input, even untested ones.

  • Q21 (3 marks) Clearly describe your modification (so the TA can repeat and test it), and why it improves performance. Measure it and report the speedup of your modification (relative to the original -O3 version of VPR).