Students will develop an in-order, fully functional C++ CPU simulator that executes appropriately binary-formatted programs. The CPU supports flexible hardware structures and program parameters, may run in debug or execution mode, and provides performance statistics on executed programs. The simulator will be coded and realized using various classes and structures in C++’s Standard
Template Library (STL).
Object-oriented programmers are often required to code simulators that model a simulated representation of a physical system. In this project, we will simulate a simple in-order Central Processing Unit (CPU) which will replicate very basic processor functionality in software. We will recreate various parts of the CPU and observe its states, instruction, and data-flow using a Debug Mode. This will allow us to trace instructions as they flow through the CPU to ensure proper functionality, while obtaining an appreciation for computer architecture. Once the simulator is completely coded and bug free, it may be run in execution mode, reflecting al terminal environment where the program will be executed on a simulated CPU.
There are numerous CPU simulators that have been created over the past decades. Many of the simulators are designed either for 1) educational purposes, or 2) for research/commercial purposes.
In the context of education, CPU simulators are mainly used to study single and multi-processor systems, where one may tune various parameters to view performance effects of design choices, and study CPU structure optimization. In the context of research/commercial simulators, computer architects must first simulate a physical system and observe potential performance gains before proceeding to create the actual hardware system and fabricating the chip. If performance is negligible, another route or design must be considered. Since tuning or redesigning a simulator is associated with minimal in cost in comparison to fabricating a processor chip (that may or may not work), simulators are widely used as a gateway for determining performance potential before proceeding to the actual hardware design.
The CPU simulator we will be designing in this project will not be modeling a complete physical system: a computing system comprises of many layers beyond the scope of this project. Instead, we will be simulating a very simple CPU for educational purposes, however the processor will still be capable of executing various programs and obtaining respective performance statistics from varying parameters. You may refer to this simulator and CPU design principles in your future engineering courses as well, such as ENSC254 and ENSC350.
An instruction which enters a CPU is processed incrementally in a series of steps. All instructions are assigned a number, or ID, as they enter the CPU. Our CPU is an in-order processor, meaning that instructions must be processed in the order which they enter the CPU. Therefore instructions must be identified using a numbering system to process instructions in-order; the younger the instruction, the greater the instruction’s ID value.
The series of steps required to execute an instruction is referred to as a pipeline. Since each instruction is divided and processed in a series of steps, several successive instructions may overlap in the pipeline at a given time.
Our CPU consists of two memory units: 1) instruction memory which contains all the programs instructions, and 2) a register file, which holds the program’s variable data (data is signed integers)
Consider the CPU as an assembly line for processing instructions with the following stages:
The first step is to read an instruction of our program (fetched) from instruction memory according to the address specified by a variable (or “register” in hardware terminology) called the Program Counter (PC). The PC is incremented every time an instruction is fetched from the instruction memory. Therefore, the value held by the PC dictates the location, or address, of the next instruction to be read from the instruction memory. We will assume the first instruction of a program is located at PC=0, where instruction_memory contains the first instruction that will be fetched and processed in the pipeline.
A “fetch width” parameter may be specified as well, indicating the number of instructions that may be obtained from instruction memory simultaneously at a given time. One unit of time is referred to as a clock cycle in CPU terminology.
Once an instruction is fetched, it is decoded according to the processor’s Instruction Set Architecture (ISA). The ISA stipulates the CPU’s supported “instruction set” i.e. instruction operations and their formats recognized and used by the processor to interpret instructions. The decoding process allows the CPU to extract information from the fetched instruction. Specifically, the decoder extracts an instruction’s i) input data locations to be read from the register file (referred to as source operands), ii) operation, and iii) the output destination/location where the result will be written to in the register file.
Once the instruction’s information is extracted, it is placed in two separate CPU structures: 1) Instruction Queue (IQ), and 2) Retire/Commit Buffer, or more commonly referred to as the Reorder Buffer (ROB).
The IQ is a finite entry queue which buffers instructions until they are ready for execution. An instruction is ready once all its source operands are marked as “valid”.
The ROB is a finite FIFO list which manages and ensures the safe eviction or “retirement” of instructions from the pipeline after execution. This retirement process is completed during the Retirement / Commit stage. The ROB has a much bigger role in complex CPUs, as its name implies, however in the context of our project, the ROB will only be used to guarantee that the instructions are executed and released from the pipeline in the order from which they were fetched.
Instruction fetch and decode will be performed as one step in our CPU simulator.
The term issue, interchangeable with the term dispatch, is the process of releasing an instruction from the IQ and proceeding to execution.
In this step, instructions in the IQ wait until all previous instruction have been issued. An instruction may be issued once all of its source operands are ready to be read from the register file, and resources are available for execution.
Accordingly in this second pipeline stage, when an instruction is ready in the IQ: i) the instruction is released from the IQ (dispatch) , ii) its input operands are read from the register file, the iii) operation is executed using source operand data, iv) the result is written back to the register file at the instruction’s specified destination register, and v) the destination register is broadcasted to the IQ to inform younger instructions that the contents in the register are ready to be read.
The CPU will implement all 5 of these substeps as one pipeline stage in the CPU simulator. More details pertaining to each step of this multi-step process are provided below:
Dispatch (Issue): The instructions in the IQ are monitored for operand readiness. Since instructions are dependent on one another, a “consumer” instruction can not execute before a “producer” instruction has finished executing. For instance:
Instruction 1: Z = A + B; ("produces" results of variable Z) Instruction 2: C = Z + F; ("consumes" the contents of variable Z)
As the example illustrates, Instruction 2 can not execute until the contents of Z are computed and written to the variable Z, signifying that instruction 1 must first execute. Instruction 2 may thereafter proceed in the next cycle to read the contents of the Z operand and executed its operation. Consequently, these two instructions can not dispatch nor execute in the same cycle due to this read-after-write (RAW) dependency. We refer to this issue validation process as monitoring an instruction for “valid” operands.
Once a given instruction’s operands are valid, the instruction may be dispatched (or “issued”) for execution. Since we are implementing an in-order CPU, this implies that the oldest instruction in the IQ must be ready for execution, implementing a FIFO (First-In, First-Out) scheme; if the oldest instruction is not ready, subsequent younger instructions can not proceed to execute even if they have valid operand data.
An “ISSUE_WIDTH” parameter may also be specified at this stage, indicating the maximum number of ready/valid in-order instructions that the may execute on the CPU in one clock cycle. This also implies that there are “ISSUE_WIDTH” number of functional units that exist to execute the instructions simultaneously. If multiple execution units exist, the processor model is referred to as Superscalar.
Read: As previously mentioned, the working set of program variables are held in the register file. Registers are the fastest access memory in a computing system, yet the most expensive to implement.
Once an instruction is dispatched, an instruction’s source operands must be read from the register file to obtain their respective data values. For instance, in the example above, variable Z is associated with a value, let’s say 5. This value must be read from the register file, and hence Z represents a memory address which holds the data content of our variable. In this case, Z would represent an address (or location) in the register file which must be accesses to obtain the value of the variable. In the case of our CPU simulator, we use a finite set of numbered “registers” representing our program variables, which are named with the prefix “r”. We will work with a set of 16 registers, and our operands will be named r0, r1… r15 accordingly.
Execute: Once the input operand contents have been read from the register file, they are passed to an Arithmetic Logic Unit (ALU). The ALU is a functional unit which performs a specified input operation on the values input to the unit. Our CPU’s ALU takes 3 inputs: two input operands, and a code representing the operation to be performed on the input values, referred to as an opcode. Based on this information, the ALU computes the operation on the input operand values, and produces an output result.
Writeback: The result generated by the ALU is redirected and written back to the register file, at the address specified by the “destination” operand extracted from the instruction at decode.
Once the contents have been written, the instruction is marked in the ROB as valid.
Retiring an instruction refers to the safe eviction of an instruction from the pipeline. As mentioned previously, the role of the ROB is to guarantee that the instructions are executed and released from the pipeline in the order from which they were fetched. Accordingly, in the Commit stage, the ROB’s head entry, considering a FIFO list structure, is accessed first. If the head of the list is valid, i.e. has been successfully executed, then it may be safely popped from the ROB. Once popped, the instruction is no longer present in the pipeline. If it is not valid, then it must wait in the ROB until it has been successfully executed.
A “COMMIT_WIDTH” parameter may also be specified in the CPU which indicates how many valid in-order instructions may be popped off the list simultaneously.
Referring back to Fig. 1, it is evident that since instructions are processed in a series of stages, assuming a FETCH_WIDTH and ISSUE_WIDTH of 1 instruction, up to three instructions may be in the midst of being processed by the CPU pipeline simultaneously. As FETCH_WIDTH and ISSUE_WIDTH parameter increase, it is possible that the CPU may increase performance as more instructions exist in the pipeline. This however depends on the number of RAW dependencies and “branch” instructions inherit of the program.
The ISA is a functional definition of the operations, instruction encodings, and storage that must be supported by a given processor. The role of the ISA is to divide labour between hardware and software1, allowing for universal software compatibility across various generations and styles of processor models.
Using the ISA, the programmer may design a software program which is compiled and converted to a set of instructions encoded for a given ISA as machine language. The processor which supports the target ISA obtains the formatted instructions and executes the software program. Thus, the ISA acts as a hardware/software interface allowing programmers and computer architects to work independently with the same computing objective.
The simulated CPU designed in this project will abide by an Instruction Set Architecture (ISA) referred to as the Two Fifty One (TFO) ISA. Instructions are read as machine language by the CPU, which we will interpret as assembly language in this document. For our sake, the general format for reading assembly is as follows: operation destination_reg, operand_source1, operand_source2
For instance, if an assembly statement were written as follows: add r5, r6, r7
This would translate to r5 = r6+r7, where our variables are expressed in terms of register locations.
Side Note: A word on binary for those who have not yet taken ENSC252: A binary number is expressed in the base-2 numeral system, and represents a decimal number as a set of 1s and 0s. Each 1 or 0 digit in a binary number is referred to as a bit. Depending on the position of the bit, b i , a given decimal number may be expressed using the following formula, assuming a binary number consisting of n bits.
Where bit i may only take the value of 1 or 0 and is multiplied by a power of 2 depending on the location of the digit. The left most bit represents the most significant bit/value, whereas the right-most bit signifies the least significant bit. If you were given the binary number “01010” for instance, this would equate to the decimal value.
The TFO ISA supports four instruction type formats: 1) Register type (R-type), 2) Immediate type (I-type),
3) Jump type (J-type) and 4) Prompt type (P-type). The instruction format type may be obtained from the two left-most bits (or Most Significant Bits (MSB) 31 and 30) of the given instruction as shown below. Instructions are represented in this section in binary.
All programs and their respective instructions will be provided to you for this project as unsigned integers.
To extract each instruction’s information, the unsigned integer must be converted to its binary equivalent, and processed as a string of 1s and 0s, each digit referred to as a bit. The bits must be decoded as shown above to obtain the field encodings, depending on the type of instruction extracted from bits[31,30]. Specifications for each instruction format are provided here.
If an instruction’s left most bits are “00”, the instruction is processed by the decoder as an R-Type instruction. Rtype instructions only operate on registers. In this case, both source operands must be read from the register file. The address of the input operands src1 and src2 may be obtained from instruction bits 20 through 16, and 15 through 11, respectively. The bits 29 through 26 represent the opcode operation performed on the input operand data, which are all passed to the functional unit during execute. Once the instruction has executed, the result is written back to the dest register specified in bits 25 through 21. Bits 10 through 0 remain unused for this instruction type.
Supported instructions: add, sub, mult, div, mod, mov (see Table I for specific formats and descriptions)
A special type of instruction is also supported by R-Types, called print, which prints the src1 value to standard output (cout). This is used to present the user with the program’s final result.
If an instruction’s left most bits are “01”, the instruction is processed by the decoder as an I-Type instruction. I-type instructions have one register source operand, and one 16bit immediate source operand. The address of the input operand src1 is obtained from bits 20 through 16, and the immediate value from bits 15 through 0, respectively. In the case, only one value must be read from the register file, and the second operand will be the immediate value extracted from the instruction, passed directly to the ALU. Bits 29 through 26 represent the opcode (operation code) to be passed to the functional unit during execute. Once the instruction is executed, the result is written back to the dest register specified in bits 25 through 21.
Note that there are special types of I-type instructions in the TFO ISA, referred to as branches, which redirect a program’s flow. More details and processing specifications of these types of I-instructions will be provided in the next section for a full elaboration on the topic.