Wednesday, September 28, 2011

DDR SDRAM Memory Bank

Because I'm getting into the section on memory (and some parts of yesterday's example made me curious, particularly the RAS-to-CAS delay) I have decided to research RAM again.  I did quite a bit of research in this area about a month ago, so a lot of this is review and should go by relatively quickly.

Data is stored in RAM in an array of bits called a "bank".

A horizontal line of bits make up a "word line" and a vertical line of bits make up a "bit line".

The array is split up into rows and columns.  A row is equivalent to a word line, so that's easy enough.  A column is made up of multiple contiguous bit lines, either 4, 8, or 16 depending on the architecture of the chip.  The number of bit lines which make up a column is the word width of the chip, or just simply the "width". This is usually written with an "x" followed by the width: x4, x8, and x16.

The number of rows depends on the generation of DDR and the "density" of the chip.  The density is the number of bits the chip has total across all banks.  The number of columns depends on the density and width of the chip.  The exact row and column counts for the possible densities and widths are specified by the DDR standard for the generation in question, although my experience is that you have to perform some calculations to find them.  Here is a table I put together of the row and column counts for the first generation of DDR:

Tuesday, September 27, 2011

Memory Access Latency

A predicated instruction is an instruction whose execution depends on the result of a true/false test.  Another way to look at it is a single instruction for code like the following: 

if (a > b) c = 6;

Predicated instructions can help to reduce the number of branches in your code, which may increase how fast your program executes.

On a slight tangent, I also learned what a transistor is: A resistor whose value (resistance) changes.  I still don't know how they are used or why there are so many in a processor, but I've satisfied my curiosity for the moment.  I highly recommend this video on the subject:

You can classify a processor as having either a Brainiac design or a Speed-Demon design based on how much it pushes for ILP.  A Brainiac design throws as much hardware at the problem as possible, sacrificing simplicity and size for more ILP.  A Speed-Demon design relies on the compiler to schedule instructions in a way that extracts as much ILP out of the code as possible.  A Speed-Demon design is relatively simple and small, allowing for higher clock speeds (until the thermal brick wall was hit in the early 2000s) which is how it got its name.

I finally started learning about memory access.  One of the reasons I started researching CPU architecture was to find out why a Load-Hit-Store on a Xenon (XBox360 processor) could cause a stall of up to 80 cycles, and I think I am getting close to an answer.  If I could reiterate an example from Modern Processors - A 90 Minute Guide, lets say you have a 2.0GHz CPU and 400MHz SDRAM.  Lets also say that it takes 1 cycle for the memory address to be sent from the CPU to the memory controller, 1 cycle to get to the DIMM, 5 cycles from the RAS-to-CAS delay (assuming there is a page miss, as is likely with the memory hierarchy we have today), another 5 cycles from CAS, then 1 cycle to send the data to the prefetch buffer, 1 cycle to the memory controller, and finally 1 cycle to the CPU.  In total we have 15 memory-clock cycles (assuming the FSB:DRAM ratio is 1:1) to get data from the main memory.  To convert this into CPU clock cycles, multiply it by the CPU:FSB ratio (CPU multiplier), which in this case is 5.  So 15*5 = 75 CPU clock cycles before the data is received from the main memory.  A diagram is definitely easier for me to understand, so here is a drawing I created to help me understand how this latency was calculated:

VLIW and Researching Interlocks

Trying out a different work schedule, so my blog entries may be a little erratic while I work out some of the kinks.  So anyways, this blog post is for yesterday.

Today I learned about a couple new techniques.  The first one is something called "very long instruction word" (VLIW).  In VLIW, a single instruction is actually composed of multiple smaller instructions which have been packed together by the compiler.  The fetch and decode stages of the pipeline can effectively work with multiple instructions in parallel, but they only have to deal with a single instruction.  The decode stage unpacks the sub-instructions and sends them to the appropriate functional units.  The decode stage does not detect hazards and the pipeline can generally only be stalled on a cache miss, so it is the job of the compiler to insert NOP instructions (no-operation) to prevent hazards.  A processor which makes use of VLIW is said to have an explicitly parallel design.

The other technique is something called an "interlock".  I am still researching the specifics, but an interlock in general is some mechanism which prevents harm to either the operator or the machine (an example would be the mechanism which locks the oven door during a self-clean cycle).  In the case of processors, an interlock can detect a data hazard in the decode stage of the pipeline and stall the previous stages while sending NOPs out of the decode stage until the hazard has been resolved.  I am assuming that an interlock is also used to prevent one stage from passing its output to the next stage in the event that the next stage is stalled (for example, a particular instruction is taking multiple clock cycles to execute and the fetch and decode stages must be stalled).

Wednesday, September 21, 2011

Google Docs Organized

I spent a good chunk of the day organizing my Google Docs page; it was becoming impossible to find anything.  Whenever I came across a resource while researching something, I would tell the Google Viewer to save it to my documents page but that caused things to become quite cluttered.

The remainder of the day was spent reading Pipelining: An Overview at ars technica.  It's mostly been review, so I don't have much to report.  However, I did learn that a non-pipelined processor is also called a "Single-Cycle Processor".  The article series also did a good job of explaining what a stall and pipeline bubble is, including a few interesting graphs.

Tuesday, September 20, 2011

Instruction Throughput Decides All

I finally have an answer to my question!  PleasingFungus from the #tigIRC channel on Esper discussed the question with me for two hours straight today and this is the answer I ended up with:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)?  Or would that make the processor superscalar?
A processor is only superscalar if its maximum instruction throughput is greater than 1.  Because only a single instruction is being issued each clock cycle, the maximum instruction throughput is 1, so the processor is not superscalar.  A processor can have multiple pipelines, but if the maximum instruction throughput is equal to or less than 1 the processor is not considered to be superscalar.

Instruction throughput is a measure of the number of instructions which can be processed by the pipeline (committed) per cycle.  Because all instructions don't always get through the pipeline in the same amount of time, the "average instruction throughput" can change depending on which instructions were executed.  The maximum instruction throughput is a static measurement however, representing the best case scenario.

Even if the instruction pipeline splits into two separate pipelines (one for each functional unit) at the issue stage (or some other point), if only one instruction is dispatched per clock cycle the processor has an instruction throughput less than or equal to 1.  Two instructions might complete at the same time (if one functional unit is slower than the other), thus giving the illusion of having an instruction throughput of 2, but it will take at least one cycle to "recover" from this, during which no instructions will complete; so it evens out.

I am still doing some research into this topic, although I currently consider myself to be about 50% finished with the question.  I plan on reading the Pipelining: An Overview series at ars technica after working through some example pipelines on paper.

Monday, September 19, 2011

Scalar But Multiple Functional Units?

I spent most of the day trying to find the answer to the following question:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)?  Or would that make the processor superscalar?

Unfortunately, I have not yet found a definite answer.  After a rather long discussion in the #hardware channel on Freenode, one person said that such a processor would have to be superscalar.  I need to get the opinion of several people on this matter though, especially since he seemed a little unsure of the answer he gave.

I asked for help on a hardware forum, but there have been no replies as of yet.  I may search for a more technical hardware forum tomorrow.

Friday, September 16, 2011

Researching Thornton's Algorithm

"Scoreboarding" is also called "Thornton's Algorithm".  Tomasulo's Algorithm and Thornton's Algorithm were developed around the same time with roughly the same goals.
  • They are both dynamic scheduling methods.
  • They both resolve dependencies (RAW, WAR, and WAW).
  • They are both pipelined.
  • They both have limited out-of-order execution capabilities.
  • Neither was designed to be superscalar.
Tomasulo's Algorithm, while solving dependencies better, was more complicated than Thornton's Algorithm.

Thursday, September 15, 2011

Forwarding and Out-of-Order Scalar Trouble

Welp, I am still trying to figure out how an out-of-order scalar processor would work.  The out-of-order execution part is easy enough, but I am having trouble figuring out how in-order completion would work.  To this end, I am currently researching scoreboarding in the hopes of gleaning some hint as to how such a processor would work.  If anyone knows of a processor which had out-of-order execution and in-order completion (different from commit), please let me know!

Research Questions

What is forwarding?
Forwarding is what you call the technique of passing the result of an instruction directly to a dependent instruction before the result is committed.  This allows an instruction to execute before its operands are available in the register file.

Wednesday, September 14, 2011

OOO Execution with In-Order Completion?

Today I was able to find an answer to the following research question:

Can a processor have out-of-order execution and in-order completion?
Yes; with one pipeline per active instruction.  In other words, for every instruction in the active instruction window (every instruction which can be selected for execution) there needs to be a functional unit.  This seems a little impractical and I doubt this sort of processor has been used much.

I am currently working out how the various combinations of in-order/out-of-order and scalar/superscalar affect the design of a processor.  I have the superscalar combinations figured out, and I am currently working on the scalar combinations.

Tuesday, September 13, 2011

Reorder Buffer, Public Project Tracking

Today I was able to find answers to the following questions:

What is the reorder buffer?
It is a buffer which is used to ensure that instructions commit in-order.  It is also used in one method of register renaming.  There is one entry for every active instruction.  An instruction is active after it has been dispatched and before it has been committed.  A reorder buffer is a circular FIFO queue, meaning it has a head pointer and a tail pointer which can wrap around once they have hit the end of the buffer, and entries are removed in the order they are added.  When an instruction is dispatched, an entry for it is placed in the reorder buffer where the tail is pointing (and the tail pointer is incremented to the next entry address).  When an instruction is being pointed to by the head pointer and the instruction has completed execution, it is committed and the head pointer is incremented to the next entry address.

Can a superscalar processor have in-order execution?
Yes.  If sequential instructions are being executed in parallel, it is still considered to be in-order execution.

I also added the following unanswered questions to my research list:
  • What does "dynamically encountered" mean regarding memory operations in the dynamical reordering process of a processor?
  • Where is the active instruction window?
  • Can a processor have out-of-order execution and in-order completion? 

One of our community members and moderators, Acaceol, created a rather cool IRC-based project tracking system.  I am currently using it to track the various projects and tasks I have to do, and the best part is that anyone can check up on the current status of development/research by simply joining our online chat room located at and either asking the project management system for the current status or watching as we update it throughout the day.  This will allow you to keep up-to-date on exactly what is being worked on, including how much time has been spent on the various tasks and projects!

Friday, September 9, 2011

Execution Scheduling and Memory Disambiguation

Unfortunately I was unable to get much research done today.  However, I did learn a few things.

A processor can be either statically scheduled or dynamically scheduled.

A statically scheduled processor depends on the compiler to create the execution schedule.  I am guessing this is either through hints or by actually reordering instructions at compile time.

A dynamically scheduled processor creates the execution schedule at run-time.  This is apparently the preferred method, although I can only guess as to the reasons at this point.  One thought is that you can change the hardware and the same program doesn't need to be recompiled.

Memory Disambiguation is used in the creation of the execution schedule to reorder memory references without running into a data hazard; it is the process of determining if two instructions access the same memory location.

As with creating the execution schedule, memory disambiguation comes in two flavors: Static Disambiguation and Dynamic Disambiguation.  As with scheduling, the static variation is performed by the compiler and the dynamic variation is performed by the processor.  Either or both can be used and they don't interfere with each other.

How dynamic disambiguation works with a statically scheduled processor, I don't know.  Hopefully that is one of those questions which I will find an answer to.

Thursday, September 8, 2011

Memory Hazards, Jaggies, and EC2

Memory data hazards are prevented by keeping a list of all the addresses with outstanding stores, and putting a load into a queue if it is for an address in that list.  I don't know how WAW hazards are prevented though...  I am currently reading "ARB: A Hardware Mechanism for Dynamic Reordering of Memory References" to try and find an answer to that question.

Indirectly, I stumbled upon the reason why having "jaggies" in a graphic is called "aliasing": Multiple parts of the picture the graphic is modeling "alias" the same pixel.  As things are want to do, this seems obvious in retrospect.

I also did some research into Amazon EC2.  It would definitely be on a faster internet connection than our current server, but the CPU may be a bit slower.  I am also unsure how installing programs is performed, and whether there is some way of viewing the Windows GUI or if you have to do everything via command line.  There is also the matter of losing all your data if the server goes down...  Although that can be prevented by using Amazon's cloud storage service.

Wednesday, September 7, 2011

Memory Superscalar Problems

Today I started reading about memory access.  I don't really have much to say on the matter yet, except that it is apparently not as easy to execute multiple memory access instructions at once as it is for integer/float instructions.  This mostly stems from the fact that until the instruction is executed, you don't know what memory location it needs access to (this is because the address needs to be calculated by adding an offset to a register or some other such operation, as well as being translated into a physical address).  You also can't use renaming to solve artificial data dependencies because there is way too much memory.

Tuesday, September 6, 2011

Instruction Issue Buffer Q&A

Today I finally finished the section on instruction issuing.  This took me a while because I had questions which didn't have readily available answers.  Here are the questions I was attempting to find answers to, along with the answers I eventually found (if you have a better answer to any of these questions, please post it in the comments):

If there are no redundant functional units (ex. only a single integer unit and floating point unit) can it still be superscalar?
Yes, superscalar is just the ability to issue instructions to multiple functional units per clock cycle.

With the single queue method, how many instructions can be issued at once?
This depends on the depth of the superscalar architecture.  If the processor is k-way superscalar, then up to k instructions can be issued in one clock cycle.

How does the in-order single queue method avoid artificial data dependencies?
Instructions complete in-order.

Can the single queue method use out of order issuing?
Yes, although register renaming will be required to fix artificial data dependencies.

Why would you want to use the multiple queue method over the single queue method?
The multiple queue method allows you to cut down on the number of data lines, since you know for certain that a queue will never issue an instruction to a functional unit outside of its designated group.  Splitting up the queues also allows you to perform register renaming for a single queue rather than all of them.  This can allow a certain queue to slip ahead of the others to perform tasks ahead of time.  For example, a queue for the load/store unit could make use of renaming in order to load memory ahead of when it is required.

How does the multiple queue method work without renaming?  How does it avoid artificial data dependencies?
In-order issuing is enforced by preventing an instruction from being issued if instructions ahead of it (possibly in other queues) have not been issued yet.

What is the difference between reservation stations and the multiple queue method?
Instruction operands are stored in the reservation station along with the rest of the instruction's information; a queue grabs the operand values out of the register file.  Reservation stations are typically always out-of-order and thus require renaming.

These are various names for the same concept:
  • Functional Unit Buffer
  • Instruction Issue Buffer
  • Issue Window
  • Window of Execution

This is a short list of some of the organizations of a functional unit buffer:
  • Single Instruction Queue
  • Multiple Instruction Queue
  • Reservation Station

Friday, September 2, 2011

Register Renaming

Register renaming is the action of replacing the registers used by an instruction with different registers.

When register renaming is used, there are more physical registers (actual hardware registers) than there are logical registers.  A logical register is a register which is defined to exist by the architecture, but this is purely an "on-paper" existence.  The actual meat of the register, what's called the physical register, can be different from the logical register the programmer works with.

The extra registers are used to solve the WAW and WAR data dependencies by rewriting the instructions to use an unused register for storing results, and updating which registers are used as inputs (because they could be depending on the result of another instruction which has had its result register moved somewhere else).

As an example, lets say you have the following code:
move   r1, r2
move   r2, r3

These instructions have a WAR data dependency.  Lets say that "r" represents a logical register, and "R" represents a physical register.  After renaming, the code might look like this:
move   R7, R3
move   R4, R5

Alright, so lets explain what happened to the first instruction.  It takes as an input the value in the logical register r2.  The last instruction to store a value in r2 (which is the value we want) stored it into the physical register R3 (because that was an unused register at the time), so the instruction is rewritten to take the value in R3 as the input.  The instruction also has an output/result value, which is stored in the logical register r1.  In order to ensure that this instruction does not overwrite a value that another instruction is depending on, an unused register is selected, in this case R7.

The explanation for the second instruction is pretty much the same as for the first.  However, notice what happens to the result register for the second instruction; it is different than the register used as an input in the first instruction.  This means that there is no longer a WAR data dependency between the two instructions!  The register written to by the second instruction is physically different (even though it appears to be logically the same from the perspective of the program code) from the register the second instruction reads from.

WAW data dependencies are solved by renaming as well because both write instructions are changed to write to different physical registers, even though they are writing to the same logical register; so it doesn't matter in which order they are executed.

This is probably a good place to confuse you by saying that there are different methods of renaming.  The methods I am familiar with all work basically the way I described, so you should be able to take this general overview and learn the nittier-grittier details.

Thursday, September 1, 2011

Data Hazards

Today I finished learning about two of the common (or were common in 1995 at any rate) methods for renaming registers.

First off I should probably explain what "renaming registers" means.  In order to have multiple instructions executing in parallel and/or executing out of order (very common nowadays) you need to prevent something called "data hazards".

We seem to have encountered yet another term, so let me define it real quick: A data hazard is a situation where there is a chance that the data will end up being incorrect due to the instructions executing out of order.  There are three kinds of data hazards: Read-After-Write (RAW), Write-After-Read (WAR) and Write-After-Write (WAW).  A RAW occurs because an instruction expects a value to have been stored in a storage location (memory or register) by a certain previous write instruction, but the write instruction hasn't been executed yet.  A WAR occurs because an instruction writes over a storage location before another instruction is able to read the previous value.  A WAW occurs because an instruction A writes over the value stored in a storage location by an instruction B, where the value written by instruction B was actually supposed to overwrite the value stored by instruction A (the value stored by B was supposed to survive this encounter, but the one written by A survives instead).

As you can probably see, data hazards occur because one instruction depends on the data of another instruction.  A RAW is considered to be the only "true" data dependency; the others are called "artificial" data dependencies.  I'll try to explain: In a WAR hazard, the write instruction doesn't depend on the value read by the read instruction, and the read instruction doesn't depend on the value written by the write instruction.  However, the read instruction does depend on the value still being there from the previous (in program order) write instruction.  Similarly, in a WAW hazard one write instruction does not depend on the value written by the other, but the following (in program order) instructions do depend on the value from the second write instruction being present, not the value from the first.

Unfortunately it looks like I am out of time for today.  I'll create a second post tomorrow where I will discuss renaming registers.