Tuesday, September 27, 2011

Memory Access Latency

A predicated instruction is an instruction whose execution depends on the result of a true/false test.  Another way to look at it is a single instruction for code like the following: 

if (a > b) c = 6;

Predicated instructions can help to reduce the number of branches in your code, which may increase how fast your program executes.

On a slight tangent, I also learned what a transistor is: A resistor whose value (resistance) changes.  I still don't know how they are used or why there are so many in a processor, but I've satisfied my curiosity for the moment.  I highly recommend this video on the subject: http://www.youtube.com/watch?v=CkX8SkTgB0g

You can classify a processor as having either a Brainiac design or a Speed-Demon design based on how much it pushes for ILP.  A Brainiac design throws as much hardware at the problem as possible, sacrificing simplicity and size for more ILP.  A Speed-Demon design relies on the compiler to schedule instructions in a way that extracts as much ILP out of the code as possible.  A Speed-Demon design is relatively simple and small, allowing for higher clock speeds (until the thermal brick wall was hit in the early 2000s) which is how it got its name.

I finally started learning about memory access.  One of the reasons I started researching CPU architecture was to find out why a Load-Hit-Store on a Xenon (XBox360 processor) could cause a stall of up to 80 cycles, and I think I am getting close to an answer.  If I could reiterate an example from Modern Processors - A 90 Minute Guide, lets say you have a 2.0GHz CPU and 400MHz SDRAM.  Lets also say that it takes 1 cycle for the memory address to be sent from the CPU to the memory controller, 1 cycle to get to the DIMM, 5 cycles from the RAS-to-CAS delay (assuming there is a page miss, as is likely with the memory hierarchy we have today), another 5 cycles from CAS, then 1 cycle to send the data to the prefetch buffer, 1 cycle to the memory controller, and finally 1 cycle to the CPU.  In total we have 15 memory-clock cycles (assuming the FSB:DRAM ratio is 1:1) to get data from the main memory.  To convert this into CPU clock cycles, multiply it by the CPU:FSB ratio (CPU multiplier), which in this case is 5.  So 15*5 = 75 CPU clock cycles before the data is received from the main memory.  A diagram is definitely easier for me to understand, so here is a drawing I created to help me understand how this latency was calculated:

No comments:

Post a Comment