Wednesday, September 28, 2011

DDR SDRAM Memory Bank

Because I'm getting into the section on memory (and some parts of yesterday's example made me curious, particularly the RAS-to-CAS delay) I have decided to research RAM again.  I did quite a bit of research in this area about a month ago, so a lot of this is review and should go by relatively quickly.

Data is stored in RAM in an array of bits called a "bank".










A horizontal line of bits make up a "word line" and a vertical line of bits make up a "bit line".


The array is split up into rows and columns.  A row is equivalent to a word line, so that's easy enough.  A column is made up of multiple contiguous bit lines, either 4, 8, or 16 depending on the architecture of the chip.  The number of bit lines which make up a column is the word width of the chip, or just simply the "width". This is usually written with an "x" followed by the width: x4, x8, and x16.












The number of rows depends on the generation of DDR and the "density" of the chip.  The density is the number of bits the chip has total across all banks.  The number of columns depends on the density and width of the chip.  The exact row and column counts for the possible densities and widths are specified by the DDR standard for the generation in question, although my experience is that you have to perform some calculations to find them.  Here is a table I put together of the row and column counts for the first generation of DDR:

Tuesday, September 27, 2011

Memory Access Latency

A predicated instruction is an instruction whose execution depends on the result of a true/false test.  Another way to look at it is a single instruction for code like the following: 

if (a > b) c = 6;

Predicated instructions can help to reduce the number of branches in your code, which may increase how fast your program executes.

On a slight tangent, I also learned what a transistor is: A resistor whose value (resistance) changes.  I still don't know how they are used or why there are so many in a processor, but I've satisfied my curiosity for the moment.  I highly recommend this video on the subject: http://www.youtube.com/watch?v=CkX8SkTgB0g

You can classify a processor as having either a Brainiac design or a Speed-Demon design based on how much it pushes for ILP.  A Brainiac design throws as much hardware at the problem as possible, sacrificing simplicity and size for more ILP.  A Speed-Demon design relies on the compiler to schedule instructions in a way that extracts as much ILP out of the code as possible.  A Speed-Demon design is relatively simple and small, allowing for higher clock speeds (until the thermal brick wall was hit in the early 2000s) which is how it got its name.

I finally started learning about memory access.  One of the reasons I started researching CPU architecture was to find out why a Load-Hit-Store on a Xenon (XBox360 processor) could cause a stall of up to 80 cycles, and I think I am getting close to an answer.  If I could reiterate an example from Modern Processors - A 90 Minute Guide, lets say you have a 2.0GHz CPU and 400MHz SDRAM.  Lets also say that it takes 1 cycle for the memory address to be sent from the CPU to the memory controller, 1 cycle to get to the DIMM, 5 cycles from the RAS-to-CAS delay (assuming there is a page miss, as is likely with the memory hierarchy we have today), another 5 cycles from CAS, then 1 cycle to send the data to the prefetch buffer, 1 cycle to the memory controller, and finally 1 cycle to the CPU.  In total we have 15 memory-clock cycles (assuming the FSB:DRAM ratio is 1:1) to get data from the main memory.  To convert this into CPU clock cycles, multiply it by the CPU:FSB ratio (CPU multiplier), which in this case is 5.  So 15*5 = 75 CPU clock cycles before the data is received from the main memory.  A diagram is definitely easier for me to understand, so here is a drawing I created to help me understand how this latency was calculated:

VLIW and Researching Interlocks

Trying out a different work schedule, so my blog entries may be a little erratic while I work out some of the kinks.  So anyways, this blog post is for yesterday.

Today I learned about a couple new techniques.  The first one is something called "very long instruction word" (VLIW).  In VLIW, a single instruction is actually composed of multiple smaller instructions which have been packed together by the compiler.  The fetch and decode stages of the pipeline can effectively work with multiple instructions in parallel, but they only have to deal with a single instruction.  The decode stage unpacks the sub-instructions and sends them to the appropriate functional units.  The decode stage does not detect hazards and the pipeline can generally only be stalled on a cache miss, so it is the job of the compiler to insert NOP instructions (no-operation) to prevent hazards.  A processor which makes use of VLIW is said to have an explicitly parallel design.

The other technique is something called an "interlock".  I am still researching the specifics, but an interlock in general is some mechanism which prevents harm to either the operator or the machine (an example would be the mechanism which locks the oven door during a self-clean cycle).  In the case of processors, an interlock can detect a data hazard in the decode stage of the pipeline and stall the previous stages while sending NOPs out of the decode stage until the hazard has been resolved.  I am assuming that an interlock is also used to prevent one stage from passing its output to the next stage in the event that the next stage is stalled (for example, a particular instruction is taking multiple clock cycles to execute and the fetch and decode stages must be stalled).

Wednesday, September 21, 2011

Google Docs Organized

I spent a good chunk of the day organizing my Google Docs page; it was becoming impossible to find anything.  Whenever I came across a resource while researching something, I would tell the Google Viewer to save it to my documents page but that caused things to become quite cluttered.

The remainder of the day was spent reading Pipelining: An Overview at ars technica.  It's mostly been review, so I don't have much to report.  However, I did learn that a non-pipelined processor is also called a "Single-Cycle Processor".  The article series also did a good job of explaining what a stall and pipeline bubble is, including a few interesting graphs.

Tuesday, September 20, 2011

Instruction Throughput Decides All

I finally have an answer to my question!  PleasingFungus from the #tigIRC channel on Esper discussed the question with me for two hours straight today and this is the answer I ended up with:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)?  Or would that make the processor superscalar?
A processor is only superscalar if its maximum instruction throughput is greater than 1.  Because only a single instruction is being issued each clock cycle, the maximum instruction throughput is 1, so the processor is not superscalar.  A processor can have multiple pipelines, but if the maximum instruction throughput is equal to or less than 1 the processor is not considered to be superscalar.

Instruction throughput is a measure of the number of instructions which can be processed by the pipeline (committed) per cycle.  Because all instructions don't always get through the pipeline in the same amount of time, the "average instruction throughput" can change depending on which instructions were executed.  The maximum instruction throughput is a static measurement however, representing the best case scenario.

Even if the instruction pipeline splits into two separate pipelines (one for each functional unit) at the issue stage (or some other point), if only one instruction is dispatched per clock cycle the processor has an instruction throughput less than or equal to 1.  Two instructions might complete at the same time (if one functional unit is slower than the other), thus giving the illusion of having an instruction throughput of 2, but it will take at least one cycle to "recover" from this, during which no instructions will complete; so it evens out.

I am still doing some research into this topic, although I currently consider myself to be about 50% finished with the question.  I plan on reading the Pipelining: An Overview series at ars technica after working through some example pipelines on paper.

Monday, September 19, 2011

Scalar But Multiple Functional Units?

I spent most of the day trying to find the answer to the following question:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)?  Or would that make the processor superscalar?

Unfortunately, I have not yet found a definite answer.  After a rather long discussion in the #hardware channel on Freenode, one person said that such a processor would have to be superscalar.  I need to get the opinion of several people on this matter though, especially since he seemed a little unsure of the answer he gave.

I asked for help on a hardware forum, but there have been no replies as of yet.  I may search for a more technical hardware forum tomorrow.

Friday, September 16, 2011

Researching Thornton's Algorithm

"Scoreboarding" is also called "Thornton's Algorithm".  Tomasulo's Algorithm and Thornton's Algorithm were developed around the same time with roughly the same goals.
  • They are both dynamic scheduling methods.
  • They both resolve dependencies (RAW, WAR, and WAW).
  • They are both pipelined.
  • They both have limited out-of-order execution capabilities.
  • Neither was designed to be superscalar.
Tomasulo's Algorithm, while solving dependencies better, was more complicated than Thornton's Algorithm.