A predicated instruction is an instruction whose execution depends on the result of a true/false test. Another way to look at it is a single instruction for code like the following:
if (a > b) c = 6;
Predicated instructions can help to reduce the number of branches in your code, which may increase how fast your program executes.
On a slight tangent, I also learned what a transistor is: A resistor whose value (resistance) changes. I still don't know how they are used or why there are so many in a processor, but I've satisfied my curiosity for the moment. I highly recommend this video on the subject: http://www.youtube.com/watch?v=CkX8SkTgB0g
You can classify a processor as having either a Brainiac design or a Speed-Demon design based on how much it pushes for ILP. A Brainiac design throws as much hardware at the problem as possible, sacrificing simplicity and size for more ILP. A Speed-Demon design relies on the compiler to schedule instructions in a way that extracts as much ILP out of the code as possible. A Speed-Demon design is relatively simple and small, allowing for higher clock speeds (until the thermal brick wall was hit in the early 2000s) which is how it got its name.
I finally started learning about memory access. One of the reasons I started researching CPU architecture was to find out why a Load-Hit-Store on a Xenon (XBox360 processor) could cause a stall of up to 80 cycles, and I think I am getting close to an answer. If I could reiterate an example from Modern Processors - A 90 Minute Guide, lets say you have a 2.0GHz CPU and 400MHz SDRAM. Lets also say that it takes 1 cycle for the memory address to be sent from the CPU to the memory controller, 1 cycle to get to the DIMM, 5 cycles from the RAS-to-CAS delay (assuming there is a page miss, as is likely with the memory hierarchy we have today), another 5 cycles from CAS, then 1 cycle to send the data to the prefetch buffer, 1 cycle to the memory controller, and finally 1 cycle to the CPU. In total we have 15 memory-clock cycles (assuming the FSB:DRAM ratio is 1:1) to get data from the main memory. To convert this into CPU clock cycles, multiply it by the CPU:FSB ratio (CPU multiplier), which in this case is 5. So 15*5 = 75 CPU clock cycles before the data is received from the main memory. A diagram is definitely easier for me to understand, so here is a drawing I created to help me understand how this latency was calculated: