Wednesday, September 28, 2011

DDR SDRAM Memory Bank

Because I'm getting into the section on memory (and some parts of yesterday's example made me curious, particularly the RAS-to-CAS delay) I have decided to research RAM again.  I did quite a bit of research in this area about a month ago, so a lot of this is review and should go by relatively quickly.

Data is stored in RAM in an array of bits called a "bank".

A horizontal line of bits make up a "word line" and a vertical line of bits make up a "bit line".

The array is split up into rows and columns.  A row is equivalent to a word line, so that's easy enough.  A column is made up of multiple contiguous bit lines, either 4, 8, or 16 depending on the architecture of the chip.  The number of bit lines which make up a column is the word width of the chip, or just simply the "width". This is usually written with an "x" followed by the width: x4, x8, and x16.

The number of rows depends on the generation of DDR and the "density" of the chip.  The density is the number of bits the chip has total across all banks.  The number of columns depends on the density and width of the chip.  The exact row and column counts for the possible densities and widths are specified by the DDR standard for the generation in question, although my experience is that you have to perform some calculations to find them.  Here is a table I put together of the row and column counts for the first generation of DDR:

Tuesday, September 27, 2011

Memory Access Latency

A predicated instruction is an instruction whose execution depends on the result of a true/false test.  Another way to look at it is a single instruction for code like the following: 

if (a > b) c = 6;

Predicated instructions can help to reduce the number of branches in your code, which may increase how fast your program executes.

On a slight tangent, I also learned what a transistor is: A resistor whose value (resistance) changes.  I still don't know how they are used or why there are so many in a processor, but I've satisfied my curiosity for the moment.  I highly recommend this video on the subject:

You can classify a processor as having either a Brainiac design or a Speed-Demon design based on how much it pushes for ILP.  A Brainiac design throws as much hardware at the problem as possible, sacrificing simplicity and size for more ILP.  A Speed-Demon design relies on the compiler to schedule instructions in a way that extracts as much ILP out of the code as possible.  A Speed-Demon design is relatively simple and small, allowing for higher clock speeds (until the thermal brick wall was hit in the early 2000s) which is how it got its name.

I finally started learning about memory access.  One of the reasons I started researching CPU architecture was to find out why a Load-Hit-Store on a Xenon (XBox360 processor) could cause a stall of up to 80 cycles, and I think I am getting close to an answer.  If I could reiterate an example from Modern Processors - A 90 Minute Guide, lets say you have a 2.0GHz CPU and 400MHz SDRAM.  Lets also say that it takes 1 cycle for the memory address to be sent from the CPU to the memory controller, 1 cycle to get to the DIMM, 5 cycles from the RAS-to-CAS delay (assuming there is a page miss, as is likely with the memory hierarchy we have today), another 5 cycles from CAS, then 1 cycle to send the data to the prefetch buffer, 1 cycle to the memory controller, and finally 1 cycle to the CPU.  In total we have 15 memory-clock cycles (assuming the FSB:DRAM ratio is 1:1) to get data from the main memory.  To convert this into CPU clock cycles, multiply it by the CPU:FSB ratio (CPU multiplier), which in this case is 5.  So 15*5 = 75 CPU clock cycles before the data is received from the main memory.  A diagram is definitely easier for me to understand, so here is a drawing I created to help me understand how this latency was calculated:

VLIW and Researching Interlocks

Trying out a different work schedule, so my blog entries may be a little erratic while I work out some of the kinks.  So anyways, this blog post is for yesterday.

Today I learned about a couple new techniques.  The first one is something called "very long instruction word" (VLIW).  In VLIW, a single instruction is actually composed of multiple smaller instructions which have been packed together by the compiler.  The fetch and decode stages of the pipeline can effectively work with multiple instructions in parallel, but they only have to deal with a single instruction.  The decode stage unpacks the sub-instructions and sends them to the appropriate functional units.  The decode stage does not detect hazards and the pipeline can generally only be stalled on a cache miss, so it is the job of the compiler to insert NOP instructions (no-operation) to prevent hazards.  A processor which makes use of VLIW is said to have an explicitly parallel design.

The other technique is something called an "interlock".  I am still researching the specifics, but an interlock in general is some mechanism which prevents harm to either the operator or the machine (an example would be the mechanism which locks the oven door during a self-clean cycle).  In the case of processors, an interlock can detect a data hazard in the decode stage of the pipeline and stall the previous stages while sending NOPs out of the decode stage until the hazard has been resolved.  I am assuming that an interlock is also used to prevent one stage from passing its output to the next stage in the event that the next stage is stalled (for example, a particular instruction is taking multiple clock cycles to execute and the fetch and decode stages must be stalled).

Wednesday, September 21, 2011

Google Docs Organized

I spent a good chunk of the day organizing my Google Docs page; it was becoming impossible to find anything.  Whenever I came across a resource while researching something, I would tell the Google Viewer to save it to my documents page but that caused things to become quite cluttered.

The remainder of the day was spent reading Pipelining: An Overview at ars technica.  It's mostly been review, so I don't have much to report.  However, I did learn that a non-pipelined processor is also called a "Single-Cycle Processor".  The article series also did a good job of explaining what a stall and pipeline bubble is, including a few interesting graphs.

Tuesday, September 20, 2011

Instruction Throughput Decides All

I finally have an answer to my question!  PleasingFungus from the #tigIRC channel on Esper discussed the question with me for two hours straight today and this is the answer I ended up with:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)?  Or would that make the processor superscalar?
A processor is only superscalar if its maximum instruction throughput is greater than 1.  Because only a single instruction is being issued each clock cycle, the maximum instruction throughput is 1, so the processor is not superscalar.  A processor can have multiple pipelines, but if the maximum instruction throughput is equal to or less than 1 the processor is not considered to be superscalar.

Instruction throughput is a measure of the number of instructions which can be processed by the pipeline (committed) per cycle.  Because all instructions don't always get through the pipeline in the same amount of time, the "average instruction throughput" can change depending on which instructions were executed.  The maximum instruction throughput is a static measurement however, representing the best case scenario.

Even if the instruction pipeline splits into two separate pipelines (one for each functional unit) at the issue stage (or some other point), if only one instruction is dispatched per clock cycle the processor has an instruction throughput less than or equal to 1.  Two instructions might complete at the same time (if one functional unit is slower than the other), thus giving the illusion of having an instruction throughput of 2, but it will take at least one cycle to "recover" from this, during which no instructions will complete; so it evens out.

I am still doing some research into this topic, although I currently consider myself to be about 50% finished with the question.  I plan on reading the Pipelining: An Overview series at ars technica after working through some example pipelines on paper.

Monday, September 19, 2011

Scalar But Multiple Functional Units?

I spent most of the day trying to find the answer to the following question:

If a scalar (not superscalar) processor has multiple nonredundant functional units (a single FPU and a single ALU for example) and it issues an instruction to the FPU at a certain clock tick, could it issue another instruction to the ALU at the next clock tick so that both the FPU and the ALU are executing at the same time (lets say that FPU instructions take longer to execute than ALU instructions)?  Or would that make the processor superscalar?

Unfortunately, I have not yet found a definite answer.  After a rather long discussion in the #hardware channel on Freenode, one person said that such a processor would have to be superscalar.  I need to get the opinion of several people on this matter though, especially since he seemed a little unsure of the answer he gave.

I asked for help on a hardware forum, but there have been no replies as of yet.  I may search for a more technical hardware forum tomorrow.

Friday, September 16, 2011

Researching Thornton's Algorithm

"Scoreboarding" is also called "Thornton's Algorithm".  Tomasulo's Algorithm and Thornton's Algorithm were developed around the same time with roughly the same goals.
  • They are both dynamic scheduling methods.
  • They both resolve dependencies (RAW, WAR, and WAW).
  • They are both pipelined.
  • They both have limited out-of-order execution capabilities.
  • Neither was designed to be superscalar.
Tomasulo's Algorithm, while solving dependencies better, was more complicated than Thornton's Algorithm.

Thursday, September 15, 2011

Forwarding and Out-of-Order Scalar Trouble

Welp, I am still trying to figure out how an out-of-order scalar processor would work.  The out-of-order execution part is easy enough, but I am having trouble figuring out how in-order completion would work.  To this end, I am currently researching scoreboarding in the hopes of gleaning some hint as to how such a processor would work.  If anyone knows of a processor which had out-of-order execution and in-order completion (different from commit), please let me know!

Research Questions

What is forwarding?
Forwarding is what you call the technique of passing the result of an instruction directly to a dependent instruction before the result is committed.  This allows an instruction to execute before its operands are available in the register file.

Wednesday, September 14, 2011

OOO Execution with In-Order Completion?

Today I was able to find an answer to the following research question:

Can a processor have out-of-order execution and in-order completion?
Yes; with one pipeline per active instruction.  In other words, for every instruction in the active instruction window (every instruction which can be selected for execution) there needs to be a functional unit.  This seems a little impractical and I doubt this sort of processor has been used much.

I am currently working out how the various combinations of in-order/out-of-order and scalar/superscalar affect the design of a processor.  I have the superscalar combinations figured out, and I am currently working on the scalar combinations.

Tuesday, September 13, 2011

Reorder Buffer, Public Project Tracking

Today I was able to find answers to the following questions:

What is the reorder buffer?
It is a buffer which is used to ensure that instructions commit in-order.  It is also used in one method of register renaming.  There is one entry for every active instruction.  An instruction is active after it has been dispatched and before it has been committed.  A reorder buffer is a circular FIFO queue, meaning it has a head pointer and a tail pointer which can wrap around once they have hit the end of the buffer, and entries are removed in the order they are added.  When an instruction is dispatched, an entry for it is placed in the reorder buffer where the tail is pointing (and the tail pointer is incremented to the next entry address).  When an instruction is being pointed to by the head pointer and the instruction has completed execution, it is committed and the head pointer is incremented to the next entry address.

Can a superscalar processor have in-order execution?
Yes.  If sequential instructions are being executed in parallel, it is still considered to be in-order execution.

I also added the following unanswered questions to my research list:
  • What does "dynamically encountered" mean regarding memory operations in the dynamical reordering process of a processor?
  • Where is the active instruction window?
  • Can a processor have out-of-order execution and in-order completion? 

One of our community members and moderators, Acaceol, created a rather cool IRC-based project tracking system.  I am currently using it to track the various projects and tasks I have to do, and the best part is that anyone can check up on the current status of development/research by simply joining our online chat room located at and either asking the project management system for the current status or watching as we update it throughout the day.  This will allow you to keep up-to-date on exactly what is being worked on, including how much time has been spent on the various tasks and projects!

Friday, September 9, 2011

Execution Scheduling and Memory Disambiguation

Unfortunately I was unable to get much research done today.  However, I did learn a few things.

A processor can be either statically scheduled or dynamically scheduled.

A statically scheduled processor depends on the compiler to create the execution schedule.  I am guessing this is either through hints or by actually reordering instructions at compile time.

A dynamically scheduled processor creates the execution schedule at run-time.  This is apparently the preferred method, although I can only guess as to the reasons at this point.  One thought is that you can change the hardware and the same program doesn't need to be recompiled.

Memory Disambiguation is used in the creation of the execution schedule to reorder memory references without running into a data hazard; it is the process of determining if two instructions access the same memory location.

As with creating the execution schedule, memory disambiguation comes in two flavors: Static Disambiguation and Dynamic Disambiguation.  As with scheduling, the static variation is performed by the compiler and the dynamic variation is performed by the processor.  Either or both can be used and they don't interfere with each other.

How dynamic disambiguation works with a statically scheduled processor, I don't know.  Hopefully that is one of those questions which I will find an answer to.

Thursday, September 8, 2011

Memory Hazards, Jaggies, and EC2

Memory data hazards are prevented by keeping a list of all the addresses with outstanding stores, and putting a load into a queue if it is for an address in that list.  I don't know how WAW hazards are prevented though...  I am currently reading "ARB: A Hardware Mechanism for Dynamic Reordering of Memory References" to try and find an answer to that question.

Indirectly, I stumbled upon the reason why having "jaggies" in a graphic is called "aliasing": Multiple parts of the picture the graphic is modeling "alias" the same pixel.  As things are want to do, this seems obvious in retrospect.

I also did some research into Amazon EC2.  It would definitely be on a faster internet connection than our current server, but the CPU may be a bit slower.  I am also unsure how installing programs is performed, and whether there is some way of viewing the Windows GUI or if you have to do everything via command line.  There is also the matter of losing all your data if the server goes down...  Although that can be prevented by using Amazon's cloud storage service.

Wednesday, September 7, 2011

Memory Superscalar Problems

Today I started reading about memory access.  I don't really have much to say on the matter yet, except that it is apparently not as easy to execute multiple memory access instructions at once as it is for integer/float instructions.  This mostly stems from the fact that until the instruction is executed, you don't know what memory location it needs access to (this is because the address needs to be calculated by adding an offset to a register or some other such operation, as well as being translated into a physical address).  You also can't use renaming to solve artificial data dependencies because there is way too much memory.

Tuesday, September 6, 2011

Instruction Issue Buffer Q&A

Today I finally finished the section on instruction issuing.  This took me a while because I had questions which didn't have readily available answers.  Here are the questions I was attempting to find answers to, along with the answers I eventually found (if you have a better answer to any of these questions, please post it in the comments):

If there are no redundant functional units (ex. only a single integer unit and floating point unit) can it still be superscalar?
Yes, superscalar is just the ability to issue instructions to multiple functional units per clock cycle.

With the single queue method, how many instructions can be issued at once?
This depends on the depth of the superscalar architecture.  If the processor is k-way superscalar, then up to k instructions can be issued in one clock cycle.

How does the in-order single queue method avoid artificial data dependencies?
Instructions complete in-order.

Can the single queue method use out of order issuing?
Yes, although register renaming will be required to fix artificial data dependencies.

Why would you want to use the multiple queue method over the single queue method?
The multiple queue method allows you to cut down on the number of data lines, since you know for certain that a queue will never issue an instruction to a functional unit outside of its designated group.  Splitting up the queues also allows you to perform register renaming for a single queue rather than all of them.  This can allow a certain queue to slip ahead of the others to perform tasks ahead of time.  For example, a queue for the load/store unit could make use of renaming in order to load memory ahead of when it is required.

How does the multiple queue method work without renaming?  How does it avoid artificial data dependencies?
In-order issuing is enforced by preventing an instruction from being issued if instructions ahead of it (possibly in other queues) have not been issued yet.

What is the difference between reservation stations and the multiple queue method?
Instruction operands are stored in the reservation station along with the rest of the instruction's information; a queue grabs the operand values out of the register file.  Reservation stations are typically always out-of-order and thus require renaming.

These are various names for the same concept:
  • Functional Unit Buffer
  • Instruction Issue Buffer
  • Issue Window
  • Window of Execution

This is a short list of some of the organizations of a functional unit buffer:
  • Single Instruction Queue
  • Multiple Instruction Queue
  • Reservation Station

Friday, September 2, 2011

Register Renaming

Register renaming is the action of replacing the registers used by an instruction with different registers.

When register renaming is used, there are more physical registers (actual hardware registers) than there are logical registers.  A logical register is a register which is defined to exist by the architecture, but this is purely an "on-paper" existence.  The actual meat of the register, what's called the physical register, can be different from the logical register the programmer works with.

The extra registers are used to solve the WAW and WAR data dependencies by rewriting the instructions to use an unused register for storing results, and updating which registers are used as inputs (because they could be depending on the result of another instruction which has had its result register moved somewhere else).

As an example, lets say you have the following code:
move   r1, r2
move   r2, r3

These instructions have a WAR data dependency.  Lets say that "r" represents a logical register, and "R" represents a physical register.  After renaming, the code might look like this:
move   R7, R3
move   R4, R5

Alright, so lets explain what happened to the first instruction.  It takes as an input the value in the logical register r2.  The last instruction to store a value in r2 (which is the value we want) stored it into the physical register R3 (because that was an unused register at the time), so the instruction is rewritten to take the value in R3 as the input.  The instruction also has an output/result value, which is stored in the logical register r1.  In order to ensure that this instruction does not overwrite a value that another instruction is depending on, an unused register is selected, in this case R7.

The explanation for the second instruction is pretty much the same as for the first.  However, notice what happens to the result register for the second instruction; it is different than the register used as an input in the first instruction.  This means that there is no longer a WAR data dependency between the two instructions!  The register written to by the second instruction is physically different (even though it appears to be logically the same from the perspective of the program code) from the register the second instruction reads from.

WAW data dependencies are solved by renaming as well because both write instructions are changed to write to different physical registers, even though they are writing to the same logical register; so it doesn't matter in which order they are executed.

This is probably a good place to confuse you by saying that there are different methods of renaming.  The methods I am familiar with all work basically the way I described, so you should be able to take this general overview and learn the nittier-grittier details.

Thursday, September 1, 2011

Data Hazards

Today I finished learning about two of the common (or were common in 1995 at any rate) methods for renaming registers.

First off I should probably explain what "renaming registers" means.  In order to have multiple instructions executing in parallel and/or executing out of order (very common nowadays) you need to prevent something called "data hazards".

We seem to have encountered yet another term, so let me define it real quick: A data hazard is a situation where there is a chance that the data will end up being incorrect due to the instructions executing out of order.  There are three kinds of data hazards: Read-After-Write (RAW), Write-After-Read (WAR) and Write-After-Write (WAW).  A RAW occurs because an instruction expects a value to have been stored in a storage location (memory or register) by a certain previous write instruction, but the write instruction hasn't been executed yet.  A WAR occurs because an instruction writes over a storage location before another instruction is able to read the previous value.  A WAW occurs because an instruction A writes over the value stored in a storage location by an instruction B, where the value written by instruction B was actually supposed to overwrite the value stored by instruction A (the value stored by B was supposed to survive this encounter, but the one written by A survives instead).

As you can probably see, data hazards occur because one instruction depends on the data of another instruction.  A RAW is considered to be the only "true" data dependency; the others are called "artificial" data dependencies.  I'll try to explain: In a WAR hazard, the write instruction doesn't depend on the value read by the read instruction, and the read instruction doesn't depend on the value written by the write instruction.  However, the read instruction does depend on the value still being there from the previous (in program order) write instruction.  Similarly, in a WAW hazard one write instruction does not depend on the value written by the other, but the following (in program order) instructions do depend on the value from the second write instruction being present, not the value from the first.

Unfortunately it looks like I am out of time for today.  I'll create a second post tomorrow where I will discuss renaming registers.

Wednesday, August 31, 2011

Researching CPU Architecture

I will be trying out a new idea while developing the SHMUP:  Develop it as though I am creating it for a console, where the console is my specific computer.  The plan is to add in support for other hardware afterwards.  This should also help me determine the system requirements, as I will be a lot more familiar with what the game requires.

So the first step is to learn about my computer's hardware, which I have been doing for the last couple weeks.  So far I have studied system memory and chipset architecture, although there is still more to learn of course.  I probably know enough in these areas for the moment though.  I am currently researching CPU architecture, and if you are doing the same I would highly recommend this article: The Microarchitecture of Superscalar Processors

Tuesday, June 28, 2011

Back to the SHMUP?

Well, I am finally through the prereqs section of the graph theory book.  Although I didn't do the final problem set, I should have a good enough grasp on the subjects to make use of them.

I am currently considering using kinematics for animation in the SHMUP rather than dynamics, which would allow me to temporarily skip most of this math.  Kinematics is basically hard-coded movement, where dynamics is more physics-based.  Gamers know these as arcade games and simulation games respectively.

We are also redesigning the SHMUP so that it will be easier to create.  This means that I shouldn't have to make many test projects, and I can jump right back in to coding the "real" game.

Friday, June 17, 2011

Mathematical functions

A function is similar to a relation.  It is a set of ordered pairs (a, b) where a is an element of some set A and b is an element of some set B.  For each element in A, there is exactly one ordered pair with that element in the first position.  Given such a function f, we say that it is a function from A to B, which is written f:A->B.  We could also say that f:A->B is a mapping from A to B, in which case we would say that f maps from A into B.  This is written b = f(a), similar to how a relation is written aRba is mapped into b by f, and we say that b is the image of a under f.

Something worth noting is that because there is exactly one ordered pair per element in A, there is exactly one image of each element in A under f.  In other words, an element of A has exactly one image, no more, no less.  However, multiple elements of A can share the same image.  If each image is the image of only a single element of A, we say that f is one-to-one.  If all the elements of B are images of elements in A under f, we say that the function f maps from A onto B, rather than "into".

A composition of two functions f:A->B and g:B->C is the function fog:A->C which maps from A into C.  Such a function is written c = g(f(a)) where a is an element of A and c is an element of C.  This works exactly how you think it would: First b = f(a) is calculated, and the resulting element b is fed into g.

And lastly we have something called a "permutation".  A permutation is a function f:A->A where A is not the empty set and f is both one-to-one and maps from A onto A.  What you end up with is a function which rearranges the elements in A.

Thursday, June 16, 2011

Equivalence relations

Today I learned about equivalence relations, equivalence classes, partitions, divides, and congruent to modulo n.

An equivalence relation is a relation that is reflexive, symmetric, and transitive.  Pretty straight forward.

An equivalence class (also known as an equivalence set) is denoted [a] = {x | x is an element of A, x R a}.  In other words, it is the set [a] consisting of all the elements of the set A which are related to a by the equivalence relation R.  To look at it in a slightly different way, [a] is the set of all elements in A which are related to a.  For example, if the relation is "has the same PvP score as" and A is the set of all the people in your gaming community and a is someone in that community (lets say it is you), then [a] would be the set of everyone in your gaming community who has the same PvP score as you do.

A partition is a set of sets.  So for example, S = {A1, A2, ..., An} where Ai is some set for 1 <= i <= n.  A partition has a few requirements for that set of sets though.  Each Ai is a subset of a single set A.  Also, every pair of distinct subsets (I will get to what "distinct" means shortly) is disjoint, meaning their intersection is the empty set (there are no overlapping elements between subsets).  And finally, the union of all Ai must be the set A.  This is kind of like partitioning a harddrive or separating the screen in a Tetris-like game into chunks for the various shapes and the empty space.

There is another noteworthy "set of sets": the equivalence classes.  You can actually make a set out of all the possible equivalence classes for a particular (nonempty) set and equivalence relation: S = {[a] | a is an element of A}  This is noteworthy because it turns out that the set of equivalence classes S is a partition of the set A.  However, if you are like me this can get confusing when you realize that two or more of the equivalence classes in the set can be equal.  For example, if A={1, 2, 3, 4} and the relation is setup so that [1] = {1, 3} and [3] = {1, 3}.  But isn't one of the requirements for a partition that all pairs of sets be disjoint?  How are {1, 3} and {1, 3} disjoint?  Well, the key bit of information I was missing is this: Only distinct sets count.  Two sets are distinct if they are not equal.  This makes sense, since you can have a set A = {a, b, a, c} but all of the a's are treated as the same element.

Lets define the set Z to be the set of all integers.  Given two variables d and m which are both elements of Z (they are both integers) and where d does not equal 0, if there is a variable q which is also an element of Z such that m = dq then you could say that d divides m, or in more "mathy" terms: d|m.  Note that q must be an element of Z, so in other words an integer.  If the division has a remainder, then q doesn't exist in Z and d does not divide m.

Lets look at d|m again, but this time using three variables instead of two: n, a, and b which are all elements of Z.  If you arrange these so that it reads "n divides the difference between a and b", in other words n|(a - b), you can also say that "a is congruent to b modulo n".  I don't have access to the exact symbols, but the mathematical way to write it is something like this: a congruent b(mod n).

Using congruent modulo n you get an interesting set called "integers modulo n", written Zn or Z/(n).  It is the set of distinct equivalence classes from the relation "is congruent modulo n" on the set of integers Z where n >= 2.  I stopped studying today about half-way through a problem which was designed to show uses for this set, so I'm not really sure how you could put it to use.  Based on the name however (and after seeing what a few of the equivalence classes are for a couple values of n), I'm going to guess that it can be used in division, possibly as a lookup table or for something related to remainders.  I'm sure I'll find out later!

Wednesday, June 15, 2011

GDC Vault and Relations

Apparently you can get a lot of the GDC presentations for free at the GDC Vault website.  It doesn't have everything unfortunately (I couldn't find Erin Catto's 2006 presentation), but there are some cool things in there.  I didn't have much time to browse around, but the Halo: Reach networking presentation caught my eye; it's rather good, and I recommend it to anyone who is writing networking code.

Today I learned about relations and Cartesian products.  The Cartesian product of two sets A and B is the set consisting of all the ordered pairs (a, b) where a is an element of A and b is an element of B.

A relation of a set V is any subset of the Cartesian product of V with itself.  Given an ordered pair (a, b) of a relation R, a is said to related to b by R.

Relations can have several properties: reflexive, irreflexive, symmetric, asymmetric, and transitive.  A relation R on V is reflexive if for every x in V there is an element (x, x) in R; it is irreflexive if for every x in V there is no element (x, x) in R.  A relation R is symmetric if for every element (a, b), there is another element (b, a); it is asymmetric if for every element (a, b) where b is not a, there is no element (b, a).  And finally, a relation R is transitive if for every element (a, b) there is an element (b, c) and (a, c).

I also learned that the cardinality of a set is the number of elements in that set.

Tuesday, June 14, 2011

De Morgan's Laws

I spent most of the morning creating a forum post asking for people's  opinions of Chris Hecker's physics references list.  I posted it on four  forums: The Bullet Physics Simulation Forum,, TIGSource, and Physics Forums.   It's been interesting to see how active the various forums are.  Bullet  and Physics Forums are about equal at 32 views and 33 views with 0  replies. appears to be the least active at 0 views and 0  replies, although I suspect that this may have something to do with  differences in the counting systems used.  TIGSource is definitely the  most active at 181 views and 12 replies.

When I had time to study math, I finished off the section on sets.  I  learned what a universal set is (the set which all the currently  discussed sets are subsets of), as well as what the complement of a set  is (it is the set of all elements in the universal set except the  elements belonging to the set being operated on). I also learned about  De Morgan's laws in relation to set theory, including how to prove  them.  The first law states that the complement of the union of two sets  A and B is equal to the intersection of the complements of the two sets  A and B.  The second law states that the complement of the intersection  of two sets A and B is equal to the union of the complements of the two  sets A and B.  One of the problems in the problem set was proving De  Morgan's second law.  It took me about an hour to figure out, but I  eventually proved it by doing a lot of substitution and logic.

Monday, June 13, 2011

Starting graph theory

Today I started reading Introductory Graph Theory by Gary Chartrand.  It's starting off teaching about modeling.  Modeling in this case is turning something into math, such as the area of a square being A=W^2.

I reached the part in the book where it talks about actual graphs, but it was using terms from set theory which I'm not familiar with.  Fortunately the book has a section in the back which teaches you the basics, so I started going through that as well.  It's mostly been review so far (I picked up some bits of set theory here and there while studying other things), introducing you to the symbols/operations such as "element of set", "subset", "equal set", etc.

This made me curious about whether there is a good book devoted to set theory.  After some searching I found this thread which contains an abundance of recommendations.  Naive Set Theory by P. R. Halmos and Theory of Sets by E. Kamke look interesting, the first covering more area in less time and the second covering a few topics in more depth.

Friday, June 10, 2011

Well, apparently it's not a prereq

Today I decided to learn about the Laplace Transform and eigenvectors/values.

From what I gather, the Laplace Transform transforms a function f(t) into a function f(s).  I'm not entirely sure what "s" is supposed to be though...  I suspect it is used as a temporary placeholder for going inbetween different Laplace transforms or something, but the videos weren't very clear.

An eigenvector is a vector which can either be transformed by a certain matrix or multiplied by a certain scalar and result in the same vector in both cases.  The scalar is called the eigenvalue.

I also found out that the class I started going through was apparently not a prereq for the Convex Optimization classes.  I seem to have misread the prereqs, effectively "putting the comma in the wrong spot".  I'm not going to complain about having less on my plate though!

Speaking of having less on my plate; I seem to have spoken too soon.  Earlier today I remembered that Chris Hecker has a page listing resources for learning how to simulate physics, located here:  Now that I have a bit more experience, I decided to give it another look.  I think it might be good idea to go through the steps he describes, so I will be looking into getting the first book in a month or so.  In the meantime, I ordered some books on graph theory, abstract algebra, and topology.  These should keep me busy for at least a month, and they cover topics I have been curious about for a while.

Thursday, June 9, 2011

Going back into study mode

Yesterday I made it so that you could control the pitch, heading, and roll of your ship using the mouse and cursor keys.  Pitch and heading adjustment works fine, but roll is dependent on the direction that your ship is facing; I actually suspect that the other two have similar problems as well, just not as pronounced.  This led me to start reading through physics articles to try and find a solution.

While reading the articles I was reminded that I still don't know much about mathematical programming (linear, quadratic...), which seems to pop up a lot in physics and makes an appearance in collision detection.  I figure this is as good a time as any to correct that, so I have started going through the free classes on the Stanford website located here:  Class EE263 is a prerequisite for the convex optimization classes, so I'm starting there.  If this starts taking too long, I may decide to postpone further study in this area, but hopefully I won't have to.

Writing blog entries will be a bit challenging for a variety of reasons, but the plan is to continue updating at least five times per week.  The last thing I want to do is just repeat what's said in the lectures, so I may draw from multiple sources when I discuss what I have learned.

Wednesday, June 8, 2011


A skybox is used to add a background to a 3D scene.  Things which are so far in the distance that they can be drawn into a 2D texture, such as the sun, distant mountains, and unmoving clouds, are part of the background.  It is literally a box with the camera at its center.  As the camera moves, the skybox follows it.  Because the background doesn't move as you move, it appears to be very far away.  The skybox doesn't rotate with the camera though.  Picture a cardboard box ontop of your character's head and you have about the right idea.

Using a skybox is pretty easy.  Making the texture for it is a bit difficult, but fortunately us programmers can leave that to the artists.  To use a skybox, you first need a cube which has the sky texture applied to its inside.  You could create this in the game, or get the artist to create it as a model and load it in along with the other models.

There are two ways you can render a skybox.  The first is to render it last, which requires a skybox large enough to enclose all of the visible geometry.  The second is to render it first and then render the rest of the scene ontop of it.  This method allows you to use a smaller skybox (just big enough to fully enclose the camera should be fine) but can be slower since there is more overdraw.  When rendering the skybox first, you need to disable writing to the z-buffer so that the rest of the scene can be drawn ontop of it.  For either method, make sure to disable lighting before rendering the skybox, otherwise the edges will be much more visible.

Tuesday, June 7, 2011

First screenshot of new project, mouse look

You can now look around the cockpit using the mouse.

In order to get the mouse's relative movement, I used raw input.  After I have the relative movement vector, I multiply it by the mouse sensitivity and multiply that by the rotation speed.  This gives the final angular velocity vector which I use to integrate the orientation.

Here is a screenshot of what the game looks like so far.  I plan on removing the "blast shield" tomorrow and put in a star filled background.  Before I can make it so you can fly around, I need to make it so that you can tell you are moving!

Monday, June 6, 2011

Direct3D9 FFP Lighting

Today I finally managed to get the cockpit model into the game.  I've decided to use the fixed function pipeline this time rather than shaders; it's probably a good idea to be experienced with both.

To use FFP (fixed function pipeline) lighting, you first need to fill in a D3DLIGHT9 structure.  The elements are pretty straight-forward.  I've found that you need values much greater than 1.0 for the light's color though; somewhere around 300 in R, G, and B for white lights is working well for the moment.  After you have the structure filled out, pass it to IDirect3DDevice9::SetLight().  After setting the light information, you need to enable to light as well using IDirect3DDevice9::LightEnable().

To take full advantage of the lighting system, I recommend setting a material as well.  To do this, fill out a D3DMATERIAL9 structure.  The material describes what components of the incoming light are reflected, and by how much.  It also describes how shiny a surface is with the Power element, and how much light the surface is emitting with the Emissive element.  Note that the emissive element does not actually act as a D3D light; it only makes it seem like the material is glowing.  After the material structure has been filled, pass it to IDirect3DDevice9::SetMaterial(). Now that the actual material is setup, we need to tell Direct3D that the diffuse and specular information is coming from the material and not the vertex buffer.  To do this, set the D3DRS_DIFFUSEMATERIALSOURCE and D3DRS_SPECULARMATERIALSOURCE render states to D3DMCS_MATERIAL.

Friday, June 3, 2011

Researching Direct3D9 Lighting

While doing some further research into the relation between a graphics API and the graphics driver, I came across this article: Mode Switching.  I recommend reading it if you are interested in the things I discussed in the previous blog post.

I spent most of the day in research, mainly into the way lighting is performed in Direct3D9.  Apparently you can set a material to specify ambient, diffuse, specular, and emissive colors, as well as the specular power.  I originally thought that you had to pass this information in a vertex buffer.  The layout of the MS3D file format makes a lot more sense now, since all of the colors are stored together rather than per vertex.

Another interesting thing I found out is that material colors specify how much of the incoming light affects the resulting color.  Having a diffuse material color of (r:0.5 g:0.0 b:0.0) for example means that 50% of the red component of the light's diffuse color is used, and the green and blue components aren't used at all.

Thursday, June 2, 2011

The relation between graphics APIs and graphics drivers, also CRT security enhancements

While working on the model loading code, the compiler complained about fopen() being deprecated.  These warnings aren't new, but I've been wondering for a while what exactly caused the functions to become deprecated.  So after a Google search I found myself at the MSDN reading "Security Enhancements in the CRT", which led me to partially read "Repel Attacks on Your Code with the Visual Studio 2005 Safe C and C++ Libraries".  Basically the functions are deprecated because new ones were written which check for buffer overruns and various misc. checks such as making sure that pointers are not NULL.  The part which I'm the most interested in however are the changes they made to error handling.  More functions return error codes now, which makes my job a little easier since I don't have to guess at why the function returned a NULL pointer.

I also did a little research into the relation between Direct3D/OpenGL and the graphics driver.  I've wondered for a while how you could make your own graphics API, and I think I have an answer: You can't unless you write your own drivers or can convince driver developers to adopt your API.  This might be a bit vague, so I'll try to explain my understanding of how things work.

A graphics API is just an interface to the graphics driver.  You are effectively communicating "directly" with the graphics driver when you use Direct3D/OpenGL (I'll get to why that is in quotes shortly).  If you aren't familiar with the term in relation to programming, an "interface" is a set of functions which act as the only way of communicating with a "black box".  A "black box" in this case is a chunk of code which you do not have access to, and you do not know how it works exactly.  Direct3D/OpenGL is pretty much just a design which the driver developers use.  Because all of the driver developers will be working with the same design, you end up with a common "face" to the driver, allowing your program to easily communicate without trying to figure out which driver it is talking to.

This brings us to why I used quotes earlier.  There is actually another layer in between your application and the graphics driver when you use Direct3D.  Rather than plugging your application directly into the driver, you are attached to the Direct3D layer (I am assuming this is d3d*.dll).  The Direct3D layer performs some additional work (again, I am assuming here) and handles communication with the actual graphics driver.

In the case of OpenGL, I think you are plugged directly into the graphics driver...  I don't know very much about OpenGL though.

Wednesday, June 1, 2011

Systems layer makes a comeback

The first thing I did today was the removal of the initialization game state, replacing it with the systems layer.  The "systems layer" is a collection of objects/systems common to all game states.  Examples of these are the main window, Direct3D device, and timing system.  Currently the systems layer is a collection of scope pointers in the application layer's Run() function.

Something I realized while working on the systems layer is that I no longer need the window manager.  The purpose for the window manager was to have a constant place for the window procedure to exist.  Now that ownership of the main window isn't being transferred between game states repeatedly, I can put the window procedure in the application class.

With the systems layer in place, I was able to easily give the main window code a pointer to the current state safe pointer.  When the main window receives the WM_CLOSE message, it calls the IGameState::WindowClose() method on the pointer.  The actual code looks a bit ugly due to the whole pointer-to-a-safe-pointer thing, but at least it works: m_state.Get()->WindowClose();

I also put the basic rendering code in place.  This gave me the chance to try out the COM wrapper idea I mentioned in a previous blog post, and it's working great!  Using a generic wrapper is a lot easier than creating a wrapper for each interface.

While writing the timing system for the fixed timestep, I decided to take a closer look at the conversion from int64 to double which is required when using QueryPerformanceCounter/Frequency().  My main concern was losing precision.  I looked into doing integer-only math for the timer, but I quickly remembered that not using floating-point math for timing in a fixed timestep had caused problems in the past.  I eventually decided to just cast to double like usual, and fix the problem when it is actually a problem.

Tuesday, May 31, 2011

Passing input events

Today I finished up the basic windowing code.  The window procedure is contained in a window manager, which also handles registering the class and creating the window.

I also did some research into the definition of RAII to try and work out what I should be calling a particular technique I have been using.  I'm still not entirely sure if the technique I am using is valid RAII, but I did find a rather good explanation of what RAII is.  You can find the article over at The RAII Programming Idiom.

I am currently working out how I am going to pass user input events to the game state.  I think I might revisit the idea of a "systems layer", since I could easily give the input gathering systems a pointer to the current game state pointer.  Without using a systems layer, I would need to pass the pointer to the initialization state, which feels a bit awkward.  I am planning on using a pointer to a pointer because the game state could change, thus changing the pointer.

Friday, May 27, 2011

Researching airplane controls and state switching

Most of today's work was spent in researching airplane controls for the new test project.  I don't have much to say on the matter, but here are a couple resources which may be of interest:
Cockpit of the Cessna 172 Skyhawk
aircraft cockpit controls and instruments

As far as code goes, I finished the basic state switching code.  The current state is switched by returning an input object for the new state from the current state's update function.  This input object is used as the input for the new state, allowing the previous state to pass on information such as the character the player selected, open network connections, Direct3D device, etc.

I also finished the window class wrapper. I am actually wondering why I have the registration code inside of the wrapper's constructor rather than having external code directly alter the wrapped class atom...  I suspect this is RAII, but I do not know for sure.  I plan on researching RAII again tomorrow (I keep forgetting the exact definition of that term).

Thursday, May 26, 2011

COM interface pointer wrapping

Today I created the repository for the new test project and started on the framework.

One of the things I want to try out in this project is a better wrapper around COM interfaces.  Previously, each interface had its own wrapper class which handled creation of the associated object, as well as destruction.  This was a bit difficult to work with though, and caused the code to look like "m_dev->m_dev->Present()".

The new wrapper I am going to try out will be generic by way of a template.  One really nice thing about this new design is that I will be able to pass the wrapper directly to a function which is expecting a normal COM interface pointer.  This will be done by overloading the address-of operator to return the address of the interface pointer being wrapped.  With the old design I had to make such calls inside of the wrapper, which lead to some really rigid code.

Another thing I will be looking into is hooking the window to process input messages separate from the windowing code.  I have never liked having the WM_KEYDOWN message processing inside of the window manager.

Wednesday, May 25, 2011

Character jitter and a new test project

The character was jittering back and forth when it walked, which I originally assumed was due to the large number of lights I was testing with.  However, when I removed all of the test lights, the character was still jittering.  After a bit of debugging, it looks like the bug was caused by the camera update being performed before the character's position was integrated.  This caused the camera to point at where the character used to be, not where it current was.

The artist GreyKnight has been creating models and textures for this test project I am working on.  However, this is taking time away from creating artwork for the SHMUP so I have decided to switch test projects a bit sooner than I was originally expecting.  This new test project will use art assets which the SHMUP will eventually need (please note that the test project will be different from the SHMUP though).

Tuesday, May 24, 2011

Cube shadow maps and a bug in PIX

Originally I was rendering the scene with a given shadow map to a temporary texture.  This texture was then rendered additively (alpha blending ONE, ONE, ADD) to the default render target in order to accumulate the various light and shadow map effects.  The reason I was rendering to a temporary texture rather than the default render target directly was that the additive rendering would cause part of the floor to show through your character (and other similar problems).

Promit in the #gamedev IRC channel pointed out that you can solve this by filling the default depth buffer in a pre-pass (this pass would also include adding any ambient lighting) and then disabling writes to the default depth buffer and setting the z compare function to "equal".  This causes only the pixels on the top to be rendered additively in subsequent passes, so you don't get any bleedthrough where there shouldn't be any.

After making that change, I moved on to implementing omni shadow mapping using a cube texture rather than 6 explicit textures.  For details on implementing this, I recommend these two articles: Cubic Shadow Mapping in Direct3D and GPU Gems - Chapter 12. Omnidirectional Shadow Mapping.

One really annoying "bug" I encountered turned out to be the fault of PIX.  While debugging the cube texture version of the shadow mapping code, I discovered (through PIX) that the depth buffer being generated in the pre-pass was empty.  I eventually narrowed the problem down to a single render state: D3DRS_COLORWRITEENABLE.  When I set this render state to 0, writing to the depth buffer appeared to be turned off as well.  Enabling all color writing also appeared to enable depth buffer writing.  This made no sense at all, and I was starting to suspect that it was an obscure problem with the ATI Radeon X1000 series (again).  Fortunately, I discovered the real problem quite by accident; apparently when you disable color writing in your application, it is also disabled in PIX.  So when PIX goes to display the depth buffer, nothing is actually displayed.  The depth buffer can be seen perfectly fine after enabling color writing, but this is hardly an ideal situation.  Oh well, considering the software is free I can't really complain too much.

After using a cubic shadow map and a depth pre-pass I was able to get 6 omni lights with shadow mapping to be displayed on the screen at about 28 FPS.  Most situations won't call for many omni lights, so I can probably work with this restriction.

Monday, May 23, 2011

Shadow Mapping Works! Devil's details and screenshot included

Shadow mapping works!  It is rather inefficient at the moment, but it does indeed work.

I think it's about time for another issue of "The Devil's Details", this time regarding shadow mapping.

Pixel shader doesn't know its own depth
You do not have access to the depth of the pixel you are working with in the pixel shader.  Because of this, when creating the shadow map you need to calculate the depth in the vertex shader and interpolate that value across the primitive.

No built-in method to go back to the default render target
After changing the current render target to create the shadow map, you want to switch back to the default render target so that you can render the scene from the camera's view.  Unfortunately, you cannot do this simply by passing NULL to SetRenderTarget().  To get around this problem, keep a reference to the default render target by calling GetRenderTarget() during initialization (or at least before the first call to SetRenderTarget()) and pass it to SetRenderTarget() when you want to switch back to it.

D3DCOLOR_ARGB accepts inputs in the range 0-255, not 0-1
This can catch you off guard, (especially late at night after working all day trying to get shadow maps to work) because colors in the shaders are 0-1, not 0-255.  Just remember this and you won't be confused trying to figure out why the shadow map isn't being cleared to white.

Transform from projection space to texture space
When calculating the location of the pixel in the shadow map for finding out if it is in shadow, you need to perform an additional transformation after the normal light view-projection transformation.  After the projection transformation (and homogeneous divide) you are in a coordinate space ranging in x and y from -1 to 1.  Texture coordinates have their origin at the top-left and range from 0 to 1, and the y axis points downwards (I am using a left-handed coordinate system where y points up).  Because of this, you need to translate the positions by (1, 1) and divide by 2, then negate the y coordinate.  You can do this before the homogeneous divide, so these transformations can be rolled into the light's view-projection matrix (just make sure you only do this for the lighting stage, not the shadow map creation stage).

Do not perform the homogeneous divide in the vertex shader!
I'm still not sure why this is the case (probably a problem with interpolation), but when you are doing the lighting stage (after the shadow map has been created), performing the homogeneous divide on the pixel's shadow map space coordinates must be done in the pixel shader.  I wasn't doing this originally, and the character's shadow on the ground was extremely thin and angled sharply away from the character.

Render targets and depth buffers are best in pairs
At least when the render targets are of differing size.  One of the most annoying problems I encountered while writing the shadow mapping system was that the entire shadow map was not being rendered to.  The shadow map I was using was 1024x1024 and the window was 800x600, which means that the default depth buffer was also 800x600.  Apparently if a pixel in the render target is outside the bounds of the depth buffer, the z-test automatically fails.  To fix this problem, a 1024x1024 depth buffer needs to be created along with the shadow map, and set when the shadow map is set.  There doesn't appear to be a built-in method to set the default depth buffer after you have set a new one, so in a similar fashion to a previous devil's detail you will need to keep track of a reference to the default depth buffer yourself.

Friday, May 20, 2011

Learning shadow mapping

I read through the section in RTR on shadow mapping and I think I know enough about the basics to get some kind of shadow system into the test project.  Shadow mapping involves rendering the scene's depth information from the light's perspective.  You then render the scene from the camera's perspective, detecting if a pixel is in shadow by transforming the pixel's world coordinates (passed to the pixel shader as an interpolated value from the vertex shader) to the light's clip space, projecting onto the depth map and comparing the depth of the pixel with the depth value in the depth map.  If the pixel you are drawing has a greater depth value (assuming z is directed into the screen) than the value in the depth map, it is in shadow since it is behind whatever caused that value to be placed in the depth map.

Ambient light is probably achieved by making a normal render pass, only including ambient lighting, before you do the light shadowing passes.  Now that I think about it, multiple lights are likely handled this same way; perform a set of 2 passes per light, one for creating the depth map and one for filling the color buffer.  Each lighting pass would add to the color in the color buffer.

I should note that the above is a mixture of things I have read and "filling in the blanks".  I don't have any actual experience with shadow mapping yet, so I may find that it's not quite what I am expecting.  A perfect example of this is with creating the depth map.  I assumed that an actual depth buffer was used, which would be set as a texture for the camera rendering pass.  However, it appears as though Direct3D9 does not let you access the depth buffer like that.  Instead, you need to create a texture and set it as the render target, and output depth values rather than colors from the pixel shader.

These are some resources which I found while researching shadow mapping.
Soft-Edged Shadows
Cubic Shadow Mapping in Direct3D

Thursday, May 19, 2011

Lighting works! Screenshot included

Lighting works!  Per-pixel lighting was easier than I thought; it is literally just moving the lighting calculation from the vertex shader into the pixel shader.

If you want to learn how to do lighting, I recommend Real-Time Rendering, specifically chapter 5.  From what I have seen, the book goes into great detail on the subject of lighting in later chapters as well.

I currently have experience with two types of lights: directional and omni.  A directional light is basically a light which shines everywhere in one direction.  Directional lights are commonly used for the sun, and other far-off light sources.  An omni light is a type of point-light; it shines from a specific point in all directions.  These would be used for things like torches, fireballs, sci-fi weapon projectiles, lamps without a shade, etc.  There is another type of point-light called a "spotlight", which only shines within a specific set of directions (a cone).

Lighting requires two properties from a material: specular and diffuse.  Currently, the specular property is constant over the entire model.  Textures supply the diffuse property on a pixel-by-pixel basis, allowing the artist to finely tune how the model will look.  A texture is pretty much a diffuse map.

Today I learned about a couple things you want to watch out for when setting shader constant registers.  The first is that if you are passing in an array all in one call to Set*ShaderConstant*(), each array element must be packed to align with a 4-float/int boundary.  I tried passing in an array where each element was a float3 and I was initially rather confused as to why it wouldn't work.

The second thing I would like to point out is that there is apparently a separate set of registers for integers.  If you need to pass an integer to your shader, make sure you are using Set*ShaderConstantI() and not Set*ShaderConstantF().  This one took me a while to figure out, which was a little embarrassing.

I am now researching shadows.  Unfortunately, it appears as though creating shadows from omni lights is a bit of a black art...  It involves creating 6 shadow maps and cube mapping with them or something.  I should have more details tomorrow.

Wednesday, May 18, 2011

Multiple animations

Multiple animations are now supported.  Previously, only the animation stored in the model file was playable, so the character had no idle animation (or pose).  Because there was no pose for being "idle", I jury rigged the animation code to zero the bone position/orientation if the player didn't want the character moving.  This resulted in the character springing back to its "arms-wide and legs straight" modeling "pose", which looked really awkward as you moved around the room.

Each animation is stored in its own model file, so I created another array of models but called them "animations" instead.  I also added an animation ID to each instance.  Changing the current animation is as simple as changing the animation ID and resetting the animation time to zero.

Now that I have animations more defined, I created separate "skinned" and "not skinned" vertex shaders.  I'm not really looking for a speed increase with this change, just trying to separate things more so that I can understand them easier.

I also started researching lighting.  Lighting is somewhat confusing, but one interesting thing I learned was the definition of diffuse and specular colors.  Diffuse color is from light which has been partially absorbed by the material, bounced around a bit, and finally shot back out.  Specular color is from light which has bounced right off the material.  I'm a bit confused as to why each material can have its own specular color, since it should be the color of the light which bounced off, but I'm sure I'll figure that out eventually.

Tuesday, May 17, 2011

Current progress screenshot

Today I added in support for materials, multiple models, and instances.  A picture is supposedly worth a thousand words, and I seem to be at a loss for anything really interesting to say today, so here is a screenshot of the current "demo" I am working on:

Monday, May 16, 2011

Storing animations, converting RH orienations to LH, and world-space cursor

Now that I have a single animation working, it's time to start worrying about how to use multiple animations.  The Milkshape3D file format only supports a single animation, so I did a bit of investigating to find out how games normally store multiple animations.  From what I could tell, a lot of people just roll their own format.  I'm not interested in doing that at the moment, so I started discussing solutions with our artist GreyKnight.  What we eventually decided on was to have a base file which stores the geometry and base pose.  Each animation would be stored as its own "model", containing only keyframe information.  This system is a bit rigid in that adding/removing a bone would be really painful (no pun intended), but it should work well enough until I find a better alternative.

I finally discovered why my method for converting orientations from RH to LH seemed wrong.  Turns out it was!  My method (found through trial-and-error since what should have been the correct method wasn't working) was to invert the z-axis orientations when positions along the z-axis are inverted for coordinate conversion between RH and LH.  If you draw this out, it doesn't make much sense; the z-axis orientations are correct between RH and LH (when the z-axis is inverted to do the conversion), but the x and y axis orienations are in reverse.  In other words, the one orientation which should have been constant between the two coordinate systems was the only one which needed to be changed.  Well, while poking around in the math code I noticed something rather odd: the rotation matrix creation function negates the angle before it is used to create the matrix.  So all x and y axis orientations were being inverted automagically, and the only way to prevent this from happening to z-axis orientations is to invert them before they are sent to the creation function.  I suspect this was a left-over from when I was creating the camera code, since the angles needed to be negated.

Today I managed to make it so that when you click on the screen, the game can detect where you clicked in the world.  This involved a ray intersection test with the world geometry, something I had never done before.  In order to define the ray, you need a point in world space and a normal vector pointing in the direction of the ray.  The easiest of these to calculate is the point, which should be right on the "lens" of the camera where your mouse cursor was when you clicked.  To get the point in 3D world space, you need to transform the screen-space coordinates of the cursor into clip space, which means reversing the viewport mapping and homogeneous divide.  This step can be a bit tricky, since you need to work in the world-space z-coordinate.  Once safely in clip space, you can transform into world space using the inverse of your view and projection matrices.  Calculating the normal was a bit more tricky.  My original idea was to transform the vector (0, 0, 1) from screen space into world space, but that doesn't work because it needs to be attached to a point other than the origin in order to be affected by the pyramid shape of the view frustum, which leads us naturally to the solution I ended up using; transform a second point from screen to world space, one which is at ray_point+(0, 0, 1)  Once in world space, you can get the ray normal by normalizing the vector from the ray point to the second point.

Saturday, May 14, 2011

Optional Weekends

Weekends are rather unpredictable for me.  Sometimes I will be able to get a normal work day's worth of coding done, and sometimes I won't be anywhere near the computer.  Today is a perfect example of the little-to-no work day: I just made the animated model move along the z-axis to see how it would look.

Because of this, I have decided to make weekend blog posts optional.  There is no point in posting a single-sentence entry.  This of course won't affect work-week posts, so you can expect to get at least 5 blog posts from me per week.

Friday, May 13, 2011

Skinning works! The devil's details.

I finally have multiple-bone skinning working!  I think I can summarize the last few days as "Wow, that was a pain".  On the surface, a hierarchical animation system sounds easy: "Just go down the tree and accumulate matrices as you go", but as is usually the case, the devil is indeed in the details.

I would love to make this blog post a full tutorial on skinning, but unfortunately I still don't completely understand it myself.  I slapped in some foreign code for nasty jobs so that I could just focus on getting the bones to animate and stay attached to the skin.  Also because the code was written by accumulating many tests, it could be improved quite a bit.

Here are some of what I will affectionately call "the devil's details".  Hopefully these will help someone who is learning to write a skinning system.  Keep in mind that some of these might only apply to the Milkshape3D model format.

Not all vertices will be attached to a bone
While this might not be the case in most animations, you should make sure that your system won't explode when this situation is encountered.  One of the animations I tested my code on was a brick hurtling  towards a brick wall, bouncing off the wall and falling onto the ground.  The wall was part of the animation, but was not attached to a bone.

Keyframes are defined per-bone, not globally
Rather than having a list of keyframes which each have a list of the bones that they affect, each bone has a list of keyframes.

Calculating keyframe indices; not as easy as you might think
The bones are animated by interpolating between two keyframes, so you need to figure out which two keyframes you are in between.  You also need to gracefully handle the situation where the current animation time is before the first keyframe for the specified joint, or past the last keyframe.  I am currently finding the second keyframe first by looking through the list of keyframes going forward in time and checking for the first keyframe which comes after the current animation time.  I then set the first keyframe to be the one previous to the second which I just found.  Both keyframes are initialized to the last keyframe, which is the value they take on if no other keyframe is found in the loop.  If the second keyframe is found to be index 0, the first keyframe will become negative, in which case the first keyframe is set to be equal to the second.  There are two cases where both keyframes will have the same index; if the time is before the first keyframe, or if the time is after the last keyframe.  If both keyframes have the same index, interpolation is not required.

Keyframe positions and orientations are relative to the bone's original position and orientation
This was one of the big ones for me.  I did not figure this out until well into development, so I had to basically rewrite the skinning system.  This means that in order to transform a point to where it should be for the given keyframe, the point must be in the same coordinate space as the bone.

Convert orientations from RH to LH, not just positions!
This was the big one.  Because of the previous "detail", things can become really out of sorts if you get rotation wrong.  If a parent joint is in a certain orientation, all movement is going to depend on that orientation being correct.  If the orientation is wrong, the vertices attached to the joint will be in the wrong spot.  This problem gets worse and worse the further down the hierarchy you go.  As far as converting the orientations goes, I'm not entirely sure how it works.  What ended up working for me was to invert z-axis positions and z-axis rotations, but I have read elsewhere that only the x and y axis rotations should be inverted when z-axis positions are inverted.  I'll have to do more research into this, but it is working fine for now!

Bone matrix, what is it?
This had me confused for a while.  Basically, the bone matrix needs to transform a vertex in model space (which is attached to the bone) into the coordinate space of the bone (without any animations), apply the keyframe transformation, then transform back into model space.

Wednesday, May 11, 2011

Radeon X1000 actually NOT SM3.0 compliant?!

Today I worked on skinning.  Skinning is the term used for 3D bone-based animation, and comes from the idea of putting "skin" on the bones.

My initial plan for skinning was to store all of the joint matrices for each model instance in one big texture.  The advantage to this is that I can fit a lot more instances and bones into one draw call than I could if I was using constant registers; the constant registers would run out rather quickly.

The skinning system required me to create a texture manually for the first time, and I learned something rather interesting:  You cannot directly access a texture which is in video memory.  In order to put data into a VRAM texture, you need to create a second texture in system memory, fill it with the data, then update the texture in VRAM using IDirect3DDevice9::UpdateTexture().  Another interesting thing to note is that apparently a texture is a collection of surfaces.  I originally thought it was just one big 3D array of data (for the different miplevels), but this does make more sense since each mipmap will be a different size.

Accessing the texture from the vertex shader was a bit more problematic than I was originally expecting.  The first issue I encountered was that you cannot use tex*D() to access the texture (at least not in D3D9, it might work in D3D10+).  The reason for this is that in a pixel shader, the miplevel is calculated based on the texture coordinate partial derivatives.  Basically, if the texture coordinates changed a lot since the previous pixel, a lower miplevel is chosen in order to keep the pixel:texel ratio as close to 1:1 as possible.  The derivatives can't be calculated in the vertex shader though, because we are working with vertices and not pixels.  So to get around this, you need to ask for a specific miplevel by using tex*Dlod().

That should have been the end of the VTF (vertex texture fetch) problems, but then Present() started failing...  I couldn't find anything wrong with the code, so I enabled debugging through the DirectX Control Panel to see if I it would tell me anything about the error.  Apparently the driver was failing internally, but no further information was being given.  I asked for help in the #gamedev IRC channel and we eventually figured out the problem.  Apparently the Radeon X1000 series of graphics card, of which mine is a part of, does not support VTF.  The really odd part of this is that one of the selling features of this series was its supposed support for SM3.0, which should have included support for VTF.  ATI have a workaround for the problem, called "Render to Vertex Buffer", but at this point, using constant registers is looking better and better.

In the end, I decided to scrap the instancing code and use constant registers for skinning.  At least the constant registers work like they should, so I can actually get 3D animation up and running.

I currently have single-joint skinning working, but the code for finding the keyframes to interpolate between is a real mess.  I have one or two ideas on how to fix this, so we'll see how that goes tomorrow.  It is kind of cool to see animation working, even if it is only a ball moving up and down.

Tuesday, May 10, 2011

RH to LH, texture coord oddities, and instancing

MS3D model geometry is now loaded correctly.  One problem I did encounter was that the models are stored in a right-handed coordinate system, but the game uses a left-handed coordinate system.  This means that the x coordinate is pointing in the wrong direction, basically.  I initially tried to solve this by inverting the x coordinate of all vertices in the model.  I was a bit surprised to find that the model appeared to have turned itself inside-out!  After a bit of discussion in the #gamedev IRC channel, I figured out what the problem was:  The triangles were wound in the wrong direction.  This was fixed easily enough by reversing the order of the vertices of each triangle, and the model was no longer inside-out.

While adding in texture support, I discovered some rather odd things.  First off, the U and V texture coordinates are named "s" and "t".  After doing some poking around, I think this is an OpenGL thing.  Trying to find the texture coordinates was a bit confusing at first though.

The second odd thing is that texture coordinates for each vertex are stored per triangle, not per vertex.  This means that a single vertex can have multiple texture coordinates.  I suppose this makes sense, since you may want two triangles which share a vertex to have completely unrelated textures.  It did goof up my usual way of doing things though, and I had to scrap my indexed drawing code since each vertex of each triangle needs to be stored.  There may be a way for me to split up the geometry data and the material data using stream frequencies, but further investigation along those lines will have to wait for the moment.

Texturing isn't completely working yet.  Coordinates are stored fine, but the same texture is used for every model.  Also, no other material properties are currently in use.  I will probably get these things working correctly after skinning is in.

The method of instancing which I am using involves separating the model data and per-instance data into different vertex buffers.  The model data is stuff like vertex position, texture coordinates, etc. and per-instance data is the world transformation matrix and anything else which differentiates instances of the same model.  Stream frequencies are then setup so that one "vertex" of instance data is applied to the entire model.

This was my first time working with multiple streams, so I suppose it's not surprising that I made a newbie blunder.  After the instancing code was all finished, the model wouldn't appear.  I spent some time trying to figure out what went wrong, but everything checked out...  I eventually found the culprit though:  When creating the vertex declaration, I didn't start the offset over at 0 when I started specifying stream 1 elements.  This caused all of the "vertices" in stream 1 to have a chunk of padding at their beginning equal in size to the stream 0 vertex size.

Monday, May 9, 2011

Started loading models. Matrices and shaders.

We use Milkshape3D for modeling, so it makes sense to support loading .ms3d model files.  I implemented basic loading code this morning, but it needs a lot of work.  In its current state, it will only load a single triangle out of the model, and materials and animation are not taken into account.  After I get some other parts of the rendering side of things finished, I plan on getting it to correctly load and display a cool looking 3D model that GreyKnight created.

The rest of the day was spent in getting the camera and the projection matrix working.

Something you should look out for when using a vertex shader to transform vertices: When the matrix is stored in the constant registers, by default each register represents a column of the matrix, not a row.  This is called "column-major", and makes sense from an optimization point of view; if you are using row vectors, then the columns of the matrices will need to be accessed to perform transformations, and the rows will rarely be directly accessed (one exception would be in matrix multiplication).  You can account for this by transposing your matrices before you place them in the registers.

To make the code more readable (and easier to write), I have created four specialized derivations of the generic matrix: RotateX, RotateY, RotateZ, and Translate.  The names are pretty self-explanatory; RotateX represents a rotation about the x-axis, etc. and Translate represents a translation.  This allows me to create a rotation or translation matrix by providing only a single argument (an angle or a vector, depending on the matrix).

On the subject of matrices, I have decided to use mOut = Transpose(mIn); rather than mOut = mIn.Transpose();  The latter implies that the matrix being transposed is being altered, which isn't the case (a copy of it is created and transposed).  I don't much like mOut = Multiply(mL, mR); though, as this is a bit too messy in my opinion.  For multiplication and similar operations, I have decided to use this format: mOut = mL*mR;

Sunday, May 8, 2011

Anti-cheat shader works

I finally have the anti-cheat shader working!  These are the results (you may need to increase your monitor's brightness to see the difference, as this is what the shader is designed to combat):

The fixed version is to the left, and the original is to the right.

Ironically, I kind of cheated to get these examples (it is a simple quad with a screenshot slapped ontop of it), so I can't see how it looks interactively.  But I plan on using this for future projects which require this kind of protection, so I should have an excuse to try it out eventually (probably during one of these quick test projects I will be creating).

P.S. Adding images to a post on Blogspot is somewhat annoying.  I had to restructure this post several times before it was formatted correctly.

Saturday, May 7, 2011

Trouble trying to create banding, and an IDE for shaders

I spent most of the day trying to get the anti-cheating shader to work.  I still don't have it working quite right, but it's getting closer.  The idea is to create worse and worse banding the darker something is, thus blurring any details.  I am able to create banding just fine, but making it vary depending on the brightness is proving to be a bit troublesome.

If you are looking for an IDE to write shaders in, I recommend RenderMonkey by AMD.  It supports both Direct3D and OpenGL shader languages, and shows you a preview of the shader you are writing.  Intellisense and syntax highlighting are also available.

Friday, May 6, 2011

HLSL and a cool Direct3D debugging tool

I worked on the rendering system today.  Because of the way I handle errors, there is a class for the D3D object, the device, vertex declarations, vertex buffers, vertex shaders, and pixel shaders.  This is a little inconvenient, as the following code shows:

hResult = m_renderer->GetDevice()->DrawPrimitive(D3DPT_TRIANGLELIST, 0, 1);
if (hResult != D3D_OK)
    throw (CException("Could not draw the triangle.\r\n\r\nError Code: 0x%08X", hResult));

I'm considering having the classes manage global variables rather than member variables.  That way, I can access the global directly in other parts of the code rather than through an object.  I may try that in the next iteration.

Today marks the first time I have ever written and used a shader.  This is something I have been wanting to do for a while, and it opens up some very interesting possibilities.  For example, I can now make the screen fade to grey.  I can also prevent players from being able to get around the darkness in a game by banding dark colors (I am currently working out how to do this exactly).

One thing which was a bit annoying for me while researching HLSL (a programming language you can use to write shaders) was being unable to find a simple, basic shader example.  So here is a set of bare-bones, pass-through shaders which will hopefully help someone who wants to learn how to write shaders:

// ***** Vertex Shader *****
struct tVSInput
    float3 pos        : POSITION;
    float4 diffuse    : COLOR;

struct tVSOutput
    float4 pos        : POSITION;
    float4 diffuse    : COLOR;

tVSOutput main(tVSInput input)
    tVSOutput output; =;
    output.pos.w = 1;
    output.diffuse = input.diffuse;
    return (output);

// *****

// ***** Pixel Shader *****
struct tVSOutput
    float4 pos        : POSITION;
    float4 diffuse    : COLOR;

float4 main(tVSOutput input) : COLOR
    return (input.diffuse);

// *****

I should probably note that this is my first time writing these, so an expert may take issue with something in there.  However, both shaders work perfectly fine in the tests I have ran, and they are nice and simple to read.

On the subject of shaders; if you use a shader model 3.0 vertex shader, you apparently need to also use a shader model 3.0 pixel shader.  If you use a 2.0 vertex shader (or lower I am assuming), you can let the fixed function pipeline take care of the pixel shader part.  Before I figured this out, I was a bit confused as to why the triangle I was using as a test wasn't appearing.

If you are using Direct3D, I highly recommend trying out PIX.  PIX is a free utility that comes with the SDK.  You can find it in the utilities folder in the SDK installation directory.  It is pretty much a debugger for Direct3D; it monitors how the program you attach it to uses Direct3D and gives you a report after the program closes.  It gives you a list of all created resources and the arguments used when creating them, you can take snapshots at runtime and later view what calls were made during that snapshot, and quite a bit more from the looks of it.