Question 1:
----------

a)  Using the approach where we OR together a collection of ANDed terms,
    one for each set of inputs that makes the output true, we'd get:

    (!Op && A && !B) || (Op && !A && B) || (Op && A && !B) || (Op && A && B)
    
b)  No, the circuit is not equivalent to the truth table. You can demonstrate
    this by building the truth table for the circuit, but it's sufficient to
    just build the first ROW of the table, since it already disagrees with
    the table from part a:
    
        0 0 0 --> 1 for the circuit, vs 0 for the table
    

Question 2:
----------

a)

We need to store the three $a registers so that we can restore them after the
call to print_output, $ra since the call to print_output overwrites that as
well, and the three $s registers the body of this procedure uses. That's a
grand total of 7 registers or 28 bytes.
  
        addi $sp, $sp, -28  
        sw $a0, 0($sp)      # Have to store, since we call procedure before use
        sw $a1, 4($sp)      # Have to store, since we call procedure before use
        sw $a2, 8($sp)      # Have to store, since we call procedure before use
        sw $ra, 12($sp)     # jr overwrites $ra, so we need a backup copy
        sw $s0, 16($sp)     # We need to save and restore $s registers we use
        sw $s1, 20($sp)     # We need to save and restore $s registers we use
        sw $s2, 24($sp)     # We need to save and restore $s registers we use
        
b)
        
We don't have to restore $a or $t registers, but we need to restore the
three $s registers, and $ra since we didn't do it above. Then pop the
stack and jr $ra to get back to the caller.

        lw $ra, 12($sp)     
        lw $s0, 16($sp)         
        lw $s1, 20($sp)         
        lw $s2, 24($sp)         
        addi $sp, $sp, 28   # "Pop" the stack before we leave
        jr $ra              #  jr $ra to get back to caller

c)

Since we stored the $a registers above, we just need to reload their values
after the call to print_output. You could reload $ra here too, or at the
end before returning.
        
        jal print_output
        lw $a0, 0($sp)      # print_output might have changed $a registers,
        lw $a1, 4($sp)      # so we need to restore their values.
        lw $a2, 8($sp)
        # Need $ra too, but can wait until the end
        

Question 3:
----------

a)  It's the final version of the 32-bit ALU.

b)  The 32 Result outputs are connected to both the register file and main
    memory, so that the results can be written to a register or used as a
    memory address for lw and sw.
    
c)  It's connected that way to implement slt. In slt we configure the ALU to
    do subtraction, in which case the leftmost bit (bit 31) will be a 1 if
    the result of slt is true, and 0 if it should be false. That result is
    piped into bit 0 instead, where the output from slt is expected.


Question 4:
----------

a)  The multiplexer is there so that the CPU can execute R-type instructions
    like add, in which both inputs to the ALU come from registers, but also
    execute I-type instructions like addi, in which the second input should
    instead come from the bits of the instruction. It controls where the
    second ALU input comes from.
    
b)  The shift left is part of the mechanism for handling branch instructions.
    A branch instruction encodes an offset in *words* as its immediate. The
    shift-left converts that to an offset in *bytes* so that the offset can
    be added to a "normal" memory address such as the one coming from PC+4.
    
c)  RegDest:    0
    Branch:     0
    MemRead:    X
    MemtoReg:   0
    ALUOp:      0 0 (do addition w/o inspecting funct bits)
    MemWrite:   0
    ALUSrc      1
    RegWrite:   1


Question 5:
----------

2.4x10^9 cycles/second
1.0x10^11 instructions
34% are mem accesses
CPI of 4
200 cycle miss penalty

With NO cache, we have the following cycle "costs":

    instruction cycles = 1x10^11 inst x 4 cycles/inst = 4 x 10^11 cycles
    inst. miss penalty = 1x10^11 misses x 200 cycles/miss = 200x10^11 cycles
    data miss penalty  = (1x10^11 x 34%) misses x 200 cycles/miss = 68x10^11
    
That's a total of 272x10^11 or 2.72x10^13 cycles. At a clock rate of 2.4GHz, that's:

    2.72 x 10^13 cycles / 2.4x10^9 cycles/inst = 11,333.3 seconds
    
With a "reasonable" cache, the instrucion-miss penalty is only 2% of what we 
calculated above, and the data-miss penalty is only 5% of the figure above:

    instruction cycles = 4x10^11 cycles
    inst. miss penalty = 2% x 200x10^11 = 4x10^11 cycles
    data miss penalty  = 5% x 68x10^11 = 3.4x10^11 cycles
    
That's a total of 11.4x10^11 cycles, which takes 475 seconds.

    11,333.3 / 475 ---> 23.86x faster with the cache