Question 1: ---------- a) Using the approach where we OR together a collection of ANDed terms, one for each set of inputs that makes the output true, we'd get: (!Op && A && !B) || (Op && !A && B) || (Op && A && !B) || (Op && A && B) b) No, the circuit is not equivalent to the truth table. You can demonstrate this by building the truth table for the circuit, but it's sufficient to just build the first ROW of the table, since it already disagrees with the table from part a: 0 0 0 --> 1 for the circuit, vs 0 for the table Question 2: ---------- a) We need to store the three $a registers so that we can restore them after the call to print_output, $ra since the call to print_output overwrites that as well, and the three $s registers the body of this procedure uses. That's a grand total of 7 registers or 28 bytes. addi $sp, $sp, -28 sw $a0, 0($sp) # Have to store, since we call procedure before use sw $a1, 4($sp) # Have to store, since we call procedure before use sw $a2, 8($sp) # Have to store, since we call procedure before use sw $ra, 12($sp) # jr overwrites $ra, so we need a backup copy sw $s0, 16($sp) # We need to save and restore $s registers we use sw $s1, 20($sp) # We need to save and restore $s registers we use sw $s2, 24($sp) # We need to save and restore $s registers we use b) We don't have to restore $a or $t registers, but we need to restore the three $s registers, and $ra since we didn't do it above. Then pop the stack and jr $ra to get back to the caller. lw $ra, 12($sp) lw $s0, 16($sp) lw $s1, 20($sp) lw $s2, 24($sp) addi $sp, $sp, 28 # "Pop" the stack before we leave jr $ra # jr $ra to get back to caller c) Since we stored the $a registers above, we just need to reload their values after the call to print_output. You could reload $ra here too, or at the end before returning. jal print_output lw $a0, 0($sp) # print_output might have changed $a registers, lw $a1, 4($sp) # so we need to restore their values. lw $a2, 8($sp) # Need $ra too, but can wait until the end Question 3: ---------- a) It's the final version of the 32-bit ALU. b) The 32 Result outputs are connected to both the register file and main memory, so that the results can be written to a register or used as a memory address for lw and sw. c) It's connected that way to implement slt. In slt we configure the ALU to do subtraction, in which case the leftmost bit (bit 31) will be a 1 if the result of slt is true, and 0 if it should be false. That result is piped into bit 0 instead, where the output from slt is expected. Question 4: ---------- a) The multiplexer is there so that the CPU can execute R-type instructions like add, in which both inputs to the ALU come from registers, but also execute I-type instructions like addi, in which the second input should instead come from the bits of the instruction. It controls where the second ALU input comes from. b) The shift left is part of the mechanism for handling branch instructions. A branch instruction encodes an offset in *words* as its immediate. The shift-left converts that to an offset in *bytes* so that the offset can be added to a "normal" memory address such as the one coming from PC+4. c) RegDest: 0 Branch: 0 MemRead: X MemtoReg: 0 ALUOp: 0 0 (do addition w/o inspecting funct bits) MemWrite: 0 ALUSrc 1 RegWrite: 1 Question 5: ---------- 2.4x10^9 cycles/second 1.0x10^11 instructions 34% are mem accesses CPI of 4 200 cycle miss penalty With NO cache, we have the following cycle "costs": instruction cycles = 1x10^11 inst x 4 cycles/inst = 4 x 10^11 cycles inst. miss penalty = 1x10^11 misses x 200 cycles/miss = 200x10^11 cycles data miss penalty = (1x10^11 x 34%) misses x 200 cycles/miss = 68x10^11 That's a total of 272x10^11 or 2.72x10^13 cycles. At a clock rate of 2.4GHz, that's: 2.72 x 10^13 cycles / 2.4x10^9 cycles/inst = 11,333.3 seconds With a "reasonable" cache, the instrucion-miss penalty is only 2% of what we calculated above, and the data-miss penalty is only 5% of the figure above: instruction cycles = 4x10^11 cycles inst. miss penalty = 2% x 200x10^11 = 4x10^11 cycles data miss penalty = 5% x 68x10^11 = 3.4x10^11 cycles That's a total of 11.4x10^11 cycles, which takes 475 seconds. 11,333.3 / 475 ---> 23.86x faster with the cache