













Software-controlled interleave, S > N. Wheel with the threads in each slot. If thread is ready to go

It goes, else NOP, I.e., pipeline bubble.





Managing interactions between threads.





SGI bought Cray, and Tera was a spin-off.

1997. Integer sort press release.





32 64-bit general-purpose registers (R0-R31)

unified integer/floating-point register set

R0 hard-wired to zero

8 64-bit branch target registers (T0-T7)

load branch target address before branch instruction

T0 contains address of user exception handler

1 64-bit stream status word (SSW)

includes 32-bit program counter

four condition code registers

floating-point rounding mode



Memory unit is busy, or sync operation failed retry.

Just goes around the memory pipeline.



Tera not very successful, 2 machines sold.

Changed their name back to Cray!

















Illustrates SMT thread issue & execution & how differs

SS: only single thread; long latency instructions w. lots of instructions dependent

FGMT: limited by amount of ILP in each thread, just as on the SS

MP: each processor issues instructions from its own thread

Example of one thread stalls or has little ILP

Performance



no special HW for scheduling instructions from different threads onto FUs

can use same ooo mechanism as superscalar for instruction issue:

**RENAMING HW** eliminates false dependences both within a thread (just like a conventional SS) & between threads

MAP thread-specific architectural registers in all threads onto a pool of physical registers

instructions are issued when operands available without regard to thread

(scheduler not look at thread IDs)

thereafter called by their physical name



8\*32 for the architecture state + 96 additional registers for register renaming



Fetch unit that can keep up with the simultaneous multithreaded execution engine

have the fewest instructions waiting to be executed

making the best progress through the machine

40% increase in IPC over RR



none of small stuff endangers critical path

most mechanisms already exist; now duplicated for each thread or implemented to apply only to 1 thread at a time

carry thread ID for retirement, trap, queue flush, not used for scheduling

HW structure that points to all Is for each thread need flush mechanism for branch misprediction

Fairly straightforward extension to OOO SS; this + n-fold performance boost was responsible for the technology transfer to chip manufacturers





## Which thread to fetch from next?

- Don't want to clog instruction window with thread with many stalls → try to fetch from thread that has fewest insts in window
- Locks
  - Virtual CPU spinning on lock executes many instructions but gets nowhere → add ISA support to lower priority of thread spinning on lock



Load-store buffer in L1 cache doesn't behave like that, and hence 15% slowdown.