



| Write-back Cache                                             | s & SC                  |                                        |                                     |                                                          |
|--------------------------------------------------------------|-------------------------|----------------------------------------|-------------------------------------|----------------------------------------------------------|
| • T1 is executed                                             | cache-1<br>X= 1<br>Y=11 | memory<br>X = 0<br>Y =10<br>X'=<br>Y'= | cache-2<br>Y =<br>Y'=<br>X =<br>X'= | prog T2<br>LD Y, R1<br>ST Y', R1<br>LD X, R2<br>ST X',R2 |
| <ul> <li>cache-1 writes back Y</li> </ul>                    | X= 1<br>Y=11            | X = 0<br>Y =11<br>X'=<br>Y'=           | Y =<br>Y'=<br>X =<br>X'=            |                                                          |
| <ul> <li>T2 executed</li> </ul>                              | X= 1<br>Y=11            | X = 0<br>Y =11<br>X'=<br>Y'=           | Y = 11<br>Y'= 11<br>X = 0<br>X'= 0  |                                                          |
| <ul> <li>cache-1 writes back X</li> </ul>                    | X= 1<br>Y=11            | X = 1<br>Y =11<br>X'=<br>Y'=           | Y = 11<br>Y'= 11<br>X = 0<br>X'= 0  |                                                          |
| <ul> <li>cache-2 writes</li> <li>back X' &amp; Y'</li> </ul> | X= 1<br>Y=11            | X = 1<br>Y =11<br>X'= 0<br>Y'=11       | Y =<br>Y'=<br>X =<br>X'=            | 3                                                        |



## Maintaining Sequential Consistency

SC sufficient for correct producer-consumer and mutual exclusion code (e.g., Dekker)

Multiple copies of a location in various caches can cause SC to break down.

## Hardware support is required such that

- only one processor at a time has write permission for a location
- no processor can load a stale copy of the location after a write

⇒ cache coherence protocols





Update protocols, or write broadcast. Latency between writing a word in one processor

and reading it in another is usually smaller in a write update scheme.

But since bandwidth is more precious, most multiprocessors use a write invalidate scheme.







A snoopy cache works in analogy to your snoopy next door neighbor, who is always watching to see what you're doing, and interfering with your life. In the case of the snoopy cache, the caches are all watching the bus for transactions that affect blocks that are in the cache at the moment. The analogy breaks down here; the snoopy cache only does something if your actions actually affect it, while the snoopy neighbor is *always* interested in what you're up to.

| Observed Bus<br>Cycle | Cache State        | Cache Action |
|-----------------------|--------------------|--------------|
|                       | Address not cached |              |
| Read Cycle            | Cached, unmodified |              |
| Memory → Disk         | Cached, modified   |              |
|                       | Address not cached |              |
| Write Cycle           | Cached, unmodified |              |
| Disk→ Memory          | Cached, modified   |              |











What does it mean to merge E, M states?



Interlocks are required when both CPU-L1 and L2-Bus interactions involve the same address.



|                 | state blk addr data0 data1 dataN                                                                            |
|-----------------|-------------------------------------------------------------------------------------------------------------|
| A cacl          | ne block contains more than one word                                                                        |
| Cache<br>word-l | -coherence is done at the block-level and not<br>evel                                                       |
|                 | se $M_1$ writes word <sub>i</sub> and $M_2$ writes word <sub>k</sub> and vords have the same block address. |
| What a          | can happen?                                                                                                 |

The block may be invalidated many times unnecessarily because the addresses share a common block.





Split transaction bus has a read-request transaction followed by a Memory-reply transaction that contains the data.

Split transactions make the bus available for other masters While the memory reads the words of the requested address. It also normally means that the CPU must arbitrate for the bus To request the data and memory must arbitrate for the bus to Return the data. Each transaction must be tagged. Split Transaction buses have higher bandwidth and higher latency.





