The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose:

Question

The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose:

for (i = 0; i < 3; i++) {
for (j = 0; j < 3; j++) {
output[j][i] = input[i][j];
}
}

Assume both the input and output matrices are stored in the row major order (row major order means row index changes fastest). Assume you are executing a 256 × 256 double-precision transpose on a processor with a 16 KB fully associative (so you don’t have to worry about cache conflicts) LRU replacement level 1 data cache with 64-byte blocks. Assume level 1 cache misses or prefetches require 16 cycles, always hit in the level 2 cache, and the level 2 cache can process a request every 2 processor cycles. Assume each iteration of the inner loop above requires 4 cycles if the data is present in the level 1 cache. Assume the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically assume writing back dirty cache blocks requires 0 cycles.

For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.

a. (5 pts) What block size should be used to completely fill the data cache with one input and output block if the level 1 cache is fully associative 64 KB?

b. (5 pts) What is the minimum associativity required of the level 1 cache for consistent performance independent of both arrays’ position in memory?

c. (10 pts) Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated "nonunit stride" hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches need to be outstanding at a given time?

Guest · Answer

what class is it and how much will you pay?...