subject

The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose:

for (i = 0; i < 3; i++) {
for (j = 0; j < 3; j++) {
output[j][i] = input[i][j];
}
}

Assume both the input and output matrices are stored in the row major order (row major order means row index changes fastest). Assume you are executing a 256 × 256 double-precision transpose on a processor with a 16 KB fully associative (so you don’t have to worry about cache conflicts) LRU replacement level 1 data cache with 64-byte blocks. Assume level 1 cache misses or prefetches require 16 cycles, always hit in the level 2 cache, and the level 2 cache can process a request every 2 processor cycles. Assume each iteration of the inner loop above requires 4 cycles if the data is present in the level 1 cache. Assume the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically assume writing back dirty cache blocks requires 0 cycles.

For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.

a. (5 pts) What block size should be used to completely fill the data cache with one input and output block if the level 1 cache is fully associative 64 KB?

b. (5 pts) What is the minimum associativity required of the level 1 cache for consistent performance independent of both arrays’ position in memory?

c. (10 pts) Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated "nonunit stride" hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches need to be outstanding at a given time?

ansver
Answers: 3

Other questions on the subject: Computers and Technology

image
Computers and Technology, 22.06.2019 13:30, morganwendel126
Use the keyword strategy to remember the meaning of the following word. the meaning for the word has been provided. write your keyword and describe the picture you would create in your mind. centurion: a commander in the army of ancient rome. keyword: picture:
Answers: 2
image
Computers and Technology, 23.06.2019 20:00, emmaraeschool
Me ajude por favor , coloquei uma senha e não consigo tira-la no chorme
Answers: 2
image
Computers and Technology, 23.06.2019 20:30, lucywood2024
What is the biggest difference between section breaks and regular page breaks
Answers: 1
image
Computers and Technology, 24.06.2019 14:00, Abrahamolve
When creating a field in a table, you must set the to determine what type of data the field can store. field property data type field type data property
Answers: 1
You know the right answer?
The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the tra...

Questions in other subjects:

Konu
Mathematics, 22.01.2021 05:20