subject

In Problem 5.12, we were able to reduce the CPE for the prefix-sum computation to 3.00, limited by the latency of floating-point addition on this machine. Simple loop unrolling does not improve things. Using a combination of loop unrolling and reassociation, write code for a prefix sum that achieves a CPE less than the latency of floating-point addition on your machine. Doing this requires actually increasing the number of additions performed. For example, our version with two-way unrolling requires three additions per iteration, while our version with four-way unrolling requires five. Our best implementation achieves a CPE of 1.67 on our reference machine.
Determine how the throughput and latency limits of your machine limit the minimum CPE you can achieve for the prefix-sum operation.

ansver
Answers: 1

Other questions on the subject: Computers and Technology

image
Computers and Technology, 21.06.2019 12:30, pollo44
Antifreeze is not considered a hazardous waste by the epa unless it is used or otherwise becomes contaminated. true or false?
Answers: 1
image
Computers and Technology, 22.06.2019 11:40, silviamgarcia
Pthreads programming: create and terminate a thread write a c++ program that creates a thread. the main will display a message “hello world from the main”. the main will create a thread that will display a message “hello world from the thread” and then terminates with a call to pthread_exit()
Answers: 3
image
Computers and Technology, 22.06.2019 20:00, ayoismeisalex
When you mouse over and click to add a search term this(these) boolean operator(s) is(are) not implied. (select all that apply)?
Answers: 1
image
Computers and Technology, 23.06.2019 04:31, genyjoannerubiera
This graph compares the cost of room and board at educational institutions in texas.
Answers: 1
You know the right answer?
In Problem 5.12, we were able to reduce the CPE for the prefix-sum computation to 3.00, limited by t...

Questions in other subjects: