subject
Mathematics, 30.03.2021 19:40 zafyafimli

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given samples of what an agent experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, we will first estimate the model (the transition function and the reward function), and then use the estimated model to find the optimal actions. To find the optimal actions, model-based RL proceeds by computing the optimal V or Q value function with respect to the estimated T and R. This could be done with any of value iteration, policy iteration, or Q-value iteration. Last week you already solved some exercises that involved value iteration and policy iteration, so we will go with Q value iteration in this exercise.
Consider the following samples that the agent encountered.
a a r S a S r S S r в 0.0 A -3.0 Clockwise B Clockwise Clockwise A C A 0.0 B 0.0 B 6.0 Clockwise Clockwise Clockwise A A 3.0 C A -3.0 B 0.0 в | 6.0 Clockwise B Clockwise Clockwise A. C A 3.0 C -10.0 A 0.0 Clockwise А B Clockwise Clockwise C 0.0 C-10.0 A 0.0 Clockwise Clockwise Clockwise А C A Counterclockwise C-8.0 B Counterclockwise A -10.0 C Counterclockwise B -8.0 A Counterclockwise C-8.0 B Counterclockwise A-10.0 C Counterclockwise B -8.0 C Counterclockwise B-8.0 B Counterclockwise A -10.0 A Counterclockwise B 0.0 A Counterclockwise B 0.0 B Counterclockwise A -10.0 C Counterclockwise A 0.0 B COunterclockwise C0.0 A Counterclockwise C-8.0 C Counterclockwise B-8.0
We start by estimating the transition function, T(s, a,s') and reward function R(s, a,s') for this MDP. Fill in the missing values in the following table for T(s, a,s') and R(s, a,s').
Discount Factor, y 0.5 s' T(S, a,s') R(S, a,s') S a Clockwise A M Clockwise A C P A Counterclockwise B 0.400 0.000 A Counterclockwise C 0.600 -8.000 Clockwise 0.800 -3.000 Clockwise 0.000 0.200 B Counterclockwise A 0.800 -10.000 B Counterclockwise C 0.200 0.000 Clockwise C A 0.600 0.000 Clockwise 0.400 6.000 C Counterclockwise A 0.200 0.000 C Counterclockwise B 0.800 -8.000 m

ansver
Answers: 2

Other questions on the subject: Mathematics

image
Mathematics, 22.06.2019 01:00, jtgarner402
The computer that controls a bank's automatic teller machine crashes a mean of 0.6 times per day. what is the probability that, in any seven-day week, the computer will crash less than 5 times? round your answer to four decimal places
Answers: 2
image
Mathematics, 22.06.2019 02:00, reeeeeee32
Hassan bought a package of tofu. the temperature of the tofu was 14° celsius when hassan put the package into the freezer. he left the tofu in the freezer until it reached –19° celsius. which expressions explain how to find the change in temperature, in degrees celsius, of the package of tofu? select three that apply.
Answers: 1
image
Mathematics, 22.06.2019 02:30, ashtonbillups
Acompany makes steel lids that have a diameter of 13 inches. what is the area of each lid? round your answer to the nearest hundredth
Answers: 2
image
Mathematics, 22.06.2019 03:00, daniellecraig77
What percent of $1.00 are a nickel and a dime?
Answers: 3
You know the right answer?
Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not k...

Questions in other subjects: