COMP 590-096 Fall 2011
Assignment 4
Due date: Saturday, December 10, 11:59 PM -- No extensions!
Markov Decision Processes and Reinforcement Learning
Consider the following environment, similar to the one in Section 17.1 of the textbook:

As in the textbook, the transition model is as follows: the intended outcome occurs
with probability 0.8, and with probability 0.1 the agent moves at either right angle to the
intended direction. If the move would make the agent walk into a wall, the agent stays in
the same place as before. The rewards for the non-terminal states are -0.04.
- Assuming the known transition model and reward function listed above, find the optimal policy
and the utilities of
all the states using value iteration or policy iteration. Display the optimal policy
and the utilities of all the states, and plot utility estimates as a function of the number of
iterations as in Figure 17.5(a) (for value iteration, you should need no more than 50 iterations to
get convergence). In this question and the next one, use a discount factor of 0.99.
Here are some reference utility values to help you
check the correctness of your answer.
- Now consider the reinforcement learning scenario in which the transition model and
the reward function are unknown to the agent, but the agent can observe the outcome of its
actions and the reward received in any state. (In the implementation, this means that the
successor states and rewards will be given to the agent by some black-box functions whose
parameters the agent doesn't have access to.) Use TD Q-learning, as in Section 21.3,
to learn an action-utility function. Experiment with different parameters for the exploration function
f as defined in Section 21.3 and in the slides, and report which choice works the best. For the learning rate
function, start with alpha(t) = 60/(59+t), and
play around with the numbers to get the best behavior.
In the report, plot utility estimates and their RMS error as a function of
the number of trials (you will probably need several thousand), as in Figure 21.7. RMS stands
for root mean squared error: 
where U'(s) is the estimated utility of state s, U(s) is the "true" utility as determined by
value iteration, and N is the number of states.
For extra credit
- Design a more complicated maze environment of your own and run algorithms from
1 and 2 on it. How does the number of states and the complexity of the environment affect
convergence? How complex can you make the environment and still be able to learn the right
policy? It might also be interesting to revisit some of the mazes from Assignment 1.
- Experiment with any advanced methods from Chapters 17 and 21 of the
textbook. For example, instead of Q-learning, try to use an explicit state-based representation
and learn the transition function of the environment.
More extra credit
As discussed in class, feel free to turn in any extensions to previous assignments or solutions
to parts of those assignments that you haven't been able to get before. If you have other ideas
for extra credit, please contact me. Turn in any additional extra credit separately (see below),
and in your report, clearly indicate what was the extent of your extra work over anything that
you've turned in for a previous assignment.
Submission Instructions
As usual, you will submit through Blackboard. For assignment 4, turn in your report in PDF format
named lastname_firstname_assignment4.pdf and your source code
in ZIP format, named lastname_firstname_assignment4.zip. Turn in any extra credit (not related
to this assignment) separately in two files named lastname_firstname_extra_credit.pdf and
lastname_firstname_extra_credit.zip.
There will be absolutely no extensions! Late submissions will
receive zero credit.
Academic integrity: As usual, feel free to discuss the assignment with each other in general terms,
and to search the Web for general guidance (not for complete solutions). Coding should be done
individually. If you make substantial use of some code snippets or information from outside sources,
be sure to acknowledge the sources in your report.
|