COMP 590-096 Fall 2011

Assignment 4

Due date: Saturday, December 10, 11:59 PM -- No extensions!

Markov Decision Processes and Reinforcement Learning

Consider the following environment, similar to the one in Section 17.1 of the textbook:



As in the textbook, the transition model is as follows: the intended outcome occurs with probability 0.8, and with probability 0.1 the agent moves at either right angle to the intended direction. If the move would make the agent walk into a wall, the agent stays in the same place as before. The rewards for the non-terminal states are -0.04.
  1. Assuming the known transition model and reward function listed above, find the optimal policy and the utilities of all the states using value iteration or policy iteration. Display the optimal policy and the utilities of all the states, and plot utility estimates as a function of the number of iterations as in Figure 17.5(a) (for value iteration, you should need no more than 50 iterations to get convergence). In this question and the next one, use a discount factor of 0.99. Here are some reference utility values to help you check the correctness of your answer.

  2. Now consider the reinforcement learning scenario in which the transition model and the reward function are unknown to the agent, but the agent can observe the outcome of its actions and the reward received in any state. (In the implementation, this means that the successor states and rewards will be given to the agent by some black-box functions whose parameters the agent doesn't have access to.) Use TD Q-learning, as in Section 21.3, to learn an action-utility function. Experiment with different parameters for the exploration function f as defined in Section 21.3 and in the slides, and report which choice works the best. For the learning rate function, start with alpha(t) = 60/(59+t), and play around with the numbers to get the best behavior.

    In the report, plot utility estimates and their RMS error as a function of the number of trials (you will probably need several thousand), as in Figure 21.7. RMS stands for root mean squared error:

    where U'(s) is the estimated utility of state s, U(s) is the "true" utility as determined by value iteration, and N is the number of states.

For extra credit

  • Design a more complicated maze environment of your own and run algorithms from 1 and 2 on it. How does the number of states and the complexity of the environment affect convergence? How complex can you make the environment and still be able to learn the right policy? It might also be interesting to revisit some of the mazes from Assignment 1.

  • Experiment with any advanced methods from Chapters 17 and 21 of the textbook. For example, instead of Q-learning, try to use an explicit state-based representation and learn the transition function of the environment.

More extra credit

As discussed in class, feel free to turn in any extensions to previous assignments or solutions to parts of those assignments that you haven't been able to get before. If you have other ideas for extra credit, please contact me. Turn in any additional extra credit separately (see below), and in your report, clearly indicate what was the extent of your extra work over anything that you've turned in for a previous assignment.

Submission Instructions

As usual, you will submit through Blackboard. For assignment 4, turn in your report in PDF format named lastname_firstname_assignment4.pdf and your source code in ZIP format, named lastname_firstname_assignment4.zip. Turn in any extra credit (not related to this assignment) separately in two files named lastname_firstname_extra_credit.pdf and lastname_firstname_extra_credit.zip.

There will be absolutely no extensions! Late submissions will receive zero credit.

Academic integrity: As usual, feel free to discuss the assignment with each other in general terms, and to search the Web for general guidance (not for complete solutions). Coding should be done individually. If you make substantial use of some code snippets or information from outside sources, be sure to acknowledge the sources in your report.