Created using Colaboratory

This commit is contained in:
udlbook
2024-03-04 09:43:56 -05:00
parent 9b2b30d4cc
commit 5c0fd0057f

View File

@@ -1,20 +1,4 @@
{ {
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyMWjsdr5SDwyzcDftnehlNo",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
@@ -28,6 +12,9 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "t9vk9Elugvmi"
},
"source": [ "source": [
"# **Notebook 19.3: Monte-Carlo methods**\n", "# **Notebook 19.3: Monte-Carlo methods**\n",
"\n", "\n",
@@ -37,42 +24,49 @@
"\n", "\n",
"Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n", "Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n",
"\n", "\n",
"Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions." "Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions.\n",
], "\n",
"metadata": { "Thanks to [Akshil Patel](https://www.akshilpatel.com) and [Jessica Nicholson](https://jessicanicholson1.github.io) for their help in preparing this notebook."
"id": "t9vk9Elugvmi" ]
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"source": [ "execution_count": null,
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from PIL import Image"
],
"metadata": { "metadata": {
"id": "OLComQyvCIJ7" "id": "OLComQyvCIJ7"
}, },
"execution_count": null, "outputs": [],
"outputs": [] "source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from PIL import Image\n",
"\n",
"from IPython.display import clear_output\n",
"from time import sleep"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZsvrUszPLyEG"
},
"outputs": [],
"source": [ "source": [
"# Get local copies of components of images\n", "# Get local copies of components of images\n",
"!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Empty.png\n", "!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Empty.png\n",
"!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Hole.png\n", "!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Hole.png\n",
"!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Fish.png\n", "!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Fish.png\n",
"!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Penguin.png" "!wget https://raw.githubusercontent.com/udlbook/udlbook/main/Notebooks/Chap19/Penguin.png"
], ]
"metadata": {
"id": "ZsvrUszPLyEG"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Gq1HfJsHN3SB"
},
"outputs": [],
"source": [ "source": [
"# Ugly class that takes care of drawing pictures like in the book.\n", "# Ugly class that takes care of drawing pictures like in the book.\n",
"# You can totally ignore this code!\n", "# You can totally ignore this code!\n",
@@ -257,205 +251,281 @@
"\n", "\n",
"\n", "\n",
" plt.show()" " plt.show()"
], ]
"metadata": {
"id": "Gq1HfJsHN3SB"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eBQ7lTpJQBSe"
},
"outputs": [],
"source": [ "source": [
"# We're going to work on the problem depicted in figure 19.10a\n", "# We're going to work on the problem depicted in figure 19.10a\n",
"n_rows = 4; n_cols = 4\n", "n_rows = 4; n_cols = 4\n",
"layout = np.zeros(n_rows * n_cols)\n", "layout = np.zeros(n_rows * n_cols)\n",
"reward_structure = np.zeros(n_rows * n_cols)\n", "reward_structure = np.zeros(n_rows * n_cols)\n",
"layout[9] = 1 ; reward_structure[9] = -2\n", "layout[9] = 1 ; reward_structure[9] = -2 # Hole\n",
"layout[10] = 1; reward_structure[10] = -2\n", "layout[10] = 1; reward_structure[10] = -2 # Hole\n",
"layout[14] = 1; reward_structure[14] = -2\n", "layout[14] = 1; reward_structure[14] = -2 # Hole\n",
"layout[15] = 2; reward_structure[15] = 3\n", "layout[15] = 2; reward_structure[15] = 3 # Fish\n",
"initial_state = 0\n", "initial_state = 0\n",
"mdp_drawer = DrawMDP(n_rows, n_cols)\n", "mdp_drawer = DrawMDP(n_rows, n_cols)\n",
"mdp_drawer.draw(layout, state = initial_state, rewards=reward_structure, draw_state_index = True)" "mdp_drawer.draw(layout, state = initial_state, rewards=reward_structure, draw_state_index = True)"
], ]
"metadata": {
"id": "eBQ7lTpJQBSe"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"For clarity, the black numbers are the state number and the red numbers are the reward for being in that state. Note that the states are indexed from 0 rather than 1 as in the book to make the code neater."
],
"metadata": { "metadata": {
"id": "6Vku6v_se2IG" "id": "6Vku6v_se2IG"
} },
"source": [
"For clarity, the black numbers are the state number and the red numbers are the reward for being in that state. Note that the states are indexed from 0 rather than 1 as in the book to make the code neater."
]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {
"id": "Fhc6DzZNOjiC"
},
"source": [ "source": [
"Now let's define the state transition function $Pr(s_{t+1}|s_{t},a)$ in full where $a$ is the actions. Here $a=0$ means try to go upward, $a=1$, right, $a=2$ down and $a=3$ right. However, the ice is slippery, so we don't always go the direction we want to.\n", "Now let's define the state transition function $Pr(s_{t+1}|s_{t},a)$ in full where $a$ is the actions. Here $a=0$ means try to go upward, $a=1$, right, $a=2$ down and $a=3$ right. However, the ice is slippery, so we don't always go the direction we want to.\n",
"\n", "\n",
"Note that as for the states, we've indexed the actions from zero (unlike in the book) so they map to the indices of arrays better" "Note that as for the states, we've indexed the actions from zero (unlike in the book) so they map to the indices of arrays better"
], ]
"metadata": {
"id": "Fhc6DzZNOjiC"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "l7rT78BbOgTi"
},
"outputs": [],
"source": [ "source": [
"transition_probabilities_given_action0 = np.array(\\\n", "transition_probabilities_given_action0 = np.array(\\\n",
"[[0.00 , 0.33, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", "[[0.90, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.50 , 0.00, 0.33, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.05, 0.85, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.33, 0.00, 0.50, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.05, 0.85, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.33, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.05, 0.90, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.50 , 0.00, 0.00, 0.00, 0.00, 0.17, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.05, 0.00, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.34, 0.00, 0.00, 0.25, 0.00, 0.17, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.34, 0.00, 0.00, 0.17, 0.00, 0.25, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.50, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.17, 0.00, 0.00, 0.75, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.25, 0.00, 0.17, 0.00, 0.00, 0.50, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.17, 0.00, 0.25, 0.00, 0.00, 0.50, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.75 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.10, 0.05, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.25, 0.00, 0.25, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.25, 0.00, 0.25 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.25, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00]])\n",
"])\n", "\n",
"\n", "\n",
"transition_probabilities_given_action1 = np.array(\\\n", "transition_probabilities_given_action1 = np.array(\\\n",
"[[0.00 , 0.25, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", "[[0.10, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.75 , 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.85, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.50, 0.00, 0.50, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.85, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.33, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.85, 0.90, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.25 , 0.00, 0.00, 0.00, 0.00, 0.17, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.05, 0.00, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.25, 0.00, 0.00, 0.50, 0.00, 0.17, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.25, 0.00, 0.00, 0.50, 0.00, 0.33, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.50, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.33, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.85, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.17, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.50, 0.00, 0.17, 0.00, 0.00, 0.25, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.50, 0.00, 0.33, 0.00, 0.00, 0.25, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.34, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.50 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.85, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.10, 0.05, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.75, 0.00, 0.25, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.05, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.50, 0.00, 0.50 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.34, 0.00, 0.00, 0.50, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.85, 0.00]])\n",
"])\n", "\n",
"\n", "\n",
"transition_probabilities_given_action2 = np.array(\\\n", "transition_probabilities_given_action2 = np.array(\\\n",
"[[0.00 , 0.25, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", "[[0.10, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.25 , 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.05, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.25, 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.05, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.05, 0.10, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.75 , 0.00, 0.00, 0.00, 0.00, 0.17, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.85, 0.00, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.50, 0.00, 0.00, 0.25, 0.00, 0.17, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.50, 0.00, 0.00, 0.16, 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.75, 0.00, 0.00, 0.16, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.17, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.25, 0.00, 0.17, 0.00, 0.00, 0.33, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.16, 0.00, 0.25, 0.00, 0.00, 0.33, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.16, 0.00, 0.00, 0.00, 0.00, 0.50 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.33, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.00, 0.90, 0.05, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.50, 0.00, 0.33, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.85, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.34, 0.00, 0.50 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.85, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.34, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00]])\n",
"])\n",
"\n", "\n",
"transition_probabilities_given_action3 = np.array(\\\n", "transition_probabilities_given_action3 = np.array(\\\n",
"[[0.00 , 0.25, 0.00, 0.00, 0.33, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", "[[0.90, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.50 , 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.05, 0.05, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.50, 0.00, 0.75, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.05, 0.05, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.05, 0.10, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.50 , 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.33, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.05, 0.00, 0.00, 0.00, 0.85, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.25, 0.00, 0.00, 0.33, 0.00, 0.50, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.50, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.34, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.85, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.33, 0.00, 0.50, 0.00, 0.00, 0.25, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.17, 0.00, 0.50, 0.00, 0.00, 0.25, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00, 0.85, 0.00, 0.00, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.17, 0.00, 0.00, 0.00, 0.00, 0.25 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.34, 0.00, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.00, 0.90, 0.85, 0.00, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.50, 0.00, 0.50, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.85, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.16, 0.00, 0.00, 0.25, 0.00, 0.75 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00],\n",
" [0.00 , 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.25, 0.00, 0.00, 0.25, 0.00 ],\n", " [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.00, 0.00, 0.05, 0.00]])\n",
"])\n", "\n",
"\n",
"\n", "\n",
"# Store all of these in a three dimension array\n", "# Store all of these in a three dimension array\n",
"# Pr(s_{t+1}=2|s_{t}=1, a_{t}=3] is stored at position [2,1,3]\n", "# Pr(s_{t+1}=2|s_{t}=1, a_{t}=3] is stored at position [2,1,3]\n",
"transition_probabilities_given_action = np.concatenate((np.expand_dims(transition_probabilities_given_action0,2),\n", "transition_probabilities_given_action = np.concatenate((np.expand_dims(transition_probabilities_given_action0,2),\n",
" np.expand_dims(transition_probabilities_given_action1,2),\n", " np.expand_dims(transition_probabilities_given_action1,2),\n",
" np.expand_dims(transition_probabilities_given_action2,2),\n", " np.expand_dims(transition_probabilities_given_action2,2),\n",
" np.expand_dims(transition_probabilities_given_action3,2)),axis=2)" " np.expand_dims(transition_probabilities_given_action3,2)),axis=2)\n",
], "\n",
"metadata": { "print('Grid Size:', len(transition_probabilities_given_action[0]))\n",
"id": "l7rT78BbOgTi" "print()\n",
"print('Transition Probabilities for when next state = 2:')\n",
"print(transition_probabilities_given_action[2])\n",
"print()\n",
"print('Transitions Probabilities for when next state = 2 and current state = 1')\n",
"print(transition_probabilities_given_action[2][1])\n",
"print()\n",
"print('Transitions Probabilities for when next state = 2 and current state = 1 and action = 3 (Left):')\n",
"print(transition_probabilities_given_action[2][1][3])"
]
}, },
"execution_count": null, {
"outputs": [] "cell_type": "markdown",
"metadata": {
"id": "BHWjp6Qq4tBF"
},
"source": [
"## Implementation Details\n",
"\n",
"We provide the following methods:\n",
"\n",
"- **`markov_decision_process_step_stochastic`** - this function selects an action based on the stochastic policy for the current state, updates the state based on the transition probabilities associated with the chosen action, and returns the new state, the reward obtained for the new state, the chosen action, and whether the episode terminates.\n",
"\n",
"- **`get_one_episode`** - this function simulates an episode of agent-environment interaction. It returns the states, rewards, and actions seen in that episode, which we can then use to update the agent.\n",
"\n",
"- **`calculate_returns`** - this function calls on the **`calculate_return`** function that computes the discounted sum of rewards from a specific step, in a sequence of rewards.\n",
"\n",
"You have to implement the following methods:\n",
"\n",
"- **`deterministic_policy_to_epsilon_greedy`** - given a deterministic policy, where one action is chosen per state, this function computes the $\\epsilon$-greedy version of that policy, where each of the four actions has some nonzero probability of being selected per state. In each state, the probability of selecting each of the actions should sum to 1.\n",
"\n",
"- **`update_policy_mc`** - this function updates the action-value function using the Monte Carlo method. We use the rollout trajectories collected using `get_one_episode` to calculate the returns. Then update the action values towards the Monte Carlo estimate of the return for each state."
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "akjrncMF-FkU"
},
"outputs": [],
"source": [ "source": [
"# This takes a single step from an MDP\n", "# This takes a single step from an MDP\n",
"def markov_decision_process_step_stochastic(state, transition_probabilities_given_action, reward_structure, stochastic_policy):\n", "def markov_decision_process_step_stochastic(state, transition_probabilities_given_action, reward_structure, terminal_states, stochastic_policy):\n",
" # Pick action\n", " # Pick action\n",
" action = np.random.choice(a=np.arange(0,4,1),p=stochastic_policy[:,state])\n", " action = np.random.choice(a=np.arange(0,4,1),p=stochastic_policy[:,state])\n",
"\n",
" # Update the state\n", " # Update the state\n",
" new_state = np.random.choice(a=np.arange(0,transition_probabilities_given_action.shape[0]),p = transition_probabilities_given_action[:,state,action])\n", " new_state = np.random.choice(a=np.arange(0,transition_probabilities_given_action.shape[0]),p = transition_probabilities_given_action[:,state,action])\n",
" # Return the reward\n", " # Return the reward\n",
" reward = reward_structure[new_state]\n", " reward = reward_structure[new_state]\n",
" is_terminal = new_state in [terminal_states]\n",
"\n", "\n",
" return new_state, reward, action" " return new_state, reward, action, is_terminal"
], ]
"metadata": {
"id": "akjrncMF-FkU"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "code", "cell_type": "code",
"source": [ "execution_count": null,
"# Run one episode and return actions, rewards, returns\n",
"def get_one_episode(initial_state, transition_probabilities_given_action, reward_structure, stochastic_policy):\n",
"\n",
" max_steps = 1000\n",
" states = np.zeros(max_steps, dtype='uint8') ;\n",
" rewards = np.zeros(max_steps) ;\n",
" actions = np.zeros(max_steps, dtype='uint8') ;\n",
"\n",
" t = 0\n",
" states[t] = initial_state\n",
" # While haven't reached maximum number of steps\n",
" while t< max_steps:\n",
" # Keep stepping through MDP\n",
" states[t+1],rewards[t+1],actions[t] = markov_decision_process_step_stochastic(states[t], transition_probabilities_given_action, reward_structure, stochastic_policy)\n",
" # If we reach te:rminal state then quit\n",
" if states[t]==15:\n",
" break;\n",
" t+=1\n",
"\n",
" states = states[:t+1]\n",
" rewards = rewards[:t+1]\n",
" actions = actions[:t+1]\n",
"\n",
" return states, rewards, actions"
],
"metadata": { "metadata": {
"id": "bFYvF9nAloIA" "id": "bFYvF9nAloIA"
}, },
"execution_count": null, "outputs": [],
"outputs": [] "source": [
"# Run one episode and return actions, rewards, returns\n",
"def get_one_episode(initial_state, transition_probabilities_given_action, reward_structure, terminal_states, stochastic_policy):\n",
"\n",
" states = []\n",
" rewards = []\n",
" actions = []\n",
"\n",
" states.append(initial_state)\n",
" state = initial_state\n",
"\n",
" is_terminal = False\n",
" # While we haven't reached a terminal state\n",
" while not is_terminal:\n",
" # Keep stepping through MDP\n",
" state, reward, action, is_terminal = markov_decision_process_step_stochastic(state,\n",
" transition_probabilities_given_action,\n",
" reward_structure,\n",
" terminal_states,\n",
" stochastic_policy)\n",
" states.append(state)\n",
" rewards.append(reward)\n",
" actions.append(action)\n",
"\n",
" states = np.array(states, dtype=\"uint8\")\n",
" rewards = np.array(rewards)\n",
" actions = np.array(actions, dtype=\"uint8\")\n",
"\n",
" # If the episode was terminated early, we need to compute the return differently using r_{t+1} + gamma*V(s_{t+1})\n",
" return states, rewards, actions"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qJhOrIId4tBF"
},
"outputs": [],
"source": [
"def visualize_one_episode(states, actions):\n",
" # Define actions for visualization\n",
" acts = ['up', 'right', 'down', 'left']\n",
"\n",
" # Iterate over the states and actions\n",
" for i in range(len(states)):\n",
"\n",
" if i == 0:\n",
" print('Starting State:', states[i])\n",
"\n",
" elif i == len(states)-1:\n",
" print('Episode Done:', states[i])\n",
"\n",
" else:\n",
" print('State', states[i-1])\n",
" a = actions[i]\n",
" print('Action:', acts[a])\n",
" print('Next State:', states[i])\n",
"\n",
" # Visualize the current state using the MDP drawer\n",
" mdp_drawer.draw(layout, state=states[i], rewards=reward_structure, draw_state_index=True)\n",
" clear_output(True)\n",
"\n",
" # Pause for a short duration to allow observation\n",
" sleep(1.5)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_AKwdtQQHzIK"
},
"outputs": [],
"source": [ "source": [
"# Convert deterministic policy (1x16) to an epsilon greedy stochastic policy (4x16)\n", "# Convert deterministic policy (1x16) to an epsilon greedy stochastic policy (4x16)\n",
"def deterministic_policy_to_epsilon_greedy(policy, epsilon=0.1):\n", "def deterministic_policy_to_epsilon_greedy(policy, epsilon=0.2):\n",
" # TODO -- write this function\n", " # TODO -- write this function\n",
" # You should wind up with a 4x16 matrix, with epsilon/3 in every position except the real policy\n", " # You should wind up with a 4x16 matrix, with epsilon/3 in every position except the real policy\n",
" # The columns should sum to one\n", " # The columns should sum to one\n",
@@ -464,27 +534,27 @@
"\n", "\n",
"\n", "\n",
" return stochastic_policy" " return stochastic_policy"
], ]
"metadata": {
"id": "_AKwdtQQHzIK"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"Let's try generating an episode"
],
"metadata": { "metadata": {
"id": "OhVXw2Favo-w" "id": "OhVXw2Favo-w"
} },
"source": [
"Let's try generating an episode"
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5zQ1Oh9Zvnwt"
},
"outputs": [],
"source": [ "source": [
"# Set seed so random numbers always the same\n", "# Set seed so random numbers always the same\n",
"np.random.seed(0)\n", "np.random.seed(6)\n",
"# Print in compact form\n", "# Print in compact form\n",
"np.set_printoptions(precision=3)\n", "np.set_printoptions(precision=3)\n",
"\n", "\n",
@@ -494,32 +564,55 @@
"# Convert deterministic policy to stochastic\n", "# Convert deterministic policy to stochastic\n",
"stochastic_policy = deterministic_policy_to_epsilon_greedy(policy)\n", "stochastic_policy = deterministic_policy_to_epsilon_greedy(policy)\n",
"\n", "\n",
"print(\"Initial policy:\")\n", "print(\"Initial Penguin Policy:\")\n",
"print(policy)\n", "print(policy)\n",
"print()\n",
"print('Stochastic Penguin Policy:')\n",
"print(stochastic_policy)\n",
"print()\n",
"\n", "\n",
"initial_state = 5\n", "initial_state = 5\n",
"states, rewards, actions = get_one_episode(initial_state,transition_probabilities_given_action, reward_structure, stochastic_policy)" "terminal_states=[15]\n",
], "states, rewards, actions = get_one_episode(initial_state,transition_probabilities_given_action, reward_structure, terminal_states, stochastic_policy)\n",
"metadata": { "\n",
"id": "5zQ1Oh9Zvnwt" "print('Initial Penguin Position:')\n",
}, "mdp_drawer.draw(layout, state = initial_state, rewards=reward_structure, draw_state_index = True)\n",
"execution_count": null, "\n",
"outputs": [] "print('Total steps to termination:', len(states))\n",
}, "print('Final Reward:', np.sum(rewards))"
{ ]
"cell_type": "markdown",
"source": [
"We'll need to calculate the returns (discounted cumulative reward) for each state action pair"
],
"metadata": {
"id": "nl6rtNffwhcU"
}
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KJH-UGKk4tBF"
},
"outputs": [],
"source": [
"#this visualizes the complete episode\n",
"visualize_one_episode(states, actions)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nl6rtNffwhcU"
},
"source": [
"We'll need to calculate the returns (discounted cumulative reward) for each state action pair"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FxrItqGPLTq7"
},
"outputs": [],
"source": [ "source": [
"def calculate_returns(rewards, gamma):\n", "def calculate_returns(rewards, gamma):\n",
" returns = np.zeros_like(rewards)\n", " returns = np.zeros(len(rewards))\n",
" for c_return in range(len(returns)):\n", " for c_return in range(len(returns)):\n",
" returns[c_return] = calculate_return(rewards[c_return:], gamma)\n", " returns[c_return] = calculate_return(rewards[c_return:], gamma)\n",
" return returns\n", " return returns\n",
@@ -529,26 +622,26 @@
" for i in range(len(rewards)):\n", " for i in range(len(rewards)):\n",
" return_val += rewards[i] * np.power(gamma, i)\n", " return_val += rewards[i] * np.power(gamma, i)\n",
" return return_val" " return return_val"
], ]
"metadata": {
"id": "FxrItqGPLTq7"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"This routine does the main work of the Monte Carlo method. We repeatedly rollout episods, calculate the returns. Then we figure out the average return for each state action pair, and choose the next policy as the action that has greatest state action value at each state."
],
"metadata": { "metadata": {
"id": "DX1KfHRhzUOU" "id": "DX1KfHRhzUOU"
} },
"source": [
"This routine does the main work of the on-policy Monte Carlo method. We repeatedly rollout episods, calculate the returns. Then we figure out the average return for each state action pair, and choose the next policy as the action that has greatest state action value at each state."
]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hCghcKlOJXSM"
},
"outputs": [],
"source": [ "source": [
"def update_policy_mc(initial_state, transition_probabilities_given_action, reward_structure, stochastic_policy, gamma, n_rollouts=1):\n", "def update_policy_mc(initial_state, transition_probabilities_given_action, reward_structure, terminal_states, stochastic_policy, gamma, n_rollouts=1):\n",
" # Create two matrices to store total returns for each action/state pair and the\n", " # Create two matrices to store total returns for each action/state pair and the\n",
" # number of times we observed that action/state pair\n", " # number of times we observed that action/state pair\n",
" n_state = transition_probabilities_given_action.shape[0]\n", " n_state = transition_probabilities_given_action.shape[0]\n",
@@ -574,18 +667,18 @@
" state_action_values = state_action_returns_total/( state_action_count+0.00001)\n", " state_action_values = state_action_returns_total/( state_action_count+0.00001)\n",
" policy = np.argmax(state_action_values, axis=0).astype(int)\n", " policy = np.argmax(state_action_values, axis=0).astype(int)\n",
" return policy, state_action_values\n" " return policy, state_action_values\n"
], ]
"metadata": {
"id": "hCghcKlOJXSM"
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8jWhDlkaKj7Q"
},
"outputs": [],
"source": [ "source": [
"# Set seed so random numbers always the same\n", "# Set seed so random numbers always the same\n",
"np.random.seed(3)\n", "np.random.seed(0)\n",
"# Print in compact form\n", "# Print in compact form\n",
"np.set_printoptions(precision=3)\n", "np.set_printoptions(precision=3)\n",
"\n", "\n",
@@ -597,32 +690,60 @@
"mdp_drawer = DrawMDP(n_rows, n_cols)\n", "mdp_drawer = DrawMDP(n_rows, n_cols)\n",
"mdp_drawer.draw(layout, policy = policy, rewards = reward_structure)\n", "mdp_drawer.draw(layout, policy = policy, rewards = reward_structure)\n",
"\n", "\n",
"\n", "terminal_states = [15]\n",
"n_policy_update = 5\n", "# Track all the policies so we can visualize them later\n",
"all_policies = []\n",
"n_policy_update = 2000\n",
"for c_policy_update in range(n_policy_update):\n", "for c_policy_update in range(n_policy_update):\n",
" # Convert policy to stochastic\n", " # Convert policy to stochastic\n",
" stochastic_policy = deterministic_policy_to_epsilon_greedy(policy)\n", " stochastic_policy = deterministic_policy_to_epsilon_greedy(policy)\n",
" # Update policy by Monte Carlo method\n", " # Update policy by Monte Carlo method\n",
" policy, state_action_values = update_policy_mc(initial_state, transition_probabilities_given_action, reward_structure, stochastic_policy, gamma, n_rollouts=1000)\n", " policy, state_action_values = update_policy_mc(initial_state, transition_probabilities_given_action, reward_structure, terminal_states, stochastic_policy, gamma, n_rollouts=100)\n",
" all_policies.append(policy)\n",
"\n",
" # Print out 10 snapshots of progress\n",
" if (c_policy_update % (n_policy_update//10) == 0) or c_policy_update == n_policy_update - 1:\n",
" print(\"Updated policy\")\n", " print(\"Updated policy\")\n",
" print(policy)\n", " print(policy)\n",
" mdp_drawer = DrawMDP(n_rows, n_cols)\n", " mdp_drawer = DrawMDP(n_rows, n_cols)\n",
" mdp_drawer.draw(layout, policy = policy, rewards = reward_structure, state_action_values=state_action_values)\n" " mdp_drawer.draw(layout, policy = policy, rewards = reward_structure, state_action_values=state_action_values)\n",
], "\n",
"metadata": { "\n"
"id": "8jWhDlkaKj7Q" ]
},
"execution_count": null,
"outputs": []
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [
"You can see that the results are quite noisy, but there is a definite improvement from the initial policy."
],
"metadata": { "metadata": {
"id": "j7Ny47kTEMzH" "id": "j7Ny47kTEMzH"
} },
} "source": [
"You can see a definite improvement to the policy"
] ]
}
],
"metadata": {
"colab": {
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
} }