From 5b845f7d7b3334375584f8339d9f5d3df5973ce8 Mon Sep 17 00:00:00 2001
From: udlbook <110402648+udlbook@users.noreply.github.com>
Date: Wed, 2 Aug 2023 18:15:44 -0400
Subject: [PATCH] Created using Colaboratory

---
 .../Chap12/12_4_Decoding_Strategies.ipynb     | 648 ++++++++++++++++++
 1 file changed, 648 insertions(+)
 create mode 100644 Notebooks/Chap12/12_4_Decoding_Strategies.ipynb
diff --git a/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb b/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb
new file mode 100644
index 0000000..c296764
--- /dev/null
+++ b/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb
@@ -0,0 +1,648 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "authorship_tag": "ABX9TyNPrHfkLWjy3NfDHRhGG3IE",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/udlbook/udlbook/blob/main/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# **Notebook 12.4: Decoding strategies**\n",
+        "\n",
+        "This practical investigates neural decoding from transformer models.  \n",
+        "\n",
+        "Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n",
+        "\n",
+        "Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions."
+      ],
+      "metadata": {
+        "id": "RnIUiieJWu6e"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install transformers"
+      ],
+      "metadata": {
+        "id": "7abjZ9pMVj3k"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed\n",
+        "import torch\n",
+        "import torch.nn.functional as F\n",
+        "import numpy as np"
+      ],
+      "metadata": {
+        "id": "sMOyD0zem2Ef"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Load model and tokenizer\n",
+        "model = GPT2LMHeadModel.from_pretrained('gpt2')\n",
+        "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')"
+      ],
+      "metadata": {
+        "id": "pZgfxbzKWNSR"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Decoding from GPT2\n",
+        "\n",
+        "This tutorial investigates how to use GPT2 (the forerunner of GPT3) to generate text.  There are a number of ways to do this that trade-off the realism of the text against the amount of variation.\n",
+        "\n",
+        "At every stage, GPT2 takes an input string and returns a probability for each of the possible subsequent tokens.  We can choose what to do with these probability.  We could always *greedily choose* the most likely next token, or we could draw a *sample* randomly according to the probabilities.  There are also intermediate strategies such as *top-k sampling* and *nucleus sampling*, that have some controlled randomness.\n",
+        "\n",
+        "We'll also investigate *beam search* -- the idea is that rather than greedily take the next best token at each stage, we maintain a set of hypotheses  (beams)as we add each subsequent token and return the most likely overall hypothesis.  This is not necessarily the same result we get from greedily choosing the next token."
+      ],
+      "metadata": {
+        "id": "TfhAGy0TXEvV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "First, let's investigate the token themselves.  The code below prints out the vocabulary size and shows 20 random tokens.  "
+      ],
+      "metadata": {
+        "id": "vsmO9ptzau3_"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "np.random.seed(1)\n",
+        "print(\"Number of tokens in dictionary = %d\"%(tokenizer.vocab_size))\n",
+        "for i in range(20):\n",
+        "  index = np.random.randint(tokenizer.vocab_size)\n",
+        "  print(\"Token: %d \"%(index)+tokenizer.decode(torch.tensor(index), skip_special_tokens=True))\n"
+      ],
+      "metadata": {
+        "id": "dmmBNS5GY_yk"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Sampling\n",
+        "\n",
+        "Each time we run GPT2 it will take in a set of tokens, and return a probability over each of the possible next tokens.  The simplest thing we could do is to just draw a sample from this probability distribution each time."
+      ],
+      "metadata": {
+        "id": "MUM3kLEjbTso"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def sample_next_token(input_tokens, model, tokenizer):\n",
+        "  # Run model to get prediction over next output\n",
+        "  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+        "  # Find prediction\n",
+        "  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+        "  # TODO Draw a random token according to the probabilities\n",
+        "  # next_token should be an array with an sole integer in it (as below)\n",
+        "  # Use:  https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html\n",
+        "  # Replace this line\n",
+        "  next_token = [5000]\n",
+        "\n",
+        "\n",
+        "  # Append token to sentence\n",
+        "  output_tokens = input_tokens\n",
+        "  output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+        "  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+        "  output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+        "\n",
+        "  return output_tokens"
+      ],
+      "metadata": {
+        "id": "TIyNgg0FkJKO"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Expected output:\n",
+        "# \"The best thing about Bath is that they don't even change or shrink anymore.\"\n",
+        "\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "for i in range(10):\n",
+        "    input_tokens = sample_next_token(input_tokens, model, tokenizer)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "BHs-IWaz9MNY"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# TODO Modify the code below by changing the number of tokens generated and the initial sentence\n",
+        "# to get a feel for how well this works.  Since I didn't reset the seed, it will give a different\n",
+        "# answer every time that you run it.\n",
+        "\n",
+        "# TODO Experiment with changing this line:\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "# TODO Experiment with changing this line:\n",
+        "for i in range(10):\n",
+        "    input_tokens = sample_next_token(input_tokens, model, tokenizer)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+      ],
+      "metadata": {
+        "id": "yN98_7WqbvIe"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Greedy token selection\n",
+        "\n",
+        "You probably (correctly) got the impression that the text from pure sampling of the probability model can be kind of random.  How about if we choose most likely token at each step?\n"
+      ],
+      "metadata": {
+        "id": "7eHFLCeZcmmg"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def get_best_next_token(input_tokens, model, tokenizer):\n",
+        "  # Run model to get prediction over next output\n",
+        "  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+        "  # Find prediction\n",
+        "  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+        "\n",
+        "  # TODO -- find the token index with the maximum probability\n",
+        "  # It should be returns as a list (i.e., put squared brackets around it)\n",
+        "  # Use https://numpy.org/doc/stable/reference/generated/numpy.argmax.html\n",
+        "  # Replace this line\n",
+        "  next_token = [5000]\n",
+        "\n",
+        "\n",
+        "  # Append token to sentence\n",
+        "  output_tokens = input_tokens\n",
+        "  output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+        "  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+        "  output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+        "  return output_tokens"
+      ],
+      "metadata": {
+        "id": "OhRzynEjxpZF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Expected output:\n",
+        "# The best thing about Bath is that it's a place where you can go to\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_best_next_token(input_tokens, model, tokenizer)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+      ],
+      "metadata": {
+        "id": "gKB1Mgndj-Hm"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# TODO Modify the code below by changing the number of tokens generated and the initial sentence\n",
+        "# to get a feel for how well this works.\n",
+        "\n",
+        "# TODO Experiment with changing this line:\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "# TODO Experiment with changing this line:\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_best_next_token(input_tokens, model, tokenizer)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+      ],
+      "metadata": {
+        "id": "L1YHKaYFfC0M"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Top-K sampling\n",
+        "\n",
+        "You probably noticed that the greedy strategy produces quite realistic text, but it's kind of boring.  It produces generic answers.  Also, if this was a chatbot, then we wouldn't necessarily want it to produce the same answer to a question each time.  \n",
+        "\n",
+        "Top-K sampling is a compromise strategy that samples randomly from the top K most probable tokens.  We could just choose them with a uniform distribution, or (as here) we could sample them according to their original probabilities."
+      ],
+      "metadata": {
+        "id": "1ORFXYX_gBDT"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def get_top_k_token(input_tokens, model, tokenizer, k=20):\n",
+        "  # Run model to get prediction over next output\n",
+        "  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+        "  # Find prediction\n",
+        "  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+        "\n",
+        "  # Draw a sample from the top K most likely tokens.\n",
+        "  # Take copy of the probabilities and sort from largest to smallest (use np.sort)\n",
+        "  # TODO -- replace this line\n",
+        "  sorted_prob_over_tokens =  prob_over_tokens\n",
+        "\n",
+        "  # Find the probability at the k'th position\n",
+        "  # TODO -- replace this line\n",
+        "  kth_prob_value = 0.0\n",
+        "\n",
+        "  # Set all probabilities below this value to zero\n",
+        "  prob_over_tokens[prob_over_tokens<kth_prob_value] = 0\n",
+        "\n",
+        "  # Renormalize the probabilities so that they sum to one\n",
+        "  # TODO -- replace this line\n",
+        "  prob_over_tokens = prob_over_tokens\n",
+        "\n",
+        "\n",
+        "  # Draw random token\n",
+        "  next_token = np.random.choice(len(prob_over_tokens), 1, replace=False, p=prob_over_tokens)\n",
+        "\n",
+        "  # Append token to sentence\n",
+        "  output_tokens = input_tokens\n",
+        "  output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+        "  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+        "  output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+        "  return output_tokens"
+      ],
+      "metadata": {
+        "id": "7RFbn6c-0Z4v"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Expected output:\n",
+        "# The best thing about Bath is that you get to see all the beautiful faces of\n",
+        "\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_top_k_token(input_tokens, model, tokenizer, k=10)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+      ],
+      "metadata": {
+        "id": "G3w1GVED4HYv"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# TODO\n",
+        "# Experiment with different values of k\n",
+        "# If you set it to a lower number (say 3) the text will be less random\n",
+        "# If you set it to a higher number (say 5000) the text will be more random\n",
+        "\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_top_k_token(input_tokens, model, tokenizer, k=10)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+      ],
+      "metadata": {
+        "id": "RySu2bzqpW9E"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Nucleus sampling\n",
+        "\n",
+        "Top-K sampling has the disadvantage that sometimes there are only a few plausible next tokens, and sometimes there are a lot.  How do we adapt to this situation?  One way is to sample from a fixed proportion of the probability mass.  That is we order the tokens in terms of probability and cut off the possibility of sampling when the cumulative sum is greater than a threshold.\n",
+        "\n",
+        "This way, we adapt the number of possible tokens that we can choose."
+      ],
+      "metadata": {
+        "id": "fOHak_QJfU-2"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh=0.25):\n",
+        "  # Run model to get prediction over next output\n",
+        "  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+        "  # Find prediction\n",
+        "  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+        "\n",
+        "  # Find the most likely tokens that make up the first (thresh) of the probability\n",
+        "  # TODO -- sort the probabilities in decreasing order\n",
+        "  # Replace this line\n",
+        "  sorted_probs_decreasing = prob_over_tokens\n",
+        "  # TODO -- compute the cumulative sum of these probabilities\n",
+        "  # Replace this line\n",
+        "  cum_sum_probs = sorted_probs_decreasing\n",
+        "\n",
+        "\n",
+        "\n",
+        "  # Find index where that the cumulative sum is greater than the threshold\n",
+        "  thresh_index = np.argmax(cum_sum_probs>thresh)\n",
+        "  print(\"Choosing from %d tokens\"%(thresh_index))\n",
+        "  # TODO:  Find the probability value to threshold\n",
+        "  # Replace this line:\n",
+        "  thresh_prob = sorted_probs_decreasing[thresh_index]\n",
+        "\n",
+        "\n",
+        "\n",
+        "  # Set any probabilities less than this to zero\n",
+        "  prob_over_tokens[prob_over_tokens<thresh_prob] = 0\n",
+        "  # Renormalize\n",
+        "  prob_over_tokens = prob_over_tokens / np.sum(prob_over_tokens)\n",
+        "  # Draw random token\n",
+        "  next_token = np.random.choice(len(prob_over_tokens), 1, replace=False, p=prob_over_tokens)\n",
+        "\n",
+        "  # Append token to sentence\n",
+        "  output_tokens = input_tokens\n",
+        "  output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+        "  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+        "  output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+        "  return output_tokens"
+      ],
+      "metadata": {
+        "id": "PtxS4kNDyUcm"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Expected output:\n",
+        "# The best thing about Bath is that it's not a city that has been around\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh = 0.2)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))\n"
+      ],
+      "metadata": {
+        "id": "K2Vk1Ly40S6c"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# TODO -- experiment with setting the threshold probability to larger or smaller values\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_nucleus_sampling_token(input_tokens, model, tokenizer, thresh = 0.2)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+      ],
+      "metadata": {
+        "id": "eQNNHe14wDvC"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Beam search\n",
+        "\n",
+        "All of the methods we've seen so far choose the tokens one by one.  But this isn't necessarily sensible.  Even greedily choosing the best token doesn't necessarily retrieve the sequence with the highest probability.  It might be that the most likely token only has very unlikely tokens following it.\n",
+        "\n",
+        "Beam search maintains $K$ hypotheses about the best possible continuation.  It starts with the top $K$ continuations.  Then for each of those, it finds the top K continuations, giving $K^2$ hypotheses.  Then it retains just the top $K$ of these so that the number of hypotheses stays the same."
+      ],
+      "metadata": {
+        "id": "WMMNeLixwlgM"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# This routine returns the k'th most likely next token.\n",
+        "# If k =0 then it returns the most likely token, if k=1 it returns the next most likely and so on\n",
+        "# We will need this for beam search\n",
+        "def get_kth_most_likely_token(input_tokens, model, tokenizer, k):\n",
+        "  # Run model to get prediction over next output\n",
+        "  outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+        "  # Find prediction\n",
+        "  prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+        "\n",
+        "  # Find the k'th most likely token\n",
+        "  # TODO Sort the probabilities from largest to smallest\n",
+        "  # Replace this line:\n",
+        "  sorted_prob_over_tokens = prob_over_tokens\n",
+        "  # TODO Find the k'th sorted probability\n",
+        "  # Replace this line\n",
+        "  kth_prob_value = prob_over_tokens[0]\n",
+        "\n",
+        "\n",
+        "\n",
+        "  # Find position of this token.\n",
+        "  next_token = np.where(prob_over_tokens == kth_prob_value)[0]\n",
+        "\n",
+        "  # Append token to sentence\n",
+        "  output_tokens = input_tokens\n",
+        "  output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+        "  output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+        "  output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+        "  output_tokens['log_prob'] = output_tokens['log_prob'] + np.log(prob_over_tokens[next_token])\n",
+        "  return output_tokens"
+      ],
+      "metadata": {
+        "id": "sAI2bClXCe2F"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# We can test this code and see that if we choose the 2nd most likely (K=1) token each time\n",
+        "# then we get much better generation results than if we choose the 2001st most likely token\n",
+        "\n",
+        "# Expected output:\n",
+        "# The best thing about Bath is the way you get the most bang outta the\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "input_tokens['log_prob'] = 0.0\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_kth_most_likely_token(input_tokens, model, tokenizer, k=1)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))\n",
+        "\n",
+        "# Expected output:\n",
+        "# The best thing about Bath is mixed profits partnerships» buy generic+ Honda throttlecont\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "input_tokens['log_prob'] = 0.0\n",
+        "for i in range(10):\n",
+        "    input_tokens = get_kth_most_likely_token(input_tokens, model, tokenizer, k=2000)\n",
+        "    print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))\n",
+        "\n",
+        "# TODO -- play around with different values of K"
+      ],
+      "metadata": {
+        "id": "6kSc0WrTELMd"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Print out each beam plus the log probability\n",
+        "def print_beams(beams):\n",
+        "  for index,beam in enumerate(beams):\n",
+        "    print(\"Beam %d, Prob %3.3f: \"%(index,beam['log_prob'])+tokenizer.decode(beam[\"input_ids\"][0], skip_special_tokens=True))\n",
+        "  print('---')\n",
+        "\n",
+        "\n",
+        "# TODO:  Read this code carefully!\n",
+        "def do_beam_search(input_tokens_in, model, tokenizer, n_beam=5, beam_length=10):\n",
+        "  # Store beams in a list\n",
+        "  input_tokens['log_prob'] = 0.0\n",
+        "\n",
+        "  # Initialize with n_beam most likely continuations\n",
+        "  beams = [None] * n_beam\n",
+        "  for c_k in range(n_beam):\n",
+        "    beams[c_k] = dict(input_tokens_in)\n",
+        "    beams[c_k] = get_kth_most_likely_token(beams[c_k], model, tokenizer, c_k)\n",
+        "\n",
+        "  print_beams(beams)\n",
+        "\n",
+        "  # For each token in the sequence we will add\n",
+        "  for c_pos in range(beam_length-1):\n",
+        "    # Now for each beam, we continue it in the most likely ways, making n_beam*n_beam type hypotheses\n",
+        "    beams_all = [None] * (n_beam*n_beam)\n",
+        "    log_probs_all = np.zeros(n_beam*n_beam)\n",
+        "    # For each current hypothesis\n",
+        "    for c_beam in range(n_beam):\n",
+        "      # For each continuation\n",
+        "      for c_k in range(n_beam):\n",
+        "        # Store the continuation and the probability\n",
+        "        beams_all[c_beam * n_beam + c_k] = dict(get_kth_most_likely_token(beams[c_beam], model, tokenizer, c_k))\n",
+        "        log_probs_all[c_beam * n_beam + c_k] = beams_all[c_beam * n_beam + c_k]['log_prob']\n",
+        "\n",
+        "    # Keep the best n_beams sequences with the highest probabilities\n",
+        "    sorted_index = np.argsort(np.array(log_probs_all)*-1)\n",
+        "    for c_k in range(n_beam):\n",
+        "      beams[c_k] = dict(beams_all[sorted_index[c_k]])\n",
+        "\n",
+        "    # Print the beams\n",
+        "    print_beams(beams)\n",
+        "\n",
+        "  return beams[0]"
+      ],
+      "metadata": {
+        "id": "Y4hFfwPFFxka"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Expected output:\n",
+        "# The best thing about Bath is that it's a place where you don't have to\n",
+        "\n",
+        "set_seed(0)\n",
+        "input_txt = \"The best thing about Bath is\"\n",
+        "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+        "\n",
+        "# Now let's call the beam search\n",
+        "# It takes a while as it has to run the model multiple times to add a token\n",
+        "n_beams = 5\n",
+        "best_beam = do_beam_search(input_tokens,model,tokenizer)\n",
+        "print(\"Beam search result:\")\n",
+        "print(tokenizer.decode(best_beam[\"input_ids\"][0], skip_special_tokens=True))\n",
+        "\n",
+        "# You should see that the best answer is not the same as the greedy solution we found above\n"
+      ],
+      "metadata": {
+        "id": "0YWKwZmz4NXb"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "You can read about more decoding strategies in this blog (which uses a recursive neural network, not a transformer, but the principles are the same).\n",
+        "\n",
+        "https://www.borealisai.com/research-blogs/tutorial-6-neural-natural-language-generation-decoding-algorithms/\n",
+        "\n",
+        "You can also look at other possible language models via hugging face:\n",
+        "\n",
+        "https://huggingface.co/docs/transformers/v4.25.1/en/model_summary#decoders-or-autoregressive-models\n"
+      ],
+      "metadata": {
+        "id": "-SXpjZPYsMhv"
+      }
+    }
+  ]
+}
\ No newline at end of file