From 5b845f7d7b3334375584f8339d9f5d3df5973ce8 Mon Sep 17 00:00:00 2001
From: udlbook <110402648+udlbook@users.noreply.github.com>
Date: Wed, 2 Aug 2023 18:15:44 -0400
Subject: [PATCH] Created using Colaboratory
---
.../Chap12/12_4_Decoding_Strategies.ipynb | 648 ++++++++++++++++++
1 file changed, 648 insertions(+)
create mode 100644 Notebooks/Chap12/12_4_Decoding_Strategies.ipynb
diff --git a/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb b/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb
new file mode 100644
index 0000000..c296764
--- /dev/null
+++ b/Notebooks/Chap12/12_4_Decoding_Strategies.ipynb
@@ -0,0 +1,648 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "authorship_tag": "ABX9TyNPrHfkLWjy3NfDHRhGG3IE",
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# **Notebook 12.4: Decoding strategies**\n",
+ "\n",
+ "This practical investigates neural decoding from transformer models. \n",
+ "\n",
+ "Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n",
+ "\n",
+ "Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions."
+ ],
+ "metadata": {
+ "id": "RnIUiieJWu6e"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip install transformers"
+ ],
+ "metadata": {
+ "id": "7abjZ9pMVj3k"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed\n",
+ "import torch\n",
+ "import torch.nn.functional as F\n",
+ "import numpy as np"
+ ],
+ "metadata": {
+ "id": "sMOyD0zem2Ef"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Load model and tokenizer\n",
+ "model = GPT2LMHeadModel.from_pretrained('gpt2')\n",
+ "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')"
+ ],
+ "metadata": {
+ "id": "pZgfxbzKWNSR"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Decoding from GPT2\n",
+ "\n",
+ "This tutorial investigates how to use GPT2 (the forerunner of GPT3) to generate text. There are a number of ways to do this that trade-off the realism of the text against the amount of variation.\n",
+ "\n",
+ "At every stage, GPT2 takes an input string and returns a probability for each of the possible subsequent tokens. We can choose what to do with these probability. We could always *greedily choose* the most likely next token, or we could draw a *sample* randomly according to the probabilities. There are also intermediate strategies such as *top-k sampling* and *nucleus sampling*, that have some controlled randomness.\n",
+ "\n",
+ "We'll also investigate *beam search* -- the idea is that rather than greedily take the next best token at each stage, we maintain a set of hypotheses (beams)as we add each subsequent token and return the most likely overall hypothesis. This is not necessarily the same result we get from greedily choosing the next token."
+ ],
+ "metadata": {
+ "id": "TfhAGy0TXEvV"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "First, let's investigate the token themselves. The code below prints out the vocabulary size and shows 20 random tokens. "
+ ],
+ "metadata": {
+ "id": "vsmO9ptzau3_"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "np.random.seed(1)\n",
+ "print(\"Number of tokens in dictionary = %d\"%(tokenizer.vocab_size))\n",
+ "for i in range(20):\n",
+ " index = np.random.randint(tokenizer.vocab_size)\n",
+ " print(\"Token: %d \"%(index)+tokenizer.decode(torch.tensor(index), skip_special_tokens=True))\n"
+ ],
+ "metadata": {
+ "id": "dmmBNS5GY_yk"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Sampling\n",
+ "\n",
+ "Each time we run GPT2 it will take in a set of tokens, and return a probability over each of the possible next tokens. The simplest thing we could do is to just draw a sample from this probability distribution each time."
+ ],
+ "metadata": {
+ "id": "MUM3kLEjbTso"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def sample_next_token(input_tokens, model, tokenizer):\n",
+ " # Run model to get prediction over next output\n",
+ " outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+ " # Find prediction\n",
+ " prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+ " # TODO Draw a random token according to the probabilities\n",
+ " # next_token should be an array with an sole integer in it (as below)\n",
+ " # Use: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html\n",
+ " # Replace this line\n",
+ " next_token = [5000]\n",
+ "\n",
+ "\n",
+ " # Append token to sentence\n",
+ " output_tokens = input_tokens\n",
+ " output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+ " output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+ " output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+ "\n",
+ " return output_tokens"
+ ],
+ "metadata": {
+ "id": "TIyNgg0FkJKO"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Expected output:\n",
+ "# \"The best thing about Bath is that they don't even change or shrink anymore.\"\n",
+ "\n",
+ "set_seed(0)\n",
+ "input_txt = \"The best thing about Bath is\"\n",
+ "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+ "for i in range(10):\n",
+ " input_tokens = sample_next_token(input_tokens, model, tokenizer)\n",
+ " print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "BHs-IWaz9MNY"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# TODO Modify the code below by changing the number of tokens generated and the initial sentence\n",
+ "# to get a feel for how well this works. Since I didn't reset the seed, it will give a different\n",
+ "# answer every time that you run it.\n",
+ "\n",
+ "# TODO Experiment with changing this line:\n",
+ "input_txt = \"The best thing about Bath is\"\n",
+ "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+ "# TODO Experiment with changing this line:\n",
+ "for i in range(10):\n",
+ " input_tokens = sample_next_token(input_tokens, model, tokenizer)\n",
+ " print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+ ],
+ "metadata": {
+ "id": "yN98_7WqbvIe"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Greedy token selection\n",
+ "\n",
+ "You probably (correctly) got the impression that the text from pure sampling of the probability model can be kind of random. How about if we choose most likely token at each step?\n"
+ ],
+ "metadata": {
+ "id": "7eHFLCeZcmmg"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def get_best_next_token(input_tokens, model, tokenizer):\n",
+ " # Run model to get prediction over next output\n",
+ " outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+ " # Find prediction\n",
+ " prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+ "\n",
+ " # TODO -- find the token index with the maximum probability\n",
+ " # It should be returns as a list (i.e., put squared brackets around it)\n",
+ " # Use https://numpy.org/doc/stable/reference/generated/numpy.argmax.html\n",
+ " # Replace this line\n",
+ " next_token = [5000]\n",
+ "\n",
+ "\n",
+ " # Append token to sentence\n",
+ " output_tokens = input_tokens\n",
+ " output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
+ " output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
+ " output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
+ " return output_tokens"
+ ],
+ "metadata": {
+ "id": "OhRzynEjxpZF"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Expected output:\n",
+ "# The best thing about Bath is that it's a place where you can go to\n",
+ "set_seed(0)\n",
+ "input_txt = \"The best thing about Bath is\"\n",
+ "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+ "for i in range(10):\n",
+ " input_tokens = get_best_next_token(input_tokens, model, tokenizer)\n",
+ " print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+ ],
+ "metadata": {
+ "id": "gKB1Mgndj-Hm"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# TODO Modify the code below by changing the number of tokens generated and the initial sentence\n",
+ "# to get a feel for how well this works.\n",
+ "\n",
+ "# TODO Experiment with changing this line:\n",
+ "input_txt = \"The best thing about Bath is\"\n",
+ "input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
+ "# TODO Experiment with changing this line:\n",
+ "for i in range(10):\n",
+ " input_tokens = get_best_next_token(input_tokens, model, tokenizer)\n",
+ " print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
+ ],
+ "metadata": {
+ "id": "L1YHKaYFfC0M"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Top-K sampling\n",
+ "\n",
+ "You probably noticed that the greedy strategy produces quite realistic text, but it's kind of boring. It produces generic answers. Also, if this was a chatbot, then we wouldn't necessarily want it to produce the same answer to a question each time. \n",
+ "\n",
+ "Top-K sampling is a compromise strategy that samples randomly from the top K most probable tokens. We could just choose them with a uniform distribution, or (as here) we could sample them according to their original probabilities."
+ ],
+ "metadata": {
+ "id": "1ORFXYX_gBDT"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "def get_top_k_token(input_tokens, model, tokenizer, k=20):\n",
+ " # Run model to get prediction over next output\n",
+ " outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
+ " # Find prediction\n",
+ " prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
+ "\n",
+ " # Draw a sample from the top K most likely tokens.\n",
+ " # Take copy of the probabilities and sort from largest to smallest (use np.sort)\n",
+ " # TODO -- replace this line\n",
+ " sorted_prob_over_tokens = prob_over_tokens\n",
+ "\n",
+ " # Find the probability at the k'th position\n",
+ " # TODO -- replace this line\n",
+ " kth_prob_value = 0.0\n",
+ "\n",
+ " # Set all probabilities below this value to zero\n",
+ " prob_over_tokens[prob_over_tokensthresh)\n",
+ " print(\"Choosing from %d tokens\"%(thresh_index))\n",
+ " # TODO: Find the probability value to threshold\n",
+ " # Replace this line:\n",
+ " thresh_prob = sorted_probs_decreasing[thresh_index]\n",
+ "\n",
+ "\n",
+ "\n",
+ " # Set any probabilities less than this to zero\n",
+ " prob_over_tokens[prob_over_tokens