{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyPLuYbU3a6Uvb/3X1shd1XV",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"source": [
"# Transformers\n",
"\n",
"This practical investigates neural decoding from transformer models. Run the next three cells as they might take a while to run (they have to download some stuff), and then read the next text box while you are waiting."
],
"metadata": {
"id": "RnIUiieJWu6e"
}
},
{
"cell_type": "code",
"source": [
"!pip install transformers"
],
"metadata": {
"id": "7abjZ9pMVj3k"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed\n",
"import torch\n",
"import torch.nn.functional as F\n",
"import numpy as np"
],
"metadata": {
"id": "sMOyD0zem2Ef"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Load model and tokenizer\n",
"model = GPT2LMHeadModel.from_pretrained('gpt2')\n",
"tokenizer = GPT2Tokenizer.from_pretrained('gpt2')"
],
"metadata": {
"id": "pZgfxbzKWNSR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Decoding from GPT2\n",
"\n",
"This tutorial investigates how to use GPT2 (the forerunner of GPT3) to generate text. There are a number of ways to do this that trade-off the realism of the text against the amount of variation.\n",
"\n",
"At every stage, GPT2 takes an input string and returns a probability for each of the possible subsequent tokens. We can choose what to do with these probability. We could always *greedily choose* the most likely next token, or we could draw a *sample* randomly according to the probabilities. There are also intermediate strategies such as *top-k sampling* and *nucleus sampling*, that have some controlled randomness.\n",
"\n",
"We'll also investigate *beam search* -- the idea is that rather than greedily take the next best token at each stage, we maintain a set of hypotheses (beams)as we add each subsequent token and return the most likely overall hypothesis. This is not necessarily the same result we get from greedily choosing the next token. "
],
"metadata": {
"id": "TfhAGy0TXEvV"
}
},
{
"cell_type": "markdown",
"source": [
"First, let's investigate the token themselves. The code below prints out the vocabulary size and shows 20 random tokens. "
],
"metadata": {
"id": "vsmO9ptzau3_"
}
},
{
"cell_type": "code",
"source": [
"np.random.seed(1)\n",
"print(\"Number of tokens in dictionary = %d\"%(tokenizer.vocab_size))\n",
"for i in range(20):\n",
" index = np.random.randint(tokenizer.vocab_size)\n",
" print(\"Token: %d \"%(index)+tokenizer.decode(torch.tensor(index), skip_special_tokens=True))\n"
],
"metadata": {
"id": "dmmBNS5GY_yk"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Sampling\n",
"\n",
"Each time we run GPT2 it will take in a set of tokens, and return a probability over each of the possible next tokens. The simplest thing we could do is to just draw a sample from this probability distribution each time."
],
"metadata": {
"id": "MUM3kLEjbTso"
}
},
{
"cell_type": "code",
"source": [
"def sample_next_token(input_tokens, model, tokenizer):\n",
" # Run model to get prediction over next output\n",
" outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
" # Find prediction\n",
" prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
" # Draw a random token according to the probabilities\n",
" # Use: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html\n",
" # Replace this line\n",
" next_token = [5000]\n",
"\n",
" # Append token to sentence\n",
" output_tokens = input_tokens\n",
" output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
" output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
" output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
"\n",
" return output_tokens"
],
"metadata": {
"id": "TIyNgg0FkJKO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Expected output:\n",
"# \"The best thing about Bath is that they don't even change or shrink anymore.\"\n",
"\n",
"set_seed(0)\n",
"input_txt = \"The best thing about Bath is\"\n",
"input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
"for i in range(10):\n",
" input_tokens = sample_next_token(input_tokens, model, tokenizer)\n",
" print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))\n",
"\n"
],
"metadata": {
"id": "BHs-IWaz9MNY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# TODO Modify the code below by changeing the number of tokens generated and the initial sentence\n",
"# to get a feel for how well this works. Since I didn't reset the seed, it will give a different\n",
"# answer every time that you run it.\n",
"\n",
"# TODO Experiment with changing this line:\n",
"input_txt = \"The best thing about Bath is\"\n",
"input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
"# TODO Experiment with changing this line:\n",
"for i in range(10):\n",
" input_tokens = sample_next_token(input_tokens, model, tokenizer)\n",
" print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
],
"metadata": {
"id": "yN98_7WqbvIe"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Greedy token selection\n",
"\n",
"You probably (correctly) got the impression that the text from pure sampling of the probability model can be kind of random. How about if we choose most likely token at each step?\n"
],
"metadata": {
"id": "7eHFLCeZcmmg"
}
},
{
"cell_type": "code",
"source": [
"def get_best_next_token(input_tokens, model, tokenizer):\n",
" # Run model to get prediction over next output\n",
" outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
" # Find prediction\n",
" prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
"\n",
" # TODO -- find the token index with the maximum probability\n",
" # It should be returns as a list (i.e., put squared brackets around it)\n",
" # Use https://numpy.org/doc/stable/reference/generated/numpy.argmax.html\n",
" # Replace this line\n",
" next_token = [5000]\n",
"\n",
" # Append token to sentence\n",
" output_tokens = input_tokens\n",
" output_tokens[\"input_ids\"] = torch.cat((output_tokens['input_ids'],torch.tensor([next_token])),dim=1)\n",
" output_tokens['attention_mask'] = torch.cat((output_tokens['attention_mask'],torch.tensor([[1]])),dim=1)\n",
" output_tokens['last_token_prob'] = prob_over_tokens[next_token]\n",
" return output_tokens"
],
"metadata": {
"id": "OhRzynEjxpZF"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Expected output:\n",
"# The best thing about Bath is that it's a place where you can go to\n",
"set_seed(0)\n",
"input_txt = \"The best thing about Bath is\"\n",
"input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
"for i in range(10):\n",
" input_tokens = get_best_next_token(input_tokens, model, tokenizer)\n",
" print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
],
"metadata": {
"id": "gKB1Mgndj-Hm"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# TODO Modify the code below by changeing the number of tokens generated and the initial sentence\n",
"# to get a feel for how well this works. \n",
"\n",
"# TODO Experiment with changing this line:\n",
"input_txt = \"The best thing about Bath is\"\n",
"input_tokens = tokenizer(input_txt, return_tensors='pt')\n",
"# TODO Experiment with changing this line:\n",
"for i in range(10):\n",
" input_tokens = get_best_next_token(input_tokens, model, tokenizer)\n",
" print(tokenizer.decode(input_tokens[\"input_ids\"][0], skip_special_tokens=True))"
],
"metadata": {
"id": "L1YHKaYFfC0M"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Top-K sampling\n",
"\n",
"You probably noticed that the greedy strategy produces quite realistic text, but it's kind of boring. It produces generic answers. Also, if this was a chatbot, then we wouldn't necessarily want it to produce the same answer to a question each time. \n",
"\n",
"Top-K sampling is a compromise strategy that samples randomly from the top K most probable tokens. We could just choose them with a uniform distribution, or (as here) we could sample them according to their original probabilities."
],
"metadata": {
"id": "1ORFXYX_gBDT"
}
},
{
"cell_type": "code",
"source": [
"def get_top_k_token(input_tokens, model, tokenizer, k=20):\n",
" # Run model to get prediction over next output\n",
" outputs = model(input_ids = input_tokens['input_ids'], attention_mask = input_tokens['attention_mask'])\n",
" # Find prediction\n",
" prob_over_tokens = F.softmax(outputs.logits, dim=-1).detach().numpy()[0,-1]\n",
"\n",
" # Draw a sample from the top K most likely tokens.\n",
" # Take copy of the probabilities and sort from largest to smallest (use np.sort)\n",
" # TODO -- replace this line\n",
" sorted_prob_over_tokens = prob_over_tokens\n",
"\n",
" # Find the probability at the k'th position\n",
" # TODO -- replace this line\n",
" kth_prob_value = 0.0\n",
"\n",
" # Set all probabilities below this value to zero \n",
" prob_over_tokens[prob_over_tokensthresh)\n",
" print(\"Choosing from %d tokens\"%(thresh_index))\n",
" # TODO: Find the probabilitiy value to threshold \n",
" # Replace this line:\n",
" thresh_prob = sorted_probs_decreasing[thresh_index]\n",
"\n",
" # Set any probabilities less than this to zero \n",
" prob_over_tokens[prob_over_tokens