Delete Notebooks/Chap12/12_1_Self_Attention_A.ipynb

This commit is contained in:
udlbook
2023-12-14 17:36:45 +00:00
committed by GitHub
parent 73c3fcc40b
commit 5f8f05a381

View File

@@ -1,497 +0,0 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyOYGccBIZQ0eZeXkhhw1Vup",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/udlbook/udlbook/blob/main/Notebooks/Chap12/12_1_Self_Attention_A.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"# **Notebook 12.1: Self Attention**\n",
"\n",
"This notebook builds a self-attnetion mechanism from scratch, as discussed in section 12.2 of the book.\n",
"\n",
"Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n",
"\n",
"Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions.\n",
"\n"
],
"metadata": {
"id": "t9vk9Elugvmi"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt"
],
"metadata": {
"id": "OLComQyvCIJ7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The self-attention mechanism maps $N$ inputs $\\mathbf{x}_{n}\\in\\mathbb{R}^{D}$ and returns $N$ outputs $\\mathbf{x}'_{n}\\in \\mathbb{R}^{D}$. \n",
"\n"
],
"metadata": {
"id": "9OJkkoNqCVK2"
}
},
{
"cell_type": "code",
"source": [
"# Set seed so we get the same random numbers\n",
"np.random.seed(3)\n",
"# Number of inputs\n",
"N = 3\n",
"# Number of dimensions of each input\n",
"D = 4\n",
"# Create an empty list\n",
"all_x = []\n",
"# Create elements x_n and append to list\n",
"for n in range(N):\n",
" all_x.append(np.random.normal(size=(D,1)))\n",
"# Print out the list\n",
"print(all_x)\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "oAygJwLiCSri",
"outputId": "2b82bd1a-9c48-4df3-e4ce-8f7376da60d8"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[array([[ 1.78862847],\n",
" [ 0.43650985],\n",
" [ 0.09649747],\n",
" [-1.8634927 ]]), array([[-0.2773882 ],\n",
" [-0.35475898],\n",
" [-0.08274148],\n",
" [-0.62700068]]), array([[-0.04381817],\n",
" [-0.47721803],\n",
" [-1.31386475],\n",
" [ 0.88462238]])]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"We'll also need the weights and biases for the keys, queries, and values (equations 12.2 and 12.4)"
],
"metadata": {
"id": "W2iHFbtKMaDp"
}
},
{
"cell_type": "code",
"source": [
"# Set seed so we get the same random numbers\n",
"np.random.seed(0)\n",
"\n",
"# Choose random values for the parameters\n",
"omega_q = np.random.normal(size=(D,D))\n",
"omega_k = np.random.normal(size=(D,D))\n",
"omega_v = np.random.normal(size=(D,D))\n",
"beta_q = np.random.normal(size=(D,1))\n",
"beta_k = np.random.normal(size=(D,1))\n",
"beta_v = np.random.normal(size=(D,1))"
],
"metadata": {
"id": "79TSK7oLMobe"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now let's compute the queries, keys, and values for each input"
],
"metadata": {
"id": "VxaKQtP3Ng6R"
}
},
{
"cell_type": "code",
"source": [
"# Make three lists to store queries, keys, and values\n",
"all_queries = []\n",
"all_keys = []\n",
"all_values = []\n",
"# For every input\n",
"for x in all_x:\n",
" # TODO -- compute the keys, queries and values.\n",
" # Replace these three lines\n",
" query = np.ones_like(x)\n",
" key = np.ones_like(x)\n",
" value = np.ones_like(x)\n",
"\n",
" # BEGIN_ANSWER\n",
" query = beta_q + np.matmul(omega_q,x)\n",
" key = beta_k + np.matmul(omega_k,x)\n",
" value = beta_v + np.matmul(omega_v, x)\n",
"\n",
" # END_ANSWER\n",
"\n",
" all_queries.append(query)\n",
" all_keys.append(key)\n",
" all_values.append(value)"
],
"metadata": {
"id": "TwDK2tfdNmw9"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We'll need a softmax function (equation 12.5) -- here, it will take a list of arbirtrary numbers and return a list where the elements are non-negative and sum to one\n"
],
"metadata": {
"id": "Se7DK6PGPSUk"
}
},
{
"cell_type": "code",
"source": [
"def softmax(items_in):\n",
"\n",
" # TODO Compute the elements of items_out\n",
" # Replace this line\n",
" items_out = items_in.copy()\n",
"\n",
" #BEGIN_ANSWER\n",
" items_out = []\n",
" denom = 0\n",
" for item in items_in:\n",
" denom = denom + np.exp(item)\n",
" for item in items_in:\n",
" items_out.append(np.exp(item)/denom)\n",
" # END_ANSWER\n",
"\n",
" return items_out ;"
],
"metadata": {
"id": "u93LIcE5PoiM"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now compute the self attention values:"
],
"metadata": {
"id": "8aJVhbKDW7lm"
}
},
{
"cell_type": "code",
"source": [
"# Create emptymlist for output\n",
"all_x_prime = []\n",
"\n",
"# For each output\n",
"for n in range(N):\n",
" # Create list for dot products of query N with all keys\n",
" all_km_qn = []\n",
" # Compute the dot products\n",
" for key in all_keys:\n",
" # TODO -- compute the appropriate dot product\n",
" # Replace this line\n",
" dot_product = 1\n",
"\n",
" # ANSWER\n",
" dot_product = np.matmul(all_queries[n].transpose(), key)\n",
" #END_ANSWER\n",
"\n",
" # Store dot product\n",
" all_km_qn.append(dot_product)\n",
"\n",
" # Compute dot product\n",
" attention = softmax(all_km_qn)\n",
" # Print result (should be positive sum to one)\n",
" print(\"Attentions for output \", n)\n",
" print(attention)\n",
"\n",
" # TODO: Compute a weighted sum of all of the values according to the attention\n",
" # (equation 12.3)\n",
" # Replace this line\n",
" x_prime = np.zeros((D,1))\n",
" #BEGIN_ANSWER\n",
" for m in range(N):\n",
" x_prime = x_prime+ attention[m] * all_values[m]\n",
" #END_ANSWER\n",
"\n",
" all_x_prime.append(x_prime)\n",
"\n",
"\n",
"# Print out true values to check you have it correct\n",
"print(\"x_prime_0_calculated:\", all_x_prime[0].transpose())\n",
"print(\"x_prime_0_true: [[ 0.94744244 -0.24348429 -0.91310441 -0.44522983]]\")\n",
"print(\"x_prime_1_calculated:\", all_x_prime[1].transpose())\n",
"print(\"x_prime_1_true: [[ 1.64201168 -0.08470004 4.02764044 2.18690791]]\")\n",
"print(\"x_prime_2_calculated:\", all_x_prime[2].transpose())\n",
"print(\"x_prime_2_true: [[ 1.61949281 -0.06641533 3.96863308 2.15858316]]\")\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "yimz-5nCW6vQ",
"outputId": "1d287fb3-e3f7-47da-b437-379df5184039"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Attentions for output 0\n",
"[array([[1.24326146e-13]]), array([[0.99828149]]), array([[0.00171851]])]\n",
"Attentions for output 1\n",
"[array([[2.79525306e-12]]), array([[0.00585506]]), array([[0.99414494]])]\n",
"Attentions for output 2\n",
"[array([[0.00505708]]), array([[0.00654776]]), array([[0.98839516]])]\n",
"x_prime_0_calculated: [[ 0.94744244 -0.24348429 -0.91310441 -0.44522983]]\n",
"x_prime_0_true: [[ 0.94744244 -0.24348429 -0.91310441 -0.44522983]]\n",
"x_prime_1_calculated: [[ 1.64201168 -0.08470004 4.02764044 2.18690791]]\n",
"x_prime_1_true: [[ 1.64201168 -0.08470004 4.02764044 2.18690791]]\n",
"x_prime_2_calculated: [[ 1.61949281 -0.06641533 3.96863308 2.15858316]]\n",
"x_prime_2_true: [[ 1.61949281 -0.06641533 3.96863308 2.15858316]]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"Now let's compute the same thing, but using matrix calculations. We'll store the $N$ inputs $\\mathbf{x}_{n}\\in\\mathbb{R}^{D}$ in the columns of a $D\\times N$ matrix, using equations 12.6 and 12.7/8.\n",
"\n",
"Note: The book uses column vectors (for compatibility with the rest of the text), but in the wider literature it is more normal to store the inputs in the rows of a matrix; in this case, the computation is the same, but all the matrices are transposed and the operations proceed in the reverse order."
],
"metadata": {
"id": "PJ2vCQ_7C38K"
}
},
{
"cell_type": "code",
"source": [
"# Define softmax operation that works independently on each column\n",
"def softmax_cols(data_in):\n",
" # Exponentiate all of the values\n",
" exp_values = np.exp(data_in) ;\n",
" # Sum over columns\n",
" denom = np.sum(exp_values, axis = 0);\n",
" # Replicate denominator to N rows\n",
" denom = np.matmul(np.ones((data_in.shape[0],1)), denom[np.newaxis,:])\n",
" # Compute softmax\n",
" softmax = exp_values / denom\n",
" # return the answer\n",
" return softmax"
],
"metadata": {
"id": "obaQBdUAMXXv"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
" # Now let's compute self attention in matrix form\n",
"def self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k):\n",
"\n",
" # TODO -- Write this function\n",
" # 1. Compute queries, keys, and values\n",
" # 2. Compute dot products\n",
" # 3. Apply softmax to calculate attentions\n",
" # 4. Weight values by attentions\n",
" # Replace this line\n",
" X_prime = np.zeros_like(X);\n",
"\n",
" # BEGIN_ANSWER\n",
" Q = np.matmul(beta_q, np.ones((1,X.shape[1]))) + np.matmul(omega_q, X)\n",
" K = np.matmul(beta_k, np.ones((1,X.shape[1]))) + np.matmul(omega_k, X)\n",
" V = np.matmul(beta_v, np.ones((1,X.shape[1]))) + np.matmul(omega_v, X)\n",
"\n",
" dot_products = np.matmul(K.transpose(), Q)\n",
" attention = softmax_cols(dot_products)\n",
" X_prime = np.matmul(V, attention)\n",
" # END_ANSWER\n",
"\n",
"\n",
" return X_prime"
],
"metadata": {
"id": "gb2WvQ3SiH8r"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Copy data into matrix\n",
"X = np.zeros((D, N))\n",
"X[:,0] = np.squeeze(all_x[0])\n",
"X[:,1] = np.squeeze(all_x[1])\n",
"X[:,2] = np.squeeze(all_x[2])\n",
"\n",
"# Run the self attention mechanism\n",
"X_prime = self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k)\n",
"\n",
"# Print out the results\n",
"print(X_prime)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MUOJbgJskUpl",
"outputId": "a3cef470-3f45-4862-81d5-b499603fe7eb"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[[ 0.94744244 1.64201168 1.61949281]\n",
" [-0.24348429 -0.08470004 -0.06641533]\n",
" [-0.91310441 4.02764044 3.96863308]\n",
" [-0.44522983 2.18690791 2.15858316]]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"If you did this correctly, the values should be the same as above.\n",
"\n",
"TODO: \n",
"\n",
"Print out the attention matrix\n",
"You will see that the values are quite extreme (one is very close to one and the others are very close to zero. Now we'll fix this problem by using scaled dot-product attention."
],
"metadata": {
"id": "as_lRKQFpvz0"
}
},
{
"cell_type": "code",
"source": [
"# Now let's compute self attention in matrix form\n",
"def scaled_dot_product_self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k):\n",
"\n",
" # TODO -- Write this function\n",
" # 1. Compute queries, keys, and values\n",
" # 2. Compute dot products\n",
" # 3. Scale the dot products as in equation 12.9\n",
" # 4. Apply softmax to calculate attentions\n",
" # 5. Weight values by attentions\n",
" # Replace this line\n",
" X_prime = np.zeros_like(X);\n",
"\n",
" # BEGIN_ANSWER\n",
" Q = np.matmul(beta_q, np.ones((1,X.shape[1]))) + np.matmul(omega_q, X)\n",
" K = np.matmul(beta_k, np.ones((1,X.shape[1]))) + np.matmul(omega_k, X)\n",
" V = np.matmul(beta_v, np.ones((1,X.shape[1]))) + np.matmul(omega_v, X)\n",
"\n",
" dot_products = np.matmul(K.transpose(), Q)/np.sqrt(X.shape[0])\n",
" attention = softmax_cols(dot_products)\n",
" print(attention)\n",
" X_prime = np.matmul(V, attention)\n",
" # END_ANSWER\n",
"\n",
" return X_prime"
],
"metadata": {
"id": "kLU7PUnnqvIh"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Run the self attention mechanism\n",
"X_prime = scaled_dot_product_self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k)\n",
"\n",
"# Print out the results\n",
"print(X_prime)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "n18e3XNzmVgL",
"outputId": "7ff5ea07-16c6-43db-c6de-b4971e5bd7f9"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"[[3.38843552e-07 1.55730194e-06 6.20418746e-02]\n",
" [9.60161968e-01 7.12734969e-02 7.05962187e-02]\n",
" [3.98376935e-02 9.28724946e-01 8.67361907e-01]]\n",
"[[ 0.97411966 1.59622051 1.32638014]\n",
" [-0.23738409 -0.09516106 0.13062402]\n",
" [-0.72333202 3.70194096 3.02371664]\n",
" [-0.34413007 2.01339538 1.6902419 ]]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"TODO -- Investigate whether the self-attention mechanism is covariant with respect to permulation.\n",
"If it is, when we permute the columns of the input matrix $\\mathbf{X}$, the columns of the output matrix $\\mathbf{X}'$ will also be permuted.\n"
],
"metadata": {
"id": "QDEkIrcgrql-"
}
}
]
}