udlbook/Notebooks/Chap05/5_2_Binary_Cross_Entropy_Loss.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/udlbook/udlbook/blob/main/Notebooks/Chap05/5_2_Binary_Cross_Entropy_Loss.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# **Notebook 5.2 Binary Cross-Entropy Loss**\n",
        "\n",
        "This notebook investigates the binary cross-entropy loss.  It follows from applying the formula in section 5.2 to a loss function based on the Bernoulli distribution.\n",
        "\n",
        "Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n",
        "\n",
        "Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions."
      ],
      "metadata": {
        "id": "jSlFkICHwHQF"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PYMZ1x-Pv1ht"
      },
      "outputs": [],
      "source": [
        "# Imports math library\n",
        "import numpy as np\n",
        "# Imports plotting library\n",
        "import matplotlib.pyplot as plt\n",
        "# Import math Library\n",
        "import math"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the Rectified Linear Unit (ReLU) function\n",
        "def ReLU(preactivation):\n",
        "  activation = preactivation.clip(0.0)\n",
        "  return activation\n",
        "\n",
        "# Define a shallow neural network\n",
        "def shallow_nn(x, beta_0, omega_0, beta_1, omega_1):\n",
        "    # Make sure that input data is (1 x n_data) array\n",
        "    n_data = x.size\n",
        "    x = np.reshape(x,(1,n_data))\n",
        "\n",
        "    # This runs the network for ALL of the inputs, x at once so we can draw graph\n",
        "    h1 = ReLU(np.matmul(beta_0,np.ones((1,n_data))) + np.matmul(omega_0,x))\n",
        "    model_out = np.matmul(beta_1,np.ones((1,n_data))) + np.matmul(omega_1,h1)\n",
        "    return model_out"
      ],
      "metadata": {
        "id": "Fv7SZR3tv7mV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Get parameters for model -- we can call this function to easily reset them\n",
        "def get_parameters():\n",
        "  # And we'll create a network that approximately fits it\n",
        "  beta_0 = np.zeros((3,1));  # formerly theta_x0\n",
        "  omega_0 = np.zeros((3,1)); # formerly theta_x1\n",
        "  beta_1 = np.zeros((1,1));  # formerly phi_0\n",
        "  omega_1 = np.zeros((1,3)); # formerly phi_x\n",
        "\n",
        "  beta_0[0,0] = 0.3; beta_0[1,0] = -1.0; beta_0[2,0] = -0.5\n",
        "  omega_0[0,0] = -1.0; omega_0[1,0] = 1.8; omega_0[2,0] = 0.65\n",
        "  beta_1[0,0] = 2.6;\n",
        "  omega_1[0,0] = -24.0; omega_1[0,1] = -8.0; omega_1[0,2] = 50.0\n",
        "\n",
        "  return beta_0, omega_0, beta_1, omega_1"
      ],
      "metadata": {
        "id": "pUT9Ain_HRim"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Utility function for plotting data\n",
        "def plot_binary_classification(x_model, out_model, lambda_model, x_data = None, y_data = None, title= None):\n",
        "  # Make sure model data are 1D arrays\n",
        "  x_model = np.squeeze(x_model)\n",
        "  out_model = np.squeeze(out_model)\n",
        "  lambda_model = np.squeeze(lambda_model)\n",
        "\n",
        "  fig, ax = plt.subplots(1,2)\n",
        "  fig.set_size_inches(7.0, 3.5)\n",
        "  fig.tight_layout(pad=3.0)\n",
        "  ax[0].plot(x_model,out_model)\n",
        "  ax[0].set_xlabel('Input, $x$'); ax[0].set_ylabel('Model output')\n",
        "  ax[0].set_xlim([0,1]);ax[0].set_ylim([-4,4])\n",
        "  if title is not None:\n",
        "    ax[0].set_title(title)\n",
        "  ax[1].plot(x_model,lambda_model)\n",
        "  ax[1].set_xlabel('Input, $x$'); ax[1].set_ylabel('$\\lambda$ or Pr(y=1|x)')\n",
        "  ax[1].set_xlim([0,1]);ax[1].set_ylim([-0.05,1.05])\n",
        "  if title is not None:\n",
        "    ax[1].set_title(title)\n",
        "  if x_data is not None:\n",
        "    ax[1].plot(x_data, y_data, 'ko')\n",
        "  plt.show()"
      ],
      "metadata": {
        "id": "NRR67ri_1TzN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Binary classification\n",
        "\n",
        "In binary classification tasks, the network predicts the probability of the output belonging to class 1.  Since probabilities must lie in [0,1] and the network can output arbitrary values, we map the network through a sigmoid function that ensures the range is valid."
      ],
      "metadata": {
        "id": "PsgLZwsPxauP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Sigmoid function that maps [-infty,infty] to [0,1]\n",
        "def sigmoid(model_out):\n",
        "  # TODO -- implement the logistic sigmoid function from equation 5.18\n",
        "  # Replace this line:\n",
        "  sig_model_out = np.zeros_like(model_out)\n",
        "\n",
        "  return sig_model_out"
      ],
      "metadata": {
        "id": "uFb8h-9IXnIe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's create some 1D training data\n",
        "x_train = np.array([0.09291784,0.46809093,0.93089486,0.67612654,0.73441752,0.86847339,\\\n",
        "                   0.49873225,0.51083168,0.18343972,0.99380898,0.27840809,0.38028817,\\\n",
        "                   0.12055708,0.56715537,0.92005746,0.77072270,0.85278176,0.05315950,\\\n",
        "                   0.87168699,0.58858043])\n",
        "y_train = np.array([0,1,1,0,0,1,\\\n",
        "                    1,0,0,1,0,1,\\\n",
        "                    0,1,1,0,1,0, \\\n",
        "                    1,1])\n",
        "\n",
        "# Get parameters for the model\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "\n",
        "# Define a range of input values\n",
        "x_model = np.arange(0,1,0.01)\n",
        "# Run the model to get values to plot and plot it.\n",
        "model_out= shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "lambda_model = sigmoid(model_out)\n",
        "plot_binary_classification(x_model, model_out, lambda_model, x_train, y_train)\n"
      ],
      "metadata": {
        "id": "VWzNOt1swFVd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The left is model output and the right is the model output after the sigmoid has been applied, so it now lies in the range [0,1] and represents the probability, that y=1.  The black dots show the training data.  We'll compute the likelihood and the negative log likelihood."
      ],
      "metadata": {
        "id": "MvVX6tl9AEXF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return probability under Bernoulli distribution for observed class y\n",
        "def bernoulli_distribution(y, lambda_param):\n",
        "    # TODO-- write in the equation for the Bernoulli distribution\n",
        "    # Equation 5.17 from the notes (you will need np.power)\n",
        "    # Replace the line below\n",
        "    prob = np.zeros_like(y)\n",
        "\n",
        "    return prob"
      ],
      "metadata": {
        "id": "YaLdRlEX0FkU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %3.3f, Your answer = %3.3f\"%(0.8,bernoulli_distribution(0,0.2)))\n",
        "print(\"Correct answer = %3.3f, Your answer = %3.3f\"%(0.2,bernoulli_distribution(1,0.2)))"
      ],
      "metadata": {
        "id": "4TSL14dqHHbV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's compute the likelihood using this function"
      ],
      "metadata": {
        "id": "R5z_0dzQMF35"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return the likelihood of all of the data under the model\n",
        "def compute_likelihood(y_train, lambda_param):\n",
        "  # TODO -- compute the likelihood of the data -- the product of the Bernoulli probabilities for each data point\n",
        "  # Top line of equation 5.3 in the notes\n",
        "  # You will need np.prod() and the bernoulli_distribution function you used above\n",
        "  # Replace the line below\n",
        "  likelihood = 0\n",
        "\n",
        "  return likelihood"
      ],
      "metadata": {
        "id": "zpS7o6liCx7f"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's test this\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "# Use our neural network to predict the Bernoulli parameter lambda\n",
        "model_out = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "lambda_train = sigmoid(model_out)\n",
        "# Compute the likelihood\n",
        "likelihood = compute_likelihood(y_train, lambda_train)\n",
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %9.9f, Your answer = %9.9f\"%(0.000070237,likelihood))"
      ],
      "metadata": {
        "id": "1hQxBLoVNlr2"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "You can see that this gives a very small answer, even for this small 1D dataset, and with the model fitting quite well.  This is because it is the product of several probabilities, which are all quite small themselves.\n",
        "This will get out of hand pretty quickly with real datasets -- the likelihood will get so small that we can't represent it with normal finite-precision math\n",
        "\n",
        "This is why we use negative log likelihood"
      ],
      "metadata": {
        "id": "HzphKgPfOvlk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return the negative log likelihood of the data under the model\n",
        "def compute_negative_log_likelihood(y_train, lambda_param):\n",
        "  # TODO -- compute the likelihood of the data -- don't use the likelihood function above -- compute the negative sum of the log probabilities\n",
        "  # You will need np.sum(), np.log()\n",
        "  # Replace the line below\n",
        "  nll = 0\n",
        "\n",
        "  return nll"
      ],
      "metadata": {
        "id": "dsT0CWiKBmTV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's test this\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "# Use our neural network to predict the mean of the Gaussian\n",
        "model_out = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "# Pass through the sigmoid function\n",
        "lambda_train = sigmoid(model_out)\n",
        "# Compute the log likelihood\n",
        "nll = compute_negative_log_likelihood(y_train, lambda_train)\n",
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %9.9f, Your answer = %9.9f\"%(9.563639387,nll))"
      ],
      "metadata": {
        "id": "nVxUXg9rQmwI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's investigate finding the maximum likelihood / minimum negative log likelihood solution.  For simplicity, we'll assume that all the parameters are fixed except one and look at how the likelihood and negative log likelihood change as we manipulate the last parameter.  We'll start with overall y_offset, beta_1 (formerly phi_0)"
      ],
      "metadata": {
        "id": "OgcRojvPWh4V"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define a range of values for the parameter\n",
        "beta_1_vals = np.arange(-2,6.0,0.1)\n",
        "# Create some arrays to store the likelihoods, negative log likelihoods\n",
        "likelihoods = np.zeros_like(beta_1_vals)\n",
        "nlls = np.zeros_like(beta_1_vals)\n",
        "\n",
        "# Initialise the parameters\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "for count in range(len(beta_1_vals)):\n",
        "  # Set the value for the parameter\n",
        "  beta_1[0,0] = beta_1_vals[count]\n",
        "  # Run the network with new parameters\n",
        "  model_out = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "  lambda_train = sigmoid(model_out)\n",
        "  # Compute and store the two values\n",
        "  likelihoods[count] = compute_likelihood(y_train,lambda_train)\n",
        "  nlls[count] = compute_negative_log_likelihood(y_train, lambda_train)\n",
        "  # Draw the model for every 20th parameter setting\n",
        "  if count % 20 == 0:\n",
        "    # Run the model to get values to plot and plot it.\n",
        "    model_out = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "    lambda_model = sigmoid(model_out)\n",
        "    plot_binary_classification(x_model, model_out, lambda_model, x_train, y_train, title=\"beta_1[0]=%3.3f\"%(beta_1[0,0]))\n"
      ],
      "metadata": {
        "id": "pFKtDaAeVU4U"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Now let's plot the likelihood and negative log likelihood as a function of the value of the offset beta1\n",
        "fig, ax = plt.subplots()\n",
        "fig.tight_layout(pad=5.0)\n",
        "likelihood_color = 'tab:red'\n",
        "nll_color = 'tab:blue'\n",
        "\n",
        "\n",
        "ax.set_xlabel('beta_1[0]')\n",
        "ax.set_ylabel('likelihood', color = likelihood_color)\n",
        "ax.plot(beta_1_vals, likelihoods, color = likelihood_color)\n",
        "ax.tick_params(axis='y', labelcolor=likelihood_color)\n",
        "\n",
        "ax1 = ax.twinx()\n",
        "ax1.plot(beta_1_vals, nlls, color = nll_color)\n",
        "ax1.set_ylabel('negative log likelihood', color = nll_color)\n",
        "ax1.tick_params(axis='y', labelcolor = nll_color)\n",
        "\n",
        "plt.axvline(x = beta_1_vals[np.argmax(likelihoods)], linestyle='dotted')\n",
        "\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "UHXeTa9MagO6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Hopefully, you can see that the maximum of the likelihood fn is at the same position as the minimum negative log likelihood\n",
        "# Let's check that:\n",
        "print(\"Maximum likelihood = %f, at beta_1=%3.3f\"%( (likelihoods[np.argmax(likelihoods)],beta_1_vals[np.argmax(likelihoods)])))\n",
        "print(\"Minimum negative log likelihood = %f, at beta_1=%3.3f\"%( (nlls[np.argmin(nlls)],beta_1_vals[np.argmin(nlls)])))\n",
        "\n",
        "# Plot the best model\n",
        "beta_1[0,0] = beta_1_vals[np.argmin(nlls)]\n",
        "model_out = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "lambda_model = sigmoid(model_out)\n",
        "plot_binary_classification(x_model, model_out, lambda_model, x_train, y_train, title=\"beta_1[0]=%3.3f\"%(beta_1[0,0]))\n"
      ],
      "metadata": {
        "id": "aDEPhddNdN4u"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "They both give the same answer. But you can see from the likelihood above that the likelihood is very small unless the parameters are almost correct.  So in practice, we would work with the negative log likelihood.<br><br>\n",
        "\n",
        "Again, to fit the full neural model we would vary all of the 10 parameters of the network in the $\\boldsymbol\\beta_{0},\\boldsymbol\\Omega_{0},\\boldsymbol\\beta_{1},\\boldsymbol\\Omega_{1}$ until we find the combination that have the maximum likelihood / minimum negative log likelihood.<br><br>\n",
        "\n"
      ],
      "metadata": {
        "id": "771G8N1Vk5A2"
      }
    }
  ]
}