{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyMrfCCH1ll/mV1nl3jPkkXz",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/udlbook/udlbook/blob/main/CM20315_Loss.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Loss functions\n",
        "\n",
        "In this practical, we'll investigate loss functions.  In part 1 (this notebook), we'll investigate univariate regression (where the output data $y$ is continuous.  Our formulation will be based on the normal/Gaussian distribution.\n",
        "\n",
        "We'll compute loss functions for maximum likelihood, minimum negative log likelihood, and least squares and show that they all imply that we should use the same parameter values\n",
        "\n",
        "In part II, we'll investigate binary classification (where the output data is 0 or 1).  This will be based on the Bernouilli distribution\n",
        "\n",
        "In part III we'll investigate multiclass classification (where the output data is 0,1, or, 2).  This will be based on the categorical distribution."
      ],
      "metadata": {
        "id": "jSlFkICHwHQF"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PYMZ1x-Pv1ht"
      },
      "outputs": [],
      "source": [
        "# Imports math library\n",
        "import numpy as np\n",
        "# Imports plotting library\n",
        "import matplotlib.pyplot as plt\n",
        "# Import math Library\n",
        "import math"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the Rectified Linear Unit (ReLU) function\n",
        "def ReLU(preactivation):\n",
        "  activation = preactivation.clip(0.0)\n",
        "  return activation\n",
        "\n",
        "# Define a shallow neural network\n",
        "def shallow_nn(x, beta_0, omega_0, beta_1, omaga_1):\n",
        "    # Make sure that input data is (1 x n_data) array\n",
        "    n_data = x.size\n",
        "    x = np.reshape(x,(1,n_data))\n",
        "\n",
        "    # This runs the network for ALL of the inputs, x at once so we can draw graph\n",
        "    h1 = ReLU(np.matmul(beta_0,np.ones((1,n_data))) + np.matmul(omega_0,x))\n",
        "    y = np.matmul(beta_1,np.ones((1,n_data))) + np.matmul(omega_1,h1)\n",
        "    return y"
      ],
      "metadata": {
        "id": "Fv7SZR3tv7mV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Utility function for plotting data\n",
        "def plot_univariate_regression(x_model, y_model, x_data = None, y_data = None, sigma_model = None, title= None):\n",
        "  # Make sure model data are 1D arrays\n",
        "  x_model = np.squeeze(x_model)\n",
        "  y_model = np.squeeze(y_model)\n",
        "\n",
        "  fig, ax = plt.subplots()\n",
        "  ax.plot(x_model,y_model)\n",
        "  if sigma_model is not None:\n",
        "    ax.fill_between(x_model, y_model-2*sigma_model, y_model+2*sigma_model, color='lightgray')\n",
        "  ax.set_xlabel('Input, $x$'); ax.set_ylabel('Output, $y$')\n",
        "  ax.set_xlim([0,1]);ax.set_ylim([-1,1])\n",
        "  ax.set_aspect(0.5)\n",
        "  if title is not None:\n",
        "    ax.set_title(title)\n",
        "  if x_data is not None:\n",
        "    ax.plot(x_data, y_data, 'ko')\n",
        "  plt.show()"
      ],
      "metadata": {
        "id": "NRR67ri_1TzN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Univariate regression"
      ],
      "metadata": {
        "id": "PsgLZwsPxauP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Get parameters for model -- we can call this function to easily reset them\n",
        "def get_parameters():\n",
        "  # And we'll create a network that approximately fits it\n",
        "  beta_0 = np.zeros((3,1));  # formerly theta_x0\n",
        "  omega_0 = np.zeros((3,1)); # formerly theta_x1\n",
        "  beta_1 = np.zeros((1,1));  # formerly phi_0\n",
        "  omega_1 = np.zeros((1,3)); # formerly phi_x\n",
        "\n",
        "  beta_0[0,0] = 0.3; beta_0[1,0] = -1.0; beta_0[2,0] = -0.5\n",
        "  omega_0[0,0] = -1.0; omega_0[1,0] = 1.8; omega_0[2,0] = 0.65\n",
        "  beta_1[0,0] = 0.1;\n",
        "  omega_1[0,0] = -2.0; omega_1[0,1] = -1.0; omega_1[0,2] = 7.0\n",
        "\n",
        "  return beta_0, omega_0, beta_1, omega_1"
      ],
      "metadata": {
        "id": "pUT9Ain_HRim"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's create some 1D training data\n",
        "x_train = np.array([0.09291784,0.46809093,0.93089486,0.67612654,0.73441752,0.86847339,\\\n",
        "                   0.49873225,0.51083168,0.18343972,0.99380898,0.27840809,0.38028817,\\\n",
        "                   0.12055708,0.56715537,0.92005746,0.77072270,0.85278176,0.05315950,\\\n",
        "                   0.87168699,0.58858043])\n",
        "y_train = np.array([-0.25934537,0.18195445,0.651270150,0.13921448,0.09366691,0.30567674,\\\n",
        "                    0.372291170,0.20716968,-0.08131792,0.51187806,0.16943738,0.3994327,\\\n",
        "                    0.019062570,0.55820410,0.452564960,-0.1183121,0.02957665,-1.24354444, \\\n",
        "                    0.248038840,0.26824970])\n",
        "\n",
        "# Get parameters for the model\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "sigma = 0.2\n",
        "\n",
        "# Define a range of input values\n",
        "x_model = np.arange(0,1,0.01)\n",
        "# Run the model to get values to plot and plot it.\n",
        "y_model = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "plot_univariate_regression(x_model, y_model, x_train, y_train, sigma_model = sigma)\n"
      ],
      "metadata": {
        "id": "VWzNOt1swFVd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The blue line i sthe mean prediction of the model and the gray area represents plus/minus two standardard deviations.  This model fits okay, but could be improved. Let's compute the loss.  We'll compute the  the least squares error, the likelihood, the negative log likelihood."
      ],
      "metadata": {
        "id": "MvVX6tl9AEXF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return probability under normal distribution for input x\n",
        "def normal_distribution(y, mu, sigma):\n",
        "    # TODO-- write in the equation for the normal distribution \n",
        "    # Equation 5.7 from the notes (you will need np.sqrt() and np.exp(), and math.pi)\n",
        "    # Don't use the numpy version -- that's cheating!\n",
        "    # Replace the line below\n",
        "    prob = np.zeros_like(y)\n",
        "    return prob"
      ],
      "metadata": {
        "id": "YaLdRlEX0FkU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %3.3f, Your answer = %3.3f\"%(0.119,normal_distribution(1,-1,2.3)))"
      ],
      "metadata": {
        "id": "4TSL14dqHHbV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's plot the Gaussian distribution.\n",
        "y_gauss = np.arange(-5,5,0.1)\n",
        "mu = 0; sigma = 1.0\n",
        "gauss_prob = normal_distribution(y_gauss, mu, sigma)\n",
        "fig, ax = plt.subplots()\n",
        "ax.plot(y_gauss, gauss_prob)\n",
        "ax.set_xlabel('Input, $y$'); ax.set_ylabel('Probability $Pr(y)$')\n",
        "ax.set_xlim([-5,5]);ax.set_ylim([0,1.0])\n",
        "plt.show()\n",
        "\n",
        "# TODO \n",
        "# 1. Predict what will happen if we change to mu=1 and leave sigma=1\n",
        "# Answer:\n",
        "# Now change the code above and see if you were correct.\n",
        "# 2. Predict what will happen if we leave mu = 0 and change sigma to 2.0\n",
        "# Answer:\n",
        "# 3. Predict what will happen if we leave mu = 0 and change sigma to 0.5\n",
        "# Answer:"
      ],
      "metadata": {
        "id": "A2HcmNfUMIlj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's compute the likelihood using this function"
      ],
      "metadata": {
        "id": "R5z_0dzQMF35"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return the likelihood of all of the data under the model\n",
        "def compute_likelihood(y_train, mu, sigma):\n",
        "  # TODO -- compute the likelihood of the data -- the product of the normal probabilities for each data point\n",
        "  # Top line of equation 5.3 in the notes\n",
        "  # You will need np.prod() and the normal_distribution function you used above\n",
        "  # Replace the line below\n",
        "  likelihood = 0\n",
        "  return likelihood"
      ],
      "metadata": {
        "id": "zpS7o6liCx7f"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's test this for a homoscedastic (constant sigma) model\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "# Use our neural network to predict the mean of the Gaussian\n",
        "mu_pred = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "# Set the standard devation to something reasonable\n",
        "sigma = 0.2\n",
        "# Compute the likelihood\n",
        "likelihood = compute_likelihood(y_train, mu_pred, sigma)\n",
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %9.9f, Your answer = %9.9f\"%(0.000010624,likelihood))"
      ],
      "metadata": {
        "id": "1hQxBLoVNlr2"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "You can see that this gives a very small answer, even for this small 1D dataset, and with the model fitting quite well.  This is because it is the product of sveral probabilities, which are all quite small themselves.\n",
        "This will get out of hand pretty quickly with real datasets -- the likelihood will get so small that we can't represent it with normal finite-precision math\n",
        "\n",
        "This is why we use negative log likelihood"
      ],
      "metadata": {
        "id": "HzphKgPfOvlk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return the negative log likelihood of the data under the model\n",
        "def compute_negative_log_likelihood(y_train, mu, sigma):\n",
        "  # TODO -- compute the likelihood of the data -- don't use the likelihood function above -- compute the negative sum of the log probabilities\n",
        "  # Bottom line of equation 5.3 in the notes\n",
        "  # You will need np.sum(), np.log()\n",
        "  # Replace the line below\n",
        "  nll = 0\n",
        "  return nll"
      ],
      "metadata": {
        "id": "dsT0CWiKBmTV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's test this for a homoscedastic (constant sigma) model\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "# Use our neural network to predict the mean of the Gaussian\n",
        "mu_pred = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "# Set the standard devation to something reasonable\n",
        "sigma = 0.2\n",
        "# Compute the log likelihood\n",
        "nll = compute_negative_log_likelihood(y_train, mu_pred, sigma)\n",
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %9.9f, Your answer = %9.9f\"%(11.452419564,nll))"
      ],
      "metadata": {
        "id": "nVxUXg9rQmwI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "For good measure, let's compute the sum of squares as well"
      ],
      "metadata": {
        "id": "-S8bXApoWVLG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Return the squared distance between the predicted \n",
        "def compute_sum_of_squares(y_train, y_pred):\n",
        "  # TODO -- compute the sum of squared distances between the training data and the model prediction\n",
        "  # Eqn 5.10 in the notes.  Make sure that you understand this, and ask questions if you don't\n",
        "  # Replace the line below\n",
        "  sum_of_squares = 0;\n",
        "  return sum_of_squares"
      ],
      "metadata": {
        "id": "I1pjFdHCF4JZ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's test this again\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "# Use our neural network to predict the mean of the Gaussian\n",
        "y_pred = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "# Compute the log likelihood\n",
        "sum_of_squares = compute_sum_of_squares(y_train, y_pred)\n",
        "# Let's double check we get the right answer before proceeding\n",
        "print(\"Correct answer = %9.9f, Your answer = %9.9f\"%(2.020992572,sum_of_squares))"
      ],
      "metadata": {
        "id": "2C40fskIHBx7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's investigate finding the maximum likelihood / minimum log likelihood / least squares solution.  For simplicity, we'll assume that all the parameters are correct except one and look at how the likelihood, log likelihood, and sum of squares change as we manipulate the last parameter.  We'll start with overall y offset, beta_1 (formerly phi_0)"
      ],
      "metadata": {
        "id": "OgcRojvPWh4V"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define a range of values for the parameter\n",
        "beta_1_vals = np.arange(0,1.0,0.01)\n",
        "# Create some arrays to store the likelihoods, negative log likehoos and sum of squares\n",
        "likelihoods = np.zeros_like(beta_1_vals)\n",
        "nlls = np.zeros_like(beta_1_vals)\n",
        "sum_squares = np.zeros_like(beta_1_vals)\n",
        "\n",
        "# Initialise the parameters\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "sigma = 0.2\n",
        "for count in range(len(beta_1_vals)):\n",
        "  # Set the value for the parameter\n",
        "  beta_1[0,0] = beta_1_vals[count]\n",
        "  # Run the network with new parameters\n",
        "  mu_pred = y_pred = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "  # Compute and store the three values\n",
        "  likelihoods[count] = compute_likelihood(y_train, mu_pred, sigma)\n",
        "  nlls[count] = compute_negative_log_likelihood(y_train, mu_pred, sigma)\n",
        "  sum_squares[count] = compute_sum_of_squares(y_train, y_pred)\n",
        "  # Draw the model for every 20th parameter setting\n",
        "  if count % 20 == 0:\n",
        "    # Run the model to get values to plot and plot it.\n",
        "    y_model = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "    plot_univariate_regression(x_model, y_model, x_train, y_train, sigma_model = sigma, title=\"beta1[0]=%3.3f\"%(beta_1[0,0]))\n"
      ],
      "metadata": {
        "id": "pFKtDaAeVU4U"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Now let's plot the likelihood, negative log likelihood, and least squares as a function the value of the offset beta1\n",
        "fig, ax = plt.subplots(1,3)\n",
        "fig.set_size_inches(10.5, 3.5)\n",
        "fig.tight_layout(pad=3.0)\n",
        "ax[0].plot(beta_1_vals, likelihoods); ax[0].set_xlabel('beta_1[0]'); ax[0].set_ylabel('likelihood')\n",
        "ax[1].plot(beta_1_vals, nlls); ax[1].set_xlabel('beta_1[0]'); ax[1].set_ylabel('negative log likelihood')\n",
        "ax[2].plot(beta_1_vals, sum_squares); ax[2].set_xlabel('beta_1[0]'); ax[2].set_ylabel('sum of squares')\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "UHXeTa9MagO6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Hopefully, you can see that the maximum of the likelihood fn is at the same position as the minimum negative log likelihood\n",
        "# and the least squares solutions\n",
        "# Let's check that:\n",
        "print(\"Maximum likelihood = %3.3f, at beta_1=%3.3f\"%( (likelihoods[np.argmax(likelihoods)],beta_1_vals[np.argmax(likelihoods)])))\n",
        "print(\"Minimum negative log likelihood = %3.3f, at beta_1=%3.3f\"%( (nlls[np.argmin(nlls)],beta_1_vals[np.argmin(nlls)])))\n",
        "print(\"Least squares = %3.3f, at beta_1=%3.3f\"%( (sum_squares[np.argmin(sum_squares)],beta_1_vals[np.argmin(sum_squares)])))\n",
        "\n",
        "# Plot the best model\n",
        "beta_1[0,0] = beta_1_vals[np.argmin(sum_squares)]\n",
        "y_model = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "plot_univariate_regression(x_model, y_model, x_train, y_train, sigma_model = sigma, title=\"beta1=%3.3f\"%(beta_1[0,0]))"
      ],
      "metadata": {
        "id": "aDEPhddNdN4u"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "They all give the same answer. But you can see from the three plots above that the likelihood is very small unless the parameters are almost correct.  So in practice, we would work with the negative log likelihood or the least squares.<br><br>\n",
        "\n",
        "For fun, let's do the same thing with the standard deviation parameter of our network.  This is not an output of the network (unless we choose to make that the case), but it still affects the likelihood.\n",
        "\n"
      ],
      "metadata": {
        "id": "771G8N1Vk5A2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define a range of values for the parameter\n",
        "sigma_vals = np.arange(0.1,0.5,0.005)\n",
        "# Create some arrays to store the likelihoods, negative log likehoos and sum of squares\n",
        "likelihoods = np.zeros_like(sigma_vals)\n",
        "nlls = np.zeros_like(sigma_vals)\n",
        "sum_squares = np.zeros_like(sigma_vals)\n",
        "\n",
        "# Initialise the parameters\n",
        "beta_0, omega_0, beta_1, omega_1 = get_parameters()\n",
        "# Might as well set to the best offset\n",
        "beta_1[0,0] = 0.27\n",
        "for count in range(len(sigma_vals)):\n",
        "  # Set the value for the parameter\n",
        "  sigma = sigma_vals[count]\n",
        "  # Run the network with new parameters\n",
        "  mu_pred = y_pred = shallow_nn(x_train, beta_0, omega_0, beta_1, omega_1)\n",
        "  # Compute and store the three values\n",
        "  likelihoods[count] = compute_likelihood(y_train, mu_pred, sigma)\n",
        "  nlls[count] = compute_negative_log_likelihood(y_train, mu_pred, sigma)\n",
        "  sum_squares[count] = compute_sum_of_squares(y_train, y_pred)\n",
        "  # Draw the model for every 20th parameter setting\n",
        "  if count % 20 == 0:\n",
        "    # Run the model to get values to plot and plot it.\n",
        "    y_model = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "    plot_univariate_regression(x_model, y_model, x_train, y_train, sigma_model=sigma, title=\"sigma=%3.3f\"%(sigma))"
      ],
      "metadata": {
        "id": "dMNAr0R8gg82"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Now let's plot the likelihood, negative log likelihood, and least squares as a function the value of the offset beta1\n",
        "fig, ax = plt.subplots(1,3)\n",
        "fig.set_size_inches(10.5, 3.5)\n",
        "fig.tight_layout(pad=3.0)\n",
        "ax[0].plot(sigma_vals, likelihoods); ax[0].set_xlabel('$\\sigma$'); ax[0].set_ylabel('likelihood')\n",
        "ax[1].plot(sigma_vals, nlls); ax[1].set_xlabel('$\\sigma$'); ax[1].set_ylabel('negative log likelihood')\n",
        "ax[2].plot(sigma_vals, sum_squares); ax[2].set_xlabel('$\\sigma$'); ax[2].set_ylabel('sum of squares')\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "l9jduyHLDAZC"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Hopefully, you can see that the maximum of the likelihood fn is at the same position as the minimum negative log likelihood\n",
        "# The least squares solution does not depend on sigma, so it's just flat -- no use here.\n",
        "# Let's check that:\n",
        "print(\"Maximum likelihood = %3.3f, at beta_1=%3.3f\"%( (likelihoods[np.argmax(likelihoods)],sigma_vals[np.argmax(likelihoods)])))\n",
        "print(\"Minimum negative log likelihood = %3.3f, at beta_1=%3.3f\"%( (nlls[np.argmin(nlls)],sigma_vals[np.argmin(nlls)])))\n",
        "# Plot the best model\n",
        "sigma= sigma_vals[np.argmin(nlls)]\n",
        "y_model = shallow_nn(x_model, beta_0, omega_0, beta_1, omega_1)\n",
        "plot_univariate_regression(x_model, y_model, x_train, y_train, sigma_model = sigma, title=\"beta1=%3.3f, sigma =%3.3f\"%(beta_1[0,0],sigma))"
      ],
      "metadata": {
        "id": "XH7yER52Dxt5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Obviously, to fit the full neural model we would vary all of the 10 parameters of the network in the $\\boldsymbol\\beta_{0},\\boldsymbol\\omega_{0},\\boldsymbol\\beta_{1},\\boldsymbol\\omega_{1}$ (and maybe $\\sigma$) until we find the combination that have the maximum likelihood / minimum negative log likelihood / least squares.<br><br>\n",
        "\n",
        "Here we just varied one at a time as it is easier to see what is going on.  This is known as **coordinate descent**.\n"
      ],
      "metadata": {
        "id": "q_KeGNAHEbIt"
      }
    }
  ]
}