{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "authorship_tag": "ABX9TyP0/KodWM9Dtr2x+8MdXXH1", "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# **Notebook 12.3: Tokenization**\n", "\n", "This notebook builds set of tokens from a text string as in figure 12.8 of the book.\n", "\n", "Work through the cells below, running each cell in turn. In various places you will see the words \"TO DO\". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.\n", "\n", "I adapted this code from *SOMEWHERE*. If anyone recognizes it, can you let me know and I will give the proper attribution or rewrite if the license is not permissive.\n", "\n", "Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions.\n", "\n" ], "metadata": { "id": "t9vk9Elugvmi" } }, { "cell_type": "code", "source": [ "import re, collections" ], "metadata": { "id": "3_WkaFO3OfLi" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "text = \"a sailor went to sea sea sea \"+\\\n", " \"to see what he could see see see \"+\\\n", " \"but all that he could see see see \"+\\\n", " \"was the bottom of the deep blue sea sea sea\"" ], "metadata": { "id": "tVZVuauIXmJk" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Tokenize the input sentence To begin with the tokens are the individual letters and the whitespace token. So, we represent each word in terms of these tokens with spaces between the tokens to delineate them.\n", "\n", "The tokenized text is stored in a structure that represents each word as tokens together with the count of how often that word occurs. We'll call this the *vocabulary*." ], "metadata": { "id": "fF2RBrouWV5w" } }, { "cell_type": "code", "source": [ "def initialize_vocabulary(text):\n", " vocab = collections.defaultdict(int)\n", " words = text.strip().split()\n", " for word in words:\n", " vocab[' '.join(list(word)) + ' '] += 1\n", " return vocab" ], "metadata": { "id": "OfvXkLSARk4_" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "vocab = initialize_vocabulary(text)\n", "print('Vocabulary: {}'.format(vocab))\n", "print('Size of vocabulary: {}'.format(len(vocab)))" ], "metadata": { "id": "aydmNqaoOpSm" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Find all the tokens in the current vocabulary and their frequencies" ], "metadata": { "id": "fJAiCjphWsI9" } }, { "cell_type": "code", "source": [ "def get_tokens_and_frequencies(vocab):\n", " tokens = collections.defaultdict(int)\n", " for word, freq in vocab.items():\n", " word_tokens = word.split()\n", " for token in word_tokens:\n", " tokens[token] += freq\n", " return tokens" ], "metadata": { "id": "qYi6F_K3RYsW" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "tokens = get_tokens_and_frequencies(vocab)\n", "print('Tokens: {}'.format(tokens))\n", "print('Number of tokens: {}'.format(len(tokens)))" ], "metadata": { "id": "Y4LCVGnvXIwp" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Find each pair of adjacent tokens in the vocabulary\n", "and count them. We will subsequently merge the most frequently occurring pair." ], "metadata": { "id": "_-Rh1mD_Ww3b" } }, { "cell_type": "code", "source": [ "def get_pairs_and_counts(vocab):\n", " pairs = collections.defaultdict(int)\n", " for word, freq in vocab.items():\n", " symbols = word.split()\n", " for i in range(len(symbols)-1):\n", " pairs[symbols[i],symbols[i+1]] += freq\n", " return pairs" ], "metadata": { "id": "OqJTB3UFYubH" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "pairs = get_pairs_and_counts(vocab)\n", "print('Pairs: {}'.format(pairs))\n", "print('Number of distinct pairs: {}'.format(len(pairs)))\n", "\n", "most_frequent_pair = max(pairs, key=pairs.get)\n", "print('Most frequent pair: {}'.format(most_frequent_pair))" ], "metadata": { "id": "d-zm0JBcZSjS" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Merge the instances of the best pair in the vocabulary" ], "metadata": { "id": "pcborzqIXQFS" } }, { "cell_type": "code", "source": [ "def merge_pair_in_vocabulary(pair, vocab_in):\n", " vocab_out = {}\n", " bigram = re.escape(' '.join(pair))\n", " p = re.compile(r'(?