Update course book

NeuromatchAcademy · Apr 11, 2024 · 562dda7 · 562dda7
1 parent 746a43b
commit 562dda7
Show file tree

Hide file tree

Showing 60 changed files with 1,473 additions and 1,345 deletions.
diff --git a/_images/1f2ee11f4771a887b67c05df37494adfb8aecbd44a7a9f5fbd5afc1cb1c75920.png b/_images/1f2ee11f4771a887b67c05df37494adfb8aecbd44a7a9f5fbd5afc1cb1c75920.png
diff --git a/_images/274f0024bb02add113fae8af2240bd62251e182ddbe6c52fb96ef0a75fbd28dd.png b/_images/274f0024bb02add113fae8af2240bd62251e182ddbe6c52fb96ef0a75fbd28dd.png
diff --git a/_images/4d22edbf348f43256d1e465ca1629b5d492e11651a4230519554fc5c37f7d2f0.png b/_images/4d22edbf348f43256d1e465ca1629b5d492e11651a4230519554fc5c37f7d2f0.png
diff --git a/_images/4d4a3e47e3571e377430828ed1dbfee87be0cc52411c5507ffb75aff723921cd.png b/_images/4d4a3e47e3571e377430828ed1dbfee87be0cc52411c5507ffb75aff723921cd.png
diff --git a/_images/4f0ebdab72143a09a7cd1f17453e98aeb23cd8e7a08dcb9d395a24f24a8dd991.png b/_images/4f0ebdab72143a09a7cd1f17453e98aeb23cd8e7a08dcb9d395a24f24a8dd991.png
diff --git a/_images/526bbfc5df5c554dcedb08f917749ac49a6112047439f59c84109ab3df812acb.png b/_images/526bbfc5df5c554dcedb08f917749ac49a6112047439f59c84109ab3df812acb.png
diff --git a/_images/5eed64af5ddd3df53dc20ef443825b72e14ad0f98e1f03e24b4c7dfcd5a27566.png b/_images/5eed64af5ddd3df53dc20ef443825b72e14ad0f98e1f03e24b4c7dfcd5a27566.png
diff --git a/_images/5f42c5cf3a26d34a551c635ee83a8eabadc455be26c75bc2f7485177a27c255d.png b/_images/5f42c5cf3a26d34a551c635ee83a8eabadc455be26c75bc2f7485177a27c255d.png
diff --git a/_images/64058d37f289e3656e6d296491b14d5cfc7641127067f41d0c6a3f6a91b7e179.png b/_images/64058d37f289e3656e6d296491b14d5cfc7641127067f41d0c6a3f6a91b7e179.png
diff --git a/_images/6b176037749651980a9a02a512bd69e994655b36cab66cb2d1a25f8d9af0c528.png b/_images/6b176037749651980a9a02a512bd69e994655b36cab66cb2d1a25f8d9af0c528.png
diff --git a/_images/7acdaa9daf0507c86354349586991516c97f08178468e1d51655fc8892312fe9.png b/_images/7acdaa9daf0507c86354349586991516c97f08178468e1d51655fc8892312fe9.png
diff --git a/_images/8a1a0f9de4ef832c52f09191ecc0214d8b6996df8d6613698f36e29e05832d63.png b/_images/8a1a0f9de4ef832c52f09191ecc0214d8b6996df8d6613698f36e29e05832d63.png
diff --git a/_images/8b7e2067a571cd81babf409c1cd9ca7c041032ef0e13cfdd4b3ed56ebe978122.png b/_images/8b7e2067a571cd81babf409c1cd9ca7c041032ef0e13cfdd4b3ed56ebe978122.png
diff --git a/_images/ee0219c6b1af7da294c7b8282f6483798df6b41b8798f2663e1a8b3c19d3e74f.png b/_images/ee0219c6b1af7da294c7b8282f6483798df6b41b8798f2663e1a8b3c19d3e74f.png
diff --git a/_sources/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb b/_sources/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb
@@ -8,7 +8,7 @@
     "id": "view-in-github"
    },
    "source": [
-    "<a href=\"https://colab.research.google.com/github/wangshaonan/course-content-dl/blob/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a>   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
+    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a>   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
    ]
   },
   {
@@ -23,7 +23,7 @@
     "\n",
     "**By Neuromatch Academy**\n",
     "\n",
-    "__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang\n",
+    "__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang, Alish Dipani\n",
     "\n",
     "__Content reviewers:__ Shaonan Wang, Weizhe Yuan, Dalia Nasr, Stephen Kiilu, Alish Dipani, Dora Zhiyu Yang, Adrita Das\n",
     "\n",
@@ -378,7 +378,7 @@
     "\n",
     "In classical transformer systems, a core principle is encoding and decoding. We can encode an input sequence as a vector (that implicitly codes what we just read). And we can then take this vector and decode it, e.g., as a new sentence. So a sequence-to-sequence (e.g., sentence translation) system may read a sentence (made out of words embedded in a relevant space) and encode it as an overall vector. It then takes the resulting encoding of the sentence and decodes it into a translated sentence.\n",
     "\n",
-    "In modern transformer systems, such as GPT, all words are used in parallel. In that sense, the transformers generalize the encoding/decoding idea. Examples of this strategy include all the modern large language models (such as GPT)."
+    "In modern transformer systems, such as GPT, all words are used parallelly. In that sense, the transformers generalize the encoding/decoding idea. Examples of this strategy include all the modern large language models (such as GPT)."
    ]
   },
   {
@@ -601,7 +601,6 @@
    },
    "outputs": [],
    "source": [
-    "# Try playing with these hyperparameters!\n",
     "VOCAB_SIZE = 12_000"
    ]
   },
@@ -677,6 +676,15 @@
     "])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "execution": {}
+   },
+   "source": [
+    "**Note:** In practice, it is not necessary to use pre-tokenizers, but we use it for demonstration purposes. For instance, \"2-3\" is not the same as \"23\", so removing punctuation or splitting up digits or punctuation is a bad idea! Moreover, the current tokenizer is powerful enough to deal with punctuation."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -708,6 +716,26 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "execution": {}
+   },
+   "source": [
+    "### Special Tokens\n",
+    "\n",
+    "Tokenizers often have special tokens representing certain concepts such as:\n",
+    "* [PAD]: Added to the end of shorter input sequences to ensure equal input length for the whole batch\n",
+    "* [START]: Start of the sequence\n",
+    "* [END]: End of the sequence\n",
+    "* [UNK]: Unknown characters not present in the vocabulary\n",
+    "* [BOS]: Beginning of sentence\n",
+    "* [EOS]: End of sentence\n",
+    "* [SEP]: Separation between two sentences in a sequence\n",
+    "* [CLS]: Token used for classification tasks to represent the whole sequence\n",
+    "* [MASK]: Used in pre-training phase for masked language modeling tasks in models like BERT"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -794,50 +822,7 @@
     "execution": {}
    },
    "source": [
-    "### Think 2.1! Is it a good idea to do pre_tokenizers?"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "colab_type": "text",
-    "execution": {}
-   },
-   "source": [
-    "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial2_Solution_802b4f3d.py)\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "####  Submit your feedback\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "cellView": "form",
-    "execution": {},
-    "tags": [
-     "hide-input"
-    ]
-   },
-   "outputs": [],
-   "source": [
-    "# @title Submit your feedback\n",
-    "content_review(f\"{feedback_prefix}_Is_it_a_good_idea_to_do_pre_tokenizers_Discussion\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "execution": {}
-   },
-   "source": [
-    "### Think 2.2! Tokenizer good practices\n",
+    "### Think 2.1! Tokenizer good practices\n",
     "\n",
     "We established that the tokenizer is a better move than the One-Hot-Encoder because it can handle out-of-vocabulary words. But what if we just made a one-hot encoding where the vocabulary is all possible two-character combinations? Would there still be an advantage to the tokenizer?\n",
     "\n",
@@ -884,7 +869,7 @@
     "execution": {}
    },
    "source": [
-    "### Think 2.3: Chinese and English tokenizer\n",
+    "### Think 2.2: Chinese and English tokenizer\n",
     "\n",
     "Let's think about a language like Chinese, where words are each composed of a relatively fewer number of characters compared to English (`hungry` is six unicode characters, but `饿` is one unicode character), but there are many more unique Chinese characters than there are letters in the English alphabet.\n",
     "\n",
@@ -1487,7 +1472,7 @@
     "execution": {}
    },
    "source": [
-    "### Coding Exercise 4.1: Implement the code to fine-tune the model\n",
+    "### Implement the code to fine-tune the model\n",
     "\n",
     "Here are the big pieces of what we do below:\n",
     "\n",
@@ -1538,7 +1523,15 @@
     "    tokenizer=tokenizer, mlm=False,\n",
     ")\n",
     "\n",
-    "trainer = ..."
+    "# Trainer:\n",
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=encoded_dataset,\n",
+    "    tokenizer=tokenizer,\n",
+    "    compute_metrics=compute_metrics,\n",
+    "    data_collator=data_collator,\n",
+    ")"
    ]
   },
   {
@@ -1549,18 +1542,19 @@
    },
    "outputs": [],
    "source": [
-    "trainer = ..."
+    "# Run the actual training:\n",
+    "trainer.train()"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "colab_type": "text",
     "execution": {}
    },
    "source": [
-    "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial2_Solution_b453433d.py)\n",
-    "\n"
+    "### Coding Exercise 4.1: Implement the code to generate text after fine-tuning.\n",
+    "\n",
+    "To generate text, we provide input tokens to the model, let it generate the next token and append it into the input tokens. Now, keep repeating this process until you reach the desired output length."
    ]
   },
   {
@@ -1571,17 +1565,64 @@
    },
    "outputs": [],
    "source": [
-    "# Run the actual training:\n",
-    "trainer.train()"
+    "# Number of tokens to generate\n",
+    "num_tokens = 100\n",
+    "\n",
+    "# Move the model to the CPU for inference\n",
+    "model.to(\"cpu\")\n",
+    "\n",
+    "# Print input prompt\n",
+    "print(f'Input prompt: \\n{input_prompt}')\n",
+    "\n",
+    "#################################################\n",
+    "# Implement a the correct tokens and outputs\n",
+    "raise NotImplementedError(\"Text Generation\")\n",
+    "#################################################\n",
+    "\n",
+    "# Encode the input prompt\n",
+    "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n",
+    "input_tokens = ...\n",
+    "\n",
+    "# Turn off storing gradients\n",
+    "with torch.no_grad():\n",
+    "  # Keep iterating until num_tokens are generated\n",
+    "  for tkn_idx in tqdm(range(num_tokens)):\n",
+    "    # Forward pass through the model\n",
+    "    # The model expects the tensor to be of Long or Int dtype\n",
+    "    output = ...\n",
+    "    # Get output logits\n",
+    "    logits = output.logits[-1, :]\n",
+    "    # Convert into probabilities\n",
+    "    probs = nn.functional.softmax(logits, dim=-1)\n",
+    "    # Get the index of top token\n",
+    "    top_token = ...\n",
+    "    # Append the token into the input sequence\n",
+    "    input_tokens.append(top_token)\n",
+    "\n",
+    "# Decode and print the generated text\n",
+    "# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n",
+    "decoded_text = ...\n",
+    "print(f'Generated text: \\n{decoded_text}')"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
+    "colab_type": "text",
     "execution": {}
    },
    "source": [
-    "Finally, we will try our model on the same code snippet to see how it performs after fine-tuning:"
+    "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial2_Solution_0f765585.py)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "execution": {}
+   },
+   "source": [
+    "We can also directly generate text using the generation_pipeline:"
    ]
   },
   {
@@ -1801,9 +1842,7 @@
    "source": [
     "## Play around with LLMs\n",
     "\n",
-    "1. Try using LLMs' API to do tasks, such as utilizing the GPT-2 API to extend text from a provided context. To achieve this, ensure you have a HuggingFace account and secure an API token.\n",
-    "\n",
-    "\n"
+    "1. Try using LLMs' API to do tasks, such as utilizing the GPT-2 API to extend text from a provided context. To achieve this, ensure you have a HuggingFace account and secure an API token."
    ]
   },
   {
@@ -1817,10 +1856,10 @@
     "import requests\n",
     "\n",
     "def query(payload, model_id, api_token):\n",
-    "    headers = {\"Authorization\": f\"Bearer {api_token}\"}\n",
-    "    API_URL = f\"https://api-inference.huggingface.co/models/{model_id}\"\n",
-    "    response = requests.post(API_URL, headers=headers, json=payload)\n",
-    "    return response.json()\n",
+    "  headers = {\"Authorization\": f\"Bearer {api_token}\"}\n",
+    "  API_URL = f\"https://api-inference.huggingface.co/models/{model_id}\"\n",
+    "  response = requests.post(API_URL, headers=headers, json=payload)\n",
+    "  return response.json()\\\n",
     "\n",
     "model_id = \"gpt2\"\n",
     "api_token = \"hf_****\" # get yours at hf.co/settings/tokens\n",

diff --git a/projects/ComputerVision/data_augmentation.html b/projects/ComputerVision/data_augmentation.html
@@ -1763,8 +1763,8 @@ <h2>Cutout<a class="headerlink" href="#cutout" title="Permalink to this heading"
 <section id="mixup">
 <h2>Mixup<a class="headerlink" href="#mixup" title="Permalink to this heading">#</a></h2>
 <p>Mixup is a data augmentation technique that combines pairs of examples via a convex combination of the images and the labels. Given images <span class="math notranslate nohighlight">\(x_i\)</span> and <span class="math notranslate nohighlight">\(x_j\)</span> with labels <span class="math notranslate nohighlight">\(y_i\)</span> and <span class="math notranslate nohighlight">\(y_j\)</span>, respectively, and <span class="math notranslate nohighlight">\(\lambda \in [0, 1]\)</span>, mixup creates a new image <span class="math notranslate nohighlight">\(\hat{x}\)</span> with label <span class="math notranslate nohighlight">\(\hat{y}\)</span> the following way:</p>
-<div class="amsmath math notranslate nohighlight" id="equation-44e940bd-c292-4eed-bcf0-b364c176cb30">
-<span class="eqno">(128)<a class="headerlink" href="#equation-44e940bd-c292-4eed-bcf0-b364c176cb30" title="Permalink to this equation">#</a></span>\[\begin{align}
+<div class="amsmath math notranslate nohighlight" id="equation-a4a0abc1-1bbd-472d-8c3f-025daccdd721">
+<span class="eqno">(128)<a class="headerlink" href="#equation-a4a0abc1-1bbd-472d-8c3f-025daccdd721" title="Permalink to this equation">#</a></span>\[\begin{align}
 \hat{x} &amp;= \lambda x_i + (1 - \lambda) x_j \\
 \hat{y} &amp;= \lambda y_i + (1 - \lambda) y_j
 \end{align}\]</div>