Skip to content

Commit

Permalink
Update course book
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Apr 11, 2024
1 parent 746a43b commit 562dda7
Show file tree
Hide file tree
Showing 60 changed files with 1,473 additions and 1,345 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"id": "view-in-github"
},
"source": [
"<a href=\"https://colab.research.google.com/github/wangshaonan/course-content-dl/blob/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a>   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
"<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content-dl/blob/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open In Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"/></a>   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content-dl/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/student/W3D1_Tutorial2.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
]
},
{
Expand All @@ -23,7 +23,7 @@
"\n",
"**By Neuromatch Academy**\n",
"\n",
"__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang\n",
"__Content creators:__ Lyle Ungar, Jordan Matelsky, Konrad Kording, Shaonan Wang, Alish Dipani\n",
"\n",
"__Content reviewers:__ Shaonan Wang, Weizhe Yuan, Dalia Nasr, Stephen Kiilu, Alish Dipani, Dora Zhiyu Yang, Adrita Das\n",
"\n",
Expand Down Expand Up @@ -378,7 +378,7 @@
"\n",
"In classical transformer systems, a core principle is encoding and decoding. We can encode an input sequence as a vector (that implicitly codes what we just read). And we can then take this vector and decode it, e.g., as a new sentence. So a sequence-to-sequence (e.g., sentence translation) system may read a sentence (made out of words embedded in a relevant space) and encode it as an overall vector. It then takes the resulting encoding of the sentence and decodes it into a translated sentence.\n",
"\n",
"In modern transformer systems, such as GPT, all words are used in parallel. In that sense, the transformers generalize the encoding/decoding idea. Examples of this strategy include all the modern large language models (such as GPT)."
"In modern transformer systems, such as GPT, all words are used parallelly. In that sense, the transformers generalize the encoding/decoding idea. Examples of this strategy include all the modern large language models (such as GPT)."
]
},
{
Expand Down Expand Up @@ -601,7 +601,6 @@
},
"outputs": [],
"source": [
"# Try playing with these hyperparameters!\n",
"VOCAB_SIZE = 12_000"
]
},
Expand Down Expand Up @@ -677,6 +676,15 @@
"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"**Note:** In practice, it is not necessary to use pre-tokenizers, but we use it for demonstration purposes. For instance, \"2-3\" is not the same as \"23\", so removing punctuation or splitting up digits or punctuation is a bad idea! Moreover, the current tokenizer is powerful enough to deal with punctuation."
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -708,6 +716,26 @@
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Special Tokens\n",
"\n",
"Tokenizers often have special tokens representing certain concepts such as:\n",
"* [PAD]: Added to the end of shorter input sequences to ensure equal input length for the whole batch\n",
"* [START]: Start of the sequence\n",
"* [END]: End of the sequence\n",
"* [UNK]: Unknown characters not present in the vocabulary\n",
"* [BOS]: Beginning of sentence\n",
"* [EOS]: End of sentence\n",
"* [SEP]: Separation between two sentences in a sequence\n",
"* [CLS]: Token used for classification tasks to represent the whole sequence\n",
"* [MASK]: Used in pre-training phase for masked language modeling tasks in models like BERT"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -794,50 +822,7 @@
"execution": {}
},
"source": [
"### Think 2.1! Is it a good idea to do pre_tokenizers?"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial2_Solution_802b4f3d.py)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Submit your feedback\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"execution": {},
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"# @title Submit your feedback\n",
"content_review(f\"{feedback_prefix}_Is_it_a_good_idea_to_do_pre_tokenizers_Discussion\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"### Think 2.2! Tokenizer good practices\n",
"### Think 2.1! Tokenizer good practices\n",
"\n",
"We established that the tokenizer is a better move than the One-Hot-Encoder because it can handle out-of-vocabulary words. But what if we just made a one-hot encoding where the vocabulary is all possible two-character combinations? Would there still be an advantage to the tokenizer?\n",
"\n",
Expand Down Expand Up @@ -884,7 +869,7 @@
"execution": {}
},
"source": [
"### Think 2.3: Chinese and English tokenizer\n",
"### Think 2.2: Chinese and English tokenizer\n",
"\n",
"Let's think about a language like Chinese, where words are each composed of a relatively fewer number of characters compared to English (`hungry` is six unicode characters, but `饿` is one unicode character), but there are many more unique Chinese characters than there are letters in the English alphabet.\n",
"\n",
Expand Down Expand Up @@ -1487,7 +1472,7 @@
"execution": {}
},
"source": [
"### Coding Exercise 4.1: Implement the code to fine-tune the model\n",
"### Implement the code to fine-tune the model\n",
"\n",
"Here are the big pieces of what we do below:\n",
"\n",
Expand Down Expand Up @@ -1538,7 +1523,15 @@
" tokenizer=tokenizer, mlm=False,\n",
")\n",
"\n",
"trainer = ..."
"# Trainer:\n",
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=encoded_dataset,\n",
" tokenizer=tokenizer,\n",
" compute_metrics=compute_metrics,\n",
" data_collator=data_collator,\n",
")"
]
},
{
Expand All @@ -1549,18 +1542,19 @@
},
"outputs": [],
"source": [
"trainer = ..."
"# Run the actual training:\n",
"trainer.train()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial2_Solution_b453433d.py)\n",
"\n"
"### Coding Exercise 4.1: Implement the code to generate text after fine-tuning.\n",
"\n",
"To generate text, we provide input tokens to the model, let it generate the next token and append it into the input tokens. Now, keep repeating this process until you reach the desired output length."
]
},
{
Expand All @@ -1571,17 +1565,64 @@
},
"outputs": [],
"source": [
"# Run the actual training:\n",
"trainer.train()"
"# Number of tokens to generate\n",
"num_tokens = 100\n",
"\n",
"# Move the model to the CPU for inference\n",
"model.to(\"cpu\")\n",
"\n",
"# Print input prompt\n",
"print(f'Input prompt: \\n{input_prompt}')\n",
"\n",
"#################################################\n",
"# Implement a the correct tokens and outputs\n",
"raise NotImplementedError(\"Text Generation\")\n",
"#################################################\n",
"\n",
"# Encode the input prompt\n",
"# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n",
"input_tokens = ...\n",
"\n",
"# Turn off storing gradients\n",
"with torch.no_grad():\n",
" # Keep iterating until num_tokens are generated\n",
" for tkn_idx in tqdm(range(num_tokens)):\n",
" # Forward pass through the model\n",
" # The model expects the tensor to be of Long or Int dtype\n",
" output = ...\n",
" # Get output logits\n",
" logits = output.logits[-1, :]\n",
" # Convert into probabilities\n",
" probs = nn.functional.softmax(logits, dim=-1)\n",
" # Get the index of top token\n",
" top_token = ...\n",
" # Append the token into the input sequence\n",
" input_tokens.append(top_token)\n",
"\n",
"# Decode and print the generated text\n",
"# https://huggingface.co/docs/transformers/en/main_classes/tokenizer\n",
"decoded_text = ...\n",
"print(f'Generated text: \\n{decoded_text}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"execution": {}
},
"source": [
"Finally, we will try our model on the same code snippet to see how it performs after fine-tuning:"
"[*Click for solution*](https://github.com/NeuromatchAcademy/course-content-dl/tree/main/tutorials/W3D1_TimeSeriesAndNaturalLanguageProcessing/solutions/W3D1_Tutorial2_Solution_0f765585.py)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"We can also directly generate text using the generation_pipeline:"
]
},
{
Expand Down Expand Up @@ -1801,9 +1842,7 @@
"source": [
"## Play around with LLMs\n",
"\n",
"1. Try using LLMs' API to do tasks, such as utilizing the GPT-2 API to extend text from a provided context. To achieve this, ensure you have a HuggingFace account and secure an API token.\n",
"\n",
"\n"
"1. Try using LLMs' API to do tasks, such as utilizing the GPT-2 API to extend text from a provided context. To achieve this, ensure you have a HuggingFace account and secure an API token."
]
},
{
Expand All @@ -1817,10 +1856,10 @@
"import requests\n",
"\n",
"def query(payload, model_id, api_token):\n",
" headers = {\"Authorization\": f\"Bearer {api_token}\"}\n",
" API_URL = f\"https://api-inference.huggingface.co/models/{model_id}\"\n",
" response = requests.post(API_URL, headers=headers, json=payload)\n",
" return response.json()\n",
" headers = {\"Authorization\": f\"Bearer {api_token}\"}\n",
" API_URL = f\"https://api-inference.huggingface.co/models/{model_id}\"\n",
" response = requests.post(API_URL, headers=headers, json=payload)\n",
" return response.json()\\\n",
"\n",
"model_id = \"gpt2\"\n",
"api_token = \"hf_****\" # get yours at hf.co/settings/tokens\n",
Expand Down
4 changes: 2 additions & 2 deletions projects/ComputerVision/data_augmentation.html
Original file line number Diff line number Diff line change
Expand Up @@ -1763,8 +1763,8 @@ <h2>Cutout<a class="headerlink" href="#cutout" title="Permalink to this heading"
<section id="mixup">
<h2>Mixup<a class="headerlink" href="#mixup" title="Permalink to this heading">#</a></h2>
<p>Mixup is a data augmentation technique that combines pairs of examples via a convex combination of the images and the labels. Given images <span class="math notranslate nohighlight">\(x_i\)</span> and <span class="math notranslate nohighlight">\(x_j\)</span> with labels <span class="math notranslate nohighlight">\(y_i\)</span> and <span class="math notranslate nohighlight">\(y_j\)</span>, respectively, and <span class="math notranslate nohighlight">\(\lambda \in [0, 1]\)</span>, mixup creates a new image <span class="math notranslate nohighlight">\(\hat{x}\)</span> with label <span class="math notranslate nohighlight">\(\hat{y}\)</span> the following way:</p>
<div class="amsmath math notranslate nohighlight" id="equation-44e940bd-c292-4eed-bcf0-b364c176cb30">
<span class="eqno">(128)<a class="headerlink" href="#equation-44e940bd-c292-4eed-bcf0-b364c176cb30" title="Permalink to this equation">#</a></span>\[\begin{align}
<div class="amsmath math notranslate nohighlight" id="equation-a4a0abc1-1bbd-472d-8c3f-025daccdd721">
<span class="eqno">(128)<a class="headerlink" href="#equation-a4a0abc1-1bbd-472d-8c3f-025daccdd721" title="Permalink to this equation">#</a></span>\[\begin{align}
\hat{x} &amp;= \lambda x_i + (1 - \lambda) x_j \\
\hat{y} &amp;= \lambda y_i + (1 - \lambda) y_j
\end{align}\]</div>
Expand Down
Loading

0 comments on commit 562dda7

Please sign in to comment.