Merge branch 'main' into overleaf-2023-06-23-1923

veekaybee · Jun 23, 2023 · 16f3fee · 16f3fee
2 parents 4480d4e + 22de782
commit 16f3fee
Show file tree

Hide file tree

Showing 7 changed files with 45 additions and 16 deletions.
diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml
@@ -34,7 +34,11 @@ jobs:
         run: |
           git config --local user.email "[email protected]"
           git config --local user.name "Vicki Boykis"
+          git fetch origin main
+          git stash
+          git rebase origin/main # keep all other changes
+          git stash pop
           git add embeddings.pdf
-          git commit -m "Generated PDF" 
+          git commit -m "Regenerate PDF from ${{ github.sha }}" 
           git push --force # overwrite old PDF
         if: github.event_name != 'pull_request'
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,10 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+authors:
+- family-names: "Boykis"
+  given-names: "Vicki"
+title: "What are embeddings?"
+version: 1.0.1
+doi: 10.5281/zenodo.8015029 
+date-released: 2023-06-08
+url: "https://github.com/veekaybee/what_are_embeddings"
diff --git a/README.md b/README.md
@@ -8,6 +8,8 @@
 This repository contains the generated LaTex document, website, and complementary notebook code for 
 ["What are Embeddings".](https://vickiboykis.com/what_are_embeddings/)
 
+[![DOI](https://zenodo.org/badge/644343479.svg)](https://zenodo.org/badge/latestdoi/644343479)
+
 ## Abstract 
 
 Over the past decade, embeddings --- numerical representations of non-tabular machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data.  However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important. 
@@ -30,3 +32,16 @@ If you have any changes that you'd like to make to the document including clarif
 6. Issue that pull request!
 
 
+## Citing 
+
+```bibtex
+@software{Boykis_What_are_embeddings_2023,
+author = {Boykis, Vicki},
+doi = {10.5281/zenodo.8015029},
+month = jun,
+title = {{What are embeddings?}},
+url = {https://github.com/veekaybee/what_are_embeddings},
+version = {1.0.1},
+year = {2023}
+}
+```
diff --git a/embeddings.pdf b/embeddings.pdf
diff --git a/embeddings.tex b/embeddings.tex
@@ -355,7 +355,7 @@ \section{Introduction}
 # Load pre-trained model tokenizer (vocabulary)
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 
-text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly.""""
+text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly."""
 
 # Tokenize the sentence with the BERT tokenizer.
 tokenized_text = tokenizer.tokenize(text)
@@ -447,7 +447,7 @@ \section{Recommendation as a business problem}
 The main task of our recommender system at Netflix is to help our members discover content that they will watch and enjoy to maximize their long-term satisfaction. This is a challenging problem for many reasons, including that every person is unique, has a multitude of interests that can vary in different contexts, and needs a recommender system most when they are not sure what they want to watch. Doing this well means that each member gets a unique experience that allows them to get the most out of Netflix. As a monthly subscription service, member satisfaction is tightly coupled to a person’s likelihood to retain with our service, which directly impacts our revenue.
 \end{quote}
 
-Knowing this business context, how and why might we use embeddings in machine learning workflows in Flutter to show users flits that are interesting to them personally, knowing that personalized content is more relevant and generally gets higher rates of engagement? \citep{jannach2010recommender}  than non-personalized forms of recommendation on online platforms \footnote{For more, see  \href{http://www.recommenderbook.net/media/Recommender_Systems_An_Introduction_Chapter08_Case_study.pdf}{this case study} on personalized recommendations as well as \href{https://www.arxiv-vanity.com/papers/1906.03109/}{the intro section of this paper} which covers many personalization use-cases}  We need to first understand how web apps work and where embeddings fit into them.
+Knowing this business context, and given that personalized content is more relevant and generally gets higher rates of engagement \citep{jannach2010recommender} than non-personalized forms of recommendation on online platforms,\footnote{For more, see  \href{http://www.recommenderbook.net/media/Recommender_Systems_An_Introduction_Chapter08_Case_study.pdf}{this case study} on personalized recommendations as well as \href{https://www.arxiv-vanity.com/papers/1906.03109/}{the intro section of this paper} which covers many personalization use-cases.}  how and why might we use embeddings in machine learning workflows in Flutter to show users flits that are interesting to them personally? We need to first understand how web apps work and where embeddings fit into them.
 
 \subsection{Building a web app}
 
@@ -906,7 +906,7 @@ \subsection{Encoding}
 
 \subsubsection{Indicator and one-hot encoding}
 
-indicator encoding is, given $n$ categories (i.e. "US", "UK", and "NZ") encodes the variables into $n-1$ categories, creating a new feature for each category. So, if we have three variables, indicator encoding encodes into two indicator variables.  Why would we do this? If the categories are mutually exclusive, as they usually are in point-in-time geolocation estimates, if someone is in the US, we know for sure they're not in the UK and not in NZ, so it reduces computational overhead.
+Indicator encoding, given $n$ categories (i.e. "US", "UK", and "NZ"), encodes the variables into $n-1$ categories, creating a new feature for each category. So, if we have three variables, indicator encoding encodes into two indicator variables.  Why would we do this? If the categories are mutually exclusive, as they usually are in point-in-time geolocation estimates, if someone is in the US, we know for sure they're not in the UK and not in NZ, so it reduces computational overhead.
 
 If we instead use all the variables and they are very closely correlated, there is a chance we'll fall into something known as the \textbf{indicator variable trap}. We can predict one variable from the others, which means we no longer have feature independence. This generally isn't a risk for geolocation since there are more than 2 or 3 and if you're not in the US, it's not guaranteed that you're in the UK. So, if we have US = 1, UK = 2, and NZ = 3, and prefer more compact representations, we can use indicator encoding.  However, many modern ML approaches don't require linear feature independence and use L1 regularization\footnote{Regularization is a way to prevent our model from \textbf{overfitting}. Overfitting means our model can exactly predict outcomes based on the training data, but it can't learn new inputs that we show it, which means it can't generalize} to prune feature inputs that don't minimize the error, and as such only use one-hot encoding.
 
@@ -1497,7 +1497,7 @@ \subsection{Word2Vec}
 
 To get around the limitations of earlier textual approaches and keep up with growing size of text corpuses, in 2013, researchers at Google came up with an elegant solution to this problem using neural networks, called Word2Vec \citep{mikolov2013efficient}.  
 
-So far, we've moved from simple heuristics like one-hot encoding, to machine learning approaches like LSA and LDA that look to learn a dataset's modeled features.  Previously, like our original one-hot encodings, all the approaches to embedding focused on generating sparse vectors much . A sparse vector gives an indication that two words are related, but not that there is a semantic relationship between them. For example, “The dog chased the cat” and “the cat chased the dog” would have the same distance in the vector space, even though they’re two completely different sentences.
+So far, we've moved from simple heuristics like one-hot encoding, to machine learning approaches like LSA and LDA that look to learn a dataset's modeled features.  Previously, like our original one-hot encodings, all the approaches to embedding focused on generating sparse vectors that can give an indication that two words are related, but not that there is a semantic relationship between them. For example, “The dog chased the cat” and “the cat chased the dog” would have the same distance in the vector space, even though they’re two completely different sentences.
 
 Word2Vec is a family of models that has several implementations, each of which focus on transforming the entire input dataset into vector representations and, more importantly, focusing  not only on the inherent labels of individual words, but on the relationship between those representations.
 
@@ -1606,9 +1606,9 @@ \subsection{Word2Vec}
 
 \begin{itemize}
   \item \textbf{Tokenization} - transforming a sentence or a word into its component character by splitting it
-  \item Removing noise - Including URLs, punctuation, and anything else in the text  that is not relevant to the task at hand
+  \item \textbf{Removing noise} - Including URLs, punctuation, and anything else in the text  that is not relevant to the task at hand
   \item \textbf{Word segmentation} - Splitting our sentences into individual words
-  \item Correcting spelling mistakes
+  \item \textbf{Correcting spelling mistakes}
 \end{itemize}
 
 
@@ -2082,8 +2082,8 @@ \subsection{BERT}
 \caption{Encoder-only architecture}
 \end{figure}
 
-After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was \textbf{BERT} released in 2018 by Google.
-BERT stands for Bi-Directional Encoder and was released 2018 \citep{devlin2018bert}, based on a paper written by Google as a way to solve common natural language tasks like sentiment analysis, question-answering, and text summarization. BERT is a transformer model , also based on the attention mechanism, but its architecture is such that it only includes the encoder piece. Its most prominent usage is in Google Search, where it's the algorithm powering surfacing relevant search results. In the blog post they released on including BERT in search ranking in 2019, Google specifically discussed adding context to queries as a replacement for keyword-based methods as a reason they did this.\footnote{\href{https://blog.google/products/search/search-language-understanding-bert/}{BERT search announcement}}
+After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was \textbf{BERT}.
+BERT stands for Bi-Directional Encoder and was released 2018 \citep{devlin2018bert}, based on a paper written by Google as a way to solve common natural language tasks like sentiment analysis, question-answering, and text summarization. BERT is a transformer model, also based on the attention mechanism, but its architecture is such that it only includes the encoder piece. Its most prominent usage is in Google Search, where it's the algorithm powering surfacing relevant search results. In the blog post they released on including BERT in search ranking in 2019, Google specifically discussed adding context to queries as a replacement for keyword-based methods as a reason they did this.\footnote{\href{https://blog.google/products/search/search-language-understanding-bert/}{BERT search announcement}}
 
 BERT works as a \textbf{masked language model}. Masking is simply what we did when we implemented Word2Vec by removing words and building our context window. When we created our representations with Word2Vec, we only looked at sliding windows moving forward. The B in Bert is for bi-directional, which means it pays attention to words in both ways through scaled dot-product attention. BERT has 12 transformer layers. It starts by using \textbf{WordPiece}, an algorithm that segments words into subwords, into tokens. To train BERT, the goal is to predict a token given its context.
 
@@ -2326,7 +2326,7 @@ \subsubsection*{An aside on training data}
 
 In \textbf{fine-tuning} a model, we perform all the same steps as we do for training from scratch. We have training data, we have a model, and we minimize a loss function. However, there are several differences. When we create our new model, we copy the existing, pre-trained model with the exception of the final output layer, which we initialize from scratch based on our new task. When we train the model, we initialize these parameters at random and only continue to adjust the parameters of the previous layers so that they focus on this task rather than starting to train from scratch. In this way, if we have a model like BERT that's trained to generalize across the whole internet, but our corpus for Flutter is very sensitive to trending topics and needs to be updated on a daily basis, we can refocus the model without having to train a new one with as few as 10k samples instead of our original hundreds of millions \citep{zhang2020revisiting}.
 
-There are, likewise, BERT embeddings available that we can fine-tune. There are other generalized corpuses available, such as GloVE, Word2Vec, and \href{https://fasttext.cc/docs/en/crawl-vectors.html}{FastText} (also trained with CBOW). We need to make a decision whether to use these, train a model from scratch, or a third option, to query \href{https://platform.openai.com/docs/guides/embeddings/limitations-risks}{embeddings available from an API} as is the case for OpenAI embeddings, although doing so can potentially come at a \href{https://github.com/ray-project/llm-numbers#101----cost-ratio-of-openai-embedding-to-self-hosted-embedding}{higher}, relative to training or fine-tuning our own. Of course, all of this is subject to our particular use-case and is important to evaluate when we start a project.
+There are, likewise, BERT embeddings available that we can fine-tune. There are other generalized corpuses available, such as GloVE, Word2Vec, and \href{https://fasttext.cc/docs/en/crawl-vectors.html}{FastText} (also trained with CBOW). We need to make a decision whether to use these, train a model from scratch, or a third option, to query \href{https://platform.openai.com/docs/guides/embeddings/limitations-risks}{embeddings available from an API} as is the case for OpenAI embeddings, although doing so can potentially come at a \href{https://github.com/ray-project/llm-numbers#101----cost-ratio-of-openai-embedding-to-self-hosted-embedding}{higher cost}, relative to training or fine-tuning our own. Of course, all of this is subject to our particular use-case and is important to evaluate when we start a project.
 
 \subsubsection{Storage and Retrieval}
 

diff --git a/notebooks/fig_24_tf_idf_from_scratch.ipynb b/notebooks/fig_24_tf_idf_from_scratch.ipynb
@@ -284,6 +284,7 @@
    ],
    "source": [
     "# Simple frequency counts of words per document by initializing a dict\n",
+    "import pandas as pd"
     "dict_a = dict.fromkeys(total_corpus, 0)\n",
     "dict_b = dict.fromkeys(total_corpus, 0)\n",
     "\n",
@@ -544,7 +545,6 @@
    "outputs": [],
    "source": [
     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
-    "import pandas as pd\n",
     "\n",
     "corpus = [\n",
     "    \"Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly.\",\n",
@@ -555,7 +555,7 @@
     "\n",
     "vectorizer = TfidfVectorizer()\n",
     "vector = vectorizer.fit_transform(corpus)\n",
-    "dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))\n",
+    "dict(zip(vectorizer.get_feature_names_out(), vector.toarray()[0]))\n",
     "\n",
     "tfidf_df = pd.DataFrame(vector.toarray(), index=text_titles, columns=vectorizer.get_feature_names_out())"
    ]
@@ -752,7 +752,7 @@
        "</div>"
       ],
       "text/plain": [
-       "        dreams_langstonhughes  quote_william_blake  00_Document Frequency\n",
+       "        quote_langstonhughes  quote_william_blake  00_Document Frequency\n",
        "bird                 0.172503             0.197242                    2.0\n",
        "broken               0.242447             0.000000                    1.0\n",
        "cannot               0.242447             0.000000                    1.0\n",

diff --git a/notebooks/fig_4_bert.ipynb b/notebooks/fig_4_bert.ipynb
@@ -279,7 +279,7 @@
     }
    ],
    "source": [
-    "# Mark each of the 22 tokens as belonging to sentence \"1\".\n",
+    "# Mark each of the 23 tokens as belonging to sentence \"1\".\n",
     "segments_ids = [1] * len(tokenized_text)\n",
     "\n",
     "print(segments_ids)"
@@ -560,10 +560,10 @@
     }
    ],
    "source": [
-    "# Stores the token vectors, with shape [22 x 768]\n",
+    "# Stores the token vectors, with shape [23 x 768]\n",
     "embeddings = []\n",
     "\n",
-    "# `token_embeddings` is a [22 x 12 x 768] tensor.\n",
+    "# `token_embeddings` is a [23 x 12 x 768] tensor.\n",
     "\n",
     "# For each token in the sentence...\n",
     "for token in token_embeddings:\n",