Skip to content

Commit

Permalink
Merge branch 'main' into overleaf-2023-06-23-1923
Browse files Browse the repository at this point in the history
  • Loading branch information
veekaybee authored Jun 23, 2023
2 parents 4480d4e + 22de782 commit 16f3fee
Show file tree
Hide file tree
Showing 7 changed files with 45 additions and 16 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,11 @@ jobs:
run: |
git config --local user.email "[email protected]"
git config --local user.name "Vicki Boykis"
git fetch origin main
git stash
git rebase origin/main # keep all other changes
git stash pop
git add embeddings.pdf
git commit -m "Generated PDF"
git commit -m "Regenerate PDF from ${{ github.sha }}"
git push --force # overwrite old PDF
if: github.event_name != 'pull_request'
10 changes: 10 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Boykis"
given-names: "Vicki"
title: "What are embeddings?"
version: 1.0.1
doi: 10.5281/zenodo.8015029
date-released: 2023-06-08
url: "https://github.com/veekaybee/what_are_embeddings"
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
This repository contains the generated LaTex document, website, and complementary notebook code for
["What are Embeddings".](https://vickiboykis.com/what_are_embeddings/)

[![DOI](https://zenodo.org/badge/644343479.svg)](https://zenodo.org/badge/latestdoi/644343479)

## Abstract

Over the past decade, embeddings --- numerical representations of non-tabular machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.
Expand All @@ -30,3 +32,16 @@ If you have any changes that you'd like to make to the document including clarif
6. Issue that pull request!


## Citing

```bibtex
@software{Boykis_What_are_embeddings_2023,
author = {Boykis, Vicki},
doi = {10.5281/zenodo.8015029},
month = jun,
title = {{What are embeddings?}},
url = {https://github.com/veekaybee/what_are_embeddings},
version = {1.0.1},
year = {2023}
}
```
Binary file modified embeddings.pdf
Binary file not shown.
18 changes: 9 additions & 9 deletions embeddings.tex
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ \section{Introduction}
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly.""""
text = """Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly."""

# Tokenize the sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(text)
Expand Down Expand Up @@ -447,7 +447,7 @@ \section{Recommendation as a business problem}
The main task of our recommender system at Netflix is to help our members discover content that they will watch and enjoy to maximize their long-term satisfaction. This is a challenging problem for many reasons, including that every person is unique, has a multitude of interests that can vary in different contexts, and needs a recommender system most when they are not sure what they want to watch. Doing this well means that each member gets a unique experience that allows them to get the most out of Netflix. As a monthly subscription service, member satisfaction is tightly coupled to a person’s likelihood to retain with our service, which directly impacts our revenue.
\end{quote}

Knowing this business context, how and why might we use embeddings in machine learning workflows in Flutter to show users flits that are interesting to them personally, knowing that personalized content is more relevant and generally gets higher rates of engagement? \citep{jannach2010recommender} than non-personalized forms of recommendation on online platforms \footnote{For more, see \href{http://www.recommenderbook.net/media/Recommender_Systems_An_Introduction_Chapter08_Case_study.pdf}{this case study} on personalized recommendations as well as \href{https://www.arxiv-vanity.com/papers/1906.03109/}{the intro section of this paper} which covers many personalization use-cases} We need to first understand how web apps work and where embeddings fit into them.
Knowing this business context, and given that personalized content is more relevant and generally gets higher rates of engagement \citep{jannach2010recommender} than non-personalized forms of recommendation on online platforms,\footnote{For more, see \href{http://www.recommenderbook.net/media/Recommender_Systems_An_Introduction_Chapter08_Case_study.pdf}{this case study} on personalized recommendations as well as \href{https://www.arxiv-vanity.com/papers/1906.03109/}{the intro section of this paper} which covers many personalization use-cases.} how and why might we use embeddings in machine learning workflows in Flutter to show users flits that are interesting to them personally? We need to first understand how web apps work and where embeddings fit into them.

\subsection{Building a web app}

Expand Down Expand Up @@ -906,7 +906,7 @@ \subsection{Encoding}

\subsubsection{Indicator and one-hot encoding}

indicator encoding is, given $n$ categories (i.e. "US", "UK", and "NZ") encodes the variables into $n-1$ categories, creating a new feature for each category. So, if we have three variables, indicator encoding encodes into two indicator variables. Why would we do this? If the categories are mutually exclusive, as they usually are in point-in-time geolocation estimates, if someone is in the US, we know for sure they're not in the UK and not in NZ, so it reduces computational overhead.
Indicator encoding, given $n$ categories (i.e. "US", "UK", and "NZ"), encodes the variables into $n-1$ categories, creating a new feature for each category. So, if we have three variables, indicator encoding encodes into two indicator variables. Why would we do this? If the categories are mutually exclusive, as they usually are in point-in-time geolocation estimates, if someone is in the US, we know for sure they're not in the UK and not in NZ, so it reduces computational overhead.

If we instead use all the variables and they are very closely correlated, there is a chance we'll fall into something known as the \textbf{indicator variable trap}. We can predict one variable from the others, which means we no longer have feature independence. This generally isn't a risk for geolocation since there are more than 2 or 3 and if you're not in the US, it's not guaranteed that you're in the UK. So, if we have US = 1, UK = 2, and NZ = 3, and prefer more compact representations, we can use indicator encoding. However, many modern ML approaches don't require linear feature independence and use L1 regularization\footnote{Regularization is a way to prevent our model from \textbf{overfitting}. Overfitting means our model can exactly predict outcomes based on the training data, but it can't learn new inputs that we show it, which means it can't generalize} to prune feature inputs that don't minimize the error, and as such only use one-hot encoding.

Expand Down Expand Up @@ -1497,7 +1497,7 @@ \subsection{Word2Vec}

To get around the limitations of earlier textual approaches and keep up with growing size of text corpuses, in 2013, researchers at Google came up with an elegant solution to this problem using neural networks, called Word2Vec \citep{mikolov2013efficient}.

So far, we've moved from simple heuristics like one-hot encoding, to machine learning approaches like LSA and LDA that look to learn a dataset's modeled features. Previously, like our original one-hot encodings, all the approaches to embedding focused on generating sparse vectors much . A sparse vector gives an indication that two words are related, but not that there is a semantic relationship between them. For example, “The dog chased the cat” and “the cat chased the dog” would have the same distance in the vector space, even though they’re two completely different sentences.
So far, we've moved from simple heuristics like one-hot encoding, to machine learning approaches like LSA and LDA that look to learn a dataset's modeled features. Previously, like our original one-hot encodings, all the approaches to embedding focused on generating sparse vectors that can give an indication that two words are related, but not that there is a semantic relationship between them. For example, “The dog chased the cat” and “the cat chased the dog” would have the same distance in the vector space, even though they’re two completely different sentences.

Word2Vec is a family of models that has several implementations, each of which focus on transforming the entire input dataset into vector representations and, more importantly, focusing not only on the inherent labels of individual words, but on the relationship between those representations.

Expand Down Expand Up @@ -1606,9 +1606,9 @@ \subsection{Word2Vec}

\begin{itemize}
\item \textbf{Tokenization} - transforming a sentence or a word into its component character by splitting it
\item Removing noise - Including URLs, punctuation, and anything else in the text that is not relevant to the task at hand
\item \textbf{Removing noise} - Including URLs, punctuation, and anything else in the text that is not relevant to the task at hand
\item \textbf{Word segmentation} - Splitting our sentences into individual words
\item Correcting spelling mistakes
\item \textbf{Correcting spelling mistakes}
\end{itemize}


Expand Down Expand Up @@ -2082,8 +2082,8 @@ \subsection{BERT}
\caption{Encoder-only architecture}
\end{figure}
After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was \textbf{BERT} released in 2018 by Google.
BERT stands for Bi-Directional Encoder and was released 2018 \citep{devlin2018bert}, based on a paper written by Google as a way to solve common natural language tasks like sentiment analysis, question-answering, and text summarization. BERT is a transformer model , also based on the attention mechanism, but its architecture is such that it only includes the encoder piece. Its most prominent usage is in Google Search, where it's the algorithm powering surfacing relevant search results. In the blog post they released on including BERT in search ranking in 2019, Google specifically discussed adding context to queries as a replacement for keyword-based methods as a reason they did this.\footnote{\href{https://blog.google/products/search/search-language-understanding-bert/}{BERT search announcement}}
After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was \textbf{BERT}.
BERT stands for Bi-Directional Encoder and was released 2018 \citep{devlin2018bert}, based on a paper written by Google as a way to solve common natural language tasks like sentiment analysis, question-answering, and text summarization. BERT is a transformer model, also based on the attention mechanism, but its architecture is such that it only includes the encoder piece. Its most prominent usage is in Google Search, where it's the algorithm powering surfacing relevant search results. In the blog post they released on including BERT in search ranking in 2019, Google specifically discussed adding context to queries as a replacement for keyword-based methods as a reason they did this.\footnote{\href{https://blog.google/products/search/search-language-understanding-bert/}{BERT search announcement}}
BERT works as a \textbf{masked language model}. Masking is simply what we did when we implemented Word2Vec by removing words and building our context window. When we created our representations with Word2Vec, we only looked at sliding windows moving forward. The B in Bert is for bi-directional, which means it pays attention to words in both ways through scaled dot-product attention. BERT has 12 transformer layers. It starts by using \textbf{WordPiece}, an algorithm that segments words into subwords, into tokens. To train BERT, the goal is to predict a token given its context.
Expand Down Expand Up @@ -2326,7 +2326,7 @@ \subsubsection*{An aside on training data}
In \textbf{fine-tuning} a model, we perform all the same steps as we do for training from scratch. We have training data, we have a model, and we minimize a loss function. However, there are several differences. When we create our new model, we copy the existing, pre-trained model with the exception of the final output layer, which we initialize from scratch based on our new task. When we train the model, we initialize these parameters at random and only continue to adjust the parameters of the previous layers so that they focus on this task rather than starting to train from scratch. In this way, if we have a model like BERT that's trained to generalize across the whole internet, but our corpus for Flutter is very sensitive to trending topics and needs to be updated on a daily basis, we can refocus the model without having to train a new one with as few as 10k samples instead of our original hundreds of millions \citep{zhang2020revisiting}.
There are, likewise, BERT embeddings available that we can fine-tune. There are other generalized corpuses available, such as GloVE, Word2Vec, and \href{https://fasttext.cc/docs/en/crawl-vectors.html}{FastText} (also trained with CBOW). We need to make a decision whether to use these, train a model from scratch, or a third option, to query \href{https://platform.openai.com/docs/guides/embeddings/limitations-risks}{embeddings available from an API} as is the case for OpenAI embeddings, although doing so can potentially come at a \href{https://github.com/ray-project/llm-numbers#101----cost-ratio-of-openai-embedding-to-self-hosted-embedding}{higher}, relative to training or fine-tuning our own. Of course, all of this is subject to our particular use-case and is important to evaluate when we start a project.
There are, likewise, BERT embeddings available that we can fine-tune. There are other generalized corpuses available, such as GloVE, Word2Vec, and \href{https://fasttext.cc/docs/en/crawl-vectors.html}{FastText} (also trained with CBOW). We need to make a decision whether to use these, train a model from scratch, or a third option, to query \href{https://platform.openai.com/docs/guides/embeddings/limitations-risks}{embeddings available from an API} as is the case for OpenAI embeddings, although doing so can potentially come at a \href{https://github.com/ray-project/llm-numbers#101----cost-ratio-of-openai-embedding-to-self-hosted-embedding}{higher cost}, relative to training or fine-tuning our own. Of course, all of this is subject to our particular use-case and is important to evaluate when we start a project.
\subsubsection{Storage and Retrieval}
Expand Down
6 changes: 3 additions & 3 deletions notebooks/fig_24_tf_idf_from_scratch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,7 @@
],
"source": [
"# Simple frequency counts of words per document by initializing a dict\n",
"import pandas as pd"
"dict_a = dict.fromkeys(total_corpus, 0)\n",
"dict_b = dict.fromkeys(total_corpus, 0)\n",
"\n",
Expand Down Expand Up @@ -544,7 +545,6 @@
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"import pandas as pd\n",
"\n",
"corpus = [\n",
" \"Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly.\",\n",
Expand All @@ -555,7 +555,7 @@
"\n",
"vectorizer = TfidfVectorizer()\n",
"vector = vectorizer.fit_transform(corpus)\n",
"dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))\n",
"dict(zip(vectorizer.get_feature_names_out(), vector.toarray()[0]))\n",
"\n",
"tfidf_df = pd.DataFrame(vector.toarray(), index=text_titles, columns=vectorizer.get_feature_names_out())"
]
Expand Down Expand Up @@ -752,7 +752,7 @@
"</div>"
],
"text/plain": [
" dreams_langstonhughes quote_william_blake 00_Document Frequency\n",
" quote_langstonhughes quote_william_blake 00_Document Frequency\n",
"bird 0.172503 0.197242 2.0\n",
"broken 0.242447 0.000000 1.0\n",
"cannot 0.242447 0.000000 1.0\n",
Expand Down
6 changes: 3 additions & 3 deletions notebooks/fig_4_bert.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@
}
],
"source": [
"# Mark each of the 22 tokens as belonging to sentence \"1\".\n",
"# Mark each of the 23 tokens as belonging to sentence \"1\".\n",
"segments_ids = [1] * len(tokenized_text)\n",
"\n",
"print(segments_ids)"
Expand Down Expand Up @@ -560,10 +560,10 @@
}
],
"source": [
"# Stores the token vectors, with shape [22 x 768]\n",
"# Stores the token vectors, with shape [23 x 768]\n",
"embeddings = []\n",
"\n",
"# `token_embeddings` is a [22 x 12 x 768] tensor.\n",
"# `token_embeddings` is a [23 x 12 x 768] tensor.\n",
"\n",
"# For each token in the sentence...\n",
"for token in token_embeddings:\n",
Expand Down

0 comments on commit 16f3fee

Please sign in to comment.