Skip to content

Commit

Permalink
Contrib/customize embeddings (#574)
Browse files Browse the repository at this point in the history
This examples uses an [openai cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb)
as inspiration for a module to customize embeddings.

One can easily pull this module down, create a dataset that fits the expected schema,
and then customize those embeddings easily.
  • Loading branch information
skrawcz authored Dec 4, 2023
1 parent 67278b9 commit b9c0e14
Show file tree
Hide file tree
Showing 7 changed files with 11,390 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Purpose of this module

This module is used to customize embeddings for text data. It is based on MIT licensed code from
this [OpenAI cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb).

The output is a matrix that you can use to multiply your embeddings. The product of this multiplication is a
'custom embedding' that will better emphasize aspects of the text relevant to your use case.
In binary classification use cases, the OpenAI cook book author claims to have seen error rates drop by as much as 50%.

Why customize embeddings? Embeddings are a way to represent text as a vector of numbers. The numbers are chosen so that
similar words have similar vectors. For example, the vectors for "cat" and "dog" will be closer together than the
vectors for "cat" and "banana". Embeddings are useful for many NLP tasks, such as sentiment analysis, text classification,
and question answering. Embeddings are often trained on a large corpus of text, such as Wikipedia. However, these
embeddings may not be optimal for a specific task. E.g. you may want to be able to distinguish between "cat" and "dog"
for a specific task, but the embeddings may not allow you to do that reliably. This module allows you to customize some
embeddings for a specific task, e.g. by making the vectors for "cat" and "dog" further apart.

## How might I use this module?
If you pass in `{"source":"snli"}` as configuration to the driver, the module will download the corpus and unzip it and
use that as input to optimize embeddings for. The embeddings of sentences are optimized for the task of predicting
whether a sentence pair is "entailment", or "contradiction". That is, each pair of sentences are logically entailed
(i.e., one implies the other). These pairs are our positives (label = 1). We generate synthetic negatives by combining
sentences from different pairs, which are presumed to not be logically entailed (label = -1).

If you pass in `{"source":"local"}` as configuration to the driver, the module will load a local dataset you provide a
path to. The dataset should be a csv with columns "text_1", "text_2", and "label". The label should be +1 if the text
pairs are similar and -1 if the text pairs are dissimilar.

Otherwise if you pass in `{}` as configuration to the driver, the module will require you to pass in a dataframe as
`processed_local_dataset` as an input. The dataframe should have columns "text_1", "text_2", and "label". The label should be +1 if the
text pairs are similar and -1 if the text pairs are dissimilar.

In general to use this module you'll need to do the following:

1. Create a dataset of text pairs, where each pair is either similar or dissimilar. Mechanically that means you'll need
to modify/override `processed_local_dataset()` so that you output a dataset of [text_1, text_2, label] where label is +1 if
the pairs of text are similar and -1 if the pairs of text are dissimilar. See the docstring for `processed_local_dataset()`.
2. Modify the "hyperparameters" for a couple of the functions, e.g. `optimize_matrix()`, to suit your needs.
3. Modify how embeddings are generated, e.g. `_get_embedding()`, to suit your needs.
4. Adjust any logic for creating negative pairs, that is, modify `_generate_negative_pairs()`, to suit your needs. For
example, if you have multi-class labels you'll have to modify how you generate negative pairs.
5. Profit!

To execute the DAG, the recommended outputs to grab are:

```python
outputs=[
"train_accuracy",
"test_accuracy",
"embedded_dataset_histogram",
"test_accuracy_post_optimization",
"accuracy_plot",
"training_and_test_loss_plot",
"customized_embeddings_dataframe",
"customized_dataset_histogram",
]
```

You should be able to read the module top to bottom which corresponds roughly to the order of execution.

## What type of functionality is in this module?
The module includes functions for getting embeddings, calculating cosine similarity, processing datasets, splitting data,
generating negative pairs, optimizing matrices, and plotting results. It also shows how one can use `@config.when`
to swap out how the data is loaded, `@check_output` for schema validation of a dataframe, `@parameterize` for
creating many functions from a single function, and `@inject` to collect results from the results of `@parameterize`.

The module uses libraries such as numpy, pandas, plotly, torch, sklearn, and openai.

# Configuration Options
`{"source": "snli"}` will use the Stanford SNLI corpus. Use this to run the code as-is.

`{"source": "local"}` will load a local dataset you provide a path to.

`{}` will require you to pass in a dataframe with name `raw_data` as an input.

# Limitations
This code is currently set up to work with OpenAI. It could be modified to work with other embedding producers/providers.

The matrix optimization is not set up to be parallelized, though it could be done using Hamilton constructs,
like `Parallelizeable`.
Loading

0 comments on commit b9c0e14

Please sign in to comment.