diff --git a/data.dvc b/data.dvc
index e5bcd28..9905113 100644
--- a/data.dvc
+++ b/data.dvc
@@ -1,5 +1,5 @@
 outs:
-- md5: 3f85b4e2df7b76c01bcb27989e564e36.dir
-  size: 163733640
-  nfiles: 6
+- md5: 2ce6297077793c098f42db8660fb0d0e.dir
+  size: 656551290
+  nfiles: 12
   path: data
diff --git a/nbs/16_notebook_refactor.ipynb b/nbs/16_notebook_refactor.ipynb
new file mode 100644
index 0000000..b88ef0c
--- /dev/null
+++ b/nbs/16_notebook_refactor.ipynb
@@ -0,0 +1,1031 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {},
+   "source": [
+    "# Refactor MeaLeon into notebook friendly\n",
+    "---\n",
+    "description: Refactoring Flask App requests into more a notebook friendly iteration\n",
+    "output-file: template.html\n",
+    "title: Notebook refactor\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# | default_exp testing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# | hide\n",
+    "import dagshub\n",
+    "import dill as pickle\n",
+    "import joblib\n",
+    "import mlflow\n",
+    "from mlflow.models import infer_signature\n",
+    "import nbdev #; nbdev.nbdev_export()\n",
+    "from nbdev.showdoc import *\n",
+    "import pandas as pd\n",
+    "import re\n",
+    "from sklearn.feature_extraction.text import (\n",
+    "    CountVectorizer\n",
+    "    , TfidfTransformer\n",
+    "    , TfidfVectorizer\n",
+    "    , \n",
+    ")\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.pipeline import make_pipeline\n",
+    "from src.backend.embedding_creation.apply_stanza import CustomSKLearnAnalyzer\n",
+    "from src.backend.embedding_creation.sklearn_transformer_as_mlflow_model import CustomSKLearnWrapper\n",
+    "import src.backend.raw_data_cleaning.raw_data_preprocessor as rdpp\n",
+    "import stanza\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Need to call DAGsHub to keep track of what we're doing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Repository initialized!\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "Repository initialized!\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "#@markdown Enter the username of your DAGsHub account:\n",
+    "DAGSHUB_USER_NAME = \"AaronWChen\" #@param {type:\"string\"}\n",
+    "\n",
+    "#@markdown Enter the email for your DAGsHub account:\n",
+    "DAGSHUB_EMAIL = \"awc33@cornell.edu\" #@param {type:\"string\"}\n",
+    "\n",
+    "#@markdown Enter the repo name \n",
+    "DAGSHUB_REPO_NAME = \"MeaLeon\"\n",
+    "\n",
+    "#@markdown Enter the name of the branch you are working on \n",
+    "BRANCH = \"init_mealeon_to_notebook_refactor\"\n",
+    "dagshub.init(repo_name=DAGSHUB_REPO_NAME\n",
+    "             , repo_owner=DAGSHUB_USER_NAME)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Things I need to do\n",
+    "\n",
+    "1. app.py calls find_similar_dishes, returns a render template\n",
+    "2. find_similar_dishes needs to call the recipe database, the sklearn model, the model-transformed database (ie, TFIDF word matrix), and the query (which needs to be transformed)\n",
+    "   1. Little confused by order; why would i need the original database if i can just call the model/vector-transformed version?\n",
+    "      1. Original database has things like url and ID, which could be needed later\n",
+    "      2. ~~Future vector data can use the same recipe_id unique key, but only have the ingredient vectors. Use unique key to join original...~~\n",
+    "      3. Wait, need cuisine filter to improve search results...so vector database should have cuisine and recipe_id\n",
+    "      4. From that, can call back to original database to get URLs and other metadata\n",
+    "         1. SQLModel query to join\n",
+    "   2. Sklearn model (really any model that transforms the query) needs to be loaded from MLflow\n",
+    "      1. Model will be used to transform query for similarity analysis\n",
+    "      2. MLflow load\n",
+    "   3. Vector database needs to be loaded from currently a json, but should switch to Vespa\n",
+    "      1. Wouldn't this need to be linked to the MLflow Model? DVC + Vespa?\n",
+    "      2. Mlflow or DVC load?\n",
+    "   4. Original recipe database might also be DVC?\n",
+    "   5. \n",
+    "3. original query should be formatted and stored into recipe database (CRUD)\n",
+    "4. this is called to edamam API\n",
+    "5. edamam return is currently model-transformed then cuisine filtered\n",
+    "   1. Swap this order so we don't have to process as much text\n",
+    "6. Vector comparison against filtered data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Preparation\n",
+    "\n",
+    "This part can be the DVC import for our data\n",
+    "\n",
+    "Currently, raw/processed data can be imported with json, need to consider how to access data something like SQL and log some snapshot of this data (and its metadata?) with DVC\n",
+    "\n",
+    "- Can i reuse some parts of GitHub Actions?\n",
+    "\n",
+    "- DVC can handle data files fine, but SQL pulls are currently experimentally supported\n",
+    "- using dvc import-db https://dvc.org/doc/command-reference/import-db\n",
+    "\n",
+    "- DVC with generative AI (might be relevant to vector databases): https://youtu.be/aqMXEvWTuVY?si=2lMKrofl9s10BXVx\n",
+    "\n",
+    "#### Let's start with local data files\n",
+    "\n",
+    "Via automated ETL, DVC could log the raw data, perform the text processing if not an embedding, add the pre processed data back to DVC, then start MLflow with embedding conversion "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[?25l\u001b[32m⠋\u001b[0m Checking graph                                                 \n",
+      "Adding...                                                                       \n",
+      "!\u001b[A\n",
+      "  0% Checking cache in '/home/awchen/Repos/Projects/MeaLeon/.dvc/cache'| |0/? [0\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Checking out ../data/raw/201706-epicur0/? [00:00<?,    ?files/s]\u001b[A\n",
+      "  0%|          |Checking out ../data/raw/201706-epicur0/1 [00:00<?,    ?files/s]\u001b[A\n",
+      "100% Adding...|████████████████████████████████████████|1/1 [00:00,  4.23file/s]\u001b[A\n",
+      "\n",
+      "To track the changes with git, run:\n",
+      "\n",
+      "\tgit add ../data.dvc\n",
+      "\n",
+      "To enable auto staging, run:\n",
+      "\n",
+      "\tdvc config core.autostage true\n",
+      "\u001b[0m"
+     ]
+    }
+   ],
+   "source": [
+    "# raw data\n",
+    "\n",
+    "!dvc add \"../data/raw/201706-epicurious-recipes-en.json\"\n",
+    "raw_df = pd.read_json(\"../data/raw/201706-epicurious-recipes-en.json\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ETL work (currently, data cleaning/prep)\n",
+    "# how the prep works is via dataframe_preprocessor \n",
+    "cleaned_df = rdpp.preprocess_dataframe(raw_df)\n",
+    "cleaned_df.to_parquet(\"../data/processed/cleaned_df.parquet.gzip\", compression=\"gzip\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[?25l\u001b[32m⠋\u001b[0m Checking graph                                                 \n",
+      "Adding...                                                                       \n",
+      "!\u001b[A\n",
+      "  0% Checking cache in '/home/awchen/Repos/Projects/MeaLeon/.dvc/cache'| |0/? [0\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Transferring                          0/? [00:00<?,     ?file/s]\u001b[A\n",
+      "  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Checking out ../data/processed/cleaned0/? [00:00<?,    ?files/s]\u001b[A\n",
+      "  0%|          |Checking out ../data/processed/cleaned0/1 [00:00<?,    ?files/s]\u001b[A\n",
+      "100% Adding...|████████████████████████████████████████|1/1 [00:00, 17.84file/s]\u001b[A\n",
+      "\n",
+      "To track the changes with git, run:\n",
+      "\n",
+      "\tgit add ../data.dvc\n",
+      "\n",
+      "To enable auto staging, run:\n",
+      "\n",
+      "\tdvc config core.autostage true\n",
+      "\u001b[0m"
+     ]
+    }
+   ],
+   "source": [
+    "# add cleaned dataframe to DVC\n",
+    "!dvc add \"../data/processed/cleaned_df.parquet.gzip\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Need to commit DVC/data changes to git, does that need to be done in this cell?\n",
+    "- based off of the nbdev tools currently (where it essentially runs the whole notebook), this may not be a good idea\n",
+    "- when working out of a notebook for testing, dvc maybe can pull the data, but we should not be doing the actual processing here\n",
+    "\n",
+    "In the future, can/should the data cleaning be done in dbt?\n",
+    "\n",
+    "- no, dbt is more about analytics then data cleaning, it seems\n",
+    "\n",
+    "- if text processing needed regularly, might have to put in Airflow\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Now that we have converted the raw dataframe to a cleaner form with lemmatization (if needed/preferred) we can move on to the embedding transformation. Currently, this is another ETL done with `nlp_processor`, but performed with an MLflow model and this embedding transformed/vectorized data should then added back to DVC.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "In the future, we can take the embeddings and convert them to PyTorch tensors/datasets, which is not something we can do with the original raw text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#| export\n",
+    "# this is a custom function to be used with MLflow to get or create experiments (is from the MLflow team)\n",
+    "def get_mlflow_experiment_id(name):\n",
+    "    # this function allows us to get the experiment ID from an experiment name\n",
+    "    exp = mlflow.get_experiment_by_name(name)\n",
+    "    if exp is None:\n",
+    "      exp_id = mlflow.create_experiment(name)\n",
+    "      return exp_id\n",
+    "    return exp.experiment_id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Starting DEV stage for TFIDF Encoded model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mlflow.set_tracking_uri(f'https://dagshub.com/{DAGSHUB_USER_NAME}/MeaLeon.mlflow')\n",
+    "\n",
+    "# starter idea for making an experiment name, can be the git branch, but need more specificity\n",
+    "experiment_name = f\"{DAGSHUB_EMAIL}/DVC-MLflow-integration-test\"\n",
+    "mlflow_exp_id = get_mlflow_experiment_id(experiment_name)\n",
+    "\n",
+    "# define processed data location and data to be added to DVC\n",
+    "processed_data_base = \"../data/processed\"\n",
+    "transformed_recipes_parquet_path = processed_data_base + \"/transformed_recipes.parquet.gzip\"\n",
+    "combined_df_path = processed_data_base + \"/combined_df.parquet.gzip\"\n",
+    "\n",
+    "\n",
+    "# define model location\n",
+    "model_directory = \"../models/sklearn_model\"\n",
+    "\n",
+    "# Define the required artifacts associated with the saved custom pyfunc\n",
+    "sklearn_model_path = model_directory + \"/python_model.pkl\"\n",
+    "sklearn_transformer_path = model_directory + \"/sklearn_transformer.pkl\"\n",
+    "# transformed_recipes_path = model_directory + \"/transformed_recipes.pkl\"\n",
+    "combined_df_sample_path = model_directory + \"/combined_df_sample.parquet\"\n",
+    "\n",
+    "artifacts = {'sklearn_model': sklearn_model_path,\n",
+    "             'sklearn_transformer': sklearn_transformer_path,\n",
+    "            #  'transformed_recipes': transformed_recipes_path,\n",
+    "            #  'combined_data': combined_df_path,\n",
+    "             'combined_data_sample': combined_df_sample_path\n",
+    "             }\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  0% Checkout|                                      |0/27 [00:00<?,     ?file/s]\n",
+      "!\u001b[A\n",
+      "Building data objects from ../joblib/2022.08.23       |0.00 [00:00,      ?obj/s]\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "Building data objects from ../data                    |0.00 [00:00,      ?obj/s]\u001b[A\n",
+      "\u001b[33mM\u001b[0m       ..\u001b[35m/data/\u001b[0m                                              \u001b[A\n",
+      "\u001b[31mD\u001b[0m       data/raw/\u001b[1;36m201706\u001b[0m-epicurious-recipes-en.json\n",
+      "\u001b[31mD\u001b[0m       data/processed/cleaned_df.parquet.gzip\n",
+      "2 files deleted and 1 file modified\n",
+      "\u001b[0m"
+     ]
+    }
+   ],
+   "source": [
+    "# Prepare whole dataframe for new processing\n",
+    "!dvc pull"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>dek</th>\n",
+       "      <th>hed</th>\n",
+       "      <th>aggregateRating</th>\n",
+       "      <th>ingredients</th>\n",
+       "      <th>prepSteps</th>\n",
+       "      <th>reviewsCount</th>\n",
+       "      <th>willMakeAgainPct</th>\n",
+       "      <th>ingredients_lemmafied</th>\n",
+       "      <th>cuisine_name</th>\n",
+       "      <th>photo_filename</th>\n",
+       "      <th>photo_credit</th>\n",
+       "      <th>author_name</th>\n",
+       "      <th>date_published</th>\n",
+       "      <th>recipe_url</th>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>id</th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>54a2b6b019925f464b373351</th>\n",
+       "      <td>How does fried chicken achieve No. 1 status? B...</td>\n",
+       "      <td>Pickle-Brined Fried Chicken</td>\n",
+       "      <td>3.11</td>\n",
+       "      <td>[1 tablespoons yellow mustard seeds, 1 tablesp...</td>\n",
+       "      <td>[Toast mustard and coriander seeds in a dry me...</td>\n",
+       "      <td>7</td>\n",
+       "      <td>100</td>\n",
+       "      <td>tablespoon yellow mustard seed brk tablespoon ...</td>\n",
+       "      <td>Missing Cuisine</td>\n",
+       "      <td>51247610_fried-chicken_1x1.jpg</td>\n",
+       "      <td>Michael Graydon and Nikole Herriott</td>\n",
+       "      <td>Missing Author Name</td>\n",
+       "      <td>2014-08-19 04:00:00+00:00</td>\n",
+       "      <td>https://www.epicurious.com/recipes/food/views/...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>54a408a019925f464b3733bc</th>\n",
+       "      <td>Spinaci all'Ebraica</td>\n",
+       "      <td>Spinach Jewish Style</td>\n",
+       "      <td>3.22</td>\n",
+       "      <td>[3 pounds small-leaved bulk spinach, Salt, 1/2...</td>\n",
+       "      <td>[Remove the stems and roots from the spinach. ...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>80</td>\n",
+       "      <td>pound small leave bulk spinach brk salt brk cu...</td>\n",
+       "      <td>Italian</td>\n",
+       "      <td>EP_12162015_placeholders_rustic.jpg</td>\n",
+       "      <td>Photo by Chelsea Kyle, Prop Styling by Anna St...</td>\n",
+       "      <td>Edda Servi Machlin</td>\n",
+       "      <td>2008-09-09 04:00:00+00:00</td>\n",
+       "      <td>https://www.epicurious.com/recipes/food/views/...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>54a408a26529d92b2c003631</th>\n",
+       "      <td>This majestic, moist, and richly spiced honey ...</td>\n",
+       "      <td>New Year’s Honey Cake</td>\n",
+       "      <td>3.62</td>\n",
+       "      <td>[3 1/2 cups all-purpose flour, 1 tablespoon ba...</td>\n",
+       "      <td>[I like this cake best baked in a 9-inch angel...</td>\n",
+       "      <td>105</td>\n",
+       "      <td>88</td>\n",
+       "      <td>cup purpose flour brk tablespoon baking powder...</td>\n",
+       "      <td>Kosher</td>\n",
+       "      <td>EP_09022015_honeycake-2.jpg</td>\n",
+       "      <td>Photo by Chelsea Kyle, Food Styling by Anna St...</td>\n",
+       "      <td>Marcy Goldman</td>\n",
+       "      <td>2008-09-10 04:00:00+00:00</td>\n",
+       "      <td>https://www.epicurious.com/recipes/food/views/...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>54a408a66529d92b2c003638</th>\n",
+       "      <td>The idea for this sandwich came to me when my ...</td>\n",
+       "      <td>The B.L.A.Bagel with Lox and Avocado</td>\n",
+       "      <td>4.00</td>\n",
+       "      <td>[1 small ripe avocado, preferably Hass (see No...</td>\n",
+       "      <td>[A short time before serving, mash avocado and...</td>\n",
+       "      <td>7</td>\n",
+       "      <td>100</td>\n",
+       "      <td>small ripe avocado hass see note brk teaspoon ...</td>\n",
+       "      <td>Kosher</td>\n",
+       "      <td>EP_12162015_placeholders_casual.jpg</td>\n",
+       "      <td>Photo by Chelsea Kyle, Prop Styling by Rhoda B...</td>\n",
+       "      <td>Faye Levy</td>\n",
+       "      <td>2008-09-08 04:00:00+00:00</td>\n",
+       "      <td>https://www.epicurious.com/recipes/food/views/...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>54a408a719925f464b3733cc</th>\n",
+       "      <td>In 1930, Simon Agranat, the chief justice of t...</td>\n",
+       "      <td>Shakshuka a la Doktor Shakshuka</td>\n",
+       "      <td>2.71</td>\n",
+       "      <td>[2 pounds fresh tomatoes, unpeeled and cut in ...</td>\n",
+       "      <td>[1. Place the tomatoes, garlic, salt, paprika,...</td>\n",
+       "      <td>7</td>\n",
+       "      <td>83</td>\n",
+       "      <td>pound fresh tomato unpeeled cut quarter ounce ...</td>\n",
+       "      <td>Kosher</td>\n",
+       "      <td>EP_12162015_placeholders_formal.jpg</td>\n",
+       "      <td>Photo by Chelsea Kyle, Prop Styling by Rhoda B...</td>\n",
+       "      <td>Joan Nathan</td>\n",
+       "      <td>2008-09-09 04:00:00+00:00</td>\n",
+       "      <td>https://www.epicurious.com/recipes/food/views/...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                                        dek  \\\n",
+       "id                                                                            \n",
+       "54a2b6b019925f464b373351  How does fried chicken achieve No. 1 status? B...   \n",
+       "54a408a019925f464b3733bc                                Spinaci all'Ebraica   \n",
+       "54a408a26529d92b2c003631  This majestic, moist, and richly spiced honey ...   \n",
+       "54a408a66529d92b2c003638  The idea for this sandwich came to me when my ...   \n",
+       "54a408a719925f464b3733cc  In 1930, Simon Agranat, the chief justice of t...   \n",
+       "\n",
+       "                                                            hed  \\\n",
+       "id                                                                \n",
+       "54a2b6b019925f464b373351            Pickle-Brined Fried Chicken   \n",
+       "54a408a019925f464b3733bc                   Spinach Jewish Style   \n",
+       "54a408a26529d92b2c003631                  New Year’s Honey Cake   \n",
+       "54a408a66529d92b2c003638  The B.L.A.Bagel with Lox and Avocado   \n",
+       "54a408a719925f464b3733cc        Shakshuka a la Doktor Shakshuka   \n",
+       "\n",
+       "                          aggregateRating  \\\n",
+       "id                                          \n",
+       "54a2b6b019925f464b373351             3.11   \n",
+       "54a408a019925f464b3733bc             3.22   \n",
+       "54a408a26529d92b2c003631             3.62   \n",
+       "54a408a66529d92b2c003638             4.00   \n",
+       "54a408a719925f464b3733cc             2.71   \n",
+       "\n",
+       "                                                                ingredients  \\\n",
+       "id                                                                            \n",
+       "54a2b6b019925f464b373351  [1 tablespoons yellow mustard seeds, 1 tablesp...   \n",
+       "54a408a019925f464b3733bc  [3 pounds small-leaved bulk spinach, Salt, 1/2...   \n",
+       "54a408a26529d92b2c003631  [3 1/2 cups all-purpose flour, 1 tablespoon ba...   \n",
+       "54a408a66529d92b2c003638  [1 small ripe avocado, preferably Hass (see No...   \n",
+       "54a408a719925f464b3733cc  [2 pounds fresh tomatoes, unpeeled and cut in ...   \n",
+       "\n",
+       "                                                                  prepSteps  \\\n",
+       "id                                                                            \n",
+       "54a2b6b019925f464b373351  [Toast mustard and coriander seeds in a dry me...   \n",
+       "54a408a019925f464b3733bc  [Remove the stems and roots from the spinach. ...   \n",
+       "54a408a26529d92b2c003631  [I like this cake best baked in a 9-inch angel...   \n",
+       "54a408a66529d92b2c003638  [A short time before serving, mash avocado and...   \n",
+       "54a408a719925f464b3733cc  [1. Place the tomatoes, garlic, salt, paprika,...   \n",
+       "\n",
+       "                          reviewsCount  willMakeAgainPct  \\\n",
+       "id                                                         \n",
+       "54a2b6b019925f464b373351             7               100   \n",
+       "54a408a019925f464b3733bc             5                80   \n",
+       "54a408a26529d92b2c003631           105                88   \n",
+       "54a408a66529d92b2c003638             7               100   \n",
+       "54a408a719925f464b3733cc             7                83   \n",
+       "\n",
+       "                                                      ingredients_lemmafied  \\\n",
+       "id                                                                            \n",
+       "54a2b6b019925f464b373351  tablespoon yellow mustard seed brk tablespoon ...   \n",
+       "54a408a019925f464b3733bc  pound small leave bulk spinach brk salt brk cu...   \n",
+       "54a408a26529d92b2c003631  cup purpose flour brk tablespoon baking powder...   \n",
+       "54a408a66529d92b2c003638  small ripe avocado hass see note brk teaspoon ...   \n",
+       "54a408a719925f464b3733cc  pound fresh tomato unpeeled cut quarter ounce ...   \n",
+       "\n",
+       "                             cuisine_name  \\\n",
+       "id                                          \n",
+       "54a2b6b019925f464b373351  Missing Cuisine   \n",
+       "54a408a019925f464b3733bc          Italian   \n",
+       "54a408a26529d92b2c003631           Kosher   \n",
+       "54a408a66529d92b2c003638           Kosher   \n",
+       "54a408a719925f464b3733cc           Kosher   \n",
+       "\n",
+       "                                               photo_filename  \\\n",
+       "id                                                              \n",
+       "54a2b6b019925f464b373351       51247610_fried-chicken_1x1.jpg   \n",
+       "54a408a019925f464b3733bc  EP_12162015_placeholders_rustic.jpg   \n",
+       "54a408a26529d92b2c003631          EP_09022015_honeycake-2.jpg   \n",
+       "54a408a66529d92b2c003638  EP_12162015_placeholders_casual.jpg   \n",
+       "54a408a719925f464b3733cc  EP_12162015_placeholders_formal.jpg   \n",
+       "\n",
+       "                                                               photo_credit  \\\n",
+       "id                                                                            \n",
+       "54a2b6b019925f464b373351                Michael Graydon and Nikole Herriott   \n",
+       "54a408a019925f464b3733bc  Photo by Chelsea Kyle, Prop Styling by Anna St...   \n",
+       "54a408a26529d92b2c003631  Photo by Chelsea Kyle, Food Styling by Anna St...   \n",
+       "54a408a66529d92b2c003638  Photo by Chelsea Kyle, Prop Styling by Rhoda B...   \n",
+       "54a408a719925f464b3733cc  Photo by Chelsea Kyle, Prop Styling by Rhoda B...   \n",
+       "\n",
+       "                                  author_name            date_published  \\\n",
+       "id                                                                        \n",
+       "54a2b6b019925f464b373351  Missing Author Name 2014-08-19 04:00:00+00:00   \n",
+       "54a408a019925f464b3733bc   Edda Servi Machlin 2008-09-09 04:00:00+00:00   \n",
+       "54a408a26529d92b2c003631        Marcy Goldman 2008-09-10 04:00:00+00:00   \n",
+       "54a408a66529d92b2c003638            Faye Levy 2008-09-08 04:00:00+00:00   \n",
+       "54a408a719925f464b3733cc          Joan Nathan 2008-09-09 04:00:00+00:00   \n",
+       "\n",
+       "                                                                 recipe_url  \n",
+       "id                                                                           \n",
+       "54a2b6b019925f464b373351  https://www.epicurious.com/recipes/food/views/...  \n",
+       "54a408a019925f464b3733bc  https://www.epicurious.com/recipes/food/views/...  \n",
+       "54a408a26529d92b2c003631  https://www.epicurious.com/recipes/food/views/...  \n",
+       "54a408a66529d92b2c003638  https://www.epicurious.com/recipes/food/views/...  \n",
+       "54a408a719925f464b3733cc  https://www.epicurious.com/recipes/food/views/...  "
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# this part can be done after a dvc pull\n",
+    "whole_nlp_df = pd.read_parquet(\"../data/processed/cleaned_df.parquet.gzip\")\n",
+    "whole_nlp_df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "sklearn fit transform on ingredients:\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "Input Data: \n",
+      "id\n",
+      "54a2b6b019925f464b373351    tablespoon yellow mustard seed brk tablespoon ...\n",
+      "54a408a019925f464b3733bc    pound small leave bulk spinach brk salt brk cu...\n",
+      "54a408a26529d92b2c003631    cup purpose flour brk tablespoon baking powder...\n",
+      "54a408a66529d92b2c003638    small ripe avocado hass see note brk teaspoon ...\n",
+      "54a408a719925f464b3733cc    pound fresh tomato unpeeled cut quarter ounce ...\n",
+      "                                                  ...                        \n",
+      "59541a31bff3052847ae2107    tablespoon unsalt butter room temperature brk ...\n",
+      "5954233ad52ca90dc28200e7    tablespoon stick salt butter room temperature ...\n",
+      "595424c2109c972493636f83    tablespoon unsalted butter more greasing pan b...\n",
+      "5956638625dc3d1d829b7166    coarse salt brk lime wedge brk ounce tomato ju...\n",
+      "59566daa25dc3d1d829b7169    bottle millileter sour beer such almanac citra...\n",
+      "Name: ingredients_lemmafied, Length: 34756, dtype: object\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "Input Data Shape: \n",
+      "(34756,)\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "Random 3 Records from Input Data: \n",
+      "id\n",
+      "54a40caa19925f464b374017    boneless muscovy duck breast half pound total ...\n",
+      "55d4e08063b1ba1b5534b198    tablespoon white wine vinegar brk teaspoon sug...\n",
+      "54a43ad16529d92b2c019fc3    cup basmati rice ounce brk cup sweeten flake c...\n",
+      "Name: ingredients_lemmafied, dtype: object\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 34756/34756 [00:03<00:00, 10450.53it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "Transformed Data:\n",
+      "                          100g  125g  13x9x2  150g  1pound  1tablespoon  \\\n",
+      "id                                                                        \n",
+      "54a2b6b019925f464b373351   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "54a408a019925f464b3733bc   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "54a408a26529d92b2c003631   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "54a408a66529d92b2c003638   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "54a408a719925f464b3733cc   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "\n",
+      "                          1teaspoon  200g  250g  2cup  ...  árbol divide  \\\n",
+      "id                                                     ...                 \n",
+      "54a2b6b019925f464b373351        0.0   0.0   0.0   0.0  ...           0.0   \n",
+      "54a408a019925f464b3733bc        0.0   0.0   0.0   0.0  ...           0.0   \n",
+      "54a408a26529d92b2c003631        0.0   0.0   0.0   0.0  ...           0.0   \n",
+      "54a408a66529d92b2c003638        0.0   0.0   0.0   0.0  ...           0.0   \n",
+      "54a408a719925f464b3733cc        0.0   0.0   0.0   0.0  ...           0.0   \n",
+      "\n",
+      "                          árbol seed  árbol seed remove  árbol stem  \\\n",
+      "id                                                                    \n",
+      "54a2b6b019925f464b373351         0.0                0.0         0.0   \n",
+      "54a408a019925f464b3733bc         0.0                0.0         0.0   \n",
+      "54a408a26529d92b2c003631         0.0                0.0         0.0   \n",
+      "54a408a66529d92b2c003638         0.0                0.0         0.0   \n",
+      "54a408a719925f464b3733cc         0.0                0.0         0.0   \n",
+      "\n",
+      "                          árbol teaspoon  árbol teaspoon crush  \\\n",
+      "id                                                               \n",
+      "54a2b6b019925f464b373351             0.0                   0.0   \n",
+      "54a408a019925f464b3733bc             0.0                   0.0   \n",
+      "54a408a26529d92b2c003631             0.0                   0.0   \n",
+      "54a408a66529d92b2c003638             0.0                   0.0   \n",
+      "54a408a719925f464b3733cc             0.0                   0.0   \n",
+      "\n",
+      "                          árbol teaspoon crush red  árbol wipe  \\\n",
+      "id                                                               \n",
+      "54a2b6b019925f464b373351                       0.0         0.0   \n",
+      "54a408a019925f464b3733bc                       0.0         0.0   \n",
+      "54a408a26529d92b2c003631                       0.0         0.0   \n",
+      "54a408a66529d92b2c003638                       0.0         0.0   \n",
+      "54a408a719925f464b3733cc                       0.0         0.0   \n",
+      "\n",
+      "                          árbol wipe clean  épice  \n",
+      "id                                                 \n",
+      "54a2b6b019925f464b373351               0.0    0.0  \n",
+      "54a408a019925f464b3733bc               0.0    0.0  \n",
+      "54a408a26529d92b2c003631               0.0    0.0  \n",
+      "54a408a66529d92b2c003638               0.0    0.0  \n",
+      "54a408a719925f464b3733cc               0.0    0.0  \n",
+      "\n",
+      "[5 rows x 78381 columns]\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "Random Sample of Combined Data:\n",
+      "                          100g  125g  13x9x2  150g  1pound  1tablespoon  \\\n",
+      "id                                                                        \n",
+      "54a40caa19925f464b374017   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "54a43ad16529d92b2c019fc3   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "55d4e08063b1ba1b5534b198   0.0   0.0     0.0   0.0     0.0          0.0   \n",
+      "\n",
+      "                          1teaspoon  200g  250g  2cup  ...  árbol seed  \\\n",
+      "id                                                     ...               \n",
+      "54a40caa19925f464b374017        0.0   0.0   0.0   0.0  ...         0.0   \n",
+      "54a43ad16529d92b2c019fc3        0.0   0.0   0.0   0.0  ...         0.0   \n",
+      "55d4e08063b1ba1b5534b198        0.0   0.0   0.0   0.0  ...         0.0   \n",
+      "\n",
+      "                          árbol seed remove  árbol stem  árbol teaspoon  \\\n",
+      "id                                                                        \n",
+      "54a40caa19925f464b374017                0.0         0.0             0.0   \n",
+      "54a43ad16529d92b2c019fc3                0.0         0.0             0.0   \n",
+      "55d4e08063b1ba1b5534b198                0.0         0.0             0.0   \n",
+      "\n",
+      "                          árbol teaspoon crush  árbol teaspoon crush red  \\\n",
+      "id                                                                         \n",
+      "54a40caa19925f464b374017                   0.0                       0.0   \n",
+      "54a43ad16529d92b2c019fc3                   0.0                       0.0   \n",
+      "55d4e08063b1ba1b5534b198                   0.0                       0.0   \n",
+      "\n",
+      "                          árbol wipe  árbol wipe clean  épice  \\\n",
+      "id                                                              \n",
+      "54a40caa19925f464b374017         0.0               0.0    0.0   \n",
+      "54a43ad16529d92b2c019fc3         0.0               0.0    0.0   \n",
+      "55d4e08063b1ba1b5534b198         0.0               0.0    0.0   \n",
+      "\n",
+      "                                                      ingredients_lemmafied  \n",
+      "id                                                                           \n",
+      "54a40caa19925f464b374017  boneless muscovy duck breast half pound total ...  \n",
+      "54a43ad16529d92b2c019fc3  cup basmati rice ounce brk cup sweeten flake c...  \n",
+      "55d4e08063b1ba1b5534b198  tablespoon white wine vinegar brk teaspoon sug...  \n",
+      "\n",
+      "[3 rows x 78382 columns]\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "413513de77ec40e097f0fe537db730da",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f463247aaa654948a7d7170a6d90f997",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "bf79aae1fe38472ba528d8b84d1b5f65",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2024/07/29 21:58:31 WARNING mlflow.utils.environment: Encountered an unexpected error while inferring pip requirements (model URI: /tmp/tmpzmn49nj8/model, flavor: python_function), fall back to return ['cloudpickle==2.2.1']. Set logging level to DEBUG to see the full traceback.\n",
+      "/home/awchen/Repos/Projects/MeaLeon/.venv/lib/python3.10/site-packages/_distutils_hack/__init__.py:18: UserWarning: Distutils was imported before Setuptools, but importing Setuptools also replaces the `distutils` module in `sys.modules`. This may lead to undesirable behaviors or errors. To avoid these issues, avoid using distutils directly, ensure that setuptools is installed in the traditional way (e.g. not an editable install), and/or make sure that setuptools is always imported before distutils.\n",
+      "  warnings.warn(\n",
+      "/home/awchen/Repos/Projects/MeaLeon/.venv/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.\n",
+      "  warnings.warn(\"Setuptools is replacing distutils.\")\n",
+      "2024/07/29 21:59:09 WARNING mlflow.models.model: Logging model metadata to the tracking server has failed. The model artifacts have been logged successfully under mlflow-artifacts:/5abd2670253447e0a4988212aabcf35a/e3cf27f656504b0d9b6d5a8d4ce1abb2/artifacts. Set logging level to DEBUG via `logging.getLogger(\"mlflow\").setLevel(logging.DEBUG)` to see the full traceback.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# load from MLflow\n",
+    "mlflow_client = mlflow.tracking.MlflowClient(\n",
+    "    tracking_uri=f'https://dagshub.com/{DAGSHUB_USER_NAME}/MeaLeon.mlflow')\n",
+    "\n",
+    "# cv_params are parameters for the sklearn CountVectorizer or TFIDFVectorizer\n",
+    "sklearn_transformer_params = {    \n",
+    "    'analyzer': CustomSKLearnAnalyzer().ngram_maker(\n",
+    "        min_ngram_length=1,\n",
+    "        max_ngram_length=4,\n",
+    "        ),\n",
+    "    'min_df':3,\n",
+    "    'binary':False\n",
+    "}\n",
+    "\n",
+    "# pipeline_params are parameters that will be logged in MLFlow and are a superset of library parameters\n",
+    "pipeline_params = {\n",
+    "    'stanza_model': 'en',\n",
+    "    'sklearn-transformer': 'TFIDF'\n",
+    "}\n",
+    "\n",
+    "# update the pipeline parameters with the library-specific ones so that they show up in MLflow Tracking\n",
+    "pipeline_params.update(sklearn_transformer_params)\n",
+    "\n",
+    "with mlflow.start_run(experiment_id=mlflow_exp_id):    \n",
+    "    # LOG PARAMETERS\n",
+    "    mlflow.log_params(pipeline_params)\n",
+    "\n",
+    "    # LOG INPUTS (QUERIES) AND OUTPUTS\n",
+    "    # MLflow example uses a list of strings or a list of str->str dicts\n",
+    "    # Will be useful in STAGING/Evaluation\n",
+    "    \n",
+    "    # LOG MODEL\n",
+    "    # Instantiate sklearn TFIDFVectorizer\n",
+    "    sklearn_transformer = TfidfVectorizer(**sklearn_transformer_params)\n",
+    "\n",
+    "    print('\\n')\n",
+    "    print('-' * 80)\n",
+    "    print('sklearn fit transform on ingredients:')\n",
+    "\n",
+    "    model_input = whole_nlp_df['ingredients_lemmafied']\n",
+    "\n",
+    "    print('\\n')\n",
+    "    print('-' * 80)\n",
+    "    print('Input Data: ')\n",
+    "    print(model_input)\n",
+    "\n",
+    "    print('\\n')\n",
+    "    print('-' * 80)\n",
+    "    print('Input Data Shape: ')\n",
+    "    print(model_input.shape)\n",
+    "\n",
+    "    random_sample = model_input.sample(3, random_state=200)\n",
+    "\n",
+    "    print('\\n')\n",
+    "    print('-' * 80)\n",
+    "    print('Random 3 Records from Input Data: ')\n",
+    "    print(random_sample)\n",
+    "\n",
+    "    # Do fit transform on data\n",
+    "    response = sklearn_transformer.fit_transform(tqdm(model_input)) \n",
+    "    \n",
+    "    transformed_recipe = pd.DataFrame(\n",
+    "            response.toarray(),\n",
+    "            columns=sklearn_transformer.get_feature_names_out(),\n",
+    "            index=model_input.index\n",
+    "    )\n",
+    "\n",
+    "    signature = infer_signature(model_input=model_input,\n",
+    "                                model_output=transformed_recipe\n",
+    "                                )\n",
+    "\n",
+    "    print('\\n')\n",
+    "    print('-' * 80)\n",
+    "    print('Transformed Data:')\n",
+    "    print(transformed_recipe.head())\n",
+    "    \n",
+    "    combined_df = transformed_recipe.join(model_input, how='inner')\n",
+    "    combined_df_sample = transformed_recipe.join(random_sample, how='inner')\n",
+    "\n",
+    "    print('\\n')\n",
+    "    print('-' * 80)\n",
+    "    print('Random Sample of Combined Data:')\n",
+    "    print(combined_df_sample.head())\n",
+    "\n",
+    "    with open(sklearn_transformer_path, \"wb\") as fo:\n",
+    "        pickle.dump(sklearn_transformer, fo)\n",
+    "\n",
+    "    transformed_recipe.to_parquet(path=transformed_recipes_parquet_path, compression=\"gzip\")\n",
+    "\n",
+    "    combined_df.to_parquet(path=combined_df_path, compression=\"gzip\")\n",
+    "    \n",
+    "    combined_df_sample.to_parquet(path=combined_df_sample_path)\n",
+    "\n",
+    "    model_info = mlflow.pyfunc.log_model( \n",
+    "        code_path=[\"../src/backend/\"],\n",
+    "        python_model=CustomSKLearnWrapper(),\n",
+    "        input_example=whole_nlp_df['ingredients_lemmafied'][0],\n",
+    "        signature=signature,        \n",
+    "        artifact_path=\"sklearn_model\",\n",
+    "        artifacts=artifacts\n",
+    "        ) \n",
+    "\n",
+    "    # since this uses a custom Stanza analyzer, we have to use a custom mlflow.Pyfunc.PythonModel"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[?25l\u001b[32m⠋\u001b[0m Checking graph                                                 \n",
+      "Adding...                                                                       \n",
+      "!\u001b[A\n",
+      "  0% Checking cache in '/home/awchen/Repos/Projects/MeaLeon/.dvc/cache'| |0/? [0\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Transferring                          0/? [00:00<?,     ?file/s]\u001b[A\n",
+      "  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Checking out ../data/processed/transfo0/? [00:00<?,    ?files/s]\u001b[A\n",
+      "  0%|          |Checking out ../data/processed/transfo0/1 [00:00<?,    ?files/s]\u001b[A\n",
+      "100% Adding...|████████████████████████████████████████|1/1 [00:00,  5.53file/s]\u001b[A\n",
+      "\n",
+      "To track the changes with git, run:\n",
+      "\n",
+      "\tgit add ../data.dvc\n",
+      "\n",
+      "To enable auto staging, run:\n",
+      "\n",
+      "\tdvc config core.autostage true\n",
+      "\u001b[0m"
+     ]
+    }
+   ],
+   "source": [
+    "!dvc add \"../data/processed/transformed_recipes.parquet.gzip\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[?25l\u001b[32m⠋\u001b[0m Checking graph                                                 \n",
+      "Adding...                                                                       \n",
+      "!\u001b[A\n",
+      "  0% Checking cache in '/home/awchen/Repos/Projects/MeaLeon/.dvc/cache'| |0/? [0\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Transferring                          0/? [00:00<?,     ?file/s]\u001b[A\n",
+      "  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]\u001b[A\n",
+      "                                                                                \u001b[A\n",
+      "!\u001b[A\n",
+      "  0%|          |Checking out ../data/processed/combine0/? [00:00<?,    ?files/s]\u001b[A\n",
+      "  0%|          |Checking out ../data/processed/combine0/1 [00:00<?,    ?files/s]\u001b[A\n",
+      "100% Adding...|████████████████████████████████████████|1/1 [00:00,  5.37file/s]\u001b[A\n",
+      "\n",
+      "To track the changes with git, run:\n",
+      "\n",
+      "\tgit add ../data.dvc\n",
+      "\n",
+      "To enable auto staging, run:\n",
+      "\n",
+      "\tdvc config core.autostage true\n",
+      "\u001b[0m"
+     ]
+    }
+   ],
+   "source": [
+    "!dvc add \"../data/processed/combined_df.parquet.gzip\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/awchen/Repos/Projects/MeaLeon/.venv/lib/python3.10/site-packages/nbdev/export.py:73: UserWarning: Notebook '/home/awchen/Repos/Projects/MeaLeon/nbs/16_notebook_refactor.ipynb' uses `#|export` without `#|default_exp` cell.\n",
+      "Note nbdev2 no longer supports nbdev1 syntax. Run `nbdev_migrate` to upgrade.\n",
+      "See https://nbdev.fast.ai/getting_started.html for more information.\n",
+      "  warn(f\"Notebook '{nbname}' uses `#|export` without `#|default_exp` cell.\\n\"\n"
+     ]
+    }
+   ],
+   "source": [
+    "# | hide\n",
+    "nbdev.nbdev_export()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/src/backend/embedding_creation/apply_stanza.py b/src/backend/embedding_creation/apply_stanza.py
new file mode 100644
index 0000000..31a63ef
--- /dev/null
+++ b/src/backend/embedding_creation/apply_stanza.py
@@ -0,0 +1,90 @@
+from itertools import tee, islice
+import re
+import stanza
+
+
+class CustomSKLearnAnalyzer:
+    """
+    This class handles allows sklearn text transformers to incorporate a Stanza pipeline with a custom analyzer
+    """
+
+    def __init__(self, stanza_lang_str="en"):
+        """
+        Constructor method. Initializes the model with a Stanza libary language
+        type. The default is "en" for English, later on, can think adding
+        functionality to download the pretrained model/embeddings
+        """
+        self.stanza_lang_str = stanza_lang_str
+
+    def prepare_stanza_pipeline(
+        self,
+        depparse_batch_size=50,
+        depparse_min_length_to_batch_separately=50,
+        verbose=True,
+        use_gpu=True,
+        batch_size=100,
+    ):
+        """
+        Method to simply construction of Stanza Pipeline for usage in the sklearn custom analyzer
+
+        Args:
+            Follow creation of stanza pipeline (link to their docs)
+
+            self.stanza_lang_str:
+                str for pretrained Stanza embeddings to use in the pipeline (from init)
+
+            depparse_batch_size:
+                int for batch size for processing, default is 50
+
+            depparse_min_length_to_batch_separately:
+                int for minimum string length to batch, default is 50
+
+            verbose:
+                boolean for information for readouts during processing, default is True
+
+            use_gpu:
+                boolean for using GPU for stanza, default is False,
+                set to True when on cloud/not on streaming computer
+
+            batch_size:
+                int for batch sizing, default is 100
+
+        Returns:
+            nlp:
+                stanza pipeline
+        """
+
+        # Perhaps down the road, this should be stored as an MLflow Artifact to be downloaded
+        # Or should this be part of the Container building at start up? If so, how would those get logged? Just as artifacts?
+        stanza.download(self.stanza_lang_str)
+
+        nlp = stanza.Pipeline(
+            self.stanza_lang_str,
+            depparse_batch_size=depparse_batch_size,
+            depparse_min_length_to_batch_separately=depparse_min_length_to_batch_separately,
+            verbose=verbose,
+            use_gpu=use_gpu,
+            batch_size=batch_size,
+        )
+
+        return nlp
+
+    @classmethod
+    def ngram_maker(self, min_ngram_length: int, max_ngram_length: int):
+        def ngrams_per_line(row: str):
+            for ln in row.split(" brk "):
+                at_least_two_english_characters_whole_words = r"(?u)\b\w{2,}\b"
+                terms = re.findall(at_least_two_english_characters_whole_words, ln)
+                for ngram_length in range(min_ngram_length, max_ngram_length + 1):
+
+                    # find and return all ngrams
+                    # for ngram in zip(*[terms[i:] for i in range(3)]):
+                    # <-- solution without a generator (works the same but has higher memory usage)
+                    for ngram in (
+                        word
+                        for i in range(len(terms) - ngram_length + 1)
+                        for word in (" ".join(terms[i : i + ngram_length]),)
+                    ):
+                        yield ngram
+
+        return ngrams_per_line
diff --git a/src/backend/embedding_creation/sklearn_transformer_as_mlflow_model.py b/src/backend/embedding_creation/sklearn_transformer_as_mlflow_model.py
new file mode 100644
index 0000000..59a1089
--- /dev/null
+++ b/src/backend/embedding_creation/sklearn_transformer_as_mlflow_model.py
@@ -0,0 +1,78 @@
+import mlflow
+import pandas as pd
+from typing import List
+
+
+class CustomSKLearnWrapper(mlflow.pyfunc.PythonModel):
+    """
+    This class allows sklearn text transformers to be logged in MLflow as a
+    custom PythonModel. It overrides the default load_context and predict methods (as required by MLflow).
+    load_context now loads pickled files representing the model itself (which requires Stanza) and the transformer (which is an sklearn object)
+    """
+
+    # def __init__(self, model):
+    #     """
+    #     Constructor method. Initializes the model with a Stanza libary language
+    #     type. The default is "en" for English
+
+    #     model:          sklearn.Transformer
+    #             The sklearn text Transformer or Pipeline that ends in a
+    #             Transformer
+
+    #     later can add functionality to include pretrained models needed for Stanza
+
+    #     """
+    #     self.model = model
+
+    def load_context(self, context):
+        """
+        Method needed to override default load_context. Needs to handle different components of sklearn model
+
+        """
+        import dill as pickle
+        # dill is needed due to generators and classes in the model itself
+
+        with open(context.artifacts["sklearn_model"], "rb") as f:
+            self.model = pickle.load(f)
+
+        with open(context.artifacts["sklearn_transformer"], "rb") as f:
+            self.sklearn_transformer = pickle.load(f)
+
+    def predict(self, context, model_input: List[str], params: dict):
+        """
+        This method is needed to override the default predict.
+        It needs to function essentially as a wrapper and returns back the
+        transformed recipes
+
+        Args:
+            context:        Any
+                Not used
+
+            model_input:    List(string)
+                The ingredients of a single query recipe in a list
+                Need to decide if this is taking in raw text or preprocessed text
+                Leaning towards taking in raw text, doing preprocessing, and
+                logging the pre processed text as an artifact
+
+            params:         dict, optional
+                Parameters used for the model (optional)
+                Not used currently for sklearn
+
+        Returns:
+            transformed_recipe_df: DataFrame of the recipes after going through
+            the sklearn/Stanza text processing
+        """
+
+        print(model_input)
+        print(model_input.shape)
+        print(model_input.sample(3, random_state=200))
+
+        response = self.sklearn_transformer.transform(model_input.values)
+
+        transformed_recipe = pd.DataFrame(
+            response.toarray(),
+            columns=self.sklearn_transformer.get_feature_names_out(),
+            index=model_input.index,
+        )
+
+        return transformed_recipe
diff --git a/src/backend/raw_data_cleaning/__init__.py b/src/backend/raw_data_cleaning/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/src/backend/raw_data_cleaning/raw_data_preprocessor.py b/src/backend/raw_data_cleaning/raw_data_preprocessor.py
new file mode 100644
index 0000000..ed6da92
--- /dev/null
+++ b/src/backend/raw_data_cleaning/raw_data_preprocessor.py
@@ -0,0 +1,191 @@
+""" This script is intended to take in a pandas dataframe from read_json for the scraped Epicurious data and preprocess the dataframe to prepare it for natural language processing down the line. 
+
+TODO:
+1. Does this benefit from refactoring into a Class?
+
+"""
+
+import pandas as pd
+import stanza
+from typing import Dict, Text, List
+
+# instantiate stanza pipeline
+stanza.download("en")
+nlp = stanza.Pipeline(
+    "en",
+    depparse_batch_size=50,
+    depparse_min_length_to_batch_separately=50,
+    verbose=True,
+    use_gpu=True,  # set to true when on cloud/not on streaming computer
+    batch_size=100,
+)
+
+
+def preprocess_dataframe(df: pd.DataFrame) -> pd.DataFrame:
+    """This function takes in a pandas DataFrame from pd.read_json and performs some preprocessing by unpacking the nested dictionaries and creating new columns with the simplified structures. It will then drop the original columns that would no longer be needed.
+
+    Args:
+        pd.DataFrame
+
+    Returns:
+        pd.DataFrame
+    """
+
+    def stanza_filterer(recipe_ingredients: List[str], stanza_pipeline: stanza.Pipeline) -> str:
+        """This function converts a list of ingredients into a list of ingredient lemmas
+        It is intended to be used via an apply(lambda) until a better way is devised
+
+        Args:
+            recipe_ingredients: List[str]
+
+        Returns:
+            lemmafied: String
+        """
+        lemmafied = " ".join(
+            str(word.lemma)
+            for sent in stanza_pipeline(recipe_ingredients).sentences
+            for word in sent.words
+            if (
+                word.upos not in ["NUM", "DET", "ADV", "CCONJ", "ADP", "SCONJ", "PUNCT"]
+                and word is not None
+            )
+        )
+        return lemmafied
+
+    def ingredient_lemmafier(df: pd.DataFrame, stanza_pipeline: stanza.Pipeline) -> pd.DataFrame:
+        """This function performs some text preprocessing:
+        1. Converts the raw list of ingredients into a big string with ' brk ' token
+        2. Remove accented characters
+        3. Lowercase all characters
+        4. Fill in nulls with filler
+        5. Apply the lemmafier function above and store the results in a new column
+        """
+        df["ingredients_lemmafied"] = (
+            df["ingredients"]
+            .str.join(" brk ")
+            .str.normalize("NFKC")
+            .str.lower()
+            .fillna("Missing ingredients")
+        ).apply(lambda x: stanza_filterer(x, stanza_pipeline))
+
+        return df
+
+    def link_maker(recipe_link: str) -> str:
+        """This function takes in the incomplete recipe link from the dataframe and returns the complete one."""
+        full_link = f"https://www.epicurious.com{recipe_link}"
+        return full_link
+
+    def cuisine_renamer(text: str) -> str:
+        """This function converts redundant and/or rare categories into more common
+        ones/umbrella ones.
+
+        In the future, there's a hope that this renaming mechanism will not have
+        under sampled cuisine tags.
+        """
+        if text == "Central American/Caribbean":
+            return "Caribbean"
+        elif text == "Jewish":
+            return "Kosher"
+        elif text == "Eastern European/Russian":
+            return "Eastern European"
+        elif text in ["Spanish/Portuguese", "Greek"]:
+            return "Mediterranean"
+        elif text == "Central/South American":
+            return "Latin American"
+        elif text == "Sushi":
+            return "Japanese"
+        elif text == "Southern Italian":
+            return "Italian"
+        elif text in ["Southern", "Tex-Mex"]:
+            return "American"
+        elif text in ["Southeast Asian", "Korean"]:
+            return "Asian"
+        else:
+            return text
+
+    def null_filler(to_check: Dict[Text, Text], key_target: Text) -> Text:
+        """This function takes in a dictionary that is currently fed in with a lambda function and then performs column specific preprocessing.
+
+        Args:
+            to_check: dict
+            key_target: str
+
+        Returns:
+            str
+        """
+
+        # Only look in the following keys, if the input isn't one of these, it should be recognized as an improper key
+        valid_keys = ["name", "filename", "credit"]
+
+        # This dictionary converts the input keys into substrings that can be used in f-strings to fill in missing values in the record
+        translation_keys = {
+            "name": "Cuisine",
+            "filename": "Photo",
+            "credit": "Photo Credit",
+        }
+
+        if key_target not in valid_keys:
+            # this logic makes sure we are only looking at valid keys
+            # this is not a real try/except
+            return (
+                "Improper key target: can only pick from 'name', 'filename', 'credit'."
+            )
+
+        else:
+            if pd.isna(to_check):  
+                # this logic checks to see if the dictionary exists at all. if so, return Missing
+                return f"Missing {translation_keys[key_target]}"
+            else:
+                if key_target == "name" and (to_check["category"] != "cuisine"):
+                    # This logic checks for the cuisine, if the cuisine is not there (and instead has 'ingredient', 'type', 'item', 'equipment', 'meal'), mark as missing
+                    return f"Missing {translation_keys[key_target]}"
+                else:
+                    # Otherwise, there should be no issue with returning
+                    return to_check[key_target]
+
+    # separating out the below to execute with a __main__ would be cleaner
+    df = ingredient_lemmafier(df, nlp)
+
+    # Dive into the tag column and extract the cuisine label. Put into new column or fills with "missing data"
+    df["cuisine_name"] = df["tag"].apply(
+        lambda x: null_filler(to_check=x, key_target="name")
+    )  
+
+    # This apply uses the cuisine_renamer function above to relabel the cuisines to more general ones
+    df["cuisine_name"] = df["cuisine_name"].apply(cuisine_renamer)
+
+    # this lambda function goes into the photo data column and extracts just the filename from the dictionary
+    df["photo_filename"] = df["photoData"].apply(
+        lambda x: null_filler(to_check=x, key_target="filename")
+    )  # type:ignore
+
+    # This lambda function goes into the photo data column and extracts just the photo credit from the dictionary
+    df["photo_credit"] = df["photoData"].apply(
+        lambda x: null_filler(to_check=x, key_target="credit")
+    )  # type:ignore
+
+    # for the above, maybe they can be refactored to one function where the arguments are a column name, dictionary key name, the substring return
+
+    # this lambda function goes into the author column and extracts the author name or fills with "missing data"
+    df["author_name"] = df["author"].apply(
+        lambda x: x[0]["name"] if x else "Missing Author Name"
+    )  # type:ignore
+
+    # This function takes in the given pubDate column and creates a new column with the pubDate values converted to datetime objects
+    df["date_published"] = pd.to_datetime(
+        df["pubDate"], infer_datetime_format=True
+    )  # type:ignore
+
+    # this function takes in the given url column and prepends the full epicurious URL base
+    df["recipe_url"] = df["url"].apply(link_maker)  # type:ignore
+
+    # drop some original columns to clean up the dataframe
+    df.drop(
+        labels=["tag", "photoData", "author", "type", "dateCrawled", "pubDate", "url"],
+        axis=1,
+        inplace=True,
+    )
+
+    df.set_index("id", inplace=True)
+
+    return df

	dek	hed	aggregateRating	ingredients	prepSteps	reviewsCount	willMakeAgainPct	ingredients_lemmafied	cuisine_name	photo_filename	photo_credit	author_name	date_published	recipe_url
id
54a2b6b019925f464b373351	How does fried chicken achieve No. 1 status? B...	Pickle-Brined Fried Chicken	3.11	[1 tablespoons yellow mustard seeds, 1 tablesp...	[Toast mustard and coriander seeds in a dry me...	7	100	tablespoon yellow mustard seed brk tablespoon ...	Missing Cuisine	51247610_fried-chicken_1x1.jpg	Michael Graydon and Nikole Herriott	Missing Author Name	2014-08-19 04:00:00+00:00	https://www.epicurious.com/recipes/food/views/...
54a408a019925f464b3733bc	Spinaci all'Ebraica	Spinach Jewish Style	3.22	[3 pounds small-leaved bulk spinach, Salt, 1/2...	[Remove the stems and roots from the spinach. ...	5	80	pound small leave bulk spinach brk salt brk cu...	Italian	EP_12162015_placeholders_rustic.jpg	Photo by Chelsea Kyle, Prop Styling by Anna St...	Edda Servi Machlin	2008-09-09 04:00:00+00:00	https://www.epicurious.com/recipes/food/views/...
54a408a26529d92b2c003631	This majestic, moist, and richly spiced honey ...	New Year’s Honey Cake	3.62	[3 1/2 cups all-purpose flour, 1 tablespoon ba...	[I like this cake best baked in a 9-inch angel...	105	88	cup purpose flour brk tablespoon baking powder...	Kosher	EP_09022015_honeycake-2.jpg	Photo by Chelsea Kyle, Food Styling by Anna St...	Marcy Goldman	2008-09-10 04:00:00+00:00	https://www.epicurious.com/recipes/food/views/...
54a408a66529d92b2c003638	The idea for this sandwich came to me when my ...	The B.L.A.Bagel with Lox and Avocado	4.00	[1 small ripe avocado, preferably Hass (see No...	[A short time before serving, mash avocado and...	7	100	small ripe avocado hass see note brk teaspoon ...	Kosher	EP_12162015_placeholders_casual.jpg	Photo by Chelsea Kyle, Prop Styling by Rhoda B...	Faye Levy	2008-09-08 04:00:00+00:00	https://www.epicurious.com/recipes/food/views/...
54a408a719925f464b3733cc	In 1930, Simon Agranat, the chief justice of t...	Shakshuka a la Doktor Shakshuka	2.71	[2 pounds fresh tomatoes, unpeeled and cut in ...	[1. Place the tomatoes, garlic, salt, paprika,...	7	83	pound fresh tomato unpeeled cut quarter ounce ...	Kosher	EP_12162015_placeholders_formal.jpg	Photo by Chelsea Kyle, Prop Styling by Rhoda B...	Joan Nathan	2008-09-09 04:00:00+00:00	https://www.epicurious.com/recipes/food/views/...