Merge pull request #2 from guy1992l/demo-v1

Final V1
mlrun · Aug 27, 2023 · 6e35395 · 6e35395
2 parents 16d86ed + 68aaa29
commit 6e35395
Show file tree

Hide file tree

Showing 22 changed files with 1,049 additions and 19 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -1,9 +1,7 @@
-# TODO: Update with relevant requirements (current llm-demo)
-FROM mlrun/ml-models-gpu:1.3.0
-RUN pip install -U transformers[deepspeed]
-RUN pip install -U datasets
-RUN pip install -U accelerate
-RUN pip install -U evaluate
-RUN pip install -U protobuf==3.20.*
-RUN pip install -U mpi4py
-RUN conda install -c "nvidia/label/cuda-11.7.1" cuda-nvprof
+FROM mlrun/mlrun-gpu
+RUN apt-get update -y
+RUN apt-get install ffmpeg -y
+RUN pip install tqdm torch bitsandbytes transformers accelerate  \
+    openai-whisper streamlit spacy librosa presidio-anonymizer  \
+    presidio-analyzer nltk flair
+RUN python -m spacy download en_core_web_lg
diff --git a/README.md b/README.md
@@ -1,2 +1,66 @@
-# demo-call-center
-Demo the use of GenAI to transcribe and analyze audio calls
+# <img src="https://uxwing.com/wp-content/themes/uxwing/download/business-professional-services/boy-services-support-icon.png" style="height: 40px"/> MLRun's Call Center Demo
+
+<img src="./images/call-center-readme.png" alt="huggingface-mlrun" style="width: 600px"/>
+
+In this demo we will be showcasing how we used LLMs to turn call center conversation audio files of customers and agents into valueable data in a single workflow orchastrated by MLRun.
+
+MLRun will be automating the entire workflow, auto-scale resources as needed and automatically log and parse values between the workflow different steps.
+
+By the end of this demo you will see the potential power of LLMs for feature extraction, and how easy it is being done using MLRun!
+
+We will use:
+* [**OpenAI's Whisper**](https://openai.com/research/whisper) - To transcribe the audio calls into text.
+* [**Flair**](https://flairnlp.github.io/) and [**Microsoft's Presidio**](https://microsoft.github.io/presidio/) - To recognize PII for filtering it out.
+* [**HuggingFace**](https://huggingface.co/) - as the main machine learning framework to get the model and tokenizer for the features extraction. The demo uses [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) as the LLM to asnwer questions.
+* and [**MLRun**](https://www.mlrun.org/) - as the orchastraitor to operationalize the workflow.
+
+The demo contains a single [notebook](./notebook.ipynb) that covers the entire demo.
+
+Most of the functions are being imported from [MLRun's hub](https://docs.mlrun.org/en/stable/runtimes/load-from-hub.html) - a wide range of functions that can be used for a variety of use cases. You can find all the python source code under [/src](./src) and links to the used functions from the hub in the notebook.
+
+Enjoy!
+
+___
+<a id="installation"></a>
+## Installation
+
+This project can run in different development environments:
+* Local computer (using PyCharm, VSCode, Jupyter, etc.)
+* Inside GitHub Codespaces 
+* Other managed Jupyter environments
+
+### Install the code and mlrun client 
+
+To get started, fork this repo into your GitHub account and clone it into your development environment.
+
+To install the package dependencies (not required in GitHub codespaces) use:
+
+    make install-requirements
+
+If you prefer to use Conda use this instead (to create and configure a conda env):
+
+    make conda-env
+
+> Make sure you open the notebooks and select the `mlrun` conda environment 
+ 
+### Install or connect to MLRun service/cluster
+
+The MLRun service and computation can run locally (minimal setup) or over a remote Kubernetes environment.
+
+If your development environment support docker and have enough CPU resources run:
+
+    make mlrun-docker
+
+> MLRun UI can be viewed in: http://localhost:8060
+    
+If your environment is minimal, run mlrun as a process (no UI):
+
+    [conda activate mlrun &&] make mlrun-api
+
+For MLRun to run properly you should set your client environment, this is not required when using **codespaces**, the mlrun **conda** environment, or **iguazio** managed notebooks.
+
+Your environment should include `MLRUN_ENV_FILE=<absolute path to the ./mlrun.env file> ` (point to the mlrun .env file 
+in this repo), see [mlrun client setup](https://docs.mlrun.org/en/latest/install/remote.html) instructions for details.  
+
+> Note: You can also use a remote MLRun service (over Kubernetes), instead of starting a local mlrun, 
+> edit the [mlrun.env](./mlrun.env) and specify its address and credentials  
diff --git a/data/ttsmaker-file-2023-7-10-14-39-40.mp3 b/data/ttsmaker-file-2023-7-10-14-39-40.mp3
diff --git a/data/ttsmaker-file-2023-7-10-14-42-15.mp3 b/data/ttsmaker-file-2023-7-10-14-42-15.mp3
diff --git a/data/ttsmaker-file-2023-7-10-14-44-41.mp3 b/data/ttsmaker-file-2023-7-10-14-44-41.mp3
diff --git a/data/ttsmaker-file-2023-7-10-14-47-23.mp3 b/data/ttsmaker-file-2023-7-10-14-47-23.mp3
diff --git a/data/ttsmaker-file-2023-7-10-14-49-29.mp3 b/data/ttsmaker-file-2023-7-10-14-49-29.mp3
diff --git a/data/ttsmaker-file-2023-7-10-19-27-6.mp3 b/data/ttsmaker-file-2023-7-10-19-27-6.mp3
diff --git a/data/ttsmaker-file-2023-7-10-19-29-34.mp3 b/data/ttsmaker-file-2023-7-10-19-29-34.mp3
diff --git a/data/ttsmaker-file-2023-7-10-19-32-43.mp3 b/data/ttsmaker-file-2023-7-10-19-32-43.mp3
diff --git a/data/ttsmaker-file-2023-7-10-19-38-39.mp3 b/data/ttsmaker-file-2023-7-10-19-38-39.mp3
diff --git a/data/ttsmaker-file-2023-7-10-19-41-5.mp3 b/data/ttsmaker-file-2023-7-10-19-41-5.mp3
diff --git a/data/ttsmaker-file-2023-7-10-19-52-34.mp3 b/data/ttsmaker-file-2023-7-10-19-52-34.mp3
diff --git a/images/call-center-readme.png b/images/call-center-readme.png
diff --git a/images/call-center-workflow.png b/images/call-center-workflow.png
diff --git a/notebook.ipynb b/notebook.ipynb
diff --git a/project.yaml b/project.yaml
@@ -0,0 +1,35 @@
+kind: project
+metadata:
+  name: call-center-demo-guyl
+  created: '2023-08-27T14:56:53.122000'
+spec:
+  params:
+    source: git://github.com/mlrun/demo-call-center.git#main
+    default_image: giladsh28/llm:v3
+    gpus: 4
+  functions:
+  - url: hub://transcribe
+    name: transcribe
+  - url: hub://pii_recognizer
+    name: pii-recognizer
+  - url: hub://question_answering
+    name: question-answering
+  - url: ./src/postprocess.py
+    name: postprocess
+    kind: job
+  workflows:
+  - path: ./src/workflow.py
+    name: workflow
+  artifacts: []
+  conda: ''
+  source: git://github.com/mlrun/demo-call-center.git#main
+  load_source_on_run: true
+  desired_state: online
+  owner: guyl
+  default_image: giladsh28/llm:v3
+  build:
+    commands: []
+    requirements: []
+  custom_packagers: []
+status:
+  state: online
diff --git a/project_setup.py b/project_setup.py
@@ -0,0 +1,60 @@
+import mlrun
+
+
+def setup(project: mlrun.projects.MlrunProject) -> mlrun.projects.MlrunProject:
+    """
+    Creating the project for this demo.
+    
+    :param project: The project to setup.
+    
+    :returns: A fully prepared project for this demo.
+    """
+    # Set the project git source
+    source = project.get_param("source")
+    if source:
+        print(f"Project Source: {source}")
+        project.set_source(project.get_param("source"), pull_at_runtime=True)
+
+    # Set or build the default image:
+    if project.get_param("default_image") is None:
+        print("Building image for the demo:")
+        assert project.build_image(
+            base_image='mlrun/mlrun-gpu',
+            commands=[
+                "apt-get update -y",
+                "apt-get install ffmpeg -y",
+                "pip install tqdm torch", 
+                "pip install bitsandbytes transformers accelerate",
+                "pip install openai-whisper",
+                "pip install streamlit spacy librosa presidio-anonymizer presidio-analyzer nltk flair",
+                "python -m spacy download en_core_web_lg",
+            ],
+            set_as_default=True,
+        )
+    else:
+        project.set_default_image(project.get_param("default_image"))
+
+    # Set the transcribe function:
+    transcribe_func = project.set_function("hub://transcribe", name="transcribe")
+    transcribe_func.apply(mlrun.auto_mount())
+    transcribe_func.save()
+
+    # Set the PII recognition function:
+    pii_recognizer_func = project.set_function("hub://pii_recognizer", name="pii-recognizer")
+
+    # Set the question asnwering function:
+    question_answering_func = project.set_function("hub://question_answering", name="question-answering")
+    if project.get_param("gpus") > 0:
+        print("Using GPUs for question asnwering.")
+        question_answering_func.with_limits(gpus=project.get_param("gpus"))
+        question_answering_func.save()
+
+    # Set the postprocessing function:
+    postprocess_function = project.set_function("./src/postprocess.py", kind="job", name="postprocess")
+
+    # Set the workflow:
+    project.set_workflow("workflow", "./src/workflow.py")
+
+    # Save and return the project:
+    project.save()
+    return project
diff --git a/requirements.txt b/requirements.txt
@@ -1,10 +1,13 @@
-mlrun
+tqdm 
 torch
-plotly
-gradio
+bitsandbytes
 transformers
-datasets
 accelerate
-evaluate
-einops
-xformers
+openai-whisper
+streamlit
+spacy
+librosa
+presidio-anonymizer
+presidio-analyzer
+nltk
+flair
diff --git a/setup.py b/setup.py
@@ -15,5 +15,5 @@
     license="MIT",
     long_description=long_description,
     long_description_content_type="text/markdown",
-    python_requires=">=3.7",
+    python_requires=">=3.9",
 )
diff --git a/src/postprocess.py b/src/postprocess.py
@@ -0,0 +1,78 @@
+import pandas as pd
+
+
+def _clean_issue(s: str) -> str:
+    """
+    Clean issue column from enumerate prefix and remove {'(', ')', ':', '"'}
+    
+    :param s: The string to clean.
+    
+    :returns: The cleaned string.
+    """
+    if len(s) > 2 and s[1] == ".":
+        s = s[2:]
+    s = s.translate({ord(c): None for c in '():"'})
+    return s
+
+
+def _extract_is_fixed(s: str) -> str:
+    """
+    Extract a single word answer from the LLM response (Yes / No).
+    
+    :param s: The content to extract the single word asnwer from.
+    
+    :returns: The extracted answer.
+    """
+    s = s.casefold()
+    if "not explicitly" in s:
+        return "Unknown"
+    if any(sub in s for sub in ["yes", "was fixed"]):
+        return "Yes"
+    if any(sub in s for sub in ["no", "was not fixed"]):
+        return "No"
+    return "Unknown"
+
+
+def _extract_tone(s: str) -> str:
+    """
+    Extract a single word answer from the LLM response (Positive / Neutral / Negative)
+    
+    :param s: The content to extract the single word asnwer from.
+    
+    :returns: The extracted answer.
+    """
+    s = s.casefold()
+    if "positive" in s:
+        return "Positive"
+    if "negative" in s:
+        return "Negative"
+    return "Neutral"
+
+
+def postprocess(
+    transcript_dataset: pd.DataFrame,
+    qa_dataset: pd.DataFrame,
+) -> pd.DataFrame:
+    """
+    Some custom post processing to apply for getting the complete features dataset.
+    
+    :param transcript_dataset: The transcript features collected.
+    :param qa_dataset:         The questions and answers features collected.
+    
+    :returns: The processed and joined dataframe.
+    """
+    # Left join:
+    qa_dataset.rename(columns={"text_file": "transcription_file"}, inplace=True)
+    df = pd.merge(transcript_dataset, qa_dataset, how="left", on="transcription_file")
+    df.dropna(inplace=True)
+
+    # Clean content and extract short answers:
+    for column, apply_function in [
+        ("Issue", _clean_issue),
+        ("is_fixed", _extract_is_fixed),
+        ("customer_tone", _extract_tone),
+        ("agent_tone", _extract_tone),
+    ]:
+        df[column] = df[column].apply(lambda s: apply_function(s))
+
+    return df