Skip to content

Commit

Permalink
Merge pull request #2 from guy1992l/demo-v1
Browse files Browse the repository at this point in the history
Final V1
  • Loading branch information
guy1992l authored Aug 27, 2023
2 parents 16d86ed + 68aaa29 commit 6e35395
Show file tree
Hide file tree
Showing 22 changed files with 1,049 additions and 19 deletions.
16 changes: 7 additions & 9 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# TODO: Update with relevant requirements (current llm-demo)
FROM mlrun/ml-models-gpu:1.3.0
RUN pip install -U transformers[deepspeed]
RUN pip install -U datasets
RUN pip install -U accelerate
RUN pip install -U evaluate
RUN pip install -U protobuf==3.20.*
RUN pip install -U mpi4py
RUN conda install -c "nvidia/label/cuda-11.7.1" cuda-nvprof
FROM mlrun/mlrun-gpu
RUN apt-get update -y
RUN apt-get install ffmpeg -y
RUN pip install tqdm torch bitsandbytes transformers accelerate \
openai-whisper streamlit spacy librosa presidio-anonymizer \
presidio-analyzer nltk flair
RUN python -m spacy download en_core_web_lg
68 changes: 66 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,66 @@
# demo-call-center
Demo the use of GenAI to transcribe and analyze audio calls
# <img src="https://uxwing.com/wp-content/themes/uxwing/download/business-professional-services/boy-services-support-icon.png" style="height: 40px"/> MLRun's Call Center Demo

<img src="./images/call-center-readme.png" alt="huggingface-mlrun" style="width: 600px"/>

In this demo we will be showcasing how we used LLMs to turn call center conversation audio files of customers and agents into valueable data in a single workflow orchastrated by MLRun.

MLRun will be automating the entire workflow, auto-scale resources as needed and automatically log and parse values between the workflow different steps.

By the end of this demo you will see the potential power of LLMs for feature extraction, and how easy it is being done using MLRun!

We will use:
* [**OpenAI's Whisper**](https://openai.com/research/whisper) - To transcribe the audio calls into text.
* [**Flair**](https://flairnlp.github.io/) and [**Microsoft's Presidio**](https://microsoft.github.io/presidio/) - To recognize PII for filtering it out.
* [**HuggingFace**](https://huggingface.co/) - as the main machine learning framework to get the model and tokenizer for the features extraction. The demo uses [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) as the LLM to asnwer questions.
* and [**MLRun**](https://www.mlrun.org/) - as the orchastraitor to operationalize the workflow.

The demo contains a single [notebook](./notebook.ipynb) that covers the entire demo.

Most of the functions are being imported from [MLRun's hub](https://docs.mlrun.org/en/stable/runtimes/load-from-hub.html) - a wide range of functions that can be used for a variety of use cases. You can find all the python source code under [/src](./src) and links to the used functions from the hub in the notebook.

Enjoy!

___
<a id="installation"></a>
## Installation

This project can run in different development environments:
* Local computer (using PyCharm, VSCode, Jupyter, etc.)
* Inside GitHub Codespaces
* Other managed Jupyter environments

### Install the code and mlrun client

To get started, fork this repo into your GitHub account and clone it into your development environment.

To install the package dependencies (not required in GitHub codespaces) use:

make install-requirements

If you prefer to use Conda use this instead (to create and configure a conda env):

make conda-env

> Make sure you open the notebooks and select the `mlrun` conda environment
### Install or connect to MLRun service/cluster

The MLRun service and computation can run locally (minimal setup) or over a remote Kubernetes environment.

If your development environment support docker and have enough CPU resources run:

make mlrun-docker

> MLRun UI can be viewed in: http://localhost:8060
If your environment is minimal, run mlrun as a process (no UI):

[conda activate mlrun &&] make mlrun-api

For MLRun to run properly you should set your client environment, this is not required when using **codespaces**, the mlrun **conda** environment, or **iguazio** managed notebooks.

Your environment should include `MLRUN_ENV_FILE=<absolute path to the ./mlrun.env file> ` (point to the mlrun .env file
in this repo), see [mlrun client setup](https://docs.mlrun.org/en/latest/install/remote.html) instructions for details.

> Note: You can also use a remote MLRun service (over Kubernetes), instead of starting a local mlrun,
> edit the [mlrun.env](./mlrun.env) and specify its address and credentials
Binary file added data/ttsmaker-file-2023-7-10-14-39-40.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-14-42-15.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-14-44-41.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-14-47-23.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-14-49-29.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-19-27-6.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-19-29-34.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-19-32-43.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-19-38-39.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-19-41-5.mp3
Binary file not shown.
Binary file added data/ttsmaker-file-2023-7-10-19-52-34.mp3
Binary file not shown.
Binary file added images/call-center-readme.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/call-center-workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
681 changes: 681 additions & 0 deletions notebook.ipynb

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions project.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
kind: project
metadata:
name: call-center-demo-guyl
created: '2023-08-27T14:56:53.122000'
spec:
params:
source: git://github.com/mlrun/demo-call-center.git#main
default_image: giladsh28/llm:v3
gpus: 4
functions:
- url: hub://transcribe
name: transcribe
- url: hub://pii_recognizer
name: pii-recognizer
- url: hub://question_answering
name: question-answering
- url: ./src/postprocess.py
name: postprocess
kind: job
workflows:
- path: ./src/workflow.py
name: workflow
artifacts: []
conda: ''
source: git://github.com/mlrun/demo-call-center.git#main
load_source_on_run: true
desired_state: online
owner: guyl
default_image: giladsh28/llm:v3
build:
commands: []
requirements: []
custom_packagers: []
status:
state: online
60 changes: 60 additions & 0 deletions project_setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import mlrun


def setup(project: mlrun.projects.MlrunProject) -> mlrun.projects.MlrunProject:
"""
Creating the project for this demo.
:param project: The project to setup.
:returns: A fully prepared project for this demo.
"""
# Set the project git source
source = project.get_param("source")
if source:
print(f"Project Source: {source}")
project.set_source(project.get_param("source"), pull_at_runtime=True)

# Set or build the default image:
if project.get_param("default_image") is None:
print("Building image for the demo:")
assert project.build_image(
base_image='mlrun/mlrun-gpu',
commands=[
"apt-get update -y",
"apt-get install ffmpeg -y",
"pip install tqdm torch",
"pip install bitsandbytes transformers accelerate",
"pip install openai-whisper",
"pip install streamlit spacy librosa presidio-anonymizer presidio-analyzer nltk flair",
"python -m spacy download en_core_web_lg",
],
set_as_default=True,
)
else:
project.set_default_image(project.get_param("default_image"))

# Set the transcribe function:
transcribe_func = project.set_function("hub://transcribe", name="transcribe")
transcribe_func.apply(mlrun.auto_mount())
transcribe_func.save()

# Set the PII recognition function:
pii_recognizer_func = project.set_function("hub://pii_recognizer", name="pii-recognizer")

# Set the question asnwering function:
question_answering_func = project.set_function("hub://question_answering", name="question-answering")
if project.get_param("gpus") > 0:
print("Using GPUs for question asnwering.")
question_answering_func.with_limits(gpus=project.get_param("gpus"))
question_answering_func.save()

# Set the postprocessing function:
postprocess_function = project.set_function("./src/postprocess.py", kind="job", name="postprocess")

# Set the workflow:
project.set_workflow("workflow", "./src/workflow.py")

# Save and return the project:
project.save()
return project
17 changes: 10 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
mlrun
tqdm
torch
plotly
gradio
bitsandbytes
transformers
datasets
accelerate
evaluate
einops
xformers
openai-whisper
streamlit
spacy
librosa
presidio-anonymizer
presidio-analyzer
nltk
flair
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@
license="MIT",
long_description=long_description,
long_description_content_type="text/markdown",
python_requires=">=3.7",
python_requires=">=3.9",
)
78 changes: 78 additions & 0 deletions src/postprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import pandas as pd


def _clean_issue(s: str) -> str:
"""
Clean issue column from enumerate prefix and remove {'(', ')', ':', '"'}
:param s: The string to clean.
:returns: The cleaned string.
"""
if len(s) > 2 and s[1] == ".":
s = s[2:]
s = s.translate({ord(c): None for c in '():"'})
return s


def _extract_is_fixed(s: str) -> str:
"""
Extract a single word answer from the LLM response (Yes / No).
:param s: The content to extract the single word asnwer from.
:returns: The extracted answer.
"""
s = s.casefold()
if "not explicitly" in s:
return "Unknown"
if any(sub in s for sub in ["yes", "was fixed"]):
return "Yes"
if any(sub in s for sub in ["no", "was not fixed"]):
return "No"
return "Unknown"


def _extract_tone(s: str) -> str:
"""
Extract a single word answer from the LLM response (Positive / Neutral / Negative)
:param s: The content to extract the single word asnwer from.
:returns: The extracted answer.
"""
s = s.casefold()
if "positive" in s:
return "Positive"
if "negative" in s:
return "Negative"
return "Neutral"


def postprocess(
transcript_dataset: pd.DataFrame,
qa_dataset: pd.DataFrame,
) -> pd.DataFrame:
"""
Some custom post processing to apply for getting the complete features dataset.
:param transcript_dataset: The transcript features collected.
:param qa_dataset: The questions and answers features collected.
:returns: The processed and joined dataframe.
"""
# Left join:
qa_dataset.rename(columns={"text_file": "transcription_file"}, inplace=True)
df = pd.merge(transcript_dataset, qa_dataset, how="left", on="transcription_file")
df.dropna(inplace=True)

# Clean content and extract short answers:
for column, apply_function in [
("Issue", _clean_issue),
("is_fixed", _extract_is_fixed),
("customer_tone", _extract_tone),
("agent_tone", _extract_tone),
]:
df[column] = df[column].apply(lambda s: apply_function(s))

return df
Loading

0 comments on commit 6e35395

Please sign in to comment.