-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from guy1992l/demo-v1
Final V1
- Loading branch information
Showing
22 changed files
with
1,049 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,7 @@ | ||
# TODO: Update with relevant requirements (current llm-demo) | ||
FROM mlrun/ml-models-gpu:1.3.0 | ||
RUN pip install -U transformers[deepspeed] | ||
RUN pip install -U datasets | ||
RUN pip install -U accelerate | ||
RUN pip install -U evaluate | ||
RUN pip install -U protobuf==3.20.* | ||
RUN pip install -U mpi4py | ||
RUN conda install -c "nvidia/label/cuda-11.7.1" cuda-nvprof | ||
FROM mlrun/mlrun-gpu | ||
RUN apt-get update -y | ||
RUN apt-get install ffmpeg -y | ||
RUN pip install tqdm torch bitsandbytes transformers accelerate \ | ||
openai-whisper streamlit spacy librosa presidio-anonymizer \ | ||
presidio-analyzer nltk flair | ||
RUN python -m spacy download en_core_web_lg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,66 @@ | ||
# demo-call-center | ||
Demo the use of GenAI to transcribe and analyze audio calls | ||
# <img src="https://uxwing.com/wp-content/themes/uxwing/download/business-professional-services/boy-services-support-icon.png" style="height: 40px"/> MLRun's Call Center Demo | ||
|
||
<img src="./images/call-center-readme.png" alt="huggingface-mlrun" style="width: 600px"/> | ||
|
||
In this demo we will be showcasing how we used LLMs to turn call center conversation audio files of customers and agents into valueable data in a single workflow orchastrated by MLRun. | ||
|
||
MLRun will be automating the entire workflow, auto-scale resources as needed and automatically log and parse values between the workflow different steps. | ||
|
||
By the end of this demo you will see the potential power of LLMs for feature extraction, and how easy it is being done using MLRun! | ||
|
||
We will use: | ||
* [**OpenAI's Whisper**](https://openai.com/research/whisper) - To transcribe the audio calls into text. | ||
* [**Flair**](https://flairnlp.github.io/) and [**Microsoft's Presidio**](https://microsoft.github.io/presidio/) - To recognize PII for filtering it out. | ||
* [**HuggingFace**](https://huggingface.co/) - as the main machine learning framework to get the model and tokenizer for the features extraction. The demo uses [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) as the LLM to asnwer questions. | ||
* and [**MLRun**](https://www.mlrun.org/) - as the orchastraitor to operationalize the workflow. | ||
|
||
The demo contains a single [notebook](./notebook.ipynb) that covers the entire demo. | ||
|
||
Most of the functions are being imported from [MLRun's hub](https://docs.mlrun.org/en/stable/runtimes/load-from-hub.html) - a wide range of functions that can be used for a variety of use cases. You can find all the python source code under [/src](./src) and links to the used functions from the hub in the notebook. | ||
|
||
Enjoy! | ||
|
||
___ | ||
<a id="installation"></a> | ||
## Installation | ||
|
||
This project can run in different development environments: | ||
* Local computer (using PyCharm, VSCode, Jupyter, etc.) | ||
* Inside GitHub Codespaces | ||
* Other managed Jupyter environments | ||
|
||
### Install the code and mlrun client | ||
|
||
To get started, fork this repo into your GitHub account and clone it into your development environment. | ||
|
||
To install the package dependencies (not required in GitHub codespaces) use: | ||
|
||
make install-requirements | ||
|
||
If you prefer to use Conda use this instead (to create and configure a conda env): | ||
|
||
make conda-env | ||
|
||
> Make sure you open the notebooks and select the `mlrun` conda environment | ||
### Install or connect to MLRun service/cluster | ||
|
||
The MLRun service and computation can run locally (minimal setup) or over a remote Kubernetes environment. | ||
|
||
If your development environment support docker and have enough CPU resources run: | ||
|
||
make mlrun-docker | ||
|
||
> MLRun UI can be viewed in: http://localhost:8060 | ||
If your environment is minimal, run mlrun as a process (no UI): | ||
|
||
[conda activate mlrun &&] make mlrun-api | ||
|
||
For MLRun to run properly you should set your client environment, this is not required when using **codespaces**, the mlrun **conda** environment, or **iguazio** managed notebooks. | ||
|
||
Your environment should include `MLRUN_ENV_FILE=<absolute path to the ./mlrun.env file> ` (point to the mlrun .env file | ||
in this repo), see [mlrun client setup](https://docs.mlrun.org/en/latest/install/remote.html) instructions for details. | ||
|
||
> Note: You can also use a remote MLRun service (over Kubernetes), instead of starting a local mlrun, | ||
> edit the [mlrun.env](./mlrun.env) and specify its address and credentials |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
kind: project | ||
metadata: | ||
name: call-center-demo-guyl | ||
created: '2023-08-27T14:56:53.122000' | ||
spec: | ||
params: | ||
source: git://github.com/mlrun/demo-call-center.git#main | ||
default_image: giladsh28/llm:v3 | ||
gpus: 4 | ||
functions: | ||
- url: hub://transcribe | ||
name: transcribe | ||
- url: hub://pii_recognizer | ||
name: pii-recognizer | ||
- url: hub://question_answering | ||
name: question-answering | ||
- url: ./src/postprocess.py | ||
name: postprocess | ||
kind: job | ||
workflows: | ||
- path: ./src/workflow.py | ||
name: workflow | ||
artifacts: [] | ||
conda: '' | ||
source: git://github.com/mlrun/demo-call-center.git#main | ||
load_source_on_run: true | ||
desired_state: online | ||
owner: guyl | ||
default_image: giladsh28/llm:v3 | ||
build: | ||
commands: [] | ||
requirements: [] | ||
custom_packagers: [] | ||
status: | ||
state: online |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
import mlrun | ||
|
||
|
||
def setup(project: mlrun.projects.MlrunProject) -> mlrun.projects.MlrunProject: | ||
""" | ||
Creating the project for this demo. | ||
:param project: The project to setup. | ||
:returns: A fully prepared project for this demo. | ||
""" | ||
# Set the project git source | ||
source = project.get_param("source") | ||
if source: | ||
print(f"Project Source: {source}") | ||
project.set_source(project.get_param("source"), pull_at_runtime=True) | ||
|
||
# Set or build the default image: | ||
if project.get_param("default_image") is None: | ||
print("Building image for the demo:") | ||
assert project.build_image( | ||
base_image='mlrun/mlrun-gpu', | ||
commands=[ | ||
"apt-get update -y", | ||
"apt-get install ffmpeg -y", | ||
"pip install tqdm torch", | ||
"pip install bitsandbytes transformers accelerate", | ||
"pip install openai-whisper", | ||
"pip install streamlit spacy librosa presidio-anonymizer presidio-analyzer nltk flair", | ||
"python -m spacy download en_core_web_lg", | ||
], | ||
set_as_default=True, | ||
) | ||
else: | ||
project.set_default_image(project.get_param("default_image")) | ||
|
||
# Set the transcribe function: | ||
transcribe_func = project.set_function("hub://transcribe", name="transcribe") | ||
transcribe_func.apply(mlrun.auto_mount()) | ||
transcribe_func.save() | ||
|
||
# Set the PII recognition function: | ||
pii_recognizer_func = project.set_function("hub://pii_recognizer", name="pii-recognizer") | ||
|
||
# Set the question asnwering function: | ||
question_answering_func = project.set_function("hub://question_answering", name="question-answering") | ||
if project.get_param("gpus") > 0: | ||
print("Using GPUs for question asnwering.") | ||
question_answering_func.with_limits(gpus=project.get_param("gpus")) | ||
question_answering_func.save() | ||
|
||
# Set the postprocessing function: | ||
postprocess_function = project.set_function("./src/postprocess.py", kind="job", name="postprocess") | ||
|
||
# Set the workflow: | ||
project.set_workflow("workflow", "./src/workflow.py") | ||
|
||
# Save and return the project: | ||
project.save() | ||
return project |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,13 @@ | ||
mlrun | ||
tqdm | ||
torch | ||
plotly | ||
gradio | ||
bitsandbytes | ||
transformers | ||
datasets | ||
accelerate | ||
evaluate | ||
einops | ||
xformers | ||
openai-whisper | ||
streamlit | ||
spacy | ||
librosa | ||
presidio-anonymizer | ||
presidio-analyzer | ||
nltk | ||
flair |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
import pandas as pd | ||
|
||
|
||
def _clean_issue(s: str) -> str: | ||
""" | ||
Clean issue column from enumerate prefix and remove {'(', ')', ':', '"'} | ||
:param s: The string to clean. | ||
:returns: The cleaned string. | ||
""" | ||
if len(s) > 2 and s[1] == ".": | ||
s = s[2:] | ||
s = s.translate({ord(c): None for c in '():"'}) | ||
return s | ||
|
||
|
||
def _extract_is_fixed(s: str) -> str: | ||
""" | ||
Extract a single word answer from the LLM response (Yes / No). | ||
:param s: The content to extract the single word asnwer from. | ||
:returns: The extracted answer. | ||
""" | ||
s = s.casefold() | ||
if "not explicitly" in s: | ||
return "Unknown" | ||
if any(sub in s for sub in ["yes", "was fixed"]): | ||
return "Yes" | ||
if any(sub in s for sub in ["no", "was not fixed"]): | ||
return "No" | ||
return "Unknown" | ||
|
||
|
||
def _extract_tone(s: str) -> str: | ||
""" | ||
Extract a single word answer from the LLM response (Positive / Neutral / Negative) | ||
:param s: The content to extract the single word asnwer from. | ||
:returns: The extracted answer. | ||
""" | ||
s = s.casefold() | ||
if "positive" in s: | ||
return "Positive" | ||
if "negative" in s: | ||
return "Negative" | ||
return "Neutral" | ||
|
||
|
||
def postprocess( | ||
transcript_dataset: pd.DataFrame, | ||
qa_dataset: pd.DataFrame, | ||
) -> pd.DataFrame: | ||
""" | ||
Some custom post processing to apply for getting the complete features dataset. | ||
:param transcript_dataset: The transcript features collected. | ||
:param qa_dataset: The questions and answers features collected. | ||
:returns: The processed and joined dataframe. | ||
""" | ||
# Left join: | ||
qa_dataset.rename(columns={"text_file": "transcription_file"}, inplace=True) | ||
df = pd.merge(transcript_dataset, qa_dataset, how="left", on="transcription_file") | ||
df.dropna(inplace=True) | ||
|
||
# Clean content and extract short answers: | ||
for column, apply_function in [ | ||
("Issue", _clean_issue), | ||
("is_fixed", _extract_is_fixed), | ||
("customer_tone", _extract_tone), | ||
("agent_tone", _extract_tone), | ||
]: | ||
df[column] = df[column].apply(lambda s: apply_function(s)) | ||
|
||
return df |
Oops, something went wrong.