intel · VincyZhang · Oct 23, 2023 · Oct 19, 2023 · Oct 19, 2023 · Oct 19, 2023
diff --git a/.github/workflows/script/unitTest/env_setup.sh b/.github/workflows/script/unitTest/env_setup.sh
@@ -7,15 +7,16 @@ if [ ${inc} != 0 ]; then
 fi
 
 echo "Install neural_compressor binary..."
-n=0
-until [ "$n" -ge 5 ]; do
-    git clone https://github.com/intel/neural-compressor.git /neural-compressor
-    cd /neural-compressor
-    pip install -r requirements.txt
-    python setup.py install && break
-    n=$((n + 1))
-    sleep 5
-done
+pip install neural-compressor
+#n=0
+#until [ "$n" -ge 5 ]; do
+#    git clone https://github.com/intel/neural-compressor.git /neural-compressor
+#    cd /neural-compressor
+#    pip install -r requirements.txt
+#    python setup.py install && break
+#    n=$((n + 1))
+#    sleep 5
+#done
 
 # Install test requirements
 cd /intel-extension-for-transformers/tests

diff --git a/.github/workflows/unit-test-neuralchat.yml b/.github/workflows/unit-test-neuralchat.yml
@@ -73,6 +73,8 @@ jobs:
           podman run -dit --disable-content-trust --privileged --name=${{ env.CONTAINER_NAME }} -v /dev/shm:/dev/shm \
           -v ${{ github.workspace }}:/intel-extension-for-transformers \
           -v ~/.cache/oneAPI:/cache \
+          -v /models:/models \
+          -v /media:/media \
           ${{ env.REPO_NAME }}:${{ env.REPO_TAG }}
 
       - name: Env build
@@ -143,4 +145,4 @@ jobs:
         with:
           name: Neural Chat Unit Test
           path: ${{ github.workspace }}/log_dir
-          retention-days: 5
+          retention-days: 5
diff --git a/README.md b/README.md
@@ -11,17 +11,11 @@ Intel® Extension for Transformers
 </div>
 
 ## 🚀Latest News
-* <b>NeuralChat has been showcased in [Intel Innovation’23 Keynote](https://www.youtube.com/watch?v=RbKRELWP9y8&t=2954s) and [Google Cloud Next'23](https://cloud.google.com/blog/topics/google-cloud-next/welcome-to-google-cloud-next-23) to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.</b>
-* <b>NeuralChat supports custom chatbot development and deployment on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out [Notebooks](./intel_extension_for_transformers/neural_chat/docs/full_notebooks.md) and see below sample code. </b>
-
-```python
-# pip install intel-extension-for-transformers
-from intel_extension_for_transformers.neural_chat import build_chatbot
-chatbot = build_chatbot()
-response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
-```
-
-* <b>LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting mainstream low precision data types such as INT8/FP8/INT4/FP4/NF4.</b>
+* [2023/10] LLM runtime, an Intel-optimized [GGML](https://github.com/ggerganov/ggml) compatible runtime, demonstrates **up to 15x performance gain in 1st token generation and 1.5x in other token generation** over the default [llama.cpp](https://github.com/ggerganov/llama.cpp).
+* [2023/10] LLM runtime now supports LLM inference with **infinite-length inputs up to 4 million tokens**, inspired from [StreamingLLM](https://arxiv.org/abs/2309.17453).
+* [2023/09] NeuralChat has been showcased in [**Intel Innovation’23 Keynote**](https://www.youtube.com/watch?v=RbKRELWP9y8&t=2954s) and [Google Cloud Next'23](https://cloud.google.com/blog/topics/google-cloud-next/welcome-to-google-cloud-next-23) to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
+* [2023/08] NeuralChat supports **custom chatbot development and deployment within minutes** on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out [Notebooks](./intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).
+* [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.
 
 ---
 <div align="left">
@@ -34,25 +28,31 @@ pip install intel-extension-for-transformers
 > For more installation methods, please refer to [Installation Page](./docs/installation.md)
 
 ## 🌟Introduction
-Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html)). The toolkit provides the below key features and examples:
-
+Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html)). The toolkit provides the below key features and examples:
 
 *  Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor)
 
-
 *  Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754))
 
-
-*  Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list) and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) 
+*  Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) 
 
 *  [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md).
 
-
 *  [Inference](intel_extension_for_transformers/llm/runtime/graph) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels, supporting [GPT-NEOX](intel_extension_for_transformers/llm/runtime/graph/models/gptneox), [LLAMA](intel_extension_for_transformers/llm/runtime/graph/models/llama), [MPT](intel_extension_for_transformers/llm/runtime/graph/models/mpt), [FALCON](intel_extension_for_transformers/llm/runtime/graph/models/falcon), [BLOOM-7B](intel_extension_for_transformers/llm/runtime/graph/models/bloom), [OPT](intel_extension_for_transformers/llm/runtime/graph/models/opt), [ChatGLM2-6B](intel_extension_for_transformers/llm/runtime/graph/models/chatglm), [GPT-J-6B](intel_extension_for_transformers/llm/runtime/graph/models/gptj) and [Dolly-v2-3B](intel_extension_for_transformers/llm/runtime/graph/models/gptneox)
 
 
 ## 🌱Getting Started
-Below are the sample code to enable weight-only low precision inference. See more [examples](intel_extension_for_transformers/llm/runtime/graph).
+Below is the sample code to enable the chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).
+
+### Chatbot 
+```python
+# pip install intel-extension-for-transformers
+from intel_extension_for_transformers.neural_chat import build_chatbot
+chatbot = build_chatbot()
+response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
+```
+
+Below is the sample code to enable weight-only INT4/INT8 inference. See more [examples](intel_extension_for_transformers/llm/runtime/graph).
 
 ### INT4 Inference 
 ```python
@@ -90,7 +90,7 @@ outputs = tokenizer.batch_decode(gen_tokens)
 
 ## 🎯Validated  Models
 Here is the average accuracy of validated models on Lambada (OpenAI), HellaSwag, Winogrande, PIQA, and WikiText.
-The next token latency is based on 32 input tokens and greedy search on Intel's 4th Generation Xeon Scalable Sapphire Rapids processor.
+The subsequent token latency is based on 32 input tokens and greedy search on Intel's 4th Generation Xeon Scalable Sapphire Rapids processor.
 
 | Model |  FP32         | INT4 Accuracy (Group size 32) | INT4 Accuracy (Group size 128) | Next Token Latency   | 
 |---------------------|:----------------------:|:-----------------------:|:----------------------------:|:------------:| 
@@ -136,8 +136,9 @@ Find other models like ChatGLM, ChatGLM2, StarCoder... in [LLM Runtime](./intel_
   </tr>
  <tr>
     <td colspan="2" align="center"><a href="intel_extension_for_transformers/llm/runtime/graph/README.md">LLM Runtime</a></td>
-    <td colspan="3" align="center"><a href="intel_extension_for_transformers/llm/runtime/graph/core/README.md">Low Precision Kernels</a></td>
-    <td colspan="3" align="center"><a href="intel_extension_for_transformers/llm/runtime/graph/tensor_parallelism.md">Tensor Parallelism</a></td>
+    <td colspan="2" align="center"><a href="intel_extension_for_transformers/llm/runtime/graph/README.md#2-run-llm-with-python-api">Streaming LLM</a></td>
+    <td colspan="2" align="center"><a href="intel_extension_for_transformers/llm/runtime/graph/core/README.md">Low Precision Kernels</a></td>
+    <td colspan="2" align="center"><a href="intel_extension_for_transformers/llm/runtime/graph/tensor_parallelism.md">Tensor Parallelism</a></td>
   </tr>
   <tr>
     <th colspan="8" align="center">LLM COMPRESSION</th>
@@ -204,10 +205,10 @@ Find other models like ChatGLM, ChatGLM2, StarCoder... in [LLM Runtime](./intel_
 
 
 ## Acknowledgements
-* Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), and many others.
+* Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), [streamingllm](https://github.com/mit-han-lab/streaming-llm) and many others.
 
 * Thanks to all the contributors including [Ikko Eltociear Ashimine](https://github.com/eltociear), [Hardik Kamboj](https://github.com/hardikkamboj), [Sangjune Park](https://github.com/JJukE), [Kevin Ta](https://github.com/kta-intel), [Huiyan Cao](https://github.com/huiyan2021), [Xigui Wang](https://github.com/xiguiw), [Jiafu Zhang](https://github.com/jiafuzha), [Tyler Titsworth](https://github.com/tylertitsworth), [Yi Wang](https://github.com/sywangyi), [Samanway Sadhu](https://github.com/SamanwaySadhu), [Jiqing Feng](https://github.com/jiqing-feng), [Jonathan Mamou](https://github.com/jmamou) and [Niroop Ammbashankar](https://github.com/nammbash).
 
 ## 💁Collaborations
 
-Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:[email protected]) and look forward to our collaborations on Intel Extension for Transformers!
+Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:[email protected]), and we look forward to our collaborations on Intel Extension for Transformers!
diff --git a/examples/huggingface/pytorch/image-classification/deployment/imagenet/vit/requirements.txt b/examples/huggingface/pytorch/image-classification/deployment/imagenet/vit/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 torchvision
 onnx>=1.12
 onnxruntime==1.13.1

diff --git a/examples/huggingface/pytorch/image-classification/quantization/requirements.txt b/examples/huggingface/pytorch/image-classification/quantization/requirements.txt
@@ -3,7 +3,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 torchvision
 onnx>=1.12
 onnxruntime==1.13.1

diff --git a/...face/pytorch/language-modeling/deployment/fill-mask/electra_base_chinese/requirements.txt b/...face/pytorch/language-modeling/deployment/fill-mask/electra_base_chinese/requirements.txt
@@ -1,2 +1,2 @@
 transformers
-torch==2.0.1
+torch==2.1.0
diff --git a/...uggingface/pytorch/question-answering/deployment/squad/MLperf_example/README.md b/...uggingface/pytorch/question-answering/deployment/squad/MLperf_example/README.md
@@ -70,7 +70,7 @@ Modify the `user.conf` when you run different models:
 
 + When you run minilm, please also add `--minilm=true` for both performance and accuracy.
 
-+ When you run benchmark on `SPR` machine, please add `--inter_parallel=28 and set --INST_NUM=28` for both perfomance and accuracy.
++ When you run benchmark on `SPR` machine, please add `--inter_parallel=28 and set --INST_NUM=28` for both performance and accuracy.
 
 + When you run bert large please keep batch size as 4.
 

diff --git a/...es/huggingface/pytorch/question-answering/deployment/squad/bert_large/README.md b/...es/huggingface/pytorch/question-answering/deployment/squad/bert_large/README.md
@@ -72,7 +72,7 @@ graph.save('./ir')
 ```
 
 # Benchmark
-If you want to run local onnx model inference, we provide with python API and C++ API. To use C++ API, you need to transfer to model ir fisrt.
+If you want to run local onnx model inference, we provide with python API and C++ API. To use C++ API, you need to transfer to model ir first.
 
 By setting ``--dynamic_quanzite`` for FP32 model, you could benchmark dynamic quantize int8 model.
 ## Accuracy

diff --git a/examples/huggingface/pytorch/question-answering/deployment/squad/bert_large/requirements.txt b/examples/huggingface/pytorch/question-answering/deployment/squad/bert_large/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/.../pytorch/question-answering/deployment/squad/length_adaptive_transformer/requirements.txt b/.../pytorch/question-answering/deployment/squad/length_adaptive_transformer/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.12.1
 

diff --git a/examples/huggingface/pytorch/question-answering/dynamic/README.md b/examples/huggingface/pytorch/question-answering/dynamic/README.md
@@ -256,7 +256,7 @@ NOTES: ** the multiplication and addition operation amount when model inference
   </tr>
   <tr>
     <td>IRQ Balance</td>
-    <td>Eabled</td>
+    <td>Enabled</td>
   </tr>
   <tr>
     <td>CPU Model</td>

diff --git a/examples/huggingface/pytorch/question-answering/dynamic/requirements.txt b/examples/huggingface/pytorch/question-answering/dynamic/requirements.txt
@@ -1,6 +1,6 @@
 transformers
 datasets
 torchprofile
-torch==2.0.1
+torch==2.1.0
 intel_extension_for_pytorch
 accelerate
diff --git a/examples/huggingface/pytorch/question-answering/pruning/basic_magnitude/README.md b/examples/huggingface/pytorch/question-answering/pruning/basic_magnitude/README.md
@@ -28,7 +28,7 @@ python run_qa.py \
 
 ## Step 2: Distributed Data Parallel Training
 
-We supporte Distributed Data Parallel training on single node and multi nodes settings for pruning. To use Distributed Data Parallel to speedup training, the bash command needs a small adjustment.
+We support Distributed Data Parallel training on single node and multi nodes settings for pruning. To use Distributed Data Parallel to speedup training, the bash command needs a small adjustment.
 <br>
 *`<MASTER_ADDRESS>`* is the address of the master node, it won't be necessary for single node case,
 <br>

diff --git a/examples/huggingface/pytorch/question-answering/pruning/basic_magnitude/requirements.txt b/examples/huggingface/pytorch/question-answering/pruning/basic_magnitude/requirements.txt
@@ -1,5 +1,5 @@
 datasets >= 1.8.0
-torch==2.0.1
+torch==2.1.0
 transformers
 wandb
 accelerate
diff --git a/examples/huggingface/pytorch/question-answering/pruning/longformer_triviaqa/requirements.txt b/examples/huggingface/pytorch/question-answering/pruning/longformer_triviaqa/requirements.txt
@@ -1,5 +1,5 @@
 accelerate
 datasets
 transformers
-torch==2.0.1
+torch==2.1.0
 neural-compressor==2.0
diff --git a/examples/huggingface/pytorch/summarization/quantization/requirements.txt b/examples/huggingface/pytorch/summarization/quantization/requirements.txt
@@ -4,6 +4,6 @@ sentencepiece != 0.1.92
 rouge-score
 nltk
 py7zr
-torch==2.0.1
+torch==2.1.0
 transformers
 protobuf
diff --git a/examples/huggingface/pytorch/text-classification/cascade-models/requirements.txt b/examples/huggingface/pytorch/text-classification/cascade-models/requirements.txt
@@ -1,4 +1,4 @@
-torch==2.0.1
+torch==2.1.0
 numpy
 transformers
 datasets

diff --git a/...e/pytorch/text-classification/deployment/emotion/distilbert_base_uncased/requirements.txt b/...e/pytorch/text-classification/deployment/emotion/distilbert_base_uncased/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/examples/huggingface/pytorch/text-classification/deployment/mrpc/bert_base/requirements.txt b/examples/huggingface/pytorch/text-classification/deployment/mrpc/bert_base/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/.../huggingface/pytorch/text-classification/deployment/mrpc/bert_base_cased/requirements.txt b/.../huggingface/pytorch/text-classification/deployment/mrpc/bert_base_cased/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/examples/huggingface/pytorch/text-classification/deployment/mrpc/bert_mini/requirements.txt b/examples/huggingface/pytorch/text-classification/deployment/mrpc/bert_mini/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/...face/pytorch/text-classification/deployment/mrpc/distilbert_base_uncased/requirements.txt b/...face/pytorch/text-classification/deployment/mrpc/distilbert_base_uncased/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/...les/huggingface/pytorch/text-classification/deployment/mrpc/roberta_base/requirements.txt b/...les/huggingface/pytorch/text-classification/deployment/mrpc/roberta_base/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/...ples/huggingface/pytorch/text-classification/deployment/sparse/bert_mini/requirements.txt b/...ples/huggingface/pytorch/text-classification/deployment/sparse/bert_mini/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/...ce/pytorch/text-classification/deployment/sparse/distilbert_base_uncased/requirements.txt b/...ce/pytorch/text-classification/deployment/sparse/distilbert_base_uncased/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/examples/huggingface/pytorch/text-classification/deployment/sst2/bert_mini/requirements.txt b/examples/huggingface/pytorch/text-classification/deployment/sst2/bert_mini/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/...face/pytorch/text-classification/deployment/sst2/distilbert_base_uncased/requirements.txt b/...face/pytorch/text-classification/deployment/sst2/distilbert_base_uncased/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/...gface/pytorch/text-classification/deployment/sst2/minilm_l6_h384_uncased/requirements.txt b/...gface/pytorch/text-classification/deployment/sst2/minilm_l6_h384_uncased/requirements.txt
@@ -4,7 +4,7 @@ accelerate
 datasets >= 1.8.0
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 onnx>=1.12
 onnxruntime==1.13.1
 
diff --git a/examples/huggingface/pytorch/text-classification/early-exit/requirements.txt b/examples/huggingface/pytorch/text-classification/early-exit/requirements.txt
@@ -1,4 +1,4 @@
-torch==2.0.1
+torch==2.1.0
 transformers
 datasets
 allennlp

diff --git a/examples/huggingface/pytorch/text-classification/new_pruning/requirements.txt b/examples/huggingface/pytorch/text-classification/new_pruning/requirements.txt
@@ -6,5 +6,5 @@ sentencepiece
 scipy
 scikit-learn
 protobuf
-torch==2.0.1
+torch==2.1.0
 evaluate
diff --git a/examples/huggingface/pytorch/text-classification/orchestrate_optimizations/requirements.txt b/examples/huggingface/pytorch/text-classification/orchestrate_optimizations/requirements.txt
@@ -1,5 +1,5 @@
 accelerate
-torch==2.0.1
+torch==2.1.0
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 transformers

diff --git a/examples/huggingface/pytorch/text-classification/pruning/requirements.txt b/examples/huggingface/pytorch/text-classification/pruning/requirements.txt
@@ -2,6 +2,6 @@ accelerate
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 transformers
 wandb
diff --git a/examples/huggingface/pytorch/text-classification/quantization/ptq/requirements.txt b/examples/huggingface/pytorch/text-classification/quantization/ptq/requirements.txt
@@ -2,6 +2,6 @@ accelerate
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 transformers
 wandb
diff --git a/examples/huggingface/pytorch/text-classification/quantization/qat/requirements.txt b/examples/huggingface/pytorch/text-classification/quantization/qat/requirements.txt
@@ -2,6 +2,6 @@ accelerate
 datasets >= 1.1.3
 sentencepiece != 0.1.92
 protobuf
-torch==2.0.1
+torch==2.1.0
 transformers
 wandb