diff --git a/README.md b/README.md index e992eaf2..e25ac508 100644 --- a/README.md +++ b/README.md @@ -1,212 +1,109 @@ - +![180242727-c8d5d236-2f0f-4394-b2ad-d393150bd177](https://user-images.githubusercontent.com/83510798/180889804-174df307-678f-4704-b82a-7d8eda2921e4.png) -
- Join the community | - Contribute to the library -
- --How nebullvm works • -Benchmarks • -Installation • -Get started -
- - -# Nebullvm - -**`nebullvm` speeds up AI inference by 2-30x in just a few lines of code 🚀** - -- [How nebullvm works](#how-nebullvm-works) -- [Benchmarks](#benchmarks) -- [Installation](#installation) -- Get started -- Pytorch, TensorFlow, Hugging Face and ONNX APIs. - - -## How nebullvm works - -> This open-source library takes your AI model as input and outputs an -> optimized version that runs 2-30 times faster on your hardware. Nebullvm -> tests multiple optimization techniques (deep learning compilers, -> quantization, sparsity, distillation, and more) to identify the optimal way -> to execute your AI model on your specific hardware. - - `nebullvm` can speed up your model 2 to 10 times without loss of performance, or up to 30 times if you specify that you are willing to trade off a self-defined amount of accuracy/precision for a super-low latency and a lighter model. - -The goal of `nebullvm` is to let any developer benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies. - -Do you want to learn more about how nebullvm optimizes your model? Take a look at the documentation. - - - -### So why nebullvm? - -🚀 **Superfast**. nebullvm speeds up the response time of AI models to enable real-time AI applications with reduced computing cost and low power consumption. - -☘️ **Easy-to-use**. It takes a few lines of code to install the library and optimize your models. - -💻 **Deep learning model agnostic.** `nebullvm` supports all the most popular architectures such as transformers, LSTMs, CNNs and FCNs. - -🔥 **Framework agnostic**. `nebullvm` supports the most widely used frameworks and provides as output an optimized version of your model with the same interface. At present, nebullvm supports PyTorch, TensorFlow, Hugging Face and ONNX models. - -🤖 **Hardware agnostic**. The library now works on most CPUs and GPUs. If you activate the TVM compiler, nebullvm will also support TPUs and other deep learning-specific ASICs. - -✨ **Leveraging the best optimization techniques**. There are many inference optimization techniques such as deep learning compilers, quantization, half-precision or distillation, which are all meant to optimize the way your AI models run on your hardware. It would take developers countless hours to install and test them on every model deployment. `nebullvm` does that for you. - -Do you like the concept? Leave a ⭐ if you enjoy the project and [join the Discord community](https://discord.gg/RbeQMu886J) where we chat about `nebullvm` and AI optimization. And happy acceleration 🚀🚀 - - - -## Benchmarks +# **Nebullvm** -We have tested `nebullvm` on popular AI models and hardware from leading vendors. +`nebullvm` is an open-source tool designed to speed up AI inference in just a few lines of code. `nebullvm` boosts your model to achieve the maximum acceleration that is physically possible on your hardware. -The table below shows the inference speedup provided by `nebullvm`. The speedup is calculated as the response time of the unoptimized model divided by the response time of the accelerated model, as an average over 100 experiments. As an example, if the response time of an unoptimized model was on average 600 milliseconds and after `nebullvm` optimization only 240 milliseconds, the resulting speedup is 2.5x times, meaning 150% faster inference. +We are building a new AI inference acceleration product leveraging state-of-the-art open-source optimization tools enabling the optimization of the whole software to hardware stack. If you like the idea, give us a star to support the project ⭐ -A complete overview of the experiment and findings can be found on in the documentation. +![nebullvm](https://user-images.githubusercontent.com/83510798/180957708-edfa8c8f-1818-4270-ac02-781eec8db773.png) -| | **M1 Pro** | **Intel Xeon** | **AMD EPYC** | **Nvidia T4** | -|-------------------------|:------------:|:---------------:|:-------------:|:-------------:| -| **EfficientNetB0** | 23.3x | 3.5x | 2.7x | 1.3x | -| **EfficientNetB2** | 19.6x | 2.8x | 1.5x | 2.7x | -| **EfficientNetB6** | 19.8x | 2.4x | 2.5x | 1.7x | -| **Resnet18** | 1.2x | 1.9x | 1.7x | 7.3x | -| **Resnet152** | 1.3x | 2.1x | 1.5x | 2.5x | -| **SqueezeNet** | 1.9x | 2.7x | 2.0x | 1.3x | -| **Convnext tiny** | 3.2x | 1.3x | 1.8x | 5.0x | -| **Convnext large** | 3.2x | 1.1x | 1.6x | 4.6x | -| **GPT2 - 10 tokens** | 2.8x | 3.2x | 2.8x | 3.8x | -| **GPT2 - 1024 tokens** | - | 1.7x | 1.9x | 1.4x | -| **Bert - 8 tokens** | 6.4x | 2.9x | 4.8x | 4.1x | -| **Bert - 512 tokens** | 1.8x | 1.3x | 1.6x | 3.1x | -| ____________________ | ____________ | ____________ | ____________ | ____________ | - -Overall, the library provides great results, with more than 2x acceleration in most cases and around 20x in a few applications. We can also observe that acceleration varies greatly across different hardware-model couplings, so we suggest you test `nebullvm` on your model and hardware to assess its full potential on your specific use case. +The core `nebullvm` workflow consists of 3 steps: -Besides, across all scenarios, `nebullvm` is very helpful for its ease of use, allowing you to take advantage of inference optimization techniques without having to spend hours studying, testing and debugging these powerful technologies. +- [x] **Select**: input your model in your preferred DL framework and express your preferences regarding: + - Accuracy loss: do you want to trade off a little accuracy for much higher performance? + - Optimization time: stellar accelerations can be time-consuming. Can you wait, or do you need an instant answer? +- [x] **Search**: `nebullvm` automatically tests every combination of optimization techniques across the software-to-hardware stack (sparsity, quantization, compilers, etc.) that is compatible with your needs and local hardware. +- [x] **Serve**: finally, `nebullvm` chooses the best configuration of optimization techniques and returns an accelerated version of your model in the DL framework of your choice (just on steroids 🚀). -## Installation +# API quick view -The installation consists of two steps: -- [`nebullvm` installation](#nebullvm-installation) -- [Installation of deep learning compilers](#installation-of-deep-learning-compilers) +Only a single line of code is needed to get your accelerated model: -### Nebullvm installation +```python +import torch +import torchvision.models as models +from nebullvm.api.functions import optimize_model -There are two ways to install `nebullvm`: -- [Using PyPI](#installation-with-pypi-recommended). We suggest installing the library with pip to get the stable version of nebullvm -- [From source code](#installation-from-source-code) to get the latest features +# Load a resnet as example +model = models.resnet50() -#### Installation with PyPI (recommended) +# Provide an input data for the model +input_data = [((torch.randn(1, 3, 256, 256), ), 0)] -The easiest way to install `nebullvm` is by using `pip`, running +# Run nebullvm optimization in one line of code +optimized_model = optimize_model( + model, input_data=input_data, optimization_time="constrained" +) +# Try the optimized model +x = torch.randn(1, 3, 256, 256) +res = optimized_model(x) ``` -pip install nebullvm -``` -#### Installation from source code - -Alternatively, you can install nebullvm from source code by cloning the directory on your local machine -using `git`. - -``` -git clone https://github.com/nebuly-ai/nebullvm.git -``` -Then, enter the repo and install `nebullvm` with `pip`. - -``` -cd nebullvm -pip install . -``` - -### Installation of deep learning compilers - -Follow the instructions below to automatically install all deep learning compilers leveraged by nebullvm (OpenVINO, TensorRT, ONNX Runtime, Apache TVM, etc.). - -To install them, there are thee ways: - -- [Installation at the first optimization run](#installation-at-the-first-optimization-run) -- [Installation before the first optimization run (recommended)](#installation-before-the-first-optimization-run-recommended) -- [Download Docker images with preinstalled compilers](#download-docker-images-with-preinstalled-compilers) - -Note that: -- Apache TVM is not installed with the below instructions. TVM can be installed separately by following this [guide](https://nebuly.gitbook.io/nebuly/nebullvm/installation/install-and-activate-the-apache-tvm-compiler). -- As an alternative to automatic installation of all compilers, they can be selectively installed by following these [instructions](https://nebuly.gitbook.io/nebuly/nebullvm/installation/selective-installation-of-deep-learning-compilers). - -#### Installation at the first optimization run -The automatic installation of the deep learning compilers is activated after you `import nebullvm` and perform your first optimization. You may run into import errors related to the deep learning compiler installation, but you can ignore these errors/warnings. It is also recommended re-starting the python kernel between the auto-installation and the first optimization, otherwise not all compilers will be activated. +For more details, please visit [Installation](https://nebuly.gitbook.io/nebuly/nebullvm/installation) and [Get started](https://nebuly.gitbook.io/nebuly/nebullvm/get-started). -#### Installation before the first optimization run (recommended) +# **How it works** -To avoid any problems, we strongly recommend running the auto-installation -before performing the first optimization by running +We are not here to reinvent the wheel, but to build an all-in-one open-source product to master all the available AI acceleration techniques and deliver the **fastest AI ever.** As a result, `nebullvm` leverages available enterprise-grade open-source optimization tools. If these tools and communities already exist, and are distributed under a permissive license (Apache, MIT, etc), we integrate them and happily contribute to their communities. However, many tools do not exist yet, in which case we implement them and open-source the code so that the community can benefit from it. -``` -python -c "import nebullvm" -``` - -You should ignore at this stage any import warning resulting from the previous -command. - -#### Download Docker images with preinstalled compilers -Instead of installing the compilers, which may take a long time, you can simply download the docker container with all compilers preinstalled and start using nebullvm. -To pull the docker image you can simply run - -``` -docker pull nebulydocker/nebullvm:cuda11.2.0-nebullvm0.3.1-allcompilers -``` +### **Product design** -and you can then run and access the docker with +`nebullvm` is shaped around **4 building blocks** and leverages a modular design to foster scalability and integration of new acceleration components across the stack. -``` -docker run -ia nebulydocker/nebullvm:cuda11.2.0-nebullvm0.3.1-allcompilers -``` +- [x] **Converter:** converts the input model from its original framework to the framework backends supported by `nebullvm`, namely PyTorch, TensorFlow, and ONNX. This allows the Compressor and Optimizer modules to apply any optimization technique to the model. +- [x] **Compressor:** applies various compression techniques to the model, such as pruning, knowledge distillation, or quantization-aware training. +- [x] **Optimizer:** converts the compressed models to the intermediate representation (IR) of the supported deep learning compilers. The compilers apply both post-training quantization techniques and graph optimizations, to produce compiled binary files. +- [x] **Inference Learner:** takes the best performing compiled model and converts it to the same interface as the original input model. -After you have compiled the model, you may decide to deploy it to production. Note that some of the components used to optimize the model are also needed to run it, so **you must have the compiler installed in the production docker**. For this reason, we have created several versions of our Docker container in the [Docker Hub](https://hub.docker.com/repository/docker/nebulydocker/nebullvm), each containing only one compiler. Pull the image with the compiler that has optimized your model! +![https://user-images.githubusercontent.com/42771598/180275153-f9e48569-221b-47c7-ab62-ed2ac1c635ca.png](https://user-images.githubusercontent.com/42771598/180275153-f9e48569-221b-47c7-ab62-ed2ac1c635ca.png) +The **compressor** stage leverages the following open-source projects: +- [Intel/neural-compressor](https://github.com/intel/neural-compressor): targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance. +- [SparseML](https://github.com/neuralmagic/sparseml): libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models. -## Get started, APIs and tutorials -`nebullvm` reduces the computation time of deep learning model inference by 2-30 times by testing multiple optimization techniques and identifying the optimal way to execute your AI model on your hardware. +The **optimizer stage** leverages the following open-source projects: -`nebullvm` can be deployed in two ways: -- [Option A: 2-10x acceleration, NO performance loss](#option-a-2-10x-acceleration-no-performance-loss) -- [Option B: 2-30x acceleration, supervised performance loss](#option-b-2-30x-acceleration-supervised-performance-loss) +- [Apache TVM](https://github.com/apache/tvm): open deep learning compiler stack for cpu, gpu and specialized accelerators. +- [BladeDISC](https://github.com/alibaba/BladeDISC): end-to-end Dynamic Shape Compiler project for machine learning workloads. +- [DeepSparse](https://github.com/neuralmagic/deepsparse): neural network inference engine that delivers GPU-class performance for sparsified models on CPUs. +- [OpenVINO](https://github.com/openvinotoolkit/openvino): open-source toolkit for optimizing and deploying AI inference. +- [ONNX Runtime](https://github.com/microsoft/onnxruntime): cross-platform, high performance ML inferencing and training accelerator +- [TensorRT](https://github.com/NVIDIA/TensorRT): C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators. +- [TFlite](https://github.com/tensorflow/tflite-micro) and [XLA](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla): open-source libraries to accelerate TensorFlow models. -For a detailed explanation of how nebullvm works and how to use it, refer to the documentation. +# **Documentation** -### Option A: 2-10x acceleration, NO performance loss +- [Installation](https://nebuly.gitbook.io/nebuly/nebullvm/installation) +- [Get started](https://nebuly.gitbook.io/nebuly/nebullvm/get-started) +- [Benchmarks](https://nebuly.gitbook.io/nebuly/nebullvm/benchmarks) +- [Supported features and roadmap](https://nebuly.gitbook.io/nebuly/nebullvm/how-nebullvm-works/supported-features-and-roadmap) -If you choose this option, `nebullvm` will test multiple deep learning compilers (TensorRT, OpenVINO, ONNX Runtime, etc.) and identify the optimal way to compile your model on your hardware, increasing inference speed by 2-10 times without affecting the performance of your model. +# **Community** -As an example, below is code for accelerating a PyTorch model with nebullvm's PyTorch API. +- **[Discord](https://discord.gg/RbeQMu886J)**: best for sharing your projects, hanging out with the community and learning about AI acceleration. +- **[Github issues](https://github.com/nebuly-ai/nebullvm/issues)**: ideal for suggesting new acceleration components, requesting new features, and reporting bugs and improvements. -``` ->>> import torch ->>> import torchvision.models as models ->>> from nebullvm import optimize_torch_model ->>> model = models.efficientnet_b0() ->>> save_dir = "." ->>> bs, input_sizes = 1, [(3, 256, 256)] ->>> optimized_model = optimize_torch_model( -... model, batch_size=bs, input_sizes=input_sizes, save_dir=save_dir -... ) ->>> x = torch.randn(1, 3, 256, 256) ->>> res = optimized_model(x) -``` +We’re developing `nebullvm` together with our community so the best way to get started is to pick a `good-first issue`. Please read our [contribution guidelines](https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions) for a deep dive on how to best contribute to our project! -### Option B: 2-30x acceleration, supervised performance loss +Don't forget to leave a star ⭐ to support the project and happy acceleration 🚀 -`Nebullvm` is capable of speeding up inference by much more than 10 times in case you are willing to sacrifice a fraction of your model's performance. If you specify how much performance loss you are willing to sustain, `nebullvm` will push your model's response time to its limits by identifying the best possible blend of state-of-the-art inference optimization techniques, such as deep learning compilers, distillation, quantization, half-precision, sparsity, etc. +# **Status** --How nebullvm works • -Benchmarks • -Installation • -Get started +Installation • +Get started • +Benchmarks
diff --git a/nebullvm/api/frontend/huggingface.py b/nebullvm/api/frontend/huggingface.py index f12581dc..02bb6524 100644 --- a/nebullvm/api/frontend/huggingface.py +++ b/nebullvm/api/frontend/huggingface.py @@ -1,29 +1,25 @@ -from collections import OrderedDict +import warnings from tempfile import TemporaryDirectory from typing import ( Tuple, Union, List, - Iterable, Dict, - Any, - Type, Callable, - Optional, - Sequence, ) -import numpy as np import torch from nebullvm import optimize_torch_model -from nebullvm.api.frontend.utils import ifnone, QUANTIZATION_METRIC_MAP -from nebullvm.base import DataType, ModelCompiler -from nebullvm.inference_learners.base import ( - PytorchBaseInferenceLearner, - InferenceLearnerWrapper, - LearnerMetadata, +from nebullvm.api.huggingface import ( + _flatten_outputs, + _TransformerWrapper, + _get_output_structure_from_text, + HuggingFaceInferenceLearner, + _HFTextDataset, ) +from nebullvm.api.utils import ifnone, QUANTIZATION_METRIC_MAP +from nebullvm.base import DataType, ModelCompiler from nebullvm.optimizers.extra import HuggingFaceOptimizer try: @@ -35,203 +31,6 @@ PreTrainedTokenizer = None -def _flatten_outputs( - outputs: Union[torch.Tensor, Iterable] -) -> List[torch.Tensor]: - new_outputs = [] - for output in outputs: - if isinstance(output, torch.Tensor): - new_outputs.append(output) - else: - flatten_list = _flatten_outputs(output) - new_outputs.extend(flatten_list) - return new_outputs - - -class _TransformerWrapper(torch.nn.Module): - """Class for wrappering the Transformers and give them an API compatible - with nebullvm. The class takes and input of the forward method positional - arguments and transform them in the input dictionaries needed by - transformers classes. At the end it also flattens their output. - """ - - def __init__( - self, - core_model: torch.nn.Module, - encoded_input: Dict[str, torch.Tensor], - ): - super().__init__() - self.core_model = core_model - self.inputs_types = OrderedDict() - for key, value in encoded_input.items(): - self.inputs_types[key] = value.dtype - - def forward(self, *args: torch.Tensor): - inputs = { - key: value for key, value in zip(self.inputs_types.keys(), args) - } - outputs = self.core_model(**inputs) - return tuple(_flatten_outputs(outputs.values())) - - -def _get_size_recursively( - tensor_tuple: Union[torch.Tensor, Tuple] -) -> List[int]: - if isinstance(tensor_tuple[0], torch.Tensor): - return [len(tensor_tuple)] - else: - inner_size = _get_size_recursively(tensor_tuple[0]) - return [len(tensor_tuple), *inner_size] - - -def _get_output_structure( - text: str, - model: PreTrainedModel, - tokenizer: PreTrainedTokenizer, - tokenizer_args: Dict, -) -> Tuple[OrderedDict, Type]: - """Function needed for saving in a dictionary the output structure of the - transformers model. - """ - encoded_input = tokenizer([text], **tokenizer_args) - output = model(**encoded_input) - structure = OrderedDict() - for key, value in output.items(): - if isinstance(value, torch.Tensor): - structure[key] = None - else: - size = _get_size_recursively(value) - structure[key] = size - return structure, type(output) - - -def _restructure_output( - output: Tuple[torch.Tensor], - structure: OrderedDict, - output_type: Any = None, -): - """Restructure the flatter output using the structure dictionary given as - input. - """ - output_dict = {} - idx = 0 - for key, value in structure.items(): - if value is None: - output_dict[key] = output[idx] - idx += 1 - else: - output_dict[key] = ( - np.array( - output[idx : int(np.prod(value)) + idx], # noqa E203 - dtype=object, - ) - .reshape(value) - .tolist() - ) - idx += np.prod(value) - if output_type is not None: - return output_type(**output_dict) - return output_dict - - -class HuggingFaceInferenceLearner(InferenceLearnerWrapper): - """Class wrapping an InferenceLearner model and giving to it the - huggingface interface. - - The class fuse both the InterfaceLearner and HuggingFace interfaces, giving - to the final user a model which can be used whit the prefered API without - the need of adapting the previous code. - - Attributes: - network_parameters (ModelParams): Model parameters of the model. - core_inference_learner (PytorchBaseInferenceLearner): Inference learner - built using the Pytorch interface. - output_structure (Dict): Original output structure of the HuggingFace - model. - input_names (List[str]): List of all the input keys used for the - original HuggingFace model. - output_type (Any, optional): Original output type of the HuggingFace - model. - """ - - def __init__( - self, - core_inference_learner: PytorchBaseInferenceLearner, - output_structure: OrderedDict, - input_names: List[str], - output_type: Any = None, - ): - super().__init__(core_inference_learner) - self.output_structure = output_structure - self.input_names = input_names - self.output_type = output_type - - def _save_wrapper_extra_info(self): - pass - - @staticmethod - def _load_wrapper_extra_info(builder_inputs: Dict) -> Dict: - return builder_inputs - - def run(self, *args, **kwargs) -> Any: - """Run the underlying optimized model for getting a prediction. - - The method has an hybrid interface. It accepts inputs either as - positional or keyword arguments. If only positional arguments are given - the method expects the inputs to be in the canonical - nebullvm interface. If only keyword arguments are given the method - expects them to be in the HuggingFace interface. Mixed representation - is not allowed and will result in an error. - """ - if len(args) > 0 and len(kwargs) > 0: - raise RuntimeError( - "Not allowed usage of the predict method. " - "Either the positional or the keyword arguments must be given." - ) - if len(args) > 0: - return self.core_inference_learner(*args) - inputs = (kwargs.pop(name) for name in self.input_names) - outputs = self.core_inference_learner(*inputs) - return _restructure_output( - outputs, self.output_structure, self.output_type - ) - - def _get_extra_metadata_kwargs(self) -> Dict: - metadata_kwargs = { - "output_structure": self.output_structure, - "output_structure_keys": list(self.output_structure.keys()), - "input_names": self.input_names, - } - if self.output_type is not None: - metadata_kwargs.update( - { - "output_type": self.output_type.__name__, - "output_type_module": self.output_type.__module__, - } - ) - return metadata_kwargs - - @staticmethod - def _convert_metadata_to_inputs(metadata: LearnerMetadata) -> Dict: - # we need to guarantee the preservation of the output structure - # elements order. - output_structure = OrderedDict() - for key in metadata["output_structure_keys"]: - output_structure[key] = metadata["output_structure"][key] - - inputs = { - "output_structure": output_structure, - "input_names": metadata["input_names"], - } - if metadata["output_type"] is not None: - exec( - f"from {metadata['output_type_module']} " - f"import {metadata['output_type']}" - ) - inputs["output_type"] = eval(metadata["output_type"]) - return inputs - - def _get_dynamic_axis( text: str, tokenizer: PreTrainedTokenizer, @@ -302,45 +101,6 @@ def _get_extra_optimizer( return [HuggingFaceOptimizer(hugging_face_params={})] -class _HFDataset(Sequence): - def __init__( - self, - input_texts: List, - ys: Optional[List], - keywords: List[str], - batch_size: int, - tokenizer: PreTrainedTokenizer, - tokenizer_args: Dict, - ): - self._input_texts = input_texts - self._ys = ys - self._bs = batch_size - self._keys = keywords - self._tokenizer = tokenizer - if self._tokenizer.pad_token is None: - self._tokenizer.pad_token = self._tokenizer.eos_token - _tokenizer_args = {"truncation": True, "padding": True} - _tokenizer_args.update(tokenizer_args) - self._tokenizer_args = _tokenizer_args - - def __getitem__(self, item: int): - pointer = self._bs * item - if pointer >= len(self): - raise IndexError - mini_batch = self._input_texts[ - pointer : pointer + self._bs # noqa E203 - ] - if self._ys is not None: - mini_batch_y = self._ys[pointer : pointer + self._bs] # noqa E203 - else: - mini_batch_y = None - encoded_inputs = self._tokenizer(mini_batch, **self._tokenizer_args) - return tuple(encoded_inputs[key] for key in self._keys), mini_batch_y - - def __len__(self): - return len(self._input_texts) - - def optimize_huggingface_model( model: PreTrainedModel, tokenizer: PreTrainedTokenizer, @@ -371,7 +131,7 @@ def optimize_huggingface_model( tokenizer (PreTrainedTokenizer): Tokenizer used for building model's inputs. input_texts (List[str]): Texts either from the training set or similar - to the ones contained in the text. If the perf_loss_ths is + to the ones contained in the text. If the metric_drop_ths is passed the input_text will be used for computing the drop in precision and for setting the quantization parameters. If you selected a quantization metric needing the input labels you need to @@ -409,7 +169,7 @@ def optimize_huggingface_model( performed, since no data is given as input. perf_metric (Union[Callable, str], optional): The metric to be used for accepting or refusing a precision-reduction - optimization proposal. If none is given but a `perf_loss_ths` is + optimization proposal. If none is given but a `metric_drop_ths` is received, the `nebullvm.measure.compute_relative_difference` metric will be used as default one. A user-defined metric can be passed as function accepting as inputs two tuples of tensors @@ -417,9 +177,9 @@ def optimize_huggingface_model( original labels. For more information see `nebullvm.measure.compute_relative_difference` and - `nebullvm.measure.compute_accuracy_drop`. `perf_metric` + `nebullvm.measure.compute_accuracy_drop`. `metric` accepts as value also a string containing the metric name. At the - current stage the supported metrics are `"precision"` and + current stage the supported metrics are `"numeric_precision"` and `"accuracy"`. ys: List of target labels. For each input in `input_texts` there should be the corresponding label. Note that this feature is just used for @@ -427,6 +187,11 @@ def optimize_huggingface_model( techniques. It will be ignored if these techniques are not activated. """ + warnings.warn( + "Deprecated: The usage of the HuggingFace api is deprecated. " + "`optimize_huggingface_model`will be removed from the next release. " + "Use `optimize_model` instead." + ) if perf_loss_ths is not None and ys is None and perf_metric == "accuracy": raise ValueError( "You cannot select the accuracy as quantization metric without " @@ -436,7 +201,7 @@ def optimize_huggingface_model( perf_metric = QUANTIZATION_METRIC_MAP.get(perf_metric) tokenizer_args = tokenizer_args or {} tokenizer_args.update({"return_tensors": "pt"}) - output_structure, output_type = _get_output_structure( + output_structure, output_type = _get_output_structure_from_text( text=input_texts[0], model=model, tokenizer=tokenizer, @@ -470,7 +235,7 @@ def optimize_huggingface_model( else None, perf_loss_ths=perf_loss_ths, perf_metric=perf_metric, - dataloader=_HFDataset( + dataloader=_HFTextDataset( input_texts, ys, list(wrapper_model.inputs_types.keys()), diff --git a/nebullvm/api/frontend/onnx.py b/nebullvm/api/frontend/onnx.py index 81c2ce42..877aa970 100644 --- a/nebullvm/api/frontend/onnx.py +++ b/nebullvm/api/frontend/onnx.py @@ -1,11 +1,12 @@ import os import shutil +import warnings from tempfile import TemporaryDirectory from typing import List, Tuple, Dict, Optional, Callable, Union import numpy as np -from nebullvm.api.frontend.utils import ( +from nebullvm.api.utils import ( inspect_dynamic_size, ifnone, QUANTIZATION_METRIC_MAP, @@ -58,7 +59,7 @@ def _extract_dynamic_axis( return None -def _extract_info_from_data( +def extract_info_from_np_data( onnx_model: str, data: List[Tuple[Tuple[np.ndarray, ...], np.ndarray]], batch_size: int, @@ -139,7 +140,7 @@ def optimize_onnx_model( performed, since no data is given as input. perf_metric (Union[Callable, str], optional): The metric to be used for accepting or refusing a precision-reduction - optimization proposal. If none is given but a `perf_loss_ths` is + optimization proposal. If none is given but a `metric_drop_ths` is received, the `nebullvm.measure.compute_relative_difference` metric will be used as default one. A user-defined metric can be passed as function accepting as inputs two tuples of tensors @@ -147,9 +148,9 @@ def optimize_onnx_model( original labels. For more information see `nebullvm.measure.compute_relative_difference` and - `nebullvm.measure.compute_accuracy_drop`. `perf_metric` + `nebullvm.measure.compute_accuracy_drop`. `metric` accepts as value also a string containing the metric name. At the - current stage the supported metrics are `"precision"` and + current stage the supported metrics are `"numeric_precision"` and `"accuracy"`. ignore_compilers (List[str], optional): List of DL compilers we want to ignore while running the optimization. Compiler name should be @@ -166,13 +167,18 @@ def optimize_onnx_model( Pytorch interface. Note that as a torch model it takes as input and it gives as output `torch.Tensor`s. """ + warnings.warn( + "Deprecated: The usage of the onnx api is deprecated. " + "`optimize_onnx_model`will be removed from the next release. " + "Use `optimize_model` instead." + ) if data is not None: ( batch_size, input_sizes, input_types, dynamic_axis, - ) = _extract_info_from_data( + ) = extract_info_from_np_data( model_path, data, batch_size, @@ -231,8 +237,8 @@ def optimize_onnx_model( output_library=dl_library, model_params=model_params, input_tfms=input_tfms, - perf_loss_ths=perf_loss_ths, - perf_metric=perf_metric, + metric_drop_ths=perf_loss_ths, + metric=perf_metric, input_data=input_data, ) else: diff --git a/nebullvm/api/frontend/tf.py b/nebullvm/api/frontend/tf.py index b53b53d9..f474bde7 100644 --- a/nebullvm/api/frontend/tf.py +++ b/nebullvm/api/frontend/tf.py @@ -9,7 +9,7 @@ import tensorflow as tf from tqdm import tqdm -from nebullvm.api.frontend.utils import ( +from nebullvm.api.utils import ( ifnone, inspect_dynamic_size, QUANTIZATION_METRIC_MAP, @@ -72,7 +72,7 @@ def _extract_dynamic_axis( return None -def _extract_info_from_data( +def extract_info_from_tf_data( tf_model: tf.Module, dataset: List[Tuple[Tuple[tf.Tensor, ...], Any]], batch_size: int, @@ -153,7 +153,7 @@ def optimize_tf_model( performed, since no data is given as input. perf_metric (Union[Callable, str], optional): The metric to be used for accepting or refusing a precision-reduction - optimization proposal. If none is given but a `perf_loss_ths` is + optimization proposal. If none is given but a `metric_drop_ths` is received, the `nebullvm.measure.compute_relative_difference` metric will be used as default one. A user-defined metric can be passed as function accepting as inputs two tuples of tensors @@ -161,9 +161,9 @@ def optimize_tf_model( original labels. For more information see `nebullvm.measure.compute_relative_difference` and - `nebullvm.measure.compute_accuracy_drop`. `perf_metric` + `nebullvm.measure.compute_accuracy_drop`. `metric` accepts as value also a string containing the metric name. At the - current stage the supported metrics are `"precision"` and + current stage the supported metrics are `"numeric_precision"` and `"accuracy"`. ignore_compilers (List[str]): List of DL compilers we want to ignore while running the optimization. Compiler name should be one @@ -180,13 +180,18 @@ def optimize_tf_model( tensorflow interface. Note that as a torch model it takes as input and it gives as output `tf.Tensor` s. """ + warnings.warn( + "Deprecated: The usage of the tensorflow api is deprecated. " + "`optimize_tf_model`will be removed from the next release. " + "Use `optimize_model` instead." + ) if dataset is not None: ( batch_size, input_sizes, input_types, dynamic_axis, - ) = _extract_info_from_data( + ) = extract_info_from_tf_data( model, dataset, batch_size, input_sizes, input_types, dynamic_axis ) input_data = DataManager(dataset) @@ -271,8 +276,8 @@ def optimize_tf_model( output_library=dl_library, model_params=model_params, input_tfms=input_tfms, - perf_loss_ths=perf_loss_ths, - perf_metric=perf_metric, + metric_drop_ths=perf_loss_ths, + metric=perf_metric, input_data=input_data, ) logger.info("Running comparison between optimized models (3/3).") @@ -341,7 +346,7 @@ def _torch_api_optimization( model=model, output_library=DeepLearningFramework.PYTORCH, model_params=model_params, - perf_loss_ths=quantization_ths + metric_drop_ths=quantization_ths if quantization_type is not None else None, quantization_type=quantization_type, diff --git a/nebullvm/api/frontend/torch.py b/nebullvm/api/frontend/torch.py index c75f7d2f..875c7ed2 100644 --- a/nebullvm/api/frontend/torch.py +++ b/nebullvm/api/frontend/torch.py @@ -10,7 +10,7 @@ from torch.utils.data import DataLoader from tqdm import tqdm -from nebullvm.api.frontend.utils import ( +from nebullvm.api.utils import ( check_inputs, ifnone, inspect_dynamic_size, @@ -35,7 +35,10 @@ ) from nebullvm.inference_learners.base import PytorchBaseInferenceLearner from nebullvm.measure import compute_optimized_running_time -from nebullvm.optimizers import ApacheTVMOptimizer, BaseOptimizer +from nebullvm.optimizers import ( + ApacheTVMOptimizer, + BaseOptimizer, +) from nebullvm.optimizers.multi_compiler import MultiCompilerOptimizer logging.basicConfig( @@ -74,7 +77,7 @@ def _extract_dynamic_axis( return None -def _extract_info_from_data( +def extract_info_from_torch_data( model: torch.nn.Module, dataloader: Union[DataLoader, Sequence], batch_size: int, @@ -170,7 +173,7 @@ def optimize_torch_model( performed, since no data is given as input. perf_metric (Union[Callable, str], optional): The metric to be used for accepting or refusing a precision-reduction - optimization proposal. If none is given but a `perf_loss_ths` is + optimization proposal. If none is given but a `metric_drop_ths` is received, the `nebullvm.measure.compute_relative_difference` metric will be used as default one. A user-defined metric can be passed as function accepting as inputs two tuples of tensors @@ -178,9 +181,9 @@ def optimize_torch_model( original labels. For more information see `nebullvm.measure.compute_relative_difference` and - `nebullvm.measure.compute_accuracy_drop`. `perf_metric` + `nebullvm.measure.compute_accuracy_drop`. `metric` accepts as value also a string containing the metric name. At the - current stage the supported metrics are `"precision"` and + current stage the supported metrics are `"numeric_precision"` and `"accuracy"`. ignore_compilers (List[str], optional): List of DL compilers we want to ignore while running the optimization. Compiler name should be @@ -197,6 +200,11 @@ def optimize_torch_model( Pytorch interface. Note that as a torch model it takes as input and it gives as output `torch.Tensor` s. """ + warnings.warn( + "Deprecated: The usage of the torch api is deprecated. " + "`optimize_torch_model`will be removed from the next release. " + "Use `optimize_model` instead." + ) check_inputs( input_data=dataloader, batch_size=batch_size, input_sizes=input_sizes ) @@ -208,7 +216,7 @@ def optimize_torch_model( input_sizes, input_types, dynamic_axis, - ) = _extract_info_from_data( + ) = extract_info_from_torch_data( model, dataloader, batch_size, @@ -299,13 +307,14 @@ def optimize_torch_model( onnx_path = model_converter.convert( model, model_params, Path(tmp_dir), input_data ) + model_optimized = model_optimizer.optimize( model=str(onnx_path), output_library=dl_library, model_params=model_params, input_tfms=input_tfms, - perf_loss_ths=perf_loss_ths, - perf_metric=perf_metric, + metric_drop_ths=perf_loss_ths, + metric=perf_metric, input_data=input_data, ) else: @@ -330,7 +339,7 @@ def _get_optimizers_supporting_torch_api( use_extra_compilers: bool, ) -> List[Tuple[ModelCompiler, BaseOptimizer]]: optimizers = [ - (ModelCompiler.TORCHVISION, PytorchBackendOptimizer(logger=logger)), + (ModelCompiler.TORCHSCRIPT, PytorchBackendOptimizer(logger=logger)), ] if use_extra_compilers: optimizers.append( @@ -359,7 +368,7 @@ def _torch_api_optimization( candidate_model = optimizer.optimize_from_torch( torch_model=model, model_params=model_params, - perf_loss_ths=quantization_ths + metric_drop_ths=quantization_ths if quantization_type is not None else None, quantization_type=quantization_type, @@ -371,7 +380,7 @@ def _torch_api_optimization( model=model, output_library=DeepLearningFramework.PYTORCH, model_params=model_params, - perf_loss_ths=quantization_ths + metric_drop_ths=quantization_ths if quantization_type is not None else None, quantization_type=quantization_type, diff --git a/nebullvm/api/functions.py b/nebullvm/api/functions.py new file mode 100644 index 00000000..4c032b71 --- /dev/null +++ b/nebullvm/api/functions.py @@ -0,0 +1,279 @@ +import logging +from pathlib import Path +from tempfile import TemporaryDirectory +from typing import Any, Iterable, Sequence, Union, Dict, Callable, List + +import tensorflow as tf +import torch.nn + +from nebullvm.api.frontend.onnx import extract_info_from_np_data +from nebullvm.api.frontend.tf import extract_info_from_tf_data +from nebullvm.api.frontend.torch import extract_info_from_torch_data +from nebullvm.api.huggingface import convert_hf_model, is_dict_type +from nebullvm.api.utils import QUANTIZATION_METRIC_MAP +from nebullvm.base import ( + DeepLearningFramework, + ModelParams, + ModelCompiler, + OptimizationTime, +) +from nebullvm.converters.converters import CrossConverter +from nebullvm.pipelines.steps import build_pipeline_from_model +from nebullvm.transformations.base import MultiStageTransformation +from nebullvm.utils.data import DataManager +from nebullvm.utils.feedback_collector import FEEDBACK_COLLECTOR +from nebullvm.utils.onnx import get_output_sizes_onnx +from nebullvm.utils.tf import get_outputs_sizes_tf +from nebullvm.utils.torch import get_outputs_sizes_torch + + +logging.basicConfig( + format="%(asctime)s %(message)s", datefmt="%d/%m/%Y %I:%M:%S %p" +) +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO) + + +def _get_dl_framework(model: Any): + if isinstance(model, torch.nn.Module): + return DeepLearningFramework.PYTORCH + elif isinstance(model, tf.Module): + return DeepLearningFramework.TENSORFLOW + elif isinstance(model, str): + return DeepLearningFramework.NUMPY + else: + raise TypeError(f"Model type {type(model)} not supported.") + + +def _check_input_data(input_data: Union[Iterable, Sequence]): + try: + input_data[0] + except: # noqa E722 + return False + else: + return True + + +INFO_EXTRACTION_DICT: Dict[DeepLearningFramework, Callable] = { + DeepLearningFramework.PYTORCH: extract_info_from_torch_data, + DeepLearningFramework.TENSORFLOW: extract_info_from_tf_data, + DeepLearningFramework.NUMPY: extract_info_from_np_data, +} + + +OUTPUT_SIZE_COMPUTATION_DICT: Dict[DeepLearningFramework, Callable] = { + DeepLearningFramework.PYTORCH: get_outputs_sizes_torch, + DeepLearningFramework.TENSORFLOW: get_outputs_sizes_tf, + DeepLearningFramework.NUMPY: get_output_sizes_onnx, +} + + +def _extract_info_from_data( + model: Any, + input_data: DataManager, + dl_framework: DeepLearningFramework, + dynamic_info: Dict, +): + batch_size, input_sizes, input_types, dynamic_info = INFO_EXTRACTION_DICT[ + dl_framework + ]( + model, + input_data, + batch_size=None, + input_sizes=None, + input_types=None, + dynamic_axis=dynamic_info, + ) + model_params = ModelParams( + batch_size=batch_size, + input_infos=[ + {"size": size, "dtype": dtype} + for size, dtype in zip(input_sizes, input_types) + ], + output_sizes=OUTPUT_SIZE_COMPUTATION_DICT[dl_framework]( + model, input_data[0][0] + ), + dynamic_info=dynamic_info, + ) + return model_params + + +def _is_huggingface_data(data_sample: Any) -> bool: + if is_dict_type(data_sample): + return True + elif isinstance(data_sample, str): + return True + elif isinstance(data_sample[0], str): + return True + return False + + +def optimize_model( + model: Any, + input_data: Union[Iterable, Sequence], + metric_drop_ths: float = None, + metric: Union[str, Callable] = None, + optimization_time: str = "constrained", + dynamic_info: Dict = None, + config_file: str = None, + ignore_compilers: List[str] = None, + **kwargs, +): + """Optimize the input model regardless of the framework it was used for + implementing it. The optimized model given as output will share with the + input one the same API, i.e. the optimized model will have the same + interface as the original one. + + Args: + model (Any): The input model. + input_data (Iterable or Sequence): Input data to be used for + optimizing the model. Note that if 'unconstrained' is selected as + `optimization_time`, it would be beneficial to provide at least 100 + data samples in order to use all the techniques supported by + Nebullvm. The data can be given in either as sequence (data can be + accessed by "element", e.g. `data[i]`) or iterable (data needs to + be accessed with loop, e.g. `for x in data`). PyTorch, TensorFlow + and Onnx respectively accept input tensor in `torch.Tensor`, + `tf.Tensor` and `np.ndarray` formats. Note that the each input + sample must be a tuple containing a tuple as first element, the + `inputs`, and the `label` as second element. The `inputs` needs to + be passed as tuple even if a single input is needed by the model + (in this case the `inputs` tuple will contain just an element). + HuggingFace models can take as data samples both dictionaries or + strings. Strings will then be converted in data samples using the + HuggingFace tokenizer which must be given as input when just a + list of string is provided as input_data (tokenizers can be passed + as extra arguments of this function using the keyword `tokenizer`). + metric_drop_ths (float, optional): Maximum reduction in the + selected metric accepted. No model with an higher error will be + accepted, i.e. all optimized model having a larger error respect to + the original one will be discarded, without even considering their + possible speed-up. Default: None, i.e. no drop in metric accepted. + metric (Union[Callable, str], optional): The metric to + be used for accepting or refusing a precision-reduction + optimization proposal. If none is given but a `metric_drop_ths` is + received, the `nebullvm.measure.compute_relative_difference` + metric will be used as default one. A user-defined metric can + be passed as function accepting as inputs two tuples of tensors + (produced by the baseline and the optimized model) and the related + original labels. + For more information see + `nebullvm.measure.compute_relative_difference` and + `nebullvm.measure.compute_accuracy_drop`. `metric` + accepts as value also a string containing the metric name. At the + current stage the supported metrics are `"numeric_precision"` and + `"accuracy"`. Default: `"numeric_precision"` + optimization_time (OptimizationTime, optional): The optimization time + mode. It can be either 'constrained' or 'unconstrained'. For + 'constrained' mode just compilers and precision reduction + techniques are used (no compression). 'Unconstrained' optimization + allows the usage of more time consuming techniques as pruning and + distillation. Note that for using many of the sophisticated + techniques in the 'unconstrained' optimization, a small fine-tuning + of the model will be needed. Thus we highly recommend to give as + input_data at least 100 samples for when selecting 'unconstrained' + optimization. Default: 'constrained'. + dynamic_info (Dict, optional): Dictionary containing info about the + dynamic axis. It should contain as keys both "inputs" and "outputs" + and as values two lists of dictionaries where each dictionary + represents the dynamic axis information for an input/output tensor. + The inner dictionary should have as key an integer, i.e. the + dynamic axis (considering also the batch size) and as value a + string giving a "tag" to it, e.g. "batch_size". Default: None + config_file (str, optional): Configuration file containing the + parameters needed for defining the CompressionStep in the pipeline. + Default: None. + ignore_compilers (List, optional): List containing the compilers to be + ignored during the OptimizerStep. Default: None. + + Returns: + InferenceLearner: Optimized version of the input model having the same + interface, imported by its original framework. For instance a + Pytorch model, when optimized, will return an InferenceLearner + object that can be call exactly as a PyTorch model (either + with `model.forward(input)` and `model(input)`), i.e. it will + take as input and it will return `torch.Tensor`s. + """ + dl_framework = _get_dl_framework(model) + optimization_time = OptimizationTime(optimization_time) + FEEDBACK_COLLECTOR.start_collection(model, framework=dl_framework) + if metric_drop_ths is not None and metric_drop_ths <= 0: + metric_drop_ths = None + if isinstance(metric, str): + metric = QUANTIZATION_METRIC_MAP.get(metric) + needs_conversion_to_hf = False + if _is_huggingface_data(input_data[0]): + ( + model, + input_data, + input_names, + output_structure, + output_type, + ) = convert_hf_model(model, input_data, **kwargs) + needs_conversion_to_hf = True + if _check_input_data(input_data): + input_data = DataManager(input_data) + else: + input_data = DataManager.from_iterable(input_data) + model_params = _extract_info_from_data( + model, + input_data, + dl_framework, + dynamic_info, + ) + converter = CrossConverter() + optimized_models = [] + with TemporaryDirectory() as tmp_dir: + tmp_dir = Path(tmp_dir) + models = converter.convert(model, model_params, tmp_dir, input_data) + + if ignore_compilers is None: + ignore_compilers = [] + else: + ignore_compilers = [ + ModelCompiler(compiler) for compiler in ignore_compilers + ] + for model in models: + input_tfms = MultiStageTransformation([]) + pipeline = build_pipeline_from_model( + model, + optimization_time, + metric_drop_ths, + metric, + config_file, + logger=logger, + ) + output_dict = pipeline.run( + model=model, + input_data=input_data, + metric_drop_ths=metric_drop_ths, + metric=metric, + output_library=dl_framework, + model_params=model_params, + input_tfms=input_tfms, + ignore_compilers=ignore_compilers, + optimization_time=optimization_time, + ) + ignore_compilers = output_dict["ignore_compilers"] + optimized_models.extend(output_dict["optimized_models"]) + + optimized_models.sort(key=lambda x: x[1], reverse=False) + optimal_model = optimized_models[0][0] + FEEDBACK_COLLECTOR.send_feedback() + if optimal_model is None: + logger.warning( + "No optimized model has been created. This is likely due to a " + "bug in Nebullvm. Please open an issue and report in details " + "your use case." + ) + return optimal_model + if needs_conversion_to_hf: + from nebullvm.api.huggingface import HuggingFaceInferenceLearner + + optimal_model = HuggingFaceInferenceLearner( + core_inference_learner=optimal_model, + output_structure=output_structure, + input_names=input_names, + output_type=output_type, + ) + return optimal_model diff --git a/nebullvm/api/huggingface.py b/nebullvm/api/huggingface.py new file mode 100644 index 00000000..4b1685ca --- /dev/null +++ b/nebullvm/api/huggingface.py @@ -0,0 +1,397 @@ +from collections import OrderedDict +from typing import ( + Union, + Iterable, + List, + Dict, + Tuple, + Type, + Any, + Sequence, + Optional, +) + +import numpy as np +import torch + +from nebullvm.inference_learners import ( + InferenceLearnerWrapper, + PytorchBaseInferenceLearner, + LearnerMetadata, +) + +try: + from transformers import ( + PreTrainedModel, + ) + from transformers.tokenization_utils import PreTrainedTokenizer +except ImportError: + # add placeholders for function definition + PreTrainedModel = None + PreTrainedTokenizer = None + + +def _flatten_outputs( + outputs: Union[torch.Tensor, Iterable] +) -> List[torch.Tensor]: + new_outputs = [] + for output in outputs: + if isinstance(output, torch.Tensor): + new_outputs.append(output) + else: + flatten_list = _flatten_outputs(output) + new_outputs.extend(flatten_list) + return new_outputs + + +class _TransformerWrapper(torch.nn.Module): + """Class for wrappering the Transformers and give them an API compatible + with nebullvm. The class takes and input of the forward method positional + arguments and transform them in the input dictionaries needed by + transformers classes. At the end it also flattens their output. + """ + + def __init__( + self, + core_model: torch.nn.Module, + encoded_input: Dict[str, torch.Tensor], + ): + super().__init__() + self.core_model = core_model + self.inputs_types = OrderedDict() + for key, value in encoded_input.items(): + self.inputs_types[key] = value.dtype + + def forward(self, *args: torch.Tensor): + inputs = { + key: value for key, value in zip(self.inputs_types.keys(), args) + } + outputs = self.core_model(**inputs) + return tuple(_flatten_outputs(outputs.values())) + + +def _get_size_recursively( + tensor_tuple: Union[torch.Tensor, Tuple] +) -> List[int]: + if isinstance(tensor_tuple[0], torch.Tensor): + return [len(tensor_tuple)] + else: + inner_size = _get_size_recursively(tensor_tuple[0]) + return [len(tensor_tuple), *inner_size] + + +def _get_output_structure_from_text( + text: str, + model: PreTrainedModel, + tokenizer: PreTrainedTokenizer, + tokenizer_args: Dict, +) -> Tuple[OrderedDict, Type]: + """Function needed for saving in a dictionary the output structure of the + transformers model. + """ + encoded_input = tokenizer([text], **tokenizer_args) + output = model(**encoded_input) + structure = OrderedDict() + for key, value in output.items(): + if isinstance(value, torch.Tensor): + structure[key] = None + else: + size = _get_size_recursively(value) + structure[key] = size + return structure, type(output) + + +def _get_output_structure_from_dict( + input_example: Dict, + model: PreTrainedModel, +) -> Tuple[OrderedDict, Type]: + """Function needed for saving in a dictionary the output structure of the + transformers model. + """ + output = model(**input_example) + structure = OrderedDict() + for key, value in output.items(): + if isinstance(value, torch.Tensor): + structure[key] = None + else: + size = _get_size_recursively(value) + structure[key] = size + return structure, type(output) + + +def _restructure_output( + output: Tuple[torch.Tensor], + structure: OrderedDict, + output_type: Any = None, +): + """Restructure the flatter output using the structure dictionary given as + input. + """ + output_dict = {} + idx = 0 + for key, value in structure.items(): + if value is None: + output_dict[key] = output[idx] + idx += 1 + else: + output_dict[key] = ( + np.array( + output[idx : int(np.prod(value)) + idx], # noqa E203 + dtype=object, + ) + .reshape(value) + .tolist() + ) + idx += np.prod(value) + if output_type is not None: + return output_type(**output_dict) + return output_dict + + +class HuggingFaceInferenceLearner(InferenceLearnerWrapper): + """Class wrapping an InferenceLearner model and giving to it the + huggingface interface. + + The class fuse both the InterfaceLearner and HuggingFace interfaces, giving + to the final user a model which can be used whit the prefered API without + the need of adapting the previous code. + + Attributes: + network_parameters (ModelParams): Model parameters of the model. + core_inference_learner (PytorchBaseInferenceLearner): Inference learner + built using the Pytorch interface. + output_structure (Dict): Original output structure of the HuggingFace + model. + input_names (List[str]): List of all the input keys used for the + original HuggingFace model. + output_type (Any, optional): Original output type of the HuggingFace + model. + """ + + def __init__( + self, + core_inference_learner: PytorchBaseInferenceLearner, + output_structure: OrderedDict, + input_names: List[str], + output_type: Any = None, + ): + super().__init__(core_inference_learner) + self.output_structure = output_structure + self.input_names = input_names + self.output_type = output_type + + def _save_wrapper_extra_info(self): + pass + + @staticmethod + def _load_wrapper_extra_info(builder_inputs: Dict) -> Dict: + return builder_inputs + + def run(self, *args, **kwargs) -> Any: + """Run the underlying optimized model for getting a prediction. + + The method has an hybrid interface. It accepts inputs either as + positional or keyword arguments. If only positional arguments are given + the method expects the inputs to be in the canonical + nebullvm interface. If only keyword arguments are given the method + expects them to be in the HuggingFace interface. Mixed representation + is not allowed and will result in an error. + """ + if len(args) > 0 and len(kwargs) > 0: + raise RuntimeError( + "Not allowed usage of the predict method. " + "Either the positional or the keyword arguments must be given." + ) + if len(args) > 0: + return self.core_inference_learner(*args) + inputs = (kwargs.pop(name) for name in self.input_names) + outputs = self.core_inference_learner(*inputs) + return _restructure_output( + outputs, self.output_structure, self.output_type + ) + + def _get_extra_metadata_kwargs(self) -> Dict: + metadata_kwargs = { + "output_structure": self.output_structure, + "output_structure_keys": list(self.output_structure.keys()), + "input_names": self.input_names, + } + if self.output_type is not None: + metadata_kwargs.update( + { + "output_type": self.output_type.__name__, + "output_type_module": self.output_type.__module__, + } + ) + return metadata_kwargs + + @staticmethod + def _convert_metadata_to_inputs(metadata: LearnerMetadata) -> Dict: + # we need to guarantee the preservation of the output structure + # elements order. + output_structure = OrderedDict() + for key in metadata["output_structure_keys"]: + output_structure[key] = metadata["output_structure"][key] + + inputs = { + "output_structure": output_structure, + "input_names": metadata["input_names"], + } + if metadata["output_type"] is not None: + exec( + f"from {metadata['output_type_module']} " + f"import {metadata['output_type']}" + ) + inputs["output_type"] = eval(metadata["output_type"]) + return inputs + + +class _HFTextDataset(Sequence): + def __init__( + self, + input_texts: List, + ys: Optional[List], + keywords: List[str], + batch_size: int, + tokenizer: PreTrainedTokenizer, + tokenizer_args: Dict, + ): + self._input_texts = input_texts + self._ys = ys + self._bs = batch_size + self._keys = keywords + self._tokenizer = tokenizer + if self._tokenizer.pad_token is None: + self._tokenizer.pad_token = self._tokenizer.eos_token + _tokenizer_args = {"truncation": True, "padding": True} + _tokenizer_args.update(tokenizer_args) + self._tokenizer_args = _tokenizer_args + + def __getitem__(self, item: int): + pointer = self._bs * item + if pointer >= len(self._input_texts): + raise IndexError + mini_batch = self._input_texts[ + pointer : pointer + self._bs # noqa E203 + ] + if self._ys is not None: + mini_batch_y = self._ys[pointer : pointer + self._bs] # noqa E203 + else: + mini_batch_y = None + encoded_inputs = self._tokenizer(mini_batch, **self._tokenizer_args) + return tuple(encoded_inputs[key] for key in self._keys), mini_batch_y + + def __len__(self): + return len(self._input_texts) // self._bs + + +class _HFDictDataset(Sequence): + def __init__( + self, + input_data: List, + ys: Optional[List], + keywords: List[str], + batch_size: int, + ): + self._input_data = input_data + self._ys = ys + self._bs = batch_size + self._keys = keywords + + def __getitem__(self, item: int): + pointer = self._bs * item + if pointer >= len(self._input_data): + raise IndexError + mini_batch = self._input_data[ + pointer : pointer + self._bs # noqa E203 + ] + if self._ys is not None: + mini_batch_y = self._ys[pointer : pointer + self._bs] # noqa E203 + else: + mini_batch_y = None + return ( + tuple( + torch.stack( + [encoded_input[key] for encoded_input in mini_batch] + ) + for key in self._keys + ), + mini_batch_y, + ) + + def __len__(self): + return len(self._input_data) // self._bs + + +def convert_hf_model( + model: PreTrainedModel, + input_data: List, + tokenizer: Optional[PreTrainedTokenizer] = None, + tokenizer_args: Optional[Dict] = None, + batch_size: int = 1, + **kwargs, +): + if is_dict_type(input_data[0]): + # already tokenized data + if "labels" in input_data[0]: + labels = [data.pop("labels") for data in input_data] + else: + labels = None + input_example = input_data[0] + output_structure, output_type = _get_output_structure_from_dict( + input_example=input_example, + model=model, + ) + input_data = _HFDictDataset( + input_data=input_data, + ys=labels, + keywords=list(input_example.keys()), + batch_size=batch_size, + ) + + else: + assert tokenizer is not None, ( + "Tokenizer is needed when passing data in string format. Please " + "provide the tokenizer as keyword argument." + ) + if tokenizer_args is None: + tokenizer_args = {} + if not isinstance(input_data[0], str): + ys = [data[1] for data in input_data] + input_data = [data[0] for data in input_data] + else: + ys = None + output_structure, output_type = _get_output_structure_from_text( + text=input_data[0], + model=model, + tokenizer=tokenizer, + tokenizer_args=tokenizer_args, + ) + input_example = tokenizer(input_data) + input_data = _HFTextDataset( + input_texts=input_data, + ys=ys, + keywords=list(input_example.keys()), + batch_size=batch_size, + tokenizer=tokenizer, + tokenizer_args=tokenizer_args, + ) + wrapper_model = _TransformerWrapper( + core_model=model, encoded_input=input_example + ) + return ( + wrapper_model, + input_data, + list(wrapper_model.inputs_types.keys()), + output_structure, + output_type, + ) + + +def is_dict_type(data_sample: Any): + try: + data_sample.items() + except AttributeError: + return False + else: + return True diff --git a/nebullvm/api/frontend/utils.py b/nebullvm/api/utils.py similarity index 95% rename from nebullvm/api/frontend/utils.py rename to nebullvm/api/utils.py index 810a0f41..e2a26d52 100644 --- a/nebullvm/api/frontend/utils.py +++ b/nebullvm/api/utils.py @@ -4,7 +4,7 @@ QUANTIZATION_METRIC_MAP = { "accuracy": compute_accuracy_drop, - "precision": compute_relative_difference, + "numeric_precision": compute_relative_difference, } diff --git a/nebullvm/base.py b/nebullvm/base.py index 3564a805..ae45bb4a 100644 --- a/nebullvm/base.py +++ b/nebullvm/base.py @@ -106,8 +106,10 @@ class ModelCompiler(Enum): OPENVINO = "openvino" APACHE_TVM = "tvm" ONNX_RUNTIME = "onnxruntime" - TORCHVISION = "torchvision" + DEEPSPARSE = "deepsparse" + TORCHSCRIPT = "torchscript" TFLITE = "tflite" + BLADEDISC = "bladedisc" class QuantizationType(Enum): @@ -115,3 +117,8 @@ class QuantizationType(Enum): STATIC = "STATIC" QAT = "QAT" HALF = "HALF" + + +class OptimizationTime(Enum): + CONSTRAINED = "constrained" + UNCONSTRAINED = "unconstrained" diff --git a/nebullvm/compressors/__init__.py b/nebullvm/compressors/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/nebullvm/compressors/base.py b/nebullvm/compressors/base.py new file mode 100644 index 00000000..1e5b0b02 --- /dev/null +++ b/nebullvm/compressors/base.py @@ -0,0 +1,40 @@ +from abc import ABC, abstractmethod +from typing import Any, Optional, Dict, Callable, Tuple + +import yaml + +from nebullvm.utils.data import DataManager + + +class BaseCompressor(ABC): + def __init__(self, config_file: str = None): + self._config = self._read_config(config_file) + + @abstractmethod + def compress( + self, + model: Any, + train_input_data: DataManager, + eval_input_data: DataManager, + metric_drop_ths: float, + metric: Callable, + ) -> Tuple[Any, Optional[float]]: + raise NotImplementedError() + + def _read_config(self, config_file: Optional[str]) -> Dict: + config = self._get_default_config() + if config_file is not None: + with open(config_file, "r") as f: + data = yaml.load(f, Loader=yaml.CLoader) + config.update(data.get(self.config_key, {})) + return config + + @staticmethod + @abstractmethod + def _get_default_config() -> Dict: + raise NotImplementedError + + @property + @abstractmethod + def config_key(self) -> str: + raise NotImplementedError() diff --git a/nebullvm/compressors/intel.py b/nebullvm/compressors/intel.py new file mode 100644 index 00000000..0801eea5 --- /dev/null +++ b/nebullvm/compressors/intel.py @@ -0,0 +1,173 @@ +import copy +from abc import ABC, abstractmethod +from pathlib import Path +from tempfile import mkdtemp +from typing import Dict, Any, Callable, Optional, Tuple + +import numpy as np +import tensorflow as tf +import torch.nn +import yaml +from torch.utils.data import DataLoader, Dataset + +from nebullvm.compressors.base import BaseCompressor +from nebullvm.utils.data import DataManager + +try: + from neural_compressor.experimental import Pruning +except ImportError: + pass + + +def _get_model_framework(model: Any) -> str: + if isinstance(model, torch.nn.Module): + return "torch" + elif isinstance(model, tf.Module): + return "tensorflow" + else: + return "numpy" + + +class IntelPruningCompressor(BaseCompressor, ABC): + def __init__(self, config_file: str = None): + super().__init__(config_file) + self._temp_dir = mkdtemp() + + @property + def config_key(self) -> str: + return "intel_pruning" + + @staticmethod + def _get_default_config() -> Dict: + # see https://github.com/intel/neural-compressor/blob/master/neural_compressor/conf/config.py # noqa + # for further details + config = { + "train": { + "optimizer": { + "Adam": { + "learning_rate": 0.001, + "beta_1": 0.9, + "beta_2": 0.999, + "epsilon": 1e-07, + "amsgrad": False, + }, + }, + "criterion": { + "SparseCategoricalCrossentropy": { + "reduction": "mean", + "from_logits": False, + }, + }, + "epoch": 10, + "start_epoch": 0, + "end_epoch": 10, + "execution_mode": "eager", # either eager or graph + "hostfile": None, # str for multinode training support + }, + "approach": { + "weight_compression": { + "initial_sparsity": 0.0, + "target_sparsity": 0.97, + "start_epoch": 0, + "end_epoch": 10, + }, + }, + } + return config + + def _prepare_config(self, model: Any): + pruning_config = copy.deepcopy(self._config) + config = { + "model": { + "name": model.__class__.name, + "framework": _get_model_framework(model), + }, + "device": "cpu", + "tuning": { + "random_seed": 1978, + "tensorboard": False, + "workspace": {"path": self._temp_dir}, + }, + "pruning": pruning_config, + } + path_file = Path(self._temp_dir) / "temp.yaml" + with open(path_file, "w") as f: + yaml.dump(config, f) + return path_file + + def compress( + self, + model: Any, + train_input_data: DataManager, + eval_input_data: DataManager, + metric_drop_ths: float, + metric: Callable, + ) -> Tuple[Any, Optional[float]]: + config_file = self._prepare_config(model) + prune = Pruning(config_file) + prune.model = model + prune.train_dataloader = self._get_dataloader(train_input_data) + compressed_model = prune.fit() + if compressed_model is None: + return compressed_model, None + error = self._compute_error( + model, compressed_model, eval_input_data, metric + ) + if error > metric_drop_ths: + return None, None + perf_loss_ths = metric_drop_ths - error + return compressed_model, perf_loss_ths + + @abstractmethod + def _compute_error( + self, + model: Any, + compressed_model: Any, + eval_input_data: DataManager, + metric: Callable, + ): + raise NotImplementedError + + @staticmethod + @abstractmethod + def _get_dataloader(input_data: DataManager): + raise NotImplementedError + + +class _IPCDataset(Dataset): + def __init__(self, input_data: DataManager): + self._input_data = input_data + self._internal_size = input_data[0][0][0].shape[0] + + def __getitem__(self, item): + ptr = item // self._internal_size + return sum(self._input_data[ptr], ()) + + def __len__(self): + last_el_size = self._input_data[-1][0][0].shape[0] + return self._internal_size * (len(self._input_data) - 1) + last_el_size + + +class TorchIntelPruningCompressor(IntelPruningCompressor): + @staticmethod + def _get_dataloader(input_data: DataManager): + bs = input_data[0][0][0].shape[0] + ds = _IPCDataset(input_data) + dl = DataLoader(ds, bs) + return dl + + def _compute_error( + self, + model: torch.nn.Module, + compressed_model: torch.nn.Module, + eval_input_data: DataManager, + metric: Callable, + ): + if len(eval_input_data) == 0: + return np.inf + metric_val = 0 + for inputs, y in eval_input_data: + pred_model = model(*inputs) + pred_compressed_model = compressed_model(*inputs) + metric_val += metric(pred_model, pred_compressed_model, y) + return metric_val / len(eval_input_data) diff --git a/nebullvm/compressors/scripts/neural_magic_training.py b/nebullvm/compressors/scripts/neural_magic_training.py new file mode 100644 index 00000000..ad9d422c --- /dev/null +++ b/nebullvm/compressors/scripts/neural_magic_training.py @@ -0,0 +1,373 @@ +import json +import logging +import os.path +from pathlib import Path +from tempfile import TemporaryDirectory +from typing import Tuple, List, Any, Dict + +import torch +from sparseml.onnx.optim import ModelAnalyzer, pruning_loss_sens_magnitude +from sparseml.pytorch.optim import ( + ScheduledModifierManager, +) +from sparseml.pytorch.sparsification import ( + EpochRangeModifier, + GMPruningModifier, +) +from sparseml.pytorch.utils import ModuleExporter +from sparsify.blueprints.utils import ( + default_epochs_distribution, + PruningModelEvaluator, + default_pruning_settings, +) +from sparsify.schemas import ProjectModelAnalysisSchema +from torch.nn import CrossEntropyLoss, MSELoss +from torch.optim import SGD +from tqdm.auto import tqdm + + +CRITERION_FNS = { + "CrossEntropy": CrossEntropyLoss(), + "MSE": MSELoss(), +} + + +logging.basicConfig( + format="%(asctime)s %(message)s", datefmt="%d/%m/%Y %I:%M:%S %p" +) +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO) + + +def _export_model_onnx( + model: torch.nn.Module, + save_path: Path, + model_name: str, + input_batch: Tuple, +): + exporter = ModuleExporter(model, output_dir=save_path) + with torch.no_grad(): + example_outputs = model(*input_batch) + exporter.export_onnx( + input_batch, name=model_name, example_outputs=example_outputs + ) + onnx_path = save_path / model_name + + return onnx_path + + +class RecipeBuilder: + def __init__(self, model_path): + self.model_path = model_path + + def _make_analysis(self): + analyzer = ModelAnalyzer(self.model_path) + self.analysis = ProjectModelAnalysisSchema().load(analyzer.dict()) + + def _compute_loss_sensitivity(self): + sensitivities = [] + parameters = [] + for i, node in enumerate(self.analysis["nodes"]): + if node["prunable"]: + sensitivities.append(node["prunable_equation_sensitivity"]) + parameters.append(node["prunable_params"]) + + loss_analysis = pruning_loss_sens_magnitude(self.model_path) + + results_model = loss_analysis.results_model + results = loss_analysis.results + + model = { + "baseline_measurement_key": ( + str(results_model.baseline_measurement_key) + ), + "measurements": { + str(key): val for key, val in results_model.averages.items() + }, + } + ops = [] + + for res in results: + ops.append( + { + "id": res.id_, + "name": res.name, + "index": res.index, + "baseline_measurement_key": ( + str(res.baseline_measurement_key) + ), + "measurements": { + str(key): val for key, val in res.averages.items() + }, + } + ) + + pruning = {"model": model, "ops": ops} + loss = {} + loss["baseline"] = {} + loss["pruning"] = pruning + + model = PruningModelEvaluator( + self.analysis, + None, + loss, + ) + model.eval_baseline(default_pruning_settings().sparsity) + model.eval_pruning(default_pruning_settings()) + + self.final_analysis = model.to_dict_values() + + def build_recipe(self, epochs_pruning_window=None, training_epochs=10): + self._make_analysis() + self._compute_loss_sensitivity() + + if epochs_pruning_window is None: + epochs = default_epochs_distribution(training_epochs) + else: + # TODO: set custom parameters + epochs = default_epochs_distribution(training_epochs) + epochs_dict = epochs._asdict() + epochs_dict.update(epochs_pruning_window) + epochs = epochs.__class__(**epochs_dict) + + mods = [ + EpochRangeModifier( + start_epoch=epochs.start_epoch, + end_epoch=epochs.end_epoch, + ) + ] + + node_weight_name_lookup = { + node["id"]: node["weight_name"] + for node in self.analysis["nodes"] + if node["prunable"] + } + + sparsity_to_params = {} + + nodes = self.final_analysis[0] + + for node in nodes: + sparsity = node["sparsity"] + node_id = node["node_id"] + weight_name = node_weight_name_lookup[node_id] + + if sparsity is None: + continue + + if sparsity not in sparsity_to_params: + sparsity_to_params[sparsity] = [] + + sparsity_to_params[sparsity].append(weight_name) + + for sparsity, params in sparsity_to_params.items(): + gm_pruning = GMPruningModifier( + init_sparsity=0.05, + final_sparsity=sparsity, + start_epoch=epochs.pruning_start_epoch, + end_epoch=epochs.pruning_end_epoch, + update_frequency=epochs.pruning_update_frequency, + params=params, + ) + + mods.append(gm_pruning) + + return ScheduledModifierManager(mods) + + +class PruningTrainer: + def __init__(self, model, bs): + self.data_loader = None + self.optimizer = None + self.model = model + self.batch_size = bs + + def _setup_training(self, loss_fn=None, lr=1e-3, momentum=0.9): + self.device = "cuda" if torch.cuda.is_available() else "cpu" + self.model.to(self.device) + if loss_fn is None: + loss_fn = CrossEntropyLoss() + else: + loss_fn = CRITERION_FNS.get(loss_fn, CrossEntropyLoss()) + self.criterion = loss_fn + self.optimizer = SGD(self.model.parameters(), lr=lr, momentum=momentum) + + def _run_model_one_epoch(self, train=False): + + if train: + self.model.train() + data_loader = self.train_data_loader + else: + self.model.eval() + data_loader = self.val_data_loader + + running_loss = 0.0 + + for step, (inputs, labels) in tqdm( + enumerate(data_loader), total=len(data_loader) + ): + inputs = tuple(t.to(self.device) for t in inputs) + if not isinstance(labels, torch.Tensor): + labels = torch.tensor(labels) + if len(labels.shape) == 0: + labels = labels.unsqueeze(0) + labels = labels.to(self.device) + + if train: + self.optimizer.zero_grad() + + outputs = self.model( + *inputs + ) # model returns logits and softmax as a tuple + loss = self.criterion(outputs, labels) + + if train: + loss.backward() + self.optimizer.step() + + running_loss += loss.item() + + loss = running_loss / (len(data_loader) + 1e-5) + return loss + + def train( + self, manager, train_data_loader, val_data_loader, **train_kwargs + ): + self.train_data_loader = train_data_loader + self.val_data_loader = val_data_loader + self._setup_training(**train_kwargs) + self.optimizer = manager.modify( + self.model, + self.optimizer, + steps_per_epoch=len(self.train_data_loader), + ) + self.model.train() + # Run model pruning + epoch = manager.min_epochs + while epoch < manager.max_epochs: + # run training loop + epoch_name = "{}/{}".format(epoch + 1, manager.max_epochs) + logger.info("Running Training Epoch {}".format(epoch_name)) + train_loss = self._run_model_one_epoch(train=True) + logger.info( + ("Training Epoch: {}\nTraining Loss: {}\n").format( + epoch_name, train_loss + ) + ) + + # run validation loop + logger.info("Running Validation Epoch {}".format(epoch_name)) + val_loss = self._run_model_one_epoch() + logger.info( + "Validation Epoch: {}\nVal Loss: {}\n".format( + epoch_name, val_loss + ) + ) + + epoch += 1 + + manager.finalize(self.model) + + return self.model + + +def _load_config(config_file: str): + with open(config_file, "r") as f: + config = json.load(f) + return config + + +def _load_data(data_dir: str): + data_dir = Path(data_dir) + return [torch.load(input_path) for input_path in data_dir.glob("*.pt")] + + +def _load_model(model_file: str): + if os.path.isdir(model_file): + path = Path(model_file) + module_file = path / "module.py" + with open(module_file, "r") as f: + module_str = f.read() + exec(module_str) + model = eval("NebullvmFxModule")() + model.load_state_dict(torch.load(path / "state_dict.pt")) + else: + model = torch.load(model_file) + return model + + +def _train_model( + model: torch.nn.Module, + train_data: List[Tuple[Tuple, Any]], + eval_data: List[Tuple[Tuple, Any]], + epochs_pruning_window: Dict = None, + training_epochs: int = 10, + lr: float = 1e-3, + momentum: float = 0.9, +): + batch_size = train_data[0][0][0].shape[0] + with TemporaryDirectory() as tmp_dir: + onnx_path = _export_model_onnx( + model, Path(tmp_dir), "model.onnx", train_data[0][0] + ) + onnx_path = onnx_path.as_posix() + + recipe = RecipeBuilder(onnx_path) + # TODO: implement custom parameters support + manager = recipe.build_recipe( + epochs_pruning_window=epochs_pruning_window, + training_epochs=training_epochs, + ) + trainer = PruningTrainer(model, batch_size) + pruned_model = trainer.train( + manager, train_data, eval_data, lr=lr, momentum=momentum + ) + return pruned_model + + +def _save_model(model: torch.nn.Module, path: str): + if path.endswith(".pt"): + torch.save(model, path) + else: + torch.save(model.state_dict(), Path(path) / "pruned_state_dict.pt") + + +def main( + model_file: str, + train_data_dir: str, + eval_data_dir: str, + config_file: str, + out_file: str, +): + config = _load_config(config_file) + model = _load_model(model_file) + train_data = _load_data(train_data_dir) + eval_data = _load_data(eval_data_dir) + pruned_model = _train_model(model, train_data, eval_data, **config) + _save_model(pruned_model, out_file) + + +if __name__ == "__main__": + from argparse import ArgumentParser + + parser = ArgumentParser() + parser.add_argument("--model", help="The model to be pruned.") + parser.add_argument( + "--train_dir", + help="The directory contained the pickled training data.", + ) + parser.add_argument( + "--eval_dir", help="The directory contained the pickled test data." + ) + parser.add_argument("--config", help="The config file.") + parser.add_argument( + "--pruned_model", help="Path where storing the pruned model." + ) + args = parser.parse_args() + main( + model_file=args.model, + train_data_dir=args.train_dir, + eval_data_dir=args.eval_dir, + config_file=args.config, + out_file=args.pruned_model, + ) diff --git a/nebullvm/compressors/sparseml.py b/nebullvm/compressors/sparseml.py new file mode 100644 index 00000000..6af39719 --- /dev/null +++ b/nebullvm/compressors/sparseml.py @@ -0,0 +1,153 @@ +import json +from pathlib import Path +from tempfile import TemporaryDirectory +from typing import Any, Callable, Tuple, Optional, Dict + +import numpy as np +import torch +import torch.fx + +from nebullvm.compressors.base import BaseCompressor +from nebullvm.utils.data import DataManager +from nebullvm.utils.venv import run_in_different_venv + +FX_MODULE_NAME = "NebullvmFxModule" + + +def _save_with_torch_fx(model: torch.nn.Module, path: Path): + traced_model = torch.fx.symbolic_trace(model) + traced_model.to_folder(path, FX_MODULE_NAME) + + +def _load_with_torch_fx(path: Path): + module_file = path / "module.py" + with open(module_file, "r") as f: + module_str = f.read() + exec(module_str) + model = eval(FX_MODULE_NAME)() + model.load_state_dict(torch.load(path / "pruned_state_dict.pt")) + return model + + +def _save_model(model: torch.nn.Module, path: Path): + try: + _save_with_torch_fx(model, path) + except Exception: + torch.save(model, path / "model.pt") + return path / "model.pt" + else: + return path + + +def _load_model(path: Path): + if path.is_file(): + return torch.load(path) + else: + return _load_with_torch_fx(path) + + +def _save_dataset(input_data: DataManager, path: Path): + path.mkdir(exist_ok=True) + for i, x in enumerate(input_data): + torch.save(x, path / f"input_{i}.pt") + + +def _save_json(dictionary: Dict, path: Path): + with open(path, "w") as f: + json.dump(dictionary, f) + + +def _write_requirements_file(path: Path): + requirements = "torch<=1.9\ntorchvision<=0.10\nsparseml\nsparsify\ntqdm" + with open(path, "w") as f: + f.write(requirements) + + +class SparseMLCompressor(BaseCompressor): + def compress( + self, + model: torch.nn.Module, + train_input_data: DataManager, + eval_input_data: DataManager, + metric_drop_ths: float, + metric: Callable, + ) -> Tuple[Any, Optional[float]]: + script_path = ( + Path(__file__).parent / "scripts/neural_magic_training.py" + ) + with TemporaryDirectory(dir=".") as tmp_dir: + tmp_dir = Path(tmp_dir) + requirements_file = tmp_dir / "requirements.txt" + model_path = _save_model(model, tmp_dir) + training_data_dir = tmp_dir / "train" + eval_data_dir = tmp_dir / "eval" + config_file = tmp_dir / "config.json" + pruned_model_path = ( + tmp_dir / "pruned_model.pt" + if model_path.is_file() + else tmp_dir + ) + + _write_requirements_file(requirements_file) + _save_dataset(train_input_data, training_data_dir) + _save_dataset(eval_input_data, eval_data_dir) + _save_json(self._config, config_file) + + run_in_different_venv( + str(requirements_file), + str(script_path), + "--model", + f"{model_path}", + "--train_dir", + f"{training_data_dir}", + "--eval_dir", + f"{eval_data_dir}", + "--config", + f"{config_file}", + "--pruned_model", + f"{pruned_model_path}", + ) + + pruned_model = _load_model(pruned_model_path) + + error = self._compute_error( + model, pruned_model, eval_input_data, metric + ) + if error > metric_drop_ths: + return None, None + new_metric_ths = metric_drop_ths - error + + return pruned_model, new_metric_ths + + @staticmethod + @torch.no_grad() + def _compute_error( + model: torch.nn.Module, + pruned_model: torch.nn.Module, + eval_input_data: DataManager, + metric: Callable, + ) -> float: + if len(eval_input_data) == 0: + return np.inf + metric_val = 0.0 + model.eval() + pruned_model.eval() + for inputs, y in eval_input_data: + model_pred = model(*inputs) + pruned_pred = pruned_model(*inputs) + metric_val += metric(model_pred, pruned_pred, y) + return metric_val / len(eval_input_data) + + @staticmethod + def _get_default_config() -> Dict: + return { + "training_epochs": 10, + "epochs_pruning_window": {"start_epoch": 0, "end_epoch": 10}, + "loss_fn": "CrossEntropy", + "lr": 1e-3, + "momentum": 0.9, + } + + @property + def config_key(self) -> str: + return "sparseml" diff --git a/nebullvm/config.py b/nebullvm/config.py index fb7420a4..10aa778e 100644 --- a/nebullvm/config.py +++ b/nebullvm/config.py @@ -1,7 +1,7 @@ import os -VERSION = "0.3.2" +VERSION = "0.4.0" LEARNER_METADATA_FILENAME = "metadata.json" NO_COMPILER_INSTALLATION = int(os.getenv("NO_COMPILER_INSTALLATION", "0")) > 0 ONNX_OPSET_VERSION = 13 @@ -29,7 +29,6 @@ ONNX_FILENAMES = {"model_name": "model.onnx"} CUDA_PROVIDERS = [ - "TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider", ] diff --git a/nebullvm/converters/converters.py b/nebullvm/converters/converters.py index 89588a58..71fe19ad 100644 --- a/nebullvm/converters/converters.py +++ b/nebullvm/converters/converters.py @@ -1,6 +1,7 @@ +import copy from abc import abstractmethod, ABC from pathlib import Path -from typing import Any +from typing import Any, List import tensorflow as tf from torch.nn import Module @@ -26,7 +27,13 @@ def __init__(self, model_name: str = None): self.model_name = model_name or "temp" @abstractmethod - def convert(self, model: Any, model_params: ModelParams, save_path: Path): + def convert( + self, + model: Any, + model_params: ModelParams, + save_path: Path, + input_data: DataManager = None, + ): raise NotImplementedError @@ -89,3 +96,36 @@ def convert( f"The ONNX conversion from {type(model)} hasn't " f"been implemented yet!" ) + + +class CrossConverter(BaseConverter): + ONNX_EXTENSION = ".onnx" + TORCH_EXTENSION = ".pt" + TF_EXTENSION = ".pb" + + def convert( + self, + model: Any, + model_params: ModelParams, + save_path: Path, + input_data: DataManager = None, + ) -> List[Any]: + # TODO: Add cross conversion torch-tf + onnx_path = save_path / f"{self.model_name}{self.ONNX_EXTENSION}" + if isinstance(model, Module): + convert_torch_to_onnx( + torch_model=copy.deepcopy(model), + model_params=model_params, + output_file_path=onnx_path, + input_data=input_data, + ) + return [model, str(onnx_path)] + elif isinstance(model, tf.Module): + convert_tf_to_onnx( + model=copy.deepcopy(model), + output_file_path=onnx_path, + ) + return [model, str(onnx_path)] + + else: + return [model] diff --git a/nebullvm/inference_learners/blade_disc.py b/nebullvm/inference_learners/blade_disc.py new file mode 100644 index 00000000..4fc0f016 --- /dev/null +++ b/nebullvm/inference_learners/blade_disc.py @@ -0,0 +1,27 @@ +from typing import Optional + +from torch.jit import ScriptModule + +from nebullvm.base import ModelParams +from nebullvm.inference_learners.pytorch import ( + PytorchBackendInferenceLearner, +) +from nebullvm.transformations.base import MultiStageTransformation +from nebullvm.utils.data import DataManager + + +class BladeDISCInferenceLearner(PytorchBackendInferenceLearner): + @classmethod + def from_torch_model( + cls, + model: ScriptModule, + network_parameters: ModelParams, + input_tfms: Optional[MultiStageTransformation] = None, + input_data: DataManager = None, + ): + return cls( + torch_model=model, + network_parameters=network_parameters, + input_tfms=input_tfms, + input_data=input_data, + ) diff --git a/nebullvm/inference_learners/deepsparse.py b/nebullvm/inference_learners/deepsparse.py new file mode 100644 index 00000000..4917a879 --- /dev/null +++ b/nebullvm/inference_learners/deepsparse.py @@ -0,0 +1,180 @@ +import os +import shutil +import warnings +from abc import ABC +from pathlib import Path +from typing import Union, List, Generator, Tuple, Dict, Type + +import numpy as np +import torch + +from nebullvm.base import DeepLearningFramework, ModelParams +from nebullvm.config import ONNX_FILENAMES +from nebullvm.inference_learners.base import ( + BaseInferenceLearner, + LearnerMetadata, + PytorchBaseInferenceLearner, +) +from nebullvm.installers.installers import install_deepsparse +from nebullvm.transformations.base import MultiStageTransformation + +try: + from deepsparse import compile_model, cpu +except ImportError: + import platform + + os_ = platform.system() + if os_ != "Darwin": + warnings.warn( + "No deepsparse installation found. Trying to install it..." + ) + install_deepsparse() + from deepsparse import compile_model, cpu + else: + warnings.warn( + "No valid deepsparse installation found. " + "The compiler won't be used in the following." + ) + + +class DeepSparseInferenceLearner(BaseInferenceLearner, ABC): + """Model optimized on CPU using DeepSparse. DeepSparse is an engine + accelerating sparse computations on CPUs. + + Attributes: + network_parameters (ModelParams): The model parameters as batch + size, input and output sizes. + onnx_path (str or Path): Path to the onnx model. + input_names (List[str]): Input names used when the onnx model + was produced. + output_names (List[str]): Output names used when the onnx model + was produced. + """ + + def __init__( + self, + onnx_path: Union[str, Path], + input_names: List[str], + output_names: List[str], + **kwargs, + ): + super().__init__(**kwargs) + self.onnx_path = self._store_file(onnx_path) + + # Compile model + cores_per_socket, _, _ = cpu.cpu_details() + # Define the number of cores to use, by default it will make use of + # all physical cores on the system + num_cores = cores_per_socket + batch_size = kwargs["network_parameters"].batch_size + self.engine = compile_model(onnx_path, batch_size, num_cores) + + self.input_names = input_names + self.output_names = output_names + + def save(self, path: Union[str, Path], **kwargs): + """Save the model. + + Args: + path (Path or str): Path to the directory where the model will + be stored. + kwargs (Dict): Dictionary of key-value pairs that will be saved in + the model metadata file. + """ + metadata = LearnerMetadata.from_model( + self, + input_names=self.input_names, + output_names=self.output_names, + **kwargs, + ) + metadata.save(path) + + shutil.copy( + self.onnx_path, + Path(path) / ONNX_FILENAMES["model_name"], + ) + + @classmethod + def load(cls, path: Union[Path, str], **kwargs): + """Load the model. + + Args: + path (Path or str): Path to the directory where the model is + stored. + kwargs (Dict): Dictionary of additional arguments for consistency + with other Learners. + + Returns: + DeepSparseInferenceLearner: The optimized model. + """ + if len(kwargs) > 0: + warnings.warn( + f"No extra keywords expected for the load method. " + f"Got {kwargs}." + ) + onnx_path = os.path.join(str(path), ONNX_FILENAMES["model_name"]) + metadata = LearnerMetadata.read(path) + input_tfms = metadata.input_tfms + if input_tfms is not None: + input_tfms = MultiStageTransformation.from_dict( + metadata.input_tfms + ) + return cls( + input_tfms=input_tfms, + network_parameters=ModelParams(**metadata.network_parameters), + onnx_path=onnx_path, + input_names=metadata["input_names"], + output_names=metadata["output_names"], + ) + + def _predict_arrays(self, input_arrays: Generator[np.ndarray, None, None]): + inputs = [array for array in input_arrays] + outputs = self.engine(inputs) + return outputs + + +class PytorchDeepSparseInferenceLearner( + DeepSparseInferenceLearner, PytorchBaseInferenceLearner +): + """Model optimized on CPU using DeepSparse. DeepSparse is an engine + accelerating sparse computations on CPUs. + + Attributes: + network_parameters (ModelParams): The model parameters as batch + size, input and output sizes. + onnx_path (str or Path): Path to the onnx model. + input_names (List[str]): Input names used when the onnx model + was produced. + output_names (List[str]): Output names used when the onnx model + was produced. + """ + + def run(self, *input_tensors: torch.Tensor) -> Tuple[torch.Tensor]: + """Predict on the input tensors. + + Note that the input tensors must be on the same batch. If a sequence + of tensors is given when the model is expecting a single input tensor + (with batch size >= 1) an error is raised. + + Args: + input_tensors (Tuple[Tensor]): Input tensors belonging to the same + batch. The tensors are expected having dimensions + (batch_size, dim1, dim2, ...). + + Returns: + Tuple[Tensor]: Output tensors. Note that the output tensors does + not correspond to the prediction on the input tensors with a + 1 to 1 mapping. In fact the output tensors are produced as the + multiple-output of the model given a (multi-) tensor input. + """ + input_arrays = ( + input_tensor.cpu().detach().numpy() + for input_tensor in input_tensors + ) + outputs = self._predict_arrays(input_arrays) + return tuple(torch.from_numpy(output) for output in outputs) + + +DEEPSPARSE_INFERENCE_LEARNERS: Dict[ + DeepLearningFramework, Type[DeepSparseInferenceLearner] +] = {DeepLearningFramework.PYTORCH: PytorchDeepSparseInferenceLearner} diff --git a/nebullvm/installers/install_bladedisc.sh b/nebullvm/installers/install_bladedisc.sh new file mode 100644 index 00000000..56c2a216 --- /dev/null +++ b/nebullvm/installers/install_bladedisc.sh @@ -0,0 +1,31 @@ +#!/bin/bash + +# Set non interactive mode for apt-get +export DEBIAN_FRONTEND=noninteractive + +if [ ! -d "BladeDISC" ] +then + git clone https://github.com/alibaba/BladeDISC.git +fi + +cd BladeDISC && git submodule update --init --recursive + +apt update && sudo apt install bazel-5.1.1 + +if [ $1 == "true" ] +then +cd pytorch_blade && bash ./scripts/build_pytorch_blade.sh +else + if [[ $OSTYPE == "darwin"* ]] + then + export TORCH_BLADE_BUILD_WITH_CUDA_SUPPORT=OFF + export TORCH_BLADE_CI_BUILD_TORCH_VERSION=1.10.0+aarch64 + cd pytorch_blade && bash ./scripts/build_pytorch_blade.sh + else + export TORCH_BLADE_BUILD_WITH_CUDA_SUPPORT=OFF + export TORCH_BLADE_CI_BUILD_TORCH_VERSION=1.8.1+cpu + cd pytorch_blade && bash ./scripts/build_pytorch_blade.sh + fi +fi + +cd ../.. diff --git a/nebullvm/installers/installers.py b/nebullvm/installers/installers.py index b538f8da..d110d805 100644 --- a/nebullvm/installers/installers.py +++ b/nebullvm/installers/installers.py @@ -55,6 +55,16 @@ def install_tvm(working_dir: str = None): ) +def install_bladedisc(): + has_cuda = False + if torch.cuda.is_available(): + has_cuda = True + + path = Path(__file__).parent + installation_file = str(path / "install_bladedisc.sh") + subprocess.Popen(["bash", installation_file, str(has_cuda).lower()]) + + def install_tensor_rt(): """Helper function for installing TensorRT. @@ -106,3 +116,9 @@ def install_onnxruntime(): # install requirements for onnxruntime.transformers cmd = ["pip3", "install", "coloredlogs", "sympy"] subprocess.run(cmd) + + +def install_deepsparse(): + """Helper function for installing DeepSparse.""" + cmd = ["pip3", "install", "deepsparse"] + subprocess.run(cmd) diff --git a/nebullvm/optimizers/__init__.py b/nebullvm/optimizers/__init__.py index 3d67fdbc..f5c6dcee 100644 --- a/nebullvm/optimizers/__init__.py +++ b/nebullvm/optimizers/__init__.py @@ -1,7 +1,20 @@ +from typing import Dict, Type + +from nebullvm.base import ModelCompiler from nebullvm.optimizers.base import BaseOptimizer # noqa F401 +from nebullvm.optimizers.deepsparse import DeepSparseOptimizer # noqa F401 from nebullvm.optimizers.onnx import ONNXOptimizer # noqa F401 from nebullvm.optimizers.openvino import OpenVinoOptimizer # noqa F401 from nebullvm.optimizers.tensor_rt import TensorRTOptimizer # noqa F401 from nebullvm.optimizers.tvm import ApacheTVMOptimizer # noqa F401 __all__ = [k for k in globals().keys() if not k.startswith("_")] + + +COMPILER_TO_OPTIMIZER_MAP: Dict[ModelCompiler, Type[BaseOptimizer]] = { + ModelCompiler.APACHE_TVM: ApacheTVMOptimizer, + ModelCompiler.OPENVINO: OpenVinoOptimizer, + ModelCompiler.TENSOR_RT: TensorRTOptimizer, + ModelCompiler.ONNX_RUNTIME: ONNXOptimizer, + ModelCompiler.DEEPSPARSE: DeepSparseOptimizer, +} diff --git a/nebullvm/optimizers/base.py b/nebullvm/optimizers/base.py index ec69556d..874fbb28 100644 --- a/nebullvm/optimizers/base.py +++ b/nebullvm/optimizers/base.py @@ -22,9 +22,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[BaseInferenceLearner]: raise NotImplementedError diff --git a/nebullvm/optimizers/blade_disc.py b/nebullvm/optimizers/blade_disc.py new file mode 100644 index 00000000..3a107616 --- /dev/null +++ b/nebullvm/optimizers/blade_disc.py @@ -0,0 +1,159 @@ +import warnings + +from collections import Callable +from typing import Optional + +import torch.nn + +from nebullvm.base import DeepLearningFramework, ModelParams, QuantizationType +from nebullvm.config import NO_COMPILER_INSTALLATION +from nebullvm.inference_learners.blade_disc import BladeDISCInferenceLearner +from nebullvm.optimizers import BaseOptimizer +from nebullvm.optimizers.quantization.pytorch import quantize_torch +from nebullvm.optimizers.quantization.utils import ( + check_quantization, + check_precision, +) +from nebullvm.transformations.base import MultiStageTransformation +from nebullvm.utils.data import DataManager +from nebullvm.utils.onnx import convert_to_target_framework +from nebullvm.utils.torch import create_model_inputs_torch, run_torch_model + +try: + import torch_blade +except ImportError: + # TODO: Remove the False flag for allowing BladeDISC to be installed by + # the Auto-Installer. + if False and not NO_COMPILER_INSTALLATION: + warnings.warn( + "No valid BladeDISC installation has been found. " + "Trying to re-install it from source." + ) + from nebullvm.installers.installers import install_bladedisc + + install_bladedisc() + import torch_blade + else: + warnings.warn( + "No BladeDISC library detected. " + "The BladeDISC Inference learner should not be used." + ) + + +class BladeDISCOptimizer(BaseOptimizer): + """Optimizer working directly on the pytorch backend, with no need of a + conversion to ONNX. The model will be finally compiled using torchscript. + For avoiding un-wanted modification to the input model models are copied + before being optimized. + + Attributes: + logger (Logger, optional): Optional logger for logging optimization + information. + """ + + def optimize( + self, + model: torch.nn.Module, + output_library: DeepLearningFramework, + model_params: ModelParams, + input_tfms: MultiStageTransformation = None, + metric_drop_ths: float = None, + quantization_type: QuantizationType = None, + metric: Callable = None, + input_data: DataManager = None, + ) -> Optional[BladeDISCInferenceLearner]: + """Optimize the input model using pytorch built-in techniques. + + Args: + model (torch.nn.Module): The pytorch model. For avoiding un-wanted + modifications to the original model, it will be copied in the + method. + output_library (DeepLearningFramework): Output framework. At the + current stage just PYTORCH is supported. + model_params (ModelParams): Model parameters. + input_tfms (MultiStageTransformation, optional): Transformations + to be performed to the model's input tensors in order to + get the prediction. + metric_drop_ths (float, optional): Threshold for the accepted drop + in terms of precision. Any optimized model with an higher drop + will be ignored. + quantization_type (QuantizationType, optional): The desired + quantization algorithm to be used. + metric (Callable, optional): If given it should + compute the difference between the quantized and the normal + prediction. + input_data (DataManager, optional): User defined data. + + Returns: + BladeDISCInferenceLearner: Model optimized for inference. + """ + self._log( + f"Optimizing with {self.__class__.__name__} and " + f"q_type: {quantization_type}." + ) + assert output_library is DeepLearningFramework.PYTORCH, ( + "Other APIs than the Pytorch one are not supported " + "for the Pytorch Backend yet." + ) + check_quantization(quantization_type, metric_drop_ths) + if metric_drop_ths is not None: + if input_data is None: + input_data_torch = [ + tuple( + create_model_inputs_torch( + model_params.batch_size, model_params.input_infos + ) + ) + ] + else: + input_data_torch, ys = input_data.get_numpy_list( + 300, with_ys=True + ) + input_data_torch = [ + tuple( + convert_to_target_framework(t, output_library) + for t in data_tuple + ) + for data_tuple in input_data_torch + ] + output_data_torch = [ + tuple(run_torch_model(model, list(input_tensors))) + for input_tensors in input_data_torch + ] + model, input_tfms = quantize_torch( + model, quantization_type, input_tfms, input_data_torch + ) + + with torch.no_grad(): + model = torch_blade.optimize( + model, + allow_tracing=True, + model_inputs=tuple((input_data.get_list(1)[0])) + if input_data is not None + else tuple( + create_model_inputs_torch( + model_params.batch_size, model_params.input_infos + ) + ), + ) + + learner = BladeDISCInferenceLearner.from_torch_model( + model, + network_parameters=model_params, + input_tfms=input_tfms, + input_data=list(input_data.get_list(1)[0]) + if input_data is not None + else None, + ) + if metric_drop_ths is not None: + is_valid = check_precision( + learner, + input_data_torch, + output_data_torch, + metric_drop_ths, + metric_func=metric, + ys=ys, + ) + if not is_valid: + return None + return learner diff --git a/nebullvm/optimizers/deepsparse.py b/nebullvm/optimizers/deepsparse.py new file mode 100644 index 00000000..7668b646 --- /dev/null +++ b/nebullvm/optimizers/deepsparse.py @@ -0,0 +1,51 @@ +from pathlib import Path +from tempfile import TemporaryDirectory +from typing import Optional, Callable + +import torch + +from nebullvm.base import ModelParams, DeepLearningFramework, QuantizationType +from nebullvm.converters import ONNXConverter +from nebullvm.inference_learners.deepsparse import ( + DEEPSPARSE_INFERENCE_LEARNERS, + DeepSparseInferenceLearner, +) +from nebullvm.optimizers import BaseOptimizer +from nebullvm.transformations.base import MultiStageTransformation +from nebullvm.utils.data import DataManager +from nebullvm.utils.onnx import ( + get_input_names, + get_output_names, +) + + +class DeepSparseOptimizer(BaseOptimizer): + def optimize( + self, + model: torch.nn.Module, + output_library: DeepLearningFramework, + model_params: ModelParams, + input_tfms: MultiStageTransformation = None, + metric_drop_ths: float = None, + quantization_type: QuantizationType = None, + metric: Callable = None, + input_data: DataManager = None, + ) -> Optional[DeepSparseInferenceLearner]: + if quantization_type is not None: + return + + with TemporaryDirectory() as tmp_dir: + converter = ONNXConverter() + onnx_pruned_path = Path(tmp_dir) / "model_pruned.onnx" + converter.convert( + model, model_params, onnx_pruned_path, input_data + ) + + learner = DEEPSPARSE_INFERENCE_LEARNERS[output_library]( + input_tfms=input_tfms, + network_parameters=model_params, + onnx_path=onnx_pruned_path, + input_names=get_input_names(str(onnx_pruned_path)), + output_names=get_output_names(str(onnx_pruned_path)), + ) + return learner diff --git a/nebullvm/optimizers/extra.py b/nebullvm/optimizers/extra.py index fe74274e..49d20793 100644 --- a/nebullvm/optimizers/extra.py +++ b/nebullvm/optimizers/extra.py @@ -37,14 +37,14 @@ class HuggingFaceOptimizer(BaseOptimizer): def __init__( self, hugging_face_params: Dict, - perf_loss_ths: float = None, - perf_metric: Callable = None, + metric_drop_ths: float = None, + metric: Callable = None, logger: Logger = None, ): super(HuggingFaceOptimizer, self).__init__(logger) self.hf_params = hugging_face_params - self.perf_loss_ths = perf_loss_ths - self.perf_metric = perf_metric + self.perf_loss_ths = metric_drop_ths + self.perf_metric = metric self.q_type = QuantizationType.HALF def optimize( @@ -53,9 +53,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[ONNXInferenceLearner]: self._log( @@ -63,7 +63,7 @@ def optimize( f"q_type: {quantization_type}." ) optimized_model = optimizer.optimize_model(model, **self.hf_params) - if perf_loss_ths is not None: + if metric_drop_ths is not None: if quantization_type is not QuantizationType.HALF: return None optimized_model.convert_float_to_float16() @@ -82,7 +82,7 @@ def optimize( if input_data is not None else None, ) - if perf_loss_ths is not None: + if metric_drop_ths is not None: # TODO: Add dataset and metric from user if input_data is None: inputs = [learner.get_inputs_example()] @@ -100,8 +100,8 @@ def optimize( learner, inputs, base_outputs, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/optimizers/multi_compiler.py b/nebullvm/optimizers/multi_compiler.py index 81b2e3a9..957ed8a1 100644 --- a/nebullvm/optimizers/multi_compiler.py +++ b/nebullvm/optimizers/multi_compiler.py @@ -5,9 +5,7 @@ from typing import Dict, Type, Tuple, Callable, List import uuid -import cpuinfo import numpy as np -import torch from tqdm import tqdm from nebullvm.base import ( @@ -21,48 +19,19 @@ from nebullvm.measure import compute_optimized_running_time from nebullvm.optimizers import ( BaseOptimizer, - TensorRTOptimizer, - ApacheTVMOptimizer, - OpenVinoOptimizer, - ONNXOptimizer, + COMPILER_TO_OPTIMIZER_MAP, ) from nebullvm.transformations.base import MultiStageTransformation +from nebullvm.utils.compilers import select_compilers_from_hardware_onnx from nebullvm.utils.data import DataManager from nebullvm.utils.feedback_collector import FEEDBACK_COLLECTOR -COMPILER_TO_OPTIMIZER_MAP: Dict[ModelCompiler, Type[BaseOptimizer]] = { - ModelCompiler.APACHE_TVM: ApacheTVMOptimizer, - ModelCompiler.OPENVINO: OpenVinoOptimizer, - ModelCompiler.TENSOR_RT: TensorRTOptimizer, - ModelCompiler.ONNX_RUNTIME: ONNXOptimizer, -} OPTIMIZER_TO_COMPILER_MAP: Dict[Type[BaseOptimizer], ModelCompiler] = dict( zip(COMPILER_TO_OPTIMIZER_MAP.values(), COMPILER_TO_OPTIMIZER_MAP.keys()) ) -def _tvm_is_available() -> bool: - try: - import tvm # noqa F401 - - return True - except ImportError: - return False - - -def select_compilers_from_hardware(): - compilers = [ModelCompiler.ONNX_RUNTIME] - if _tvm_is_available(): - compilers.append(ModelCompiler.APACHE_TVM) - if torch.cuda.is_available(): - compilers.append(ModelCompiler.TENSOR_RT) - cpu_raw_info = cpuinfo.get_cpu_info()["brand_raw"].lower() - if "intel" in cpu_raw_info: - compilers.append(ModelCompiler.OPENVINO) - return compilers - - def _optimize_with_compiler( compiler: ModelCompiler, logger: Logger, @@ -88,7 +57,7 @@ def _save_info( quantization_string = "_".join( [ str(optimization_params.get(param)) or "" - for param in ["perf_loss_ths", "quantization_type"] + for param in ["metric_drop_ths", "quantization_type"] ] ) if len(quantization_string) > 1: @@ -113,7 +82,7 @@ def _optimize_with_optimizer( FEEDBACK_COLLECTOR.store_compiler_result( OPTIMIZER_TO_COMPILER_MAP[type(optimizer)], kwargs["quantization_type"], - kwargs["perf_loss_ths"], + kwargs.get("metric_drop_ths"), latency, ) except Exception as ex: @@ -130,7 +99,7 @@ def _optimize_with_optimizer( FEEDBACK_COLLECTOR.store_compiler_result( OPTIMIZER_TO_COMPILER_MAP[type(optimizer)], kwargs["quantization_type"], - kwargs["perf_loss_ths"], + kwargs.get("metric_drop_ths"), None, ) if debug_file: @@ -169,7 +138,7 @@ def __init__( super().__init__(logger) self.compilers = [ compiler - for compiler in select_compilers_from_hardware() + for compiler in select_compilers_from_hardware_onnx() if compiler not in (ignore_compilers or []) ] self.extra_optimizers = extra_optimizers @@ -183,9 +152,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> BaseInferenceLearner: """Optimize the ONNX model using the available compilers. @@ -198,12 +167,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -211,7 +180,7 @@ def optimize( Returns: BaseInferenceLearner: Model optimized for inference. """ - if perf_loss_ths is not None and quantization_type is None: + if metric_drop_ths is not None and quantization_type is None: quantization_types = [ None, QuantizationType.DYNAMIC, @@ -232,9 +201,11 @@ def optimize( if input_tfms is not None else None, debug_file=self.debug_file, - perf_loss_ths=perf_loss_ths if q_type is not None else None, + metric_drop_ths=metric_drop_ths + if q_type is not None + else None, quantization_type=q_type, - perf_metric=perf_metric, + metric=metric, input_data=input_data, ) for compiler in self.compilers @@ -253,11 +224,11 @@ def optimize( if input_tfms is not None else None, debug_file=self.debug_file, - perf_loss_ths=perf_loss_ths + metric_drop_ths=metric_drop_ths if q_type is not None else None, quantization_type=q_type, - perf_metric=perf_metric, + metric=metric, input_data=input_data, ) for op in self.extra_optimizers @@ -273,9 +244,9 @@ def optimize_on_custom_metric( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, return_all: bool = False, ): @@ -298,12 +269,12 @@ def optimize_on_custom_metric( return_all (bool, optional): Boolean flag. If true the method returns the tuple (compiled_model, score) for each available compiler. Default `False`. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -314,7 +285,7 @@ def optimize_on_custom_metric( `return_all` is `False` or all the compiled models and their scores otherwise. """ - if perf_loss_ths is not None and quantization_type is None: + if metric_drop_ths is not None and quantization_type is None: quantization_types = [ None, QuantizationType.DYNAMIC, @@ -336,9 +307,9 @@ def optimize_on_custom_metric( if input_tfms is not None else None, debug_file=self.debug_file, - perf_loss_ths=perf_loss_ths if q_type is not None else None, + perf_loss_ths=metric_drop_ths if q_type is not None else None, quantization_type=q_type, - perf_metric=perf_metric, + perf_metric=metric, input_data=input_data, ) for compiler in self.compilers @@ -356,11 +327,11 @@ def optimize_on_custom_metric( if input_tfms is not None else None, debug_file=self.debug_file, - perf_loss_ths=perf_loss_ths + perf_loss_ths=metric_drop_ths if q_type is not None else None, quantization_type=q_type, - perf_metric=perf_metric, + perf_metric=metric, input_data=input_data, ) for op in self.extra_optimizers diff --git a/nebullvm/optimizers/onnx.py b/nebullvm/optimizers/onnx.py index b38fe775..56346626 100644 --- a/nebullvm/optimizers/onnx.py +++ b/nebullvm/optimizers/onnx.py @@ -31,9 +31,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[ONNXInferenceLearner]: """Build the ONNX runtime learner from the onnx model. @@ -46,12 +46,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -66,8 +66,8 @@ def optimize( f"q_type: {quantization_type}." ) input_data_onnx, output_data_onnx, ys = [], [], None - check_quantization(quantization_type, perf_loss_ths) - if perf_loss_ths is not None: + check_quantization(quantization_type, metric_drop_ths) + if metric_drop_ths is not None: if input_data is None: input_data_onnx = [ tuple( @@ -98,7 +98,7 @@ def optimize( if input_data is not None else None, ) - if perf_loss_ths is not None: + if metric_drop_ths is not None: inputs = [ tuple( convert_to_target_framework(t, output_library) @@ -110,8 +110,8 @@ def optimize( learner, inputs, output_data_onnx, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/optimizers/openvino.py b/nebullvm/optimizers/openvino.py index ea35e385..8a7dd870 100644 --- a/nebullvm/optimizers/openvino.py +++ b/nebullvm/optimizers/openvino.py @@ -29,9 +29,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[OpenVinoInferenceLearner]: """Optimize the onnx model with OpenVino. @@ -44,12 +44,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -80,12 +80,12 @@ def optimize( ), ] if ( - perf_loss_ths is not None + metric_drop_ths is not None and quantization_type is QuantizationType.HALF ): cmd = cmd + ["--data_type", "FP16"] elif ( - perf_loss_ths is not None + metric_drop_ths is not None and quantization_type is QuantizationType.DYNAMIC ): return None @@ -95,7 +95,7 @@ def optimize( openvino_model_path = base_path / f"{Path(model).stem}.xml" openvino_model_weights = base_path / f"{Path(model).stem}.bin" if ( - perf_loss_ths is not None + metric_drop_ths is not None and quantization_type is not QuantizationType.HALF ): if input_data is not None and quantization_type: @@ -126,7 +126,7 @@ def optimize( if input_data is not None else None, ) - if perf_loss_ths is not None: + if metric_drop_ths is not None: if input_data is None: inputs = [learner.get_inputs_example()] ys = None @@ -147,8 +147,8 @@ def optimize( learner, inputs, output_data_onnx, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/optimizers/pytorch.py b/nebullvm/optimizers/pytorch.py index 1bc6c369..9e8cb0ad 100644 --- a/nebullvm/optimizers/pytorch.py +++ b/nebullvm/optimizers/pytorch.py @@ -34,9 +34,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[PytorchBackendInferenceLearner]: """Optimize the input model using pytorch built-in techniques. @@ -51,12 +51,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -72,8 +72,8 @@ def optimize( "Other APIs than the Pytorch one are not supported " "for the Pytorch Backend yet." ) - check_quantization(quantization_type, perf_loss_ths) - if perf_loss_ths is not None: + check_quantization(quantization_type, metric_drop_ths) + if metric_drop_ths is not None: if input_data is None: input_data_torch = [ tuple( @@ -109,13 +109,13 @@ def optimize( if input_data is not None else None, ) - if perf_loss_ths is not None: + if metric_drop_ths is not None: is_valid = check_precision( learner, input_data_torch, output_data_torch, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/optimizers/tensor_rt.py b/nebullvm/optimizers/tensor_rt.py index f73becfb..9198efa5 100644 --- a/nebullvm/optimizers/tensor_rt.py +++ b/nebullvm/optimizers/tensor_rt.py @@ -137,9 +137,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[NvidiaInferenceLearner]: """Optimize the input model with TensorRT. @@ -152,12 +152,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -176,10 +176,10 @@ def optimize( "You are trying to run an optimizer developed for NVidia gpus " "on a machine not connected to any GPU supporting CUDA." ) - check_quantization(quantization_type, perf_loss_ths) + check_quantization(quantization_type, metric_drop_ths) engine_path = Path(model).parent / NVIDIA_FILENAMES["engine"] if ( - perf_loss_ths is not None + metric_drop_ths is not None and quantization_type is QuantizationType.STATIC ): if input_data is None: @@ -193,7 +193,7 @@ def optimize( else: input_data_onnx = input_data.get_numpy_list(300, with_ys=False) elif ( - perf_loss_ths is not None + metric_drop_ths is not None and quantization_type is QuantizationType.DYNAMIC ): return None # Dynamic quantization is not supported on tensorRT @@ -239,8 +239,8 @@ def optimize( learner, inputs, output_data, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/optimizers/tensorflow.py b/nebullvm/optimizers/tensorflow.py index 33d5774b..a73cb5d2 100644 --- a/nebullvm/optimizers/tensorflow.py +++ b/nebullvm/optimizers/tensorflow.py @@ -37,9 +37,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[TensorflowBackendInferenceLearner]: """Optimize the input model using pytorch built-in techniques. @@ -54,12 +54,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -77,9 +77,9 @@ def optimize( "for the Tensorflow Backend yet." ) - check_quantization(quantization_type, perf_loss_ths) + check_quantization(quantization_type, metric_drop_ths) with TemporaryDirectory() as tmp_dir: - if perf_loss_ths is not None: + if metric_drop_ths is not None: if input_data is None: input_data_tf = [ tuple( @@ -114,7 +114,7 @@ def optimize( ) learner = TF_BACKEND_LEARNERS_DICT[ - "tflite" if perf_loss_ths is not None else "tf" + "tflite" if metric_drop_ths is not None else "tf" ]( model, network_parameters=model_params, @@ -123,13 +123,13 @@ def optimize( if input_data is not None else None, ) - if perf_loss_ths is not None: + if metric_drop_ths is not None: is_valid = check_precision( learner, input_data_tf, output_data_tf, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/optimizers/tvm.py b/nebullvm/optimizers/tvm.py index 6d1c8fef..8ff7b527 100644 --- a/nebullvm/optimizers/tvm.py +++ b/nebullvm/optimizers/tvm.py @@ -52,9 +52,9 @@ def optimize_from_torch( torch_model: torch.nn.Module, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[ApacheTVMInferenceLearner]: self._log( @@ -65,7 +65,7 @@ def optimize_from_torch( mod, params = self._build_tvm_model_from_torch( torch_model, model_params ) - if perf_loss_ths is not None: + if metric_drop_ths is not None: if quantization_type is QuantizationType.HALF: mod = tvm.relay.transform.ToMixedPrecision( mixed_precision_type="float16" @@ -122,8 +122,8 @@ def optimize_from_torch( model, inputs, output_data, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: @@ -136,9 +136,9 @@ def optimize( output_library: DeepLearningFramework, model_params: ModelParams, input_tfms: MultiStageTransformation = None, - perf_loss_ths: float = None, + metric_drop_ths: float = None, quantization_type: QuantizationType = None, - perf_metric: Callable = None, + metric: Callable = None, input_data: DataManager = None, ) -> Optional[ApacheTVMInferenceLearner]: """Optimize the input model with Apache TVM. @@ -151,12 +151,12 @@ def optimize( input_tfms (MultiStageTransformation, optional): Transformations to be performed to the model's input tensors in order to get the prediction. - perf_loss_ths (float, optional): Threshold for the accepted drop + metric_drop_ths (float, optional): Threshold for the accepted drop in terms of precision. Any optimized model with an higher drop will be ignored. quantization_type (QuantizationType, optional): The desired quantization algorithm to be used. - perf_metric (Callable, optional): If given it should + metric (Callable, optional): If given it should compute the difference between the quantized and the normal prediction. input_data (DataManager, optional): User defined data. @@ -170,10 +170,10 @@ def optimize( f"Optimizing with {self.__class__.__name__} and " f"q_type: {quantization_type}." ) - check_quantization(quantization_type, perf_loss_ths) + check_quantization(quantization_type, metric_drop_ths) target = self._get_target() mod, params = self._build_tvm_model_from_onnx(model, model_params) - if perf_loss_ths is not None: + if metric_drop_ths is not None: if quantization_type is QuantizationType.HALF: mod = tvm.relay.transform.ToMixedPrecision( mixed_precision_type="float16" @@ -236,8 +236,8 @@ def optimize( model, inputs, output_data, - perf_loss_ths, - metric_func=perf_metric, + metric_drop_ths, + metric_func=metric, ys=ys, ) if not is_valid: diff --git a/nebullvm/pipelines/__init__.py b/nebullvm/pipelines/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/nebullvm/pipelines/steps.py b/nebullvm/pipelines/steps.py new file mode 100644 index 00000000..6ace9253 --- /dev/null +++ b/nebullvm/pipelines/steps.py @@ -0,0 +1,652 @@ +import copy +import logging +from abc import ABC, abstractmethod +from logging import Logger +from typing import Dict, List, Any, Callable, Tuple, Optional + +import cpuinfo +import numpy as np +import tensorflow as tf +import torch.nn +from tqdm import tqdm + +from nebullvm.base import ( + DeepLearningFramework, + ModelParams, + QuantizationType, + ModelCompiler, + OptimizationTime, +) +from nebullvm.compressors.base import BaseCompressor +from nebullvm.compressors.intel import TorchIntelPruningCompressor +from nebullvm.compressors.sparseml import SparseMLCompressor +from nebullvm.inference_learners import ( + BaseInferenceLearner, + PytorchBaseInferenceLearner, +) +from nebullvm.measure import compute_optimized_running_time +from nebullvm.optimizers import ( + BaseOptimizer, + ApacheTVMOptimizer, + COMPILER_TO_OPTIMIZER_MAP, + DeepSparseOptimizer, +) +from nebullvm.optimizers.blade_disc import BladeDISCOptimizer +from nebullvm.optimizers.pytorch import PytorchBackendOptimizer +from nebullvm.optimizers.tensorflow import TensorflowBackendOptimizer +from nebullvm.transformations.base import MultiStageTransformation +from nebullvm.utils.compilers import ( + tvm_is_available, + select_compilers_from_hardware_onnx, + deepsparse_is_available, + bladedisc_is_available, +) +from nebullvm.utils.data import DataManager +from nebullvm.utils.feedback_collector import FEEDBACK_COLLECTOR + + +class Step(ABC): + """Fundamental building block for the Pipeline. + + Attributes: + logger (Logger, optional): Logger defined by the user. + """ + + def __init__(self, logger: Logger = None): + self._logger = logger + + @abstractmethod + def run(self, *args, **kwargs) -> Dict: + """Run the pipeline step.""" + raise NotImplementedError() + + @property + @abstractmethod + def name(self): + raise NotImplementedError() + + def _log_info(self, text: str): + if self._logger is None: + logging.info(text) + else: + self._logger.info(text) + + def _log_warning(self, text: str): + if self._logger is None: + logging.warning(text) + else: + self._logger.warning(text) + + +class CompressorStep(Step, ABC): + """Object managing the Compressor step in the Pipeline. This step manages + all the defined Compressor objects available considering the data given + by the user. + + Attributes: + config_file (str, optional): The configuration file containing the + configuration parameter for each Compressor. The config_file is + a YAML file having as main keywords the Compressor names and as + values dictionaries containing the specific parameters for the + related Compressor object. + logger (Logger, optional): Logger defined by the user. + """ + + def __init__(self, config_file: str = None, logger: Logger = None): + super().__init__(logger) + self._config_file = config_file + + def run( + self, + model: Any = None, + input_data: DataManager = None, + metric_drop_ths: float = None, + metric: Callable = None, + **kwargs, + ) -> Dict: + """Run the CompressorStep. + + Args: + model (Any): Model to be compressed. + input_data (DataManager): Data to be used for compressing the + model. + metric_drop_ths: Maximum reduction in the selected metric accepted. + No model with an higher error will be accepted. Note that the + maximum error is modified and then propagated to the next + steps. + metric (Callable): Metric to be used for estimating the error + due to the compression. + kwargs (Dict): Keyword arguments propagated to the next step. + """ + compressor_dict = self._get_compressors() + self._log_info(f"Compressions: {tuple(compressor_dict.keys())}") + models = {"no_compression": (copy.deepcopy(model), metric_drop_ths)} + train_input_data, eval_input_data = input_data.split(0.8) + for technique, compressor in tqdm(compressor_dict.items()): + try: + compressed_model, ths = compressor.compress( + model, + train_input_data, + eval_input_data, + metric_drop_ths, + metric, + ) + models[technique] = (compressed_model, ths) + except Exception as ex: + self._log_warning( + f"Error during compression {technique}. Got error {ex}. " + f"The compression technique will be skipped. " + f"Please consult the documentation for further info or " + f"open an issue on GitHub for receiving assistance." + ) + return { + "models": models, + "input_data": eval_input_data, + "metric": metric, + **kwargs, + } + + @abstractmethod + def _get_compressors(self) -> Dict[str, BaseCompressor]: + raise NotImplementedError() + + @property + def name(self): + return "compression_step" + + +class TorchCompressorStep(CompressorStep): + """Object managing the Compressor step in the Pipeline for PyTorch models. + This step manages all the defined Compressor objects available considering + the data given by the user. + + At the current state this step supports pruning with SparseML and (just on + intel devices) pruning with the IntelNeuralCompressor. + + Attributes: + config_file (str, optional): The configuration file containing the + configuration parameter for each Compressor. The config_file is + a YAML file having as main keywords the Compressor names and as + values dictionaries containing the specific parameters for the + related Compressor object. + logger (Logger, optional): Logger defined by the user. + """ + + def _get_compressors(self) -> Dict[str, BaseCompressor]: + compressors = { + "sparseml": SparseMLCompressor(config_file=self._config_file) + } + # TODO: Reactivate the intel-neural-compressor when properly tested + if False and "intel" in cpuinfo.get_cpu_info()["brand_raw"].lower(): + compressors["intel_pruning"] = TorchIntelPruningCompressor( + config_file=self._config_file + ) + return compressors + + +class NoCompressionStep(Step): + """Step to be used when no compression is required. + + Attributes: + logger (Logger, optional): Logger defined by the user. + """ + + def run( + self, model: Any, metric_drop_ths: Optional[float], **kwargs + ) -> Dict: + return { + "models": {"no_compression": (model, metric_drop_ths)}, + **kwargs, + } + + @property + def name(self): + return "no_compression" + + +class OptimizerStep(Step, ABC): + """Object managing the Optimizers in the pipeline step. All available + optimizers are run on the model given as input and a list of tuples + (optimized_model, latency) is given as output. + + Attributes: + logger (Logger, optional): Logger defined by the user. + """ + + def run( + self, + models: Dict[str, Tuple[Any, Optional[float]]] = None, + output_library: DeepLearningFramework = None, + model_params: ModelParams = None, + input_tfms: MultiStageTransformation = None, + metric: Optional[Callable] = None, + input_data: Optional[DataManager] = None, + ignore_compilers: List[ModelCompiler] = None, + optimization_time: OptimizationTime = None, + **kwargs, + ) -> Dict: + """Run the OptimizerStep for all the available compilers. + + Args: + models (Dict): Dictionary of models produced by the CompressorStep. + For each model produced by the previous step, the updated + metric_drop_ths (i.e. the error allowed on the model) is + given together with the model. Keys represent the compression + technique used for obtaining the model. + output_library (DeepLearningFramework): The target framework. + model_params (ModelParams): The model parameters. + input_tfms (MultiStageTransformation, optional): Transformations + to be performed to the model's input tensors in order to + get the prediction. + metric (Callable): Metric to be used for estimating the error + due to the compression. + input_data (DataManager): Input data to be used for optimizing the + model. + ignore_compilers (List): List of compilers to be ignored. + optimization_time (OptimizationTime): The optimization time mode. + It can be either 'constrained' or 'unconstrained'. For + 'unconstrained' optimization all the compilers are re-used on + the different framework interfaces, even if the model has + already been compiled with the same compiler on another + framework interface. + kwargs (Dict): Extra keywords that will be ignored. + """ + + optimizers = self._get_optimizers(ignore_compilers) + self._log_info( + f"Optimizations: " + f"{tuple(compiler.value for compiler in optimizers.keys())}" + ) + optimized_models = [] + + for prev_tech, (model, metric_drop_ths) in tqdm(models.items()): + self._log_info(f"Optimizing output of {prev_tech}") + if model is None: + continue + if metric_drop_ths is not None: + q_types = [ + None, + QuantizationType.DYNAMIC, + QuantizationType.HALF, + ] + if input_data is not None: + q_types.append(QuantizationType.STATIC) + else: + q_types = [None] + for compiler, optimizer in tqdm(optimizers.items()): + for q_type in q_types: + try: + optimized_model = self._run_optimizer( + optimizer, + model, + output_library, + model_params, + input_tfms.copy(), + metric_drop_ths, + q_type, + metric, + input_data, + ) + if optimized_model is not None: + latency = compute_optimized_running_time( + optimized_model + ) + else: + latency = np.inf + optimized_models.append((optimized_model, latency)) + if ( + compiler not in ignore_compilers + and optimization_time + is OptimizationTime.CONSTRAINED + ): + ignore_compilers.append(compiler) + FEEDBACK_COLLECTOR.store_compiler_result( + compiler=compiler, + q_type=q_type, + metric_drop_ths=metric_drop_ths, + latency=latency, + compression=prev_tech, + ) + except Exception as ex: + self._log_warning( + f"Compilation failed with {output_library.value} " + f"interface of {compiler}. Got error {ex}. " + f"If possible the compilation will be re-scheduled" + f" with another interface. Please consult the " + f"documentation for further info or open an issue " + f"on GitHub for receiving assistance." + ) + FEEDBACK_COLLECTOR.store_compiler_result( + compiler=compiler, + q_type=q_type, + metric_drop_ths=metric_drop_ths, + latency=None, + compression=prev_tech, + ) + + return { + "optimized_models": optimized_models, + "ignore_compilers": ignore_compilers, + } + + @property + def name(self): + return "optimizer_step" + + @abstractmethod + def _run_optimizer( + self, + optimizer, + model: Any, + output_library: DeepLearningFramework, + model_params: ModelParams, + input_tfms: MultiStageTransformation = None, + metric_drop_ths: float = None, + quantization_type: QuantizationType = None, + metric: Callable = None, + input_data: DataManager = None, + ) -> BaseInferenceLearner: + raise NotImplementedError() + + @abstractmethod + def _get_optimizers( + self, ignore_compilers: List[ModelCompiler] + ) -> Dict[ModelCompiler, BaseOptimizer]: + raise NotImplementedError() + + +class TorchOptimizerStep(OptimizerStep): + """Object managing the Optimizers in the pipeline step supporting PyTorch + as compiler interface. All available optimizers are run on the model given + as input and a list of tuples (optimized_model, latency) is given as + output. + + Attributes: + logger (Logger, optional): Logger defined by the user. + """ + + def _get_optimizers( + self, ignore_compilers: List[ModelCompiler] + ) -> Dict[ModelCompiler, BaseOptimizer]: + optimizers = { + ModelCompiler.TORCHSCRIPT: PytorchBackendOptimizer( + logger=self._logger + ), + } + if ( + tvm_is_available() + and ModelCompiler.APACHE_TVM not in ignore_compilers + ): + optimizers[ModelCompiler.APACHE_TVM] = ApacheTVMOptimizer( + logger=self._logger + ) + if ( + deepsparse_is_available() + and ModelCompiler.DEEPSPARSE not in ignore_compilers + ): + optimizers[ModelCompiler.DEEPSPARSE] = DeepSparseOptimizer() + if ( + bladedisc_is_available() + and ModelCompiler.BLADEDISC not in ignore_compilers + ): + optimizers[ModelCompiler.BLADEDISC] = BladeDISCOptimizer() + return optimizers + + def _run_optimizer( + self, + optimizer, + model: Any, + output_library: DeepLearningFramework, + model_params: ModelParams, + input_tfms: MultiStageTransformation = None, + metric_drop_ths: float = None, + quantization_type: QuantizationType = None, + metric: Callable = None, + input_data: DataManager = None, + ) -> PytorchBaseInferenceLearner: + if hasattr(optimizer, "optimize_from_torch"): + optimized_model = optimizer.optimize_from_torch( + torch_model=model, + model_params=model_params, + metric_drop_ths=metric_drop_ths + if quantization_type is not None + else None, + metric=metric, + quantization_type=quantization_type, + input_tfms=input_tfms, + input_data=input_data, + ) + else: + optimized_model = optimizer.optimize( + model=model, + output_library=output_library, + model_params=model_params, + metric_drop_ths=metric_drop_ths + if quantization_type is not None + else None, + metric=metric, + quantization_type=quantization_type, + input_tfms=input_tfms, + input_data=input_data, + ) + return optimized_model + + +class TFOptimizerStep(OptimizerStep): + """Object managing the Optimizers in the pipeline step supporting + TensorFlow as compiler interface. All available optimizers are run on + the model given as input and a list of tuples (optimized_model, latency) + is given as output. + + Attributes: + logger (Logger, optional): Logger defined by the user. + """ + + def _get_optimizers( + self, ignore_compilers: List[ModelCompiler] + ) -> Dict[ModelCompiler, BaseOptimizer]: + optimizers = { + ModelCompiler.TFLITE: TensorflowBackendOptimizer( + logger=self._logger + ) + } + return optimizers + + def _run_optimizer( + self, + optimizer, + model: Any, + output_library: DeepLearningFramework, + model_params: ModelParams, + input_tfms: MultiStageTransformation = None, + metric_drop_ths: float = None, + quantization_type: QuantizationType = None, + metric: Callable = None, + input_data: DataManager = None, + ) -> PytorchBaseInferenceLearner: + if hasattr(optimizer, "optimize_from_tf"): + optimized_model = optimizer.optimize_from_tf( + torch_model=model, + model_params=model_params, + metric_drop_ths=metric_drop_ths + if quantization_type is not None + else None, + metric=metric, + quantization_type=quantization_type, + input_tfms=input_tfms, + input_data=input_data, + ) + else: + optimized_model = optimizer.optimize( + model=model, + output_library=output_library, + model_params=model_params, + metric_drop_ths=metric_drop_ths + if quantization_type is not None + else None, + metric=metric, + quantization_type=quantization_type, + input_tfms=input_tfms, + input_data=input_data, + ) + return optimized_model + + +class OnnxOptimizerStep(OptimizerStep): + """Object managing the Optimizers in the pipeline step supporting ONNX + as compiler interface. All available optimizers are run on the model given + as input and a list of tuples (optimized_model, latency) is given as + output. + + Attributes: + logger (Logger, optional): Logger defined by the user. + """ + + def _get_optimizers( + self, ignore_compilers: List[ModelCompiler] + ) -> Dict[ModelCompiler, BaseOptimizer]: + compilers = select_compilers_from_hardware_onnx() + optimizers = { + compiler: COMPILER_TO_OPTIMIZER_MAP[compiler](self._logger) + for compiler in compilers + if compiler not in ignore_compilers + } + return optimizers + + def _run_optimizer( + self, + optimizer, + model: Any, + output_library: DeepLearningFramework, + model_params: ModelParams, + input_tfms: MultiStageTransformation = None, + metric_drop_ths: float = None, + quantization_type: QuantizationType = None, + metric: Callable = None, + input_data: DataManager = None, + ) -> PytorchBaseInferenceLearner: + optimized_model = optimizer.optimize( + model=model, + output_library=output_library, + model_params=model_params, + metric_drop_ths=metric_drop_ths + if quantization_type is not None + else None, + metric=metric, + quantization_type=quantization_type, + input_tfms=input_tfms, + input_data=input_data, + ) + return optimized_model + + +class Pipeline(Step): + """Pipeline object. + + A Pipeline is a list of steps executed sequentially, where each step + takes as input the output of the previous one. + + Attributes: + pipeline_name: str, + steps (List): List of Steps composing the pipeline. + logger (Logger): Logger defined by the user. + """ + + def __init__( + self, pipeline_name: str, steps: List[Step], logger: Logger = None + ): + super().__init__(logger) + self._name = pipeline_name + self._steps = steps + + def run(self, **kwargs) -> Dict: + self._log_info(f"Running pipeline: {self.name}") + for step in self._steps: + self._log_info(f"Running step: {step.name}") + kwargs = step.run(**kwargs) + return kwargs + + @property + def name(self): + return self._name + + +def _get_compressor_step( + model: Any, + optimization_time: OptimizationTime, + config_file: Optional[str], + metric_drop_ths: Optional[float], + metric: Optional[Callable], + logger: Optional[Logger], +) -> Step: + if optimization_time is OptimizationTime.CONSTRAINED: + return NoCompressionStep(logger=logger) + if metric_drop_ths is None or metric is None: + return NoCompressionStep(logger=logger) + elif isinstance(model, torch.nn.Module): + return TorchCompressorStep(config_file=config_file, logger=logger) + else: # default is NoCompression + return NoCompressionStep(logger=logger) + + +def _get_optimizer_step( + model: Any, + logger: Optional[Logger], +) -> Step: + if isinstance(model, torch.nn.Module): + return TorchOptimizerStep(logger=logger) + elif isinstance(model, tf.Module): + return TFOptimizerStep(logger=logger) + else: + return OnnxOptimizerStep(logger=logger) + + +def _get_pipeline_name(model: Any): + if isinstance(model, torch.nn.Module): + return "pytorch_pipeline" + elif isinstance(model, tf.Module): + return "tensorflow_pipeline" + else: + return "onnx_pipeline" + + +def build_pipeline_from_model( + model: Any, + optimization_time: OptimizationTime, + metric_drop_ths: Optional[float], + metric: Optional[Callable], + config_file: Optional[str], + logger: Logger = None, +) -> Pipeline: + """Function for building a pipeline from a model and user-defined + parameters + + Args: + model (Any): The input model. + optimization_time (OptimizationTime): The optimization time mode. + It can be either 'constrained' or 'unconstrained'. For + 'constrained' mode just compilers and precision reduction + techniques are used (no compression). 'Unconstrained' optimization + allows the usage of more time consuming techniques as pruning and + distillation. + metric_drop_ths (float, optional): Maximum reduction in the + selected metric accepted. No model with an higher error will be + accepted, i.e. all optimized model having a larger error respect to + the original one will be discarded, without even considering their + possible speed-up. + metric (Callable): Metric to be used for estimating the error + due to the optimization techniques. + config_file (str, optional): Configuration file containing the + parameters needed for defining the CompressionStep in the pipeline. + logger (Logger, optional): Logger defined by the user. + """ + compressor_step = _get_compressor_step( + model, optimization_time, config_file, metric_drop_ths, metric, logger + ) + optimizer_step = _get_optimizer_step(model, logger) + pipeline = Pipeline( + pipeline_name=_get_pipeline_name(model), + logger=logger, + steps=[compressor_step, optimizer_step], + ) + return pipeline diff --git a/nebullvm/utils/compilers.py b/nebullvm/utils/compilers.py new file mode 100644 index 00000000..6d0acbbd --- /dev/null +++ b/nebullvm/utils/compilers.py @@ -0,0 +1,43 @@ +import cpuinfo +import torch + +from nebullvm.base import ModelCompiler + + +def tvm_is_available() -> bool: + try: + import tvm # noqa F401 + + return True + except ImportError: + return False + + +def bladedisc_is_available() -> bool: + try: + import torch_blade # noqa F401 + + return True + except ImportError: + return False + + +def deepsparse_is_available() -> bool: + try: + import deepsarse # noqa F401 + except ImportError: + return False + else: + return True + + +def select_compilers_from_hardware_onnx(): + compilers = [ModelCompiler.ONNX_RUNTIME] + if tvm_is_available(): + compilers.append(ModelCompiler.APACHE_TVM) + if torch.cuda.is_available(): + compilers.append(ModelCompiler.TENSOR_RT) + cpu_raw_info = cpuinfo.get_cpu_info()["brand_raw"].lower() + if "intel" in cpu_raw_info: + compilers.append(ModelCompiler.OPENVINO) + return compilers diff --git a/nebullvm/utils/data.py b/nebullvm/utils/data.py index 9339af97..975b31b2 100644 --- a/nebullvm/utils/data.py +++ b/nebullvm/utils/data.py @@ -1,3 +1,4 @@ +import warnings from typing import Sequence, List, Tuple, Any, Union, Iterable import numpy as np @@ -81,3 +82,20 @@ def get_list( @classmethod def from_iterable(cls, iterable: Iterable, max_length: int = 500): return cls([x for i, x in enumerate(iterable) if i < max_length]) + + def split(self, split_pct: float, shuffle: bool = False): + if shuffle: + idx = np.random.choice(len(self), len(self), replace=False) + else: + idx = np.arange(len(self)) + + n = int(round(len(idx) * split_pct)) + if n == 0 or n == len(idx): + warnings.warn( + "Not enough data for splitting the DataManager. " + "An empty data-manager will be passed as result of the split." + ) + return ( + DataManager([self[i] for i in idx[:n]]), + DataManager([self[i] for i in idx[n:]]), + ) diff --git a/nebullvm/utils/feedback_collector.py b/nebullvm/utils/feedback_collector.py index 1d7712b6..d5cab30c 100644 --- a/nebullvm/utils/feedback_collector.py +++ b/nebullvm/utils/feedback_collector.py @@ -124,16 +124,19 @@ def store_compiler_result( self, compiler: ModelCompiler, q_type: Optional[QuantizationType], - perf_loss_ths: Optional[float], + metric_drop_ths: Optional[float], latency: Optional[float], + compression: str = None, ): if self._model_id is None: return q_type_key = ( - f"{q_type.value}_{perf_loss_ths}" - if q_type is not None and perf_loss_ths is not None + f"{q_type.value}_{metric_drop_ths}" + if q_type is not None and metric_drop_ths is not None else "noopt" ) + if compression is not None and len(compression) > 0: + q_type_key = compression + "_" + q_type_key compiler_dict = self._latency_dict.get(compiler.value, {}) compiler_dict[q_type_key] = latency if latency else -1.0 self._latency_dict[compiler.value] = compiler_dict diff --git a/nebullvm/utils/venv.py b/nebullvm/utils/venv.py new file mode 100644 index 00000000..13759afb --- /dev/null +++ b/nebullvm/utils/venv.py @@ -0,0 +1,61 @@ +import logging +import subprocess +import tempfile +import venv + + +class EnvBuilder(venv.EnvBuilder): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.context = None + + def post_setup(self, context): + self.context = context + + +def run_in_different_venv( + requirements_file: str, + script_path: str, + *args, + logger: logging.Logger = None, +): + """Run a python scripts in a new temporary environment. Arguments for the + script must be passed in the function args. + it is equivalent to create and activate a new environment and running + > pip install -r $requirement_file + > python -m script_path *args + + Args: + requirements_file (str): File (.txt) containing the list of + requirements. + script_path (str): Path to the script that must be run. + args: Arguments of the script. + logger (Logger, optional): Logger for the project. + """ + if logger is not None: + logger.debug( + f"Debug: Running script {script_path} in a new virtual env." + ) + with tempfile.TemporaryDirectory() as target_dir_path: + if logger is not None: + logger.debug("Debug: Creating virtual environment...") + venv_builder = EnvBuilder(with_pip=True) + venv_builder.create(str(target_dir_path)) + venv_context = venv_builder.context + + if logger is not None: + logger.debug("Debug: Installing requirements...") + pip_install_command = [ + venv_context.env_exe, + "-m", + "pip", + "install", + "-r", + requirements_file, + ] + subprocess.check_call(pip_install_command) + + if logger is not None: + logger.debug("Debug: Executing script...") + script_command = [venv_context.env_exe, script_path, *args] + subprocess.check_call(script_command) diff --git a/requirements.txt b/requirements.txt index 189f2fd6..0f095ebf 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,6 +3,7 @@ scipy<=1.5.4 onnx>=1.10.0 onnxmltools>=1.11.0 py-cpuinfo==8.0.0 +PyYAML>=6.0 psutil>=5.9.0 requests>=2.28.1 tensorflow>=2.7.0 diff --git a/resources/README.md b/resources/README.md deleted file mode 100644 index e9d3cf53..00000000 --- a/resources/README.md +++ /dev/null @@ -1,77 +0,0 @@ -# Benchmarks - -We have tested nebullvm on popular AI models and hardware from leading vendors. - -- Hardware: M1 Pro, Intel Xeon, AMD EPYC and NVIDIA T4 -- AI Models: EfficientNet, Resnet, SqueezeNet, GPT2, BERT - -## Reponse time acceleration in milliseconds - -The table below shows the response time in milliseconds of the non-optimized model and the optimized model for the various model-hardware couplings as an average value over 100 experiments. - -Optimized performance is provided in the case where acceleration has not resulted in any performance loss by using deep learning compilers (Option A) or when also other otpimization techniques such as quantization and half precision (Option B) are also applied with perf_loss_ths parameter set to 2. Refer to the nebullvm library readme for more clarification on the two options and the perf_loss_ths parameter. - -|Response time (milliseconds)| <-- | **M1 Pro** | --> | <-- | **Intel Xeon** | --> | <-- | **AMD EPYC** | --> | <-- | **Nvidia T4** | --> | -|:----------------------:|:-----------:|:------------:|:------------:|:-----------:|:---------------:|:------------:|:-----------:|:-------------:|:------------:|:-----------:|:-------------:|:------------:| -| | **Vanilla** | **Option A** | **Option B** | **Vanilla** | **Option A** | **Option B** | **Vanilla** | **Option A** | **Option B** | **Vanilla** | **Option A** | **Option B** | -| **EfficientNetB0** | 214.95 | 24.4 | 9.24 | 36.07 | 12.15 | 10.44 | 86.29 | 38.64 | 31.67 | 12.92 | 9.59 | - | -| **EfficientNetB1** | 278.81 | 33.62 | 13.28 | 50.47 | 17.33 | 16.15 | 96.65 | 59.93 | 41.69 | 17.99 | 14.19 | - | -| **EfficientNetB2** | 284.88 | 36.77 | 14.56 | 50.33 | 19.06 | 17.99 | 97.32 | 65.93 | - | 36.91 | 13.46 | - | -| **EfficientNetB3** | 370.11 | 50.37 | 20.29 | 67.98 | 26.74 | 25.83 | 207.95 | 89.61 | - | 20.26 | 14.33 | - | -| **EfficientNetB4** | 558.86 | 70.99 | 28.03 | 91.43 | 35.89 | 35.08 | 274.93 | 119.17 | - | 24.89 | 17.08 | - | -| **EfficientNetB5** | 704.25 | 99.84 | 41.62 | 125.69 | 53.91 | 51.7 | 481.7 | 188.63 | - | 31.23 | 17.94 | - | -| **EfficientNetB6** | 1124 | 157.38 | 56.67 | 165.15 | 71.99 | 68.74 | 630.95 | 256.65 | - | 35.79 | 21.27 | - | -| **EfficientNetB7** | 1521.71 | 212.12 | 81.83 | 223.15 | 106.86 | 95.85 | 766.61 | 395.57 | - | 45.65 | 23.32 | - | -| **Resnet18** | 18.48 | 15.75 | - | 32.2 | 17.79 | 16.66 | 147.04 | 93.43 | 84.99 | 25.23 | 12.39 | 3.46 | -| **Resnet34** | 42.06 | 34.4 | - | 61.67 | 36.54 | 33.19 | 180.18 | 166.13 | - | 27.41 | 5.36 | 4.79 | -| **Resnet50** | 62.22 | 54.25 | 46.22 | 83.1 | 46.81 | 38.42 | 311.44 | 197.68 | 161.45 | 10.5 | 7.81 | 5.65 | -| **Resnet101** | 118.95 | 92.01 | 86.48 | 152.52 | 82.99 | 71.19 | 545.65 | 364.74 | 358.55 | 20.22 | 12.82 | 9.43 | -| **Resnet152** | 166.89 | 129.81 | 127.31 | 220.78 | 129.86 | 104.05 | 810.95 | 540.86 | - | 32.51 | 17.86 | 12.92 | -| **SqueezeNet** | 15.25 | 7.86 | - | 23.63 | 8.7 | - | 86.78 | 43.49 | - | 3.48 | 2.7 | - | -| **Convnext tiny** | 305.58 | 95.55 | 94.89 | 79.91 | 62.01 | - | 404.75 | 220.91 | - | 38.29 | 9.58 | 7.69 | -| **Convnext small** | 615.25 | 167.78 | 167.43 | 145.05 | 110.69 | - | 735.037 | 544.47 | - | 24.31 | 17.02 | 12.21 | -| **Convnext base** | 815.01 | 240.4 | - | 230.72 | 187.39 | - | 1237.36 | 966.58 | - | 76.53 | 25.79 | 15.44 | -| **Convnext large** | 1266.87 | 394.85 | - | 444.82 | 396.62 | - | 2537.23 | 1868.43 | 1567.97 | 108.12 | 38.41 | 23.67 | -| **GPT2 - 10 tokens** | 29.67 | 10.75 | - | 38.45 | 31.88 | 12.15 | 138.11 | 55.31 | 48.76 | 15.31 | 4.42 | 4.01 | -| **GPT2 - 1024 tokens** | 546.74 | - | - | 1564.67 | 924.58 | - | 9423.16 | 5076.11 | - | 84.47 | - | 58.63 | -| **Bert - 8 tokens** | 39.39 | 6.2 | - | 31.31 | 14.87 | 10.86 | 164.9 | 38.12 | 34.08 | 10.35 | 3.78 | 2.51 | -| **Bert - 512 tokens** | 489.52 | 276.35 | - | 494.21 | 376.13 | - | 2985.27 | 1847.31 | - | 31.25 | 27.37 | 10.12 | - - - - -## Reponse time acceleration (inference speedup) - -The table below displays the speedup provided by nebullvm, where speedup is defined as the response time of the optimized model over the response time of the non-optimized model. - -The speedup is shown for option A and B. We also present the OpB boost, which refers to the additional acceleration provided by the techniques used only in Option B (quantization and half-precision) over those also used in Option A (deep learning compilers. Refer to the nebullvm library readme for more information about Option A and B. - - -|Infefence speedup| <-- | **M1 Pro** | --> | <-- | **Intel Xeon** | --> | <-- | **AMD EPYC** | --> | <-- | **Nvidia T4** | --> | -|:----------------------:|:------------:|:-----------:|:------------:|:----------------:|:---------------:|:------------:|:----------------:|:-------------:|:------------:|:----------------:|:-------------:|:------------:| -| | **Option A** | **OpB boost** | **Option B** | **DL compilers** | **OpB boost** | **Option B** | **DL compilers** | **OpB boost** | **Option B** | **DL compilers** | **OpB boost** | **Option B** | -| **EfficientNetB0** | 8.8x | 2.6x | 23.3x | 3.0x | 1.2x | 3.5x | 2.2x | 1.2x | 2.7x | 1.3x | - | 1.3x | -| **EfficientNetB1** | 8.3x | 2.5x | 21.0x | 2.9x | 1.1x | 3.1x | 1.6x | 1.4x | 2.3x | 1.3x | - | 1.3x | -| **EfficientNetB2** | 7.7x | 2.5x | 19.6x | 2.6x | 1.1x | 2.8x | 1.5x | - | 1.5x | 2.7x | - | 2.7x | -| **EfficientNetB3** | 7.3x | 2.5x | 18.2x | 2.5x | 1.0x | 2.6x | 2.3x | - | 2.3x | 1.4x | - | 1.4x | -| **EfficientNetB4** | 7.9x | 2.5x | 19.9x | 2.5x | 1.0x | 2.6x | 2.3x | - | 2.3x | 1.5x | - | 1.5x | -| **EfficientNetB5** | 7.1x | 2.4x | 16.9x | 2.3x | 1.0x | 2.4x | 2.6x | - | 2.6x | 1.7x | - | 1.7x | -| **EfficientNetB6** | 7.1x | 2.8x | 19.8x | 2.3x | 1.0x | 2.4x | 2.5x | - | 2.5x | 1.7x | - | 1.7x | -| **EfficientNetB7** | 7.2x | 2.6x | 18.6x | 2.1x | 1.1x | 2.3x | 1.9x | - | 1.9x | 2.0x | - | 2.0x | -| **Resnet18** | 1.2x | - | 1.2x | 1.8x | 1.1x | 1.9x | 1.6x | 1.1x | 1.7x | 2.0x | 3.6x | 7.3x | -| **Resnet34** | 1.2x | - | 1.2x | 1.7x | 1.1x | 1.9x | 1.1x | - | 1.1x | 5.1x | 1.1x | 5.7x | -| **Resnet50** | 1.1x | 1.2x | 1.3x | 1.8x | 1.2x | 2.2x | 1.6x | 1.2x | 1.9x | 1.3x | 1.4x | 1.9x | -| **Resnet101** | 1.3x | 1.1x | 1.4x | 1.8x | 1.2x | 2.1x | 1.5x | 1.0x | 1.5x | 1.6x | 1.4x | 2.1x | -| **Resnet152** | 1.3x | 1.0x | 1.3x | 1.7x | 1.2x | 2.1x | 1.5x | - | 1.5x | 1.8x | 1.4x | 2.5x | -| **SqueezeNet** | 1.9x | - | 1.9x | 2.7x | - | 2.7x | 2.0x | - | 2.0x | 1.3x | - | 1.3x | -| **Convnext tiny** | 3.2x | 1.0x | 3.2x | 1.3x | - | 1.3x | 1.8x | - | 1.8x | 4.0x | 1.2x | 5.0x | -| **Convnext small** | 3.7x | 1.0x | 3.7x | 1.3x | - | 1.3x | 1.4x | - | 1.4x | 1.4x | 1.4x | 2.0x | -| **Convnext base** | 3.4x | - | 3.4x | 1.2x | - | 1.2x | 1.3x | - | 1.3x | 3.0x | 1.7x | 5.0x | -| **Convnext large** | 3.2x | - | 3.2x | 1.1x | - | 1.1x | 1.4x | 1.2x | 1.6x | 2.8x | 1.6x | 4.6x | -| **GPT2 - 10 tokens** | 2.8x | - | 2.8x | 1.2x | 2.6x | 3.2x | 2.5x | 1.1x | 2.8x | 3.5x | 1.1x | 3.8x | -| **GPT2 - 1024 tokens** | - | - | - | 1.7x | - | 1.7x | 1.9x | - | 1.9x | - | 1.6x | 1.4x | -| **Bert - 8 tokens** | 6.4x | - | 6.4x | 2.1x | 1.4x | 2.9x | 4.3x | 1.1x | 4.8x | 2.7x | 1.5x | 4.1x | -| **Bert - 512 tokens** | 1.8x | - | 1.8x | 1.3x | - | 1.3x | 1.6x | - | 1.6x | 1.1x | 2.7x | 3.1x | - - - diff --git a/setup.py b/setup.py index d47f31c5..d3885618 100644 --- a/setup.py +++ b/setup.py @@ -8,6 +8,7 @@ "onnx>=1.10.0", "onnxmltools>=1.11.0", "py-cpuinfo>=8.0.0", + "PyYAML>=6.0", "psutil>=5.9.0", "requests>=2.28.1", "tensorflow>=2.7.0", @@ -22,7 +23,7 @@ setup( name="nebullvm", - version="0.3.2", + version="0.4.0", packages=find_packages(), install_requires=REQUIREMENTS, long_description=long_description,