Skip to content

🚀 Fix model export as code #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 3 additions & 16 deletions apply_model/model_export_as_cpp_code_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,22 @@ Catboost model could be saved as standalone C++ code. This can ease an integrati

The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/google/cityhash/tree/00b9287e8c1255b5922ef90e304d5287361b2c2a) (NOTE: The exact revision under the link is required).


### Exporting from Catboost application via command line interface:
### Exporting from Catboost application via command line interface

```bash
catboost fit --model-format CPP <other_fit_parameters>
```

By default model is saved into *model.cpp* file. One could alter the output name using *-m* key. If there is more that one model-format specified, then the *.cpp* extention will be added to the name provided after *-m* key.


### Exporting from Catboost python library interface:
### Exporting from Catboost python library interface

```python
model = CatBoost(<train_params>)
model.fit(train_pool)
model.save_model(OUTPUT_CPP_MODEL_PATH, format="CPP")
```


## Models trained with only Float features

If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface:
Expand All @@ -32,14 +29,12 @@ If the model was trained using only numerical features (no cat features), then t
double ApplyCatboostModel(const std::vector<float>& features);
```


### Parameters

| parameter | description |
|-----------|--------------------------------------------------|
| features | features of a single document to make prediction |


### Return value

Prediction of the model for the document with given features.
Expand All @@ -58,7 +53,6 @@ double ApplyCatboostModel(const std::vector<float>& features) {

C++11 support of non-static data member initializers and extended initializer lists


## Models trained with Categorical features

If the model was trained with categorical features present, then the application function in output code will be generated with the following interface:
Expand All @@ -67,7 +61,6 @@ If the model was trained with categorical features present, then the application
double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::vector<std::string>& catFeatures);
```


### Parameters

| parameter | description |
Expand All @@ -77,7 +70,6 @@ double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::ve

NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here floatFeatures = {f1, f3}, catFeatures = {f2, f4}.


### Return value

Prediction of the model for the document with given features.
Expand All @@ -92,21 +84,16 @@ double ApplyCatboostModel(const std::vector<float>& floatFeatures, const std::ve
}
```


### Compiler requiremens

C++14 compiler with aggregate member initialization support. Tested compilers: g++ 5(5.4.1 20160904), clang++ 3.8.


## Current limitations

- MultiClassification models are not supported.
- applyCatboostModel() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.

- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported.

## Troubleshooting

Q: Generated model results differ from native model when categorical features present
A: Please check that CityHash version 1 is used. Exact required revision of [C++ Google CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56%29). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/blob/master/util/digest/city.h). This is due other versions of CityHash may produce different hash code for the same string.


31 changes: 17 additions & 14 deletions apply_model/model_export_as_python_code_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,22 @@ Catboost model could be saved as standalone Python code. This can ease an integr

The exported model code contains complete data for the current trained model and *apply_catboost_model()* function which applies the model to a given dataset. The only current dependency for the code is [CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56).


### Exporting from Catboost application via command line interface:
### Exporting from Catboost application via command line interface

```bash
catboost fit --model-format Python <other_fit_parameters>
```

By default model is saved into *model.py* file, one could alter the output name using *-m* key. If there is more that one model-format specified, then the *.py* extention will be added to the name provided after *-m* key.


### Exporting from Catboost python library interface:
### Exporting from Catboost python library interface

```python
model = CatBoost(<train_params>)
model.fit(train_pool)
model.save_model(OUTPUT_PYTHON_MODEL_PATH, format="python")
```


## Models trained with only Float features

If the model was trained using only numerical features (no cat features), then the application function in generated code will have the following interface:
Expand All @@ -32,19 +29,16 @@ If the model was trained using only numerical features (no cat features), then t
def apply_catboost_model(float_features):
```


### Parameters

| parameter | type | description |
|----------------|----------------------------|--------------------------------------------------|
| float_features | list of int or float values| features of a single document to make prediction |


### Return value

Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal').


## Models trained with Categorical features

If the model was trained with categorical features present, then the application function in output code will be generated with the following interface:
Expand All @@ -53,7 +47,6 @@ If the model was trained with categorical features present, then the application
def apply_catboost_model(float_features, cat_features):
```


### Parameters

| parameter | type | description |
Expand All @@ -63,18 +56,28 @@ def apply_catboost_model(float_features, cat_features):

NOTE: You need to pass float and categorical features separately in the same order they appeared in the train dataset. For example if you had features f1,f2,f3,f4, where f2 and f4 were considered categorical, you need to pass here float_features=[f1,f3], cat_features=[f2,f4].


### Return value

Prediction of the model for the document with given features, equivalent to CatBoost().predict(prediction_type='RawFormulaVal').


## Current limitations
- MultiClassification models are not supported.
- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.

- apply_catboost_model() function has reference implementation and may lack of performance comparing to native applicator of CatBoost, especially on large models and multiple of documents.
- [Text](https://catboost.ai/en/docs/features/text-features) and [Embeddings](https://catboost.ai/en/docs/features/embeddings-features) features are not supported.

## Troubleshooting

Q: Generated model results differ from native model when categorical features present
A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string.
A: Please check that the CityHash version 1 is used. Exact required revision of [Python CityHash library](https://github.com/Amper/cityhash/tree/4f02fe0ba78d4a6d1735950a9c25809b11786a56). There is also proper CityHash implementation in [Catboost repository](https://github.com/catboost/catboost/tree/master/library/python/cityhash). This is due other versions of CityHash may produce different hash code for the same string. One option is to use the library [clickhouse-cityhash](https://pypi.org/project/clickhouse-cityhash/):

```python
from clickhouse_cityhash.cityhash import CityHash64

def calc_cat_feature_hash(value: str):
value_hash = CityHash64(value.encode('utf-8')) % (2 ** 32)

if value_hash >= 2 ** 31:
value_hash -= 2 ** 32

return value_hash
```