Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support from_sklearn for trees #689

Merged
merged 1 commit into from
Jun 3, 2024
Merged

feat: support from_sklearn for trees #689

merged 1 commit into from
Jun 3, 2024

Conversation

fd0r
Copy link
Collaborator

@fd0r fd0r commented May 21, 2024

Support from_sklearn for tree based models.

Two options:

  • Quantization from thresholds: the main idea is to consider the thresholds of the nodes of the trees for quantization to not have to use data.
  • Quantization from data: build a quantizer from the data provided by the user and quantize the thresholds based on that.

This also raises the question of non-uniform input quantization. We could quantize the data based on the thresholds thus reducing the number of bits required to log2(max_{feature}(node_{feature})).

That would leak the thresholds used in the model per feature but not the structure of the tree itself while increasing significantly the number of bits required.

We could try to automatically determine the n-bits to use to properly represent all thresholds but this might result in a very high bit-with.

@fd0r fd0r requested a review from a team as a code owner May 21, 2024 14:25
@cla-bot cla-bot bot added the cla-signed label May 21, 2024
@fd0r fd0r force-pushed the trees_from_sklearn branch 5 times, most recently from b88907c to b344aa4 Compare May 22, 2024 08:40
Copy link
Collaborator

@jfrery jfrery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have a few comments / questions.

pyproject.toml Show resolved Hide resolved
src/concrete/ml/onnx/onnx_impl_utils.py Show resolved Hide resolved
src/concrete/ml/quantization/quantizers.py Outdated Show resolved Hide resolved
src/concrete/ml/sklearn/base.py Show resolved Hide resolved
src/concrete/ml/sklearn/base.py Outdated Show resolved Hide resolved
tests/sklearn/test_sklearn_models.py Outdated Show resolved Hide resolved
docs/advanced_examples/DecisionTreeClassifier.ipynb Outdated Show resolved Hide resolved
docs/advanced_examples/DecisionTreeClassifier.ipynb Outdated Show resolved Hide resolved
docs/advanced_examples/DecisionTreeClassifier.ipynb Outdated Show resolved Hide resolved
docs/advanced_examples/DecisionTreeClassifier.ipynb Outdated Show resolved Hide resolved
@fd0r fd0r force-pushed the trees_from_sklearn branch from b344aa4 to eaa0249 Compare May 22, 2024 09:57
@fd0r fd0r force-pushed the trees_from_sklearn branch 2 times, most recently from 016d323 to f0d2dfd Compare May 22, 2024 14:16
@fd0r fd0r requested a review from jfrery May 23, 2024 07:07
@fd0r fd0r force-pushed the trees_from_sklearn branch from f0d2dfd to 240eb7b Compare May 23, 2024 07:13
@fd0r
Copy link
Collaborator Author

fd0r commented May 23, 2024

New flaky it seems: python -m pytest --pdb --randomly-seed 2865112219 --randomly-dont-reset-seed "tests/sklearn/test_fhe_training.py::test_fit_single_target_class[True-7-30-1.0]"

@RomanBredehoft
Copy link
Collaborator

that's a test I've added recently, I might have an idea of why this happens actually, I'll let you open an issue @fd0r and I'll see if I can reproduce it

@fd0r
Copy link
Collaborator Author

fd0r commented May 23, 2024

@RomanBredehoft , I already have a fix, just adding some tests as we speak

@fd0r
Copy link
Collaborator Author

fd0r commented May 23, 2024

#692

@fd0r
Copy link
Collaborator Author

fd0r commented May 23, 2024

Quick summary of things to do here:

  • Verify the source of the degradation of the model of the notebook. And why the target changed.
  • Move most of the code to tree_to_numpy
  • Validate the move to truncate_bit_pattern
  • (EDIT) Make another notebook with importing tree-based models from scikit-learn, maybe we can regroup it with linear models from scikit-learn too?

anything else @RomanBredehoft @jfrery ?

@jfrery
Copy link
Collaborator

jfrery commented May 23, 2024

Verify the source of the degradation of the model of the notebook.

  • why have the target value changed? They never changed since we started this notebook
  • remove rf xgb from decision tree classifier notebook I suppose.

@fd0r fd0r force-pushed the trees_from_sklearn branch from 240eb7b to 5de50b5 Compare May 23, 2024 13:28
@@ -773,7 +773,7 @@ def quant(self, values: numpy.ndarray) -> numpy.ndarray:

return qvalues.astype(numpy.int64)

def dequant(self, qvalues: numpy.ndarray) -> Union[numpy.ndarray, Tracer]:
def dequant(self, qvalues: numpy.ndarray) -> Union[float, int, numpy.ndarray, Tracer]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can the following create a float, int if the input is a numpy.ndarray ?

values = self.scale * (qvalues - numpy.asarray(self.zero_point, dtype=numpy.float64))

should remain an array no ? looks like something is fishy here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that I'm cheating a bit while using this function and providing a python int instead of a numpy array

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we would want to keepsuch change if if it's possible tho, I guess this is related to the other similar change @jfrery saw

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

risky change here, dequant should return float

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove int from the result of dequant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer both personally, I don't see why we should allow dequant to return a float (worst, an int) and not a numpy.array !

else:
for n_bits, cml_tolerance, sklearn_tolerance in [
(max_n_bits, 0.8, 1e-5),
(reasonable_n_bits, 1.8, 1.8),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these 2 configs ? how were they chose ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly to check that increasing the number of bits we get better results.

They thresholds were chosen by trial and error. (easiest way I find)

# Compile both the initial Concrete ML model and the loaded one
concrete_model.compile(x)
mode = "disable"
if n_bits <= 8:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have constants like N_BITS_THRESHOLD_FOR_CRT_FHE_CIRCUITS in our tests if that's what you had in mind here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that won't really work with trees since we have rounding.

So I don't think it's possible from parameters only to known if the circuit will use the CRT representation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what is this <= 8 then ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho yeah I should use reasonable_n_bits here

Copy link
Collaborator

@RomanBredehoft RomanBredehoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some new comments, but overall I agree with @jfrery's ones

Copy link

⚠️ Known flaky tests have been rerun ⚠️

One or several tests initially failed but were identified as known flaky. tests. Therefore, they have been rerun and passed. See below for more details.

Failed tests details

Known flaky tests that initially failed:

  • tests/torch/test_compile_torch.py::test_compile_torch_or_onnx_conv_networks[False-True-CNN_grouped-relu]

@fd0r fd0r force-pushed the trees_from_sklearn branch from 12f7a92 to 2db3b84 Compare May 24, 2024 11:55
@fd0r fd0r requested a review from RomanBredehoft May 24, 2024 12:35
Copy link
Collaborator

@andrei-stoian-zama andrei-stoian-zama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments above, can you please:


# Compile both the initial Concrete ML model and the loaded one
concrete_model.compile(x)
mode = "disable"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use simulate here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho yeah forgot to activate simulation mode with reasonable number of bits


# Compile both the initial Concrete ML model and the loaded one
concrete_model.compile(x)
mode = "disable"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, simulate would be better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should show nice graphs with decision boundaries like in ClassifierComparison. We shouldn't show n_bits variation of accuracy, just show good configs. Ideally the default configs (dont set n_bits in import_sklearn) should be used and they should work well (except PTQ, where you should set n_bits, we don't have a good default for that)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a section where you import the xgboost sklearn classifier to compare to the CML xgb one?

cls,
sklearn_model: sklearn.base.BaseEstimator,
X: Optional[numpy.ndarray] = None,
n_bits: int = 8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so 8 bits is ok? I remember it wasn't always the best choice in the graphs and we had discussed 9 bits. What is the accuracy on the regressors ?

jfrery
jfrery previously approved these changes May 27, 2024
Copy link
Collaborator

@jfrery jfrery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks!

RomanBredehoft
RomanBredehoft previously approved these changes May 27, 2024
Copy link
Collaborator

@RomanBredehoft RomanBredehoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot !!

@fd0r fd0r dismissed stale reviews from RomanBredehoft and jfrery via e73483b May 29, 2024 11:49
@fd0r fd0r force-pushed the trees_from_sklearn branch from e73483b to b3a5b44 Compare May 29, 2024 11:50
@fd0r
Copy link
Collaborator Author

fd0r commented May 29, 2024

The notebooks can still be improved but I think I addressed most comments.

RomanBredehoft
RomanBredehoft previously approved these changes May 29, 2024
Copy link
Collaborator

@andrei-stoian-zama andrei-stoian-zama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few questions!

if model_inputs is None:
# If we have no data we can just randomly generate a dataset
assert isinstance(n_features, int)
calibration_set_size = 100_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need 100 000 sampels here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, it was to be safe when developing, I can reduce this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to 1000

"""Test `from_sklearn_model` functionnality of tree-based models."""

numpy.random.seed(0)
os.environ["TREES_USE_ROUNDING"] = str(int(use_rounding))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this works?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it does

@fd0r
Copy link
Collaborator Author

fd0r commented Jun 3, 2024

I got to rebase on main

Support `from_sklearn` for tree based models.

Two options:
- Quantization from thresholds: the main idea is to consider the
  thresholds of the nodes of the trees for quantization to not have
  to use data.
- Quantization from data: build a quantizer from the data provided by
  the user and quantize the thresholds based on that.

This also raises the question of non-uniform input quantization.
We could quantize the data based on the thresholds thus reducing the
number of bits required to log2(max_{feature}(node_{feature})).

That would leak the thresholds used in the model per feature but not the
structure of the tree itself while increasing significantly the number
of bits required.

We could try to automatically determine the n-bits to use to
properly represent all thresholds but this might result in a very high
bit-with.

This commit also changes to comparison so that it uses truncation and
not rounding anymore.
@fd0r fd0r force-pushed the trees_from_sklearn branch from 8b7cd79 to 1f5b10e Compare June 3, 2024 13:07
Copy link

github-actions bot commented Jun 3, 2024

Coverage passed ✅

Coverage details

---------- coverage: platform linux, python 3.8.18-final-0 -----------
Name    Stmts   Miss  Cover   Missing
-------------------------------------
TOTAL    7744      0   100%

59 files skipped due to complete coverage.

@fd0r fd0r merged commit 5ca282b into main Jun 3, 2024
12 checks passed
@fd0r fd0r deleted the trees_from_sklearn branch June 3, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants