-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support from_sklearn
for trees
#689
Conversation
b88907c
to
b344aa4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have a few comments / questions.
016d323
to
f0d2dfd
Compare
New flaky it seems: |
that's a test I've added recently, I might have an idea of why this happens actually, I'll let you open an issue @fd0r and I'll see if I can reproduce it |
@RomanBredehoft , I already have a fix, just adding some tests as we speak |
Quick summary of things to do here:
anything else @RomanBredehoft @jfrery ? |
|
@@ -773,7 +773,7 @@ def quant(self, values: numpy.ndarray) -> numpy.ndarray: | |||
|
|||
return qvalues.astype(numpy.int64) | |||
|
|||
def dequant(self, qvalues: numpy.ndarray) -> Union[numpy.ndarray, Tracer]: | |||
def dequant(self, qvalues: numpy.ndarray) -> Union[float, int, numpy.ndarray, Tracer]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can the following create a float, int
if the input is a numpy.ndarray
?
values = self.scale * (qvalues - numpy.asarray(self.zero_point, dtype=numpy.float64))
should remain an array no ? looks like something is fishy here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that I'm cheating a bit while using this function and providing a python int instead of a numpy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we would want to keepsuch change if if it's possible tho, I guess this is related to the other similar change @jfrery saw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
risky change here, dequant should return float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove int from the result of dequant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer both personally, I don't see why we should allow dequant to return a float (worst, an int) and not a numpy.array !
else: | ||
for n_bits, cml_tolerance, sklearn_tolerance in [ | ||
(max_n_bits, 0.8, 1e-5), | ||
(reasonable_n_bits, 1.8, 1.8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these 2 configs ? how were they chose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly to check that increasing the number of bits we get better results.
They thresholds were chosen by trial and error. (easiest way I find)
tests/sklearn/test_sklearn_models.py
Outdated
# Compile both the initial Concrete ML model and the loaded one | ||
concrete_model.compile(x) | ||
mode = "disable" | ||
if n_bits <= 8: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have constants like N_BITS_THRESHOLD_FOR_CRT_FHE_CIRCUITS
in our tests if that's what you had in mind here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that won't really work with trees since we have rounding.
So I don't think it's possible from parameters only to known if the circuit will use the CRT representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what is this <= 8
then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho yeah I should use reasonable_n_bits
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some new comments, but overall I agree with @jfrery's ones
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments above, can you please:
- improve the importingfromscikitlearn notebook to make it work exactly like classifier comparison
- add comparison of imported xgbregressor in the https://github.com/zama-ai/concrete-ml/blob/main/docs/advanced_examples/XGBRegressor.ipynb notebook
|
||
# Compile both the initial Concrete ML model and the loaded one | ||
concrete_model.compile(x) | ||
mode = "disable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use simulate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho yeah forgot to activate simulation mode with reasonable number of bits
|
||
# Compile both the initial Concrete ML model and the loaded one | ||
concrete_model.compile(x) | ||
mode = "disable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, simulate would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should show nice graphs with decision boundaries like in ClassifierComparison. We shouldn't show n_bits variation of accuracy, just show good configs. Ideally the default configs (dont set n_bits in import_sklearn) should be used and they should work well (except PTQ, where you should set n_bits, we don't have a good default for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a section where you import the xgboost sklearn classifier to compare to the CML xgb one?
src/concrete/ml/sklearn/base.py
Outdated
cls, | ||
sklearn_model: sklearn.base.BaseEstimator, | ||
X: Optional[numpy.ndarray] = None, | ||
n_bits: int = 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so 8 bits is ok? I remember it wasn't always the best choice in the graphs and we had discussed 9 bits. What is the accuracy on the regressors ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot !!
The notebooks can still be improved but I think I addressed most comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few questions!
if model_inputs is None: | ||
# If we have no data we can just randomly generate a dataset | ||
assert isinstance(n_features, int) | ||
calibration_set_size = 100_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really need 100 000 sampels here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, it was to be safe when developing, I can reduce this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to 1000
"""Test `from_sklearn_model` functionnality of tree-based models.""" | ||
|
||
numpy.random.seed(0) | ||
os.environ["TREES_USE_ROUNDING"] = str(int(use_rounding)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does
I got to rebase on |
Support `from_sklearn` for tree based models. Two options: - Quantization from thresholds: the main idea is to consider the thresholds of the nodes of the trees for quantization to not have to use data. - Quantization from data: build a quantizer from the data provided by the user and quantize the thresholds based on that. This also raises the question of non-uniform input quantization. We could quantize the data based on the thresholds thus reducing the number of bits required to log2(max_{feature}(node_{feature})). That would leak the thresholds used in the model per feature but not the structure of the tree itself while increasing significantly the number of bits required. We could try to automatically determine the n-bits to use to properly represent all thresholds but this might result in a very high bit-with. This commit also changes to comparison so that it uses truncation and not rounding anymore.
Coverage passed ✅Coverage details
|
Support
from_sklearn
for tree based models.Two options:
This also raises the question of non-uniform input quantization. We could quantize the data based on the thresholds thus reducing the number of bits required to log2(max_{feature}(node_{feature})).
That would leak the thresholds used in the model per feature but not the structure of the tree itself while increasing significantly the number of bits required.
We could try to automatically determine the n-bits to use to properly represent all thresholds but this might result in a very high bit-with.