Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builtins functions are not working when try to save pipeline #371

Closed
aliyilmaz61 opened this issue Feb 23, 2023 · 4 comments
Closed

builtins functions are not working when try to save pipeline #371

aliyilmaz61 opened this issue Feb 23, 2023 · 4 comments

Comments

@aliyilmaz61
Copy link

aliyilmaz61 commented Feb 23, 2023

Hi,
I want to use Dataframe mapping function and Expression transformer for my preprocessing steps. Xgb model has 25 input feature but i wont talk about all the process.
I am having trouble just with bulit-in function in Expression transformer.
Other features that created with Expression transformer are saveable.

Let me put some example:

mapper_1 = DataFrameMapper([(["A","B"],Alias(ExpressionTransformer("(numpy.nan if X[1]== 0 else X[0]/X[1])",dtype = np.float64),"1",prefit=True),{'alias': 'C'})],df_out=True)
                                 
pipeline_1 = PMMLPipeline(
    steps=[
        ('preprocessor_1',mapper_1)
        ])
sklearn2pmml(pipeline_1,"pipeline_1.pmml") 

When i do not use any built in function like mapper_1 it can be saved as pmml.

mapper_2 = DataFrameMapper([(["A","B"],Alias(ExpressionTransformer("(max(X[0], X[1])",dtype = np.float64),"1",prefit=True),{'alias': 'C'})],df_out=True)
                                 
pipeline_2 = PMMLPipeline(
    steps=[
        ('preprocessor_1',mapper_2)
        ])
sklearn2pmml(pipeline_2,"pipeline_2.pmml") 

But i want to use a built-in function like max, str, float sklearn2pmml give me an error.

Standard output is empty
Standard error:
Exception in thread "main" java.lang.IllegalArgumentException: Function 'builtins.max' is not supported

Another example;

mapper_3 = DataFrameMapper([(["A","B"],Alias(ExpressionTransformer("(float(str(X[0])[:2])*12 + float(str(X[0])[2:4]) - X[1])",dtype = np.float64),"1",prefit=True),{'alias': 'C'})],df_out=True)
                                 
pipeline_3 = PMMLPipeline(
    steps=[
        ('preprocessor_1',mapper_3)
        ])
sklearn2pmml(pipeline_3,"pipeline_3.pmml") 

Standard output is empty
Standard error:
Exception in thread "main" java.lang.IllegalArgumentException: Function 'builtins.str' is not supported
at org.jpmml.python.FunctionUtil.encodePythonFunction(
FunctionUtil.java:104

How can i handle this situation?

@vruusmann
Copy link
Member

You can do aggregation using the special-purpose sklearn2pmml.preprocessing.Aggregator transformer.

For binary case, you can express builtins.max simply like this: X[0] if X[0] > X[1] else X[1].

As for builtins.str, builtins.float etc value conversion functions, there's a separate issue about them:
jpmml/jpmml-python#20

Looking at the business logic of your mapper_3 makes me think that such value conversion functions shouldn't be implemented at all. Whatever you're trying to accomplish there, try to think the algorithm through, and get it implemented using pure mathematical operators. Converting numbers to strings, then taking substrings, and then converting back to numbers is not efficient nor elegant.

Alternatively, consider using sklearn2pmml.preprocessing.DateTimeFormatter or sklearn2pmml.preprocessing.NumberFormatter transformations.

@aliyilmaz61
Copy link
Author

Thank you vruusmann but I do not think that there is a direct solution to the problem I wrote and different problems that may arise in the future.

Because i put mapper where i take the maximum of 2 numbers as an example, you offered a simple if statement solution, but my main problem was to get the maximum of 4 numbers. It's not elegant to write this with an if statement.
I used numpy.fmax instead of builtin max and it was solved.

In the second example, instead of offering an alternative solution to functions such as round, str, float and int, you suggested mathematical operations, but I don't think it's an elegant solution to write the code that can be replaced all of these functions with mathematical operations.
I think we can find alternative solutions to builtin functions with numpy mathematical functions like numpy.floor, numpy.rint etc.

@vruusmann
Copy link
Member

I think we can find alternative solutions to builtin functions with numpy mathematical functions like numpy.floor, numpy.rint etc.

The list of supported functions is here:
https://github.com/jpmml/jpmml-python/blob/1.1.12/pmml-python/src/main/java/org/jpmml/python/FunctionUtil.java#L90-L114

Indeed, there are close to 50 functions to choose from, mostly Numpy.

Also, you should pay attention to "missing value awareness" of chosen functions. IIRC, the min function is not missing value aware, so it cannot be used when there's a chance of numpy.NaN values floating around.

@aliyilmaz61
Copy link
Author

Thank you for your share. It will be useful indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants