-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems when inputting values for date/datetime fields #16
Comments
I am concerned about the issue of a new (jpmml/sklearn2pmml#357) I used Then save pmml again and use JPMML-Evaluator-Python to read the model for prediction Now, instead of prompting the previous error, it prints another error
I probably know what this error means, presumably there is a problem with the string conversion? Could it be something wrong with the following code?Because def make_modify_date_pipeline():
return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 and X[0][0:8] < '20221230' else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
def make_day_id_pipeline():
return make_pipeline(ExpressionTransformer("X[1][:4] + '-'+ X[1][4:6] + '-' + X[1][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
def make_feature_union():
return FeatureUnion([
("modify_date", make_modify_date_pipeline()),
("day_id", make_day_id_pipeline())]) But I need to emphasize that the above custom functions work well on pipeline, my pipeline is completely correct and it predicts the correct result. It seems to be back to the previous problem "My pipeline works fine, I just converted the pipeline to a pmml file and it doesn't work!" So I don't know whether this is the problem of sklearn2pmml or JPMML-Evaluator-Python. Could you please help me to study it |
Your PMML declares that all 60 input fields are of You have to re-declare the relevant input fields so that implicit value conversion would be possible. Alternatively, you may implement custom conversion using some |
TLDR: You cannot represent (prospective-)
Your input values are something like Do you now see where this |
No, this issue is totally unrelated to that. Your pipeline works in Python, because Python performs very liberal type casts. Your pipeline would not work in any strict and statically typed programming language (such as PMML), because the necessary type casts could possible add or remove precision pretty much randomly. In other words, this is legal in Python, but not in other languages: # A float magically becomes a date, WTAF?
day_id = asdate(2.0221031E7) The SkLearn2PMML package provides so-called domain decorator classes (inside the The following might help: mapper = DataFrameMapper([
# THIS: First specify 'modify_date', then specify 'day_id'
(['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain())]),
(['modify_date','day_id'], [make_feature_union(), ExpressionTransformer("X[1] - X[0]")])
]) |
In other words, Python is like Microsoft Excel, which auto-converts everything into a date/datetime. |
You should actually combine these two lines into one: mapper = DataFrameMapper([
(['modify_date','day_id'], [MultiDomain(ContinuousDomain(dtype = numpy.int64), DateDomain()), make_feature_union(), ExpressionTransformer("X[1] - X[0]")]),
]) |
Thank you very much for your answer. I probably know the reason (although I am not quite clear how to solve it). As an algorithm engineer, I don't pay much attention to these underlying data structure issues. I learned a lot from your reply. I hear a lot about Python's dynamic typing, or how not specifying a type can be a disaster, and I think that might be the case.
I will deal with this as you suggested, it seems that all columns like '20200909' need a type designation? Anyway, I'm going to try it for myself first! |
I modified the code as follows. Unfortunately, even the pipeline doesn't work anymore
here is the error code
|
I tried to insert code in various places, but nothing worked.
The following error is always displayed
|
I think I may have found the problem. In my opinion, I don't know what transformer would convert these two columns to string format, though. But I tried to format these two columns in the pmml file in the same format as the other columns
Now, importing the pmml file for the prediction issues another error!
Could you tell me how to do it? I want to repeat my requirements again. The I just need to calculate their time difference ( Of course, there are some other restrictions for modify_date, such as it cannot be empty and cannot be greater than 20221231, which is why the following if code exists
I really need your help! |
The So, the correct syntax would be like this (one child decorator per column - one for decorator = MultiDomain([ContinuousDomain(), DateDomain()]) |
We've discussed this situation before - comparing one string with another using comparison operators like my_date = "20221031"
if my_date < "20221101":
print("Date is earlier than 1st of November, 2022") I remember commenting that I would expect to see a type check error being thrown... I can't find my comment, but this is exactly the kind of exception that I was hoping to see. |
They are both strings that match pattern We can use string_reformatter = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8]") However, it is possible that When working with strings, then you can only implement the first part of the above clause (ie. string is empty/not empty). You cannot do the second part, because the comparison operator modify_date_reformatter = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 else '2022-12-30'")
day_id_reformatter = ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8]") After reformatting, you can cast them to The final exercise is about sanitizing Doing the final arithmetic: days_difference = ExpressionTransformer("(X[1] - X[0]) if X[0] <= 365 else (X[1] - 365)") Can probably be rearranged into: days_difference = ExpressionTransformer("X[1] - numpy.min(X[0], 365)") |
Thank you very much. I think I understand exactly what you mean I used the code you provided recently and it works very well on part of the dataset, thank you very much However, it will also report an error in the case of too much time. Let me get straight to the point and model the following data def make_modify_date_pipeline():
return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
def make_day_id_pipeline():
return make_pipeline(ExpressionTransformer("X[1][:4] + '-' + X[1][4:6] + '-' + X[1][6:8]"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022))
def make_feature_union():
return FeatureUnion([
("modify_date", make_modify_date_pipeline()),
("day_id", make_day_id_pipeline())])
mapper_encode = [(['modify_date','day_id'],[make_feature_union(), ExpressionTransformer("(X[1] - X[0]) if (X[0] <= 365 and X[1]>X[0]) else -1")],{'alias':'modify_days'})]
mapper = DataFrameMapper(mapper_encode, input_df=True,df_out=True)
data_test = pd.DataFrame({
'modify_date':['20220626223702','20220629204300','20220602000000'],
'day_id':['20220714','20220715','20220914']
}) Now, with mapper on data_test, it works fine
However, if you change a day_id to 2999, you will get an error data_test_new = pd.DataFrame({
'modify_date':['20220626223702','20220629204300','20220602000000'],
'day_id':['20220714','29991231','20221231']
})
mapper.fit_transform(data_test_new) here is the error code
I definitely know that the error is caused by this 2999, but I don't know how to deal with it. In fact, I can understand the error and I searched the error code and found many solutions, but they are all based on the pandas function. Based on my previous experience, I don't know whether these methods can be supported or not. Since no relevant posts have such problems when using sklearn2pmml, I need your help. I wonder if CastTransformer caused the problem and if CastTransformer has a parameter that can change a value like 2099 to a specified value. |
This error happens in the Python side, inside the Pandas library. It refuses to accept
Does the Pandas parse succeed when you omit this obviously incorrect value element? Perhaps Pandas also contains some data sanitization code that accepts Perhaps Pandas would try harder is it was given an ISO 8601-like date string like |
Sanitize both your If the Pandas library refuses to parse Write a unit test for all possible combinations that you have tried. Right now you seem to be struggling with code pieces that were working OK before. |
It looks like you can use pandas to convert,Because the following code executes correctly pd.to_datetime(pd.DataFrame(['20991231'])[0], errors = 'coerce')
--------------
0 2099-12-31
Name: 0, dtype: datetime64[ns] I think this goes back to the fact that int can't be used, The code below works fine because I used
Unfortunately, an error occurred while converting to pmml, prompting
I am about to collapse, I think this is a very simple task, really has been unable to complete! |
Actually, my idea is simple. All I need to do is add a condition somewhere in the code below (which should be the original location) to change 29991231 to 20221230. But no matter how I tried, I couldn't succeed. Even if successful, it cannot be converted to pmml. I am in the process of converting the company related algorithm model to pmml and I almost crashed in this small place! def make_modify_date_pipeline():
return make_pipeline(ExpressionTransformer("X[0][:4] + '-' + X[0][4:6] + '-' + X[0][6:8] if len(X[0]) > 0 else '2022-12-30'"), CastTransformer(dtype = "datetime64[D]"), DaysSinceYearTransformer(year = 2022)) |
I'm using a very stupid method now Is the use of
Now it's finally working fit_transform no problem! No problem converting pmml! However, when invoked, the following error is still displayed
I clearly have not according to your instructions, to solve the problem, why is it still like this! I'm falling apart! |
This will cause an error, I have upgraded to the latest version
|
I've tried every transformer in sklearn2pmml.decoration. Anyway, I finally found a method that allowed me to convert pmml files successfully and also work with java calls. Just add the following code at the beginning
So that's it
I don't know why it works. I've spent so much time on it that I don't have the energy to figure out why it works. But with the addition of this one piece of code, my system worked. Anyway, I want to thank you! Thank you for developing such a great package! |
PMML operates similarly to Here's my unit test: # Fails with pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2999-12-31 00:00:00 present at position 0
pandas.to_datetime("29991231", errors = "raise")
# Succeeds, kind of. The result is NaT
pandas.to_datetime("29991231", errors = "coerce") |
It's impossible to use inline cast functions such as That's a clever "hack", trying to replace The inline cast is blocked because of this: |
Did you see #16 (comment)? I told you that the You're passing two child decorators, without wrapping them into a list. Of course it won't work. |
You could use |
Did you see #16 (comment)? If you format Now, if you format |
Marking as "resolved". The troubled user still doesn't appear to grasp the functional difference between integer and floating-point value spaces (one of them is suitable for emulating dates/datetimes, the other is not), but it's beyond my capacity to provide the necessary education here. I'm sure life will teach him well! |
Hello Villu
I'm sorry that I still need your help to troubleshoot a problem predicted by pmml
Last week, I successfully converted my Python model to pmml.
When I used pypmml to call the prediction, I found that the prediction value was inaccurate. Of course, I followed your instructions and installed JPMML-Evaluator-Python
However, when I used JPMML-Evaluator-Python, it didn't work properly and I just reported an error
Here is my code, written according to the readme prompt
Here is the error code
I tried to analyze the problem by myself, and it seemed that the data format was wrong
However, none of the columns in my input need date format, nor does it need date format itself. I used pipepline before to predict with the same data is OK (I don't know if you still remember, Detailed requirements I mentioned in [sklearn2pmml # 356] (jpmml/sklearn2pmml#356))
I also checked my pmml file and it looks correct as well, none of the 60 features required are date columns
So I can't tell what the problem is.
The only thing I can think of is maybe the problem is not in the input but in the output?
Because I am in order to avoid an error (similar to ), added that one line of code
I don't know whether this is the cause of the problem, in a word, could you help me to make a simple analysis
The text was updated successfully, but these errors were encountered: