Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring the presentation of pandas_categorical section in model files #1201

Closed
vruusmann opened this issue Jan 14, 2018 · 7 comments
Closed

Comments

@vruusmann
Copy link

If the LightGBM model was trained using pandas.DataFrame that contains categorical columns, then the last section of the model file is a pandas_categorical section.

The problem is that this section is 1) formatted differently from other feature-related sections (eg. the feature_importance section), and 2) the current representation (list of lists) is difficult to parse for outside applications.

For example, consider the Auto-MPG dataset.

Current presentation (https://github.com/jpmml/jpmml-lightgbm/blob/master/src/test/resources/lgbm/RegressionAuto.txt#L574):

pandas_categorical:[[3, 4, 5, 6, 8], [70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], [1, 2, 3]]

Refactored presentation:

pandas_categorical:
cylinders=3:4:5:6:8
model_year=70:71:72:73:74:75:76:77:78:79:80:81:82
origin=1:2:3

Another example of a very messy pandas_categorical section (https://github.com/jpmml/jpmml-lightgbm/blob/master/src/test/resources/lgbm/ClassificationAudit.txt#L581):

pandas_categorical:[["Consultant", "PSFederal", "PSLocal", "PSState", "Private", "SelfEmp", "Volunteer"], ["Associate", "Bachelor", "College", "Doctorate", "HSgrad", "Master", "Preschool", "Professional", "Vocational", "Yr10", "Yr11", "Yr12", "Yr1t4", "Yr5t6", "Yr7t8", "Yr9"], ["Absent", "Divorced", "Married", "Married-spouse-absent", "Unmarried", "Widowed"], ["Cleaner", "Clerical", "Executive", "Farming", "Home", "Machinist", "Military", "Professional", "Protective", "Repair", "Sales", "Service", "Support", "Transport"], ["Female", "Male"]]

Changing the model file data format would necessitate updating the version number also?

@guolinke
Copy link
Collaborator

@StrikerRUS @wxchan Do you have time to implement this feature ?

@vruusmann
Copy link
Author

The two main requirements are:

  1. Split the current single-line representation into multi-line representation. Each line should correspond to format <feature name>=<list of feature values in the original value space>.
  2. Figure out a good list-of-values representation. In my above comment it is suggested that individual values could be separated by the colon character :. This works with numeric values, but not with string values (imagine a case where the original string feature value also contains the colon character).

@vruusmann
Copy link
Author

This issue is nothing (time-)critical for me.

It's more about pointing out a minor style issue/inconsistency with LightGBM model file format.

@StrikerRUS
Copy link
Collaborator

I add this issue to the #960 TODO list.

@econkc
Copy link

econkc commented Apr 10, 2018

I also notice that when use "predict", we also need to specify the categorical features on the predicting data too. Is it possible to convert the categorical features in the predicted data to match the training data?

@StrikerRUS
Copy link
Collaborator

Closing this issue as we have one consolidated issue for pandas refactoring and this one is included there.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants