Refactoring the presentation of `pandas_categorical` section in model files #1201

vruusmann · 2018-01-14T16:01:24Z

If the LightGBM model was trained using pandas.DataFrame that contains categorical columns, then the last section of the model file is a pandas_categorical section.

The problem is that this section is 1) formatted differently from other feature-related sections (eg. the feature_importance section), and 2) the current representation (list of lists) is difficult to parse for outside applications.

For example, consider the Auto-MPG dataset.

Current presentation (https://github.com/jpmml/jpmml-lightgbm/blob/master/src/test/resources/lgbm/RegressionAuto.txt#L574):

pandas_categorical:[[3, 4, 5, 6, 8], [70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], [1, 2, 3]]

Refactored presentation:

pandas_categorical:
cylinders=3:4:5:6:8
model_year=70:71:72:73:74:75:76:77:78:79:80:81:82
origin=1:2:3

Another example of a very messy pandas_categorical section (https://github.com/jpmml/jpmml-lightgbm/blob/master/src/test/resources/lgbm/ClassificationAudit.txt#L581):

pandas_categorical:[["Consultant", "PSFederal", "PSLocal", "PSState", "Private", "SelfEmp", "Volunteer"], ["Associate", "Bachelor", "College", "Doctorate", "HSgrad", "Master", "Preschool", "Professional", "Vocational", "Yr10", "Yr11", "Yr12", "Yr1t4", "Yr5t6", "Yr7t8", "Yr9"], ["Absent", "Divorced", "Married", "Married-spouse-absent", "Unmarried", "Widowed"], ["Cleaner", "Clerical", "Executive", "Farming", "Home", "Machinist", "Military", "Professional", "Protective", "Repair", "Sales", "Service", "Support", "Transport"], ["Female", "Male"]]

Changing the model file data format would necessitate updating the version number also?

The text was updated successfully, but these errors were encountered:

guolinke · 2018-01-22T03:15:06Z

@StrikerRUS @wxchan Do you have time to implement this feature ?

vruusmann · 2018-01-22T10:04:04Z

The two main requirements are:

Split the current single-line representation into multi-line representation. Each line should correspond to format <feature name>=<list of feature values in the original value space>.
Figure out a good list-of-values representation. In my above comment it is suggested that individual values could be separated by the colon character :. This works with numeric values, but not with string values (imagine a case where the original string feature value also contains the colon character).

vruusmann · 2018-01-22T10:07:34Z

This issue is nothing (time-)critical for me.

It's more about pointing out a minor style issue/inconsistency with LightGBM model file format.

StrikerRUS · 2018-01-22T18:10:21Z

I add this issue to the #960 TODO list.

econkc · 2018-04-10T19:27:11Z

I also notice that when use "predict", we also need to specify the categorical features on the predicting data too. Is it possible to convert the categorical features in the predicted data to match the training data?

StrikerRUS · 2019-05-08T12:28:57Z

Closing this issue as we have one consolidated issue for pandas refactoring and this one is included there.

github-actions · 2023-08-23T23:56:44Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

guolinke added the help wanted label Jan 22, 2018

wxchan mentioned this issue May 27, 2018

[python] refine pandas support #960

Closed

6 tasks

StrikerRUS closed this as completed May 8, 2019

vruusmann mentioned this issue Jun 30, 2019

Loading pandas categorical breaks when a name contains square bracket character "]" jpmml/jpmml-lightgbm#24

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring the presentation of `pandas_categorical` section in model files #1201

Refactoring the presentation of `pandas_categorical` section in model files #1201

vruusmann commented Jan 14, 2018

guolinke commented Jan 22, 2018

vruusmann commented Jan 22, 2018

vruusmann commented Jan 22, 2018

StrikerRUS commented Jan 22, 2018

econkc commented Apr 10, 2018

StrikerRUS commented May 8, 2019

github-actions bot commented Aug 23, 2023

Refactoring the presentation of pandas_categorical section in model files #1201

Refactoring the presentation of pandas_categorical section in model files #1201

Comments

vruusmann commented Jan 14, 2018

guolinke commented Jan 22, 2018

vruusmann commented Jan 22, 2018

vruusmann commented Jan 22, 2018

StrikerRUS commented Jan 22, 2018

econkc commented Apr 10, 2018

StrikerRUS commented May 8, 2019

github-actions bot commented Aug 23, 2023

Refactoring the presentation of `pandas_categorical` section in model files #1201

Refactoring the presentation of `pandas_categorical` section in model files #1201