[python-package] stop relying on string concatenation / splitting for cv() eval results #6761

jameslamb · 2024-12-17T05:37:54Z

Contributes to #6748

There are a few activities where lightgbm (the Python package) needs to inspect the output of one or more evaluation metrics on one or more datasets.

For example:

early stopping
printing evaluation results
recording evaluation results in memory (e.g. in a dictionary) for use after training

For train() and other APIs that end up using it, it tracks those using a list of tuples like this (pseudocode):

[
  ({dataset_name}, {metric_name}, {is_higher_better}, {metric_value}).
  ...
]

cv() does something similar. However, its "metric value" is actually a mean of such values taken over all cross-validation folds. Because multiple values are being aggregated, it appends a 5th item with the standard deviation.

[
  ({dataset_name}, {metric_name}, {is_higher_better}, mean({metric_value}), stddev({metric_value}),
  ...
]

Some code in callbacks.py needs to know, given a list of such tuples, whether they were produced by cross-validation or regular train().

To facilitate that while still somewhat preserving the schema for the tuples, the cv() code:

concatenates the first and second elements into 1
appends the string literal "cv_agg" to the beginning of the tuple

So e.g. ("valid1", "auc", ...) becomes ("cv_agg", "valid1 auc", ...). That happens here:

LightGBM/python-package/lightgbm/engine.py

Lines 580 to 592 in 480600b

    
           def _agg_cv_result( 
        
               raw_results: List[List[_LGBM_BoosterEvalMethodResultType]], 
        
           ) -> List[_LGBM_BoosterEvalMethodResultWithStandardDeviationType]: 
        
               """Aggregate cross-validation results.""" 
        
               cvmap: Dict[str, List[float]] = OrderedDict() 
        
               metric_type: Dict[str, bool] = {} 
        
               for one_result in raw_results: 
        
                   for one_line in one_result: 
        
                       key = f"{one_line[0]} {one_line[1]}" 
        
                       metric_type[key] = one_line[3] 
        
                       cvmap.setdefault(key, []) 
        
                       cvmap[key].append(one_line[2]) 
        
               return [("cv_agg", k, float(np.mean(v)), metric_type[k], float(np.std(v))) for k, v in cvmap.items()]

Every place dealing with such tuples then needs to deal with that, including splitting and re-combining that second element. Like this:

LightGBM/python-package/lightgbm/callback.py

Lines 416 to 418 in 480600b

    
           # split is needed for "<dataset type> <metric>" case (e.g. "train l1") 
        
           eval_name_splitted = env.evaluation_result_list[i][1].split(" ") 
        
           if self.first_metric_only and self.first_metric != eval_name_splitted[-1]:

This proposes changes to remove that, so that the cv() and train() tuples follow a similar schema and all the complexity of splitting and re-combining names can be removed.

It also standardizes on the names from #6749 (comment)

Notes for Reviewers

This change should be completely backwards-compatible, including with user-provided custom metric function. The code paths here are well-covered by tests (as I found out from many failed tests while developing this 😅 ).

jameslamb · 2024-12-17T05:38:49Z

python-package/lightgbm/engine.py

-            metric_type[key] = one_line[3]
-            cvmap.setdefault(key, [])
-            cvmap[key].append(one_line[2])
-    return [("cv_agg", k, float(np.mean(v)), metric_type[k], float(np.std(v))) for k, v in cvmap.items()]


This, removing this "cv_agg" string literal, is the key change... everything else flows from that.

StrikerRUS

LGTM! Thanks a lot for clear refactoring!
Just some minor suggestions.

StrikerRUS · 2024-12-17T15:02:24Z

python-package/lightgbm/callback.py

@@ -71,6 +71,14 @@ class CallbackEnv:
    evaluation_result_list: Optional[_ListOfEvalResultTuples]


+def _using_cv(env: CallbackEnv) -> bool:


Suggested change

def _using_cv(env: CallbackEnv) -> bool:

def _is_using_cv(env: CallbackEnv) -> bool:

For the consistency with other functions like

LightGBM/python-package/lightgbm/callback.py

Line 307 in 480600b

def _is_train_set(self, ds_name: str, eval_name: str, env: CallbackEnv) -> bool:

LightGBM/python-package/lightgbm/basic.py

Line 175 in 480600b

def _is_zero(x: float) -> bool:

LightGBM/python-package/lightgbm/basic.py

Line 315 in 480600b

def _is_numeric(obj: Any) -> bool:

LightGBM/python-package/lightgbm/basic.py

Line 326 in 480600b

def _is_numpy_1d_array(data: Any) -> bool:

etc.

Sure, good point. Just pushed 5b25aed updating this to _is_using_cv().

StrikerRUS · 2024-12-17T17:37:59Z

python-package/lightgbm/callback.py

@@ -71,6 +71,14 @@ class CallbackEnv:
    evaluation_result_list: Optional[_ListOfEvalResultTuples]


+def _using_cv(env: CallbackEnv) -> bool:
+    """Check if model in callback env is a CVBooster."""
+    # this string-matching is used instead of isinstance() to avoid a circular import


Instead of string-matching I think it's possible to use lazy import, i.e. import CVBooster inside this function.

I was thinking that it'd be expensive to do that import (since this could end up running on every iteration), but I guess it shouldn't be much overhead because Python caches imports: https://docs.python.org/3/reference/import.html#the-module-cache

I'll try that, thanks for the suggestion!

Just pushed 5b25aed using a lazy import + isinstance() check.

StrikerRUS · 2024-12-17T18:10:49Z

python-package/lightgbm/engine.py

+    # }
+    #
+    metric_types: Dict[Tuple[str, str], bool] = OrderedDict()
+    metric_values: Dict[Tuple[str, str], List[float]] = OrderedDict()
    for one_result in raw_results:
        for one_line in one_result:


Just removes unnecessary intermediate variable one_line.

Suggested change

for one_line in one_result:

for dataset_name, metric_name, metric_value, is_higher_better in one_result:

I suspect that this double loop has been there in lightgbm for a long time to account for the fact that raw_results could either be a list of tuples or a list of list of tuples?

LightGBM/python-package/lightgbm/engine.py

Lines 638 to 641 in 4feee28

feval : callable, list of callable, or None, optional (default=None)

Customized evaluation function.

Each evaluation function should accept two parameters: preds, eval_data,

and return (eval_name, eval_result, is_higher_better) or list of such tuples.

I'll double-check that, and add some tests on that pattern if we don't yet have any.

I realize now that I read your comment too quickly... you're not recommending removing the double-nested loop, just moving the tuple unpacking up into the second for line.

I agree! I just pushed 5b25aed doing that.

In that commit, I also added 2 tests confirming that if a custom metric function returns a list of evaluation tuples, train() and cv() will work as expected. I only added 1 test case for each, in the interest of not expanding the already very thorough test_metrics test, but please let me know if you'd like to see more test cases added there.

python-package/lightgbm/engine.py

Co-authored-by: Nikita Titov <[email protected]>

…into python/remove-cv-agg

jameslamb added 8 commits December 16, 2024 20:35

unpack into named variables

3c97410

still working

3c94663

more simplification

57a080f

more refactoring

599a732

more refactoring

4fce6ad

simplify _is_train_set()

7386897

bit more refactoring

cf12c81

update _using_cv() check

a989f5f

jameslamb added in progress maintenance labels Dec 17, 2024

jameslamb commented Dec 17, 2024

View reviewed changes

formatting

4176327

jameslamb changed the title ~~WIP: [python-package] stop relying on string concatenation / splitting for cv() eval results~~ [python-package] stop relying on string concatenation / splitting for cv() eval results Dec 17, 2024

jameslamb added awaiting review and removed in progress labels Dec 17, 2024

jameslamb marked this pull request as ready for review December 17, 2024 05:56

jameslamb requested review from guolinke, shiyu1994, jmoralez, borchero and StrikerRUS as code owners December 17, 2024 05:56

StrikerRUS requested changes Dec 17, 2024

View reviewed changes

jameslamb and others added 5 commits December 17, 2024 23:26

Update python-package/lightgbm/engine.py

d116279

Co-authored-by: Nikita Titov <[email protected]>

simplify

5b25aed

merge master

47bd9d7

Merge branch 'python/remove-cv-agg' of github.com:microsoft/LightGBM …

c5a69fd

…into python/remove-cv-agg

ruff auto-formatting

b8220b7

jameslamb requested a review from StrikerRUS December 18, 2024 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] stop relying on string concatenation / splitting for cv() eval results #6761

[python-package] stop relying on string concatenation / splitting for cv() eval results #6761

jameslamb commented Dec 17, 2024

jameslamb Dec 17, 2024

StrikerRUS left a comment

StrikerRUS Dec 17, 2024

jameslamb Dec 18, 2024

StrikerRUS Dec 17, 2024

jameslamb Dec 18, 2024

jameslamb Dec 18, 2024

StrikerRUS Dec 17, 2024

jameslamb Dec 18, 2024

jameslamb Dec 18, 2024

	def _agg_cv_result(
	raw_results: List[List[_LGBM_BoosterEvalMethodResultType]],
	) -> List[_LGBM_BoosterEvalMethodResultWithStandardDeviationType]:
	"""Aggregate cross-validation results."""
	cvmap: Dict[str, List[float]] = OrderedDict()
	metric_type: Dict[str, bool] = {}
	for one_result in raw_results:
	for one_line in one_result:
	key = f"{one_line[0]} {one_line[1]}"
	metric_type[key] = one_line[3]
	cvmap.setdefault(key, [])
	cvmap[key].append(one_line[2])
	return [("cv_agg", k, float(np.mean(v)), metric_type[k], float(np.std(v))) for k, v in cvmap.items()]

	# split is needed for "<dataset type> <metric>" case (e.g. "train l1")
	eval_name_splitted = env.evaluation_result_list[i][1].split(" ")
	if self.first_metric_only and self.first_metric != eval_name_splitted[-1]:

		@@ -71,6 +71,14 @@ class CallbackEnv:
		evaluation_result_list: Optional[_ListOfEvalResultTuples]


		def _using_cv(env: CallbackEnv) -> bool:

	def _using_cv(env: CallbackEnv) -> bool:
	def _is_using_cv(env: CallbackEnv) -> bool:

	for one_line in one_result:
	for dataset_name, metric_name, metric_value, is_higher_better in one_result:

	feval : callable, list of callable, or None, optional (default=None)
	Customized evaluation function.
	Each evaluation function should accept two parameters: preds, eval_data,
	and return (eval_name, eval_result, is_higher_better) or list of such tuples.

[python-package] stop relying on string concatenation / splitting for cv() eval results #6761

Are you sure you want to change the base?

[python-package] stop relying on string concatenation / splitting for cv() eval results #6761

Conversation

jameslamb commented Dec 17, 2024

Notes for Reviewers

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment