diff --git a/docs/algos/performances.md b/docs/algos/performances.md index 7265bd0d..3e5cd0d9 100644 --- a/docs/algos/performances.md +++ b/docs/algos/performances.md @@ -12,7 +12,7 @@ For single-policy algorithms, the metric used will be the scalarized return of t ### Multi-policy algorithms For multi-policy algorithms, we propose to rely on various metrics to assess the quality of the **discounted** Pareto Fronts (PF) or Convex Coverage Set (CCS). In general, we want to have a metric that is able to assess the convergence of the PF, a metric that is able to assess the diversity of the PF, and a hybrid metric assessing both. The metrics are implemented in `common/performance_indicators`. We propose to use the following metrics: -* (Diversity) Sparsity: average distance between each consecutive point in the PF. From the PGMORL paper [1]. Keyword: `eval/sparsity`. +* **[Do not use]** (Diversity) Sparsity: average distance between each consecutive point in the PF. From the PGMORL paper [1]. Keyword: `eval/sparsity`. * (Diversity) Cardinality: number of points in the PF. Keyword: `eval/cardinality`. * (Convergence) IGD: a SOTA metric from Multi-Objective Optimization (MOO) literature. It requires a reference PF that we can compute a posteriori. That is, we do a merge of all the PFs found by the method and compute the IGD with respect to this reference PF. Keyword: `eval/igd`. * (Hybrid) Hypervolume: a SOTA metric from MOO and MORL literature. Keyword: `eval/hypervolume`. diff --git a/morl_baselines/common/evaluation.py b/morl_baselines/common/evaluation.py index af106c08..8413a708 100644 --- a/morl_baselines/common/evaluation.py +++ b/morl_baselines/common/evaluation.py @@ -16,7 +16,6 @@ hypervolume, igd, maximum_utility_loss, - sparsity, ) from morl_baselines.common.weights import equally_spaced_weights @@ -156,7 +155,6 @@ def log_all_multi_policy_metrics( Logged metrics: - hypervolume - - sparsity - expected utility metric (EUM) If a reference front is provided, also logs: - Inverted generational distance (IGD) @@ -172,14 +170,12 @@ def log_all_multi_policy_metrics( """ filtered_front = list(filter_pareto_dominated(current_front)) hv = hypervolume(hv_ref_point, filtered_front) - sp = sparsity(filtered_front) eum = expected_utility(filtered_front, weights_set=equally_spaced_weights(reward_dim, n_sample_weights)) card = cardinality(filtered_front) wandb.log( { "eval/hypervolume": hv, - "eval/sparsity": sp, "eval/eum": eum, "eval/cardinality": card, "global_step": global_step, diff --git a/morl_baselines/common/performance_indicators.py b/morl_baselines/common/performance_indicators.py index 8462dbb3..4fb610f1 100644 --- a/morl_baselines/common/performance_indicators.py +++ b/morl_baselines/common/performance_indicators.py @@ -42,6 +42,9 @@ def igd(known_front: List[np.ndarray], current_estimate: List[np.ndarray]) -> fl def sparsity(front: List[np.ndarray]) -> float: """Sparsity metric from PGMORL. + (!) This metric only considers the points from the PF identified by the algorithm, not the full objective space. + Therefore, it is misleading (e.g. learning only one point is considered good) and we recommend not using it when comparing algorithms. + Basically, the sparsity is the average distance between each point in the front. Args: