Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check for NaNs in emd loss matrix #623

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
5 changes: 5 additions & 0 deletions ot/lp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,8 @@ def emd(a, b, M, numItermax=100000, log=False, center_dual=True, numThreads=1, c
.. note:: An error will be raised if the vectors :math:`\mathbf{a}` and :math:`\mathbf{b}` do not sum to the same value.
.. note:: An error will be raided if the loss matrix :math:`\mathbf{M}` contains NaNs.
Uses the algorithm proposed in :ref:`[1] <references-emd>`.
Parameters
Expand Down Expand Up @@ -302,6 +304,9 @@ def emd(a, b, M, numItermax=100000, log=False, center_dual=True, numThreads=1, c
ot.optim.cg : General regularized OT
"""

if np.isnan(M).any():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem here is that you are using numpy on arrays that might not be numpy (see backend function below). You should do the test later in the function on the OT loss marix that hhas been converted to numpy to avoid backend errors.

raise ValueError('The loss matrix should not contain NaN values.')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing early here ensures that we do not segfault in the accelerated emd_c call.

I did not look too deep into the emd_c implementation, but my assumption is that this check is somewhat pessimistic. Maybe it is possible to formulate problems for which we do not need to access a subset of values in the loss matrix (possibly due to the graph being disconnected). In that case we could support NaN values in some cases. @rflamary what is your opinion on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the graph is disconnected then the parts that are not used should have an infinite value (which is ha,ndled by the C++ solver). i'm OK with not handling naNs.

a, b, M = list_to_array(a, b, M)
nx = get_backend(M, a, b)

Expand Down
30 changes: 30 additions & 0 deletions test/gromov/test_gw.py
Original file line number Diff line number Diff line change
Expand Up @@ -832,3 +832,33 @@ def test_fgw_barycenter(nx):
# test correspondance with utils function
recovered_C = ot.gromov.update_kl_loss(p, lambdas, log['T'], [C1, C2])
np.testing.assert_allclose(C, recovered_C)


# Related to issue 469
def test_gromov2_nan_in_source_cost():
# GIVEN a source cost matrix with a NaN value
source_cost = np.zeros((2, 2))
target_cost = np.ones((2, 2))
source_distribution = np.array([0.5, 0.5])
target_distribution = np.array([0.5, 0.5])

source_cost[0, 0] = np.nan

# WHEN we call gromov_wasserstein2 - THEN we expect a ValueError
with pytest.raises(ValueError, match='The loss matrix should not contain NaN values.'):
ot.gromov_wasserstein2(source_cost, target_cost, source_distribution, target_distribution)


# Related to issue 469
def test_gromov2_nan_in_target_cost():
# GIVEN - a target cost matrix with a NaN value
source_cost = np.zeros((2, 2))
target_cost = np.ones((2, 2))
source_distribution = np.array([0.5, 0.5])
target_distribution = np.array([0.5, 0.5])

target_cost[0, 0] = np.nan

# WHEN - we call
with pytest.raises(ValueError, match='The loss matrix should not contain NaN values.'):
ot.gromov_wasserstein2(source_cost, target_cost, source_distribution, target_distribution)
5 changes: 5 additions & 0 deletions test/test_ot.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,11 @@ def test_emd_empty():
np.testing.assert_allclose(w, 0)


def test_emd_nan_in_loss_matrix():
with pytest.raises(ValueError, match='The loss matrix should not contain NaN values.'):
ot.emd([], [], [np.nan])


def test_emd2_multi():
n = 500 # nb bins

Expand Down