Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CopulaGANSynthesizer more likely to see see FitError (Optimization converged to parameters that are outside the range allowed by the distribution.) #2391

Open
npatki opened this issue Feb 25, 2025 · 1 comment
Labels
bug Something isn't working data:single-table Related to tabular datasets

Comments

@npatki
Copy link
Contributor

npatki commented Feb 25, 2025

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.18.0
  • Python version: 3.11.11
  • Operating System: Linux (Google Colab)

Error Description

The CopulaGANSynthesizer is more likely to run into a FitError (Optimization converged to parameters that are outside the range allowed by the distribution.) than other synthesizers. This can particularly happen when using the beta distribution.

Why is this happening? Some synthesizers (GaussianCopula and CopulaGAN) use the scipy library to estimate the shape of each column (aka marginal distribution). Scipy runs optimization algorithms to estimate these parameters, but these algorithms are not infallible; sometimes, just due to how the data is shaped, the algorithm won't work and it will produce a FitError. See scipy docs.

Unfortunately, the SDV team cannot control how the internals of scipy work. However, we can provide error-checking and fallbacks so that the inability to fit one column won't cause a crash for the entire dataset. In the GaussianCopulaSynthesizer, we do the following:

  1. We apply some heuristics first and pass them as starting values into scipy, in the hopes that the algorithm would be more likely to converge. See code
  2. In the worse case, we fallback to a different distribution that is guaranteed to converge -- aka the normal distribution. See code.

We may not be doing these steps CopulaGANSynthesizer because it is an experimental synthesizer.

Steps to reproduce

Replicate this issue using the same data that is described in Copulas #264.

import pandas as pd
import numpy as np

from sdv.single_table import CopulaGANSynthesizer
from sdv.metadata import Metadata


data = pd.DataFrame(data={
    'A': np.concatenate([np.zeros(29), np.ones(21)]) # exact data from Copulas issue #264
})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'columns': {
                'A': { 'sdtype': 'numerical'}}}}})

synthesizer = CopulaGANSynthesizer(metadata)
synthesizer.fit(data)

Workarounds

Any 1 of the following configurations get rid of this error:

  1. (If applicable) List this data as categorical instead of numerical in the metadata. Numerical data is meant to denote continuous distributions of data. If your data is actually present as discrete values, modeling is as numerical is more likely to cause issues.
metadata.update_column(column_name='A', sdtype='categorical')
  1. Switch from using CopulaGANSynthesizer to any other single table synthesizer -- GaussianCopulaSynthesizer, CTGANSynthesizer, or TVAESynthesizer. I recommend GaussianCopulaSynthesizer, as it is a fast and flexible statistical model that has been shown to achieve good quality synthetic data.
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
  1. Keep the CopulaGANSynthesizer, but switch to a different default distribution. I recommend truncnorm, as it has been shown to achieve comparable quality to beta for many different types of data -- and it is much faster.
synthesizer = CopulaGANSynthesizer(metadata, default_distribution='truncnorm')
@npatki npatki added bug Something isn't working new Automatic label applied to new issues data:single-table Related to tabular datasets and removed new Automatic label applied to new issues labels Feb 25, 2025
@npatki
Copy link
Contributor Author

npatki commented Feb 26, 2025

Update on this one: After some investigation, we have identified that this is, indeed, happening because CopulaGAN is not hooked up to a fallback option in the case that the requested distribution fails.

We can confirm that for this same data, the distribution also fails for GaussianCopula. But since GaussianCopula has a built-in fallback, it doesn't crash and keeps going with the modeling.

RDT issue #945 will ultimately fix this problem in CopulaGAN. Until then, we can keep this issue open in case anyone else runs into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:single-table Related to tabular datasets
Projects
None yet
Development

No branches or pull requests

1 participant