CopulaGANSynthesizer more likely to see see FitError
(Optimization converged to parameters that are outside the range allowed by the distribution.)
#2391
Labels
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
The CopulaGANSynthesizer is more likely to run into a
FitError
(Optimization converged to parameters that are outside the range allowed by the distribution.) than other synthesizers. This can particularly happen when using thebeta
distribution.Why is this happening? Some synthesizers (GaussianCopula and CopulaGAN) use the
scipy
library to estimate the shape of each column (aka marginal distribution). Scipy runs optimization algorithms to estimate these parameters, but these algorithms are not infallible; sometimes, just due to how the data is shaped, the algorithm won't work and it will produce aFitError
. See scipy docs.Unfortunately, the SDV team cannot control how the internals of scipy work. However, we can provide error-checking and fallbacks so that the inability to fit one column won't cause a crash for the entire dataset. In the GaussianCopulaSynthesizer, we do the following:
We may not be doing these steps CopulaGANSynthesizer because it is an experimental synthesizer.
Steps to reproduce
Replicate this issue using the same data that is described in Copulas #264.
Workarounds
Any 1 of the following configurations get rid of this error:
categorical
instead ofnumerical
in the metadata. Numerical data is meant to denote continuous distributions of data. If your data is actually present as discrete values, modeling is as numerical is more likely to cause issues.GaussianCopulaSynthesizer
, as it is a fast and flexible statistical model that has been shown to achieve good quality synthetic data.truncnorm
, as it has been shown to achieve comparable quality tobeta
for many different types of data -- and it is much faster.The text was updated successfully, but these errors were encountered: