Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetimes all become NaT #2361

Open
wilcovanvorstenbosch opened this issue Jan 30, 2025 · 1 comment
Open

Datetimes all become NaT #2361

wilcovanvorstenbosch opened this issue Jan 30, 2025 · 1 comment
Assignees
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@wilcovanvorstenbosch
Copy link

I'm trying to synthesize a dataset with various columns for dates.
I'm using the Gaussian Copula Synthesizer, and the following transformer for these columns:

OptimizedTimestampEncoder(
            enforce_min_max_values=True,
            missing_value_replacement='random',
            missing_value_generation='from_column',
            datetime_format=None
        )

The columns are of type <M8[ns] or datetime64[ns]. They already contain a lot of missing values (NaT).
After synthesis, the sampled data contains ONLY missing values (NaT).

The dates are of format yyyy-mm-dd. I tried setting the date_time format to '%Y-%m-%d' , but to no avail.

What did I do wrong?

@wilcovanvorstenbosch wilcovanvorstenbosch added new Automatic label applied to new issues question General question about the software labels Jan 30, 2025
@srinify
Copy link
Contributor

srinify commented Jan 31, 2025

Hi @wilcovanvorstenbosch 👋

  1. Out of curiosity, does the NaT issue disappear if you use the default transformer, without updating to OptimizedTimestampEncoder? I'd be curious to know if the default workflow results in a roughly similar proportion of NaT values in the synthetic data.

  2. I'm also curious to know if you're hoping to generate synthetic data with the same datetime format as your real data, or if you're hoping to generate synthetic data with a different one? In SDV land, the datetime_format value should describe the format in your real data, not the aspirational format for the synthetic data.

  3. Do you also mind sharing your SDV code? This would include how you're instantiating the synthesizer, customizing the synthesizer's behavior using parameters and methods, any constraints you're using, etc.

@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 31, 2025
@srinify srinify self-assigned this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants