Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using Metadata.detect_from_dataframe for complex dataset #2346

Open
jaysara opened this issue Jan 14, 2025 · 2 comments
Open

using Metadata.detect_from_dataframe for complex dataset #2346

jaysara opened this issue Jan 14, 2025 · 2 comments
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@jaysara
Copy link

jaysara commented Jan 14, 2025

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version:1.17.3
  • Python version:3.9
  • Operating System: MacOS

Problem description

I am reading a parquet file in panda dataframe and using Metadata.detect_from_dataframe to detect the metadata. The dataframe has multiple fields those have an array of another subelements. This is more of denormalized dataset. Will SDV work for this kind of structure of , it is important that my dataset has to be totally flatout (normalized)
To explain better, heres is the example of schema,

Field1 : String (id)
Field2: String (Category)
Field3 : Array[] of [{Field3, Field4, Field5},{Field3, Field4, Field5},{Field3, Field4, Field5}]
Field 6: String (Category)

What I already tried

I tried using the Metadata API on this complext dataset, I got following error,

df = pd.read_parquet('myfile.parquet')
metadata = Metadata.detect_from_dataframe(
    data=df,
    table_name='sample')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 metadata = Metadata.detect_from_dataframe(
      2     data=newdf,
      3     table_name='sample')

File ~/Library/Python/3.9/lib/python/site-packages/sdv/metadata/metadata.py:108, in Metadata.detect_from_dataframe(cls, data, table_name)
    105     raise ValueError('The provided data must be a pandas DataFrame object.')
    107 metadata = Metadata()
--> 108 metadata.detect_table_from_dataframe(table_name, data)
    109 return metadata

File ~/Library/Python/3.9/lib/python/site-packages/sdv/metadata/multi_table.py:547, in MultiTableMetadata.detect_table_from_dataframe(self, table_name, data)
    545 self._validate_table_not_detected(table_name)
    546 table = SingleTableMetadata()
--> 547 table._detect_columns(data)
    548 self.tables[table_name] = table
    549 self._log_detected_table(table)```
@jaysara jaysara added new Automatic label applied to new issues question General question about the software labels Jan 14, 2025
@npatki
Copy link
Contributor

npatki commented Jan 15, 2025

Hi @jaysara, nice to meet you. It would be very helpful if you are able to share what a few rows of your data look like, just as an example (you can redact any private info, but it would be helpful to see the format).

Field1 : String (id)
Field2: String (Category)
Field3 : Array[] of [{Field3, Field4, Field5},{Field3, Field4, Field5},{Field3, Field4, Field5}]
Field 6: String (Category)

In the absence of any examples, I am assuming here that each of the Fields you are specifying represent different columns of your data? If so, then your understanding is correct -- SDV will not accept columns whose values contain arrays, dictionaries, etc. The data should be in a flat structure so that each column would contain a simple value such as a string, a number, or a datetime.

Are you able to modify your data to be in such a format? Perhaps you can expand out Fields 1-6 so that they are each separate columns?

Field 1 Field 2 Field 3 Field 4 Field 5 Field 6
id-000 Yes 2021-02-03 1.23 True 1
id-001 No 2022-03-05 3.45 False 1
id-002 Yes 2020-01-12 2.13 False 0
... ... ... ... ... ...

Hope that helps.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 15, 2025
@npatki
Copy link
Contributor

npatki commented Jan 15, 2025

As a side note: I realize our error message wasn't very useful for you. We are actively working on providing better error messages to you. See #2327.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants