using Metadata.detect_from_dataframe for complex dataset #2346

jaysara · 2025-01-14T20:34:07Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version:1.17.3
Python version:3.9
Operating System: MacOS

Problem description

I am reading a parquet file in panda dataframe and using Metadata.detect_from_dataframe to detect the metadata. The dataframe has multiple fields those have an array of another subelements. This is more of denormalized dataset. Will SDV work for this kind of structure of , it is important that my dataset has to be totally flatout (normalized)
To explain better, heres is the example of schema,

Field1 : String (id)
Field2: String (Category)
Field3 : Array[] of [{Field3, Field4, Field5},{Field3, Field4, Field5},{Field3, Field4, Field5}]
Field 6: String (Category)

What I already tried

I tried using the Metadata API on this complext dataset, I got following error,

df = pd.read_parquet('myfile.parquet')
metadata = Metadata.detect_from_dataframe(
    data=df,
    table_name='sample')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 metadata = Metadata.detect_from_dataframe(
      2     data=newdf,
      3     table_name='sample')

File ~/Library/Python/3.9/lib/python/site-packages/sdv/metadata/metadata.py:108, in Metadata.detect_from_dataframe(cls, data, table_name)
    105     raise ValueError('The provided data must be a pandas DataFrame object.')
    107 metadata = Metadata()
--> 108 metadata.detect_table_from_dataframe(table_name, data)
    109 return metadata

File ~/Library/Python/3.9/lib/python/site-packages/sdv/metadata/multi_table.py:547, in MultiTableMetadata.detect_table_from_dataframe(self, table_name, data)
    545 self._validate_table_not_detected(table_name)
    546 table = SingleTableMetadata()
--> 547 table._detect_columns(data)
    548 self.tables[table_name] = table
    549 self._log_detected_table(table)```

The text was updated successfully, but these errors were encountered:

npatki · 2025-01-15T20:53:08Z

Hi @jaysara, nice to meet you. It would be very helpful if you are able to share what a few rows of your data look like, just as an example (you can redact any private info, but it would be helpful to see the format).

Field1 : String (id)
Field2: String (Category)
Field3 : Array[] of [{Field3, Field4, Field5},{Field3, Field4, Field5},{Field3, Field4, Field5}]
Field 6: String (Category)

In the absence of any examples, I am assuming here that each of the Fields you are specifying represent different columns of your data? If so, then your understanding is correct -- SDV will not accept columns whose values contain arrays, dictionaries, etc. The data should be in a flat structure so that each column would contain a simple value such as a string, a number, or a datetime.

Are you able to modify your data to be in such a format? Perhaps you can expand out Fields 1-6 so that they are each separate columns?

Field 1	Field 2	Field 3	Field 4	Field 5	Field 6
id-000	Yes	2021-02-03	1.23	True	1
id-001	No	2022-03-05	3.45	False	1
id-002	Yes	2020-01-12	2.13	False	0
...	...	...	...	...	...

Hope that helps.

npatki · 2025-01-15T20:54:38Z

As a side note: I realize our error message wasn't very useful for you. We are actively working on providing better error messages to you. See #2327.

jaysara added new Automatic label applied to new issues question General question about the software labels Jan 14, 2025

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using Metadata.detect_from_dataframe for complex dataset #2346

using Metadata.detect_from_dataframe for complex dataset #2346

jaysara commented Jan 14, 2025

npatki commented Jan 15, 2025

npatki commented Jan 15, 2025 •

edited

Loading

using Metadata.detect_from_dataframe for complex dataset #2346

using Metadata.detect_from_dataframe for complex dataset #2346

Comments

jaysara commented Jan 14, 2025

Environment details

Problem description

What I already tried

npatki commented Jan 15, 2025

npatki commented Jan 15, 2025 • edited Loading

npatki commented Jan 15, 2025 •

edited

Loading