Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataRecord improvement: Consider to unify the setup/usage for field_types field_values in DataRecord #105

Open
chjuncn opened this issue Feb 3, 2025 · 2 comments

Comments

@chjuncn
Copy link
Collaborator

chjuncn commented Feb 3, 2025

Issues

datarecord.field_types.keys() and datarocrd.field_values.keys() return different sets and they even don't necessarily overlap with each other, which is confusing. And the usage for field_types and field_values seem to be random in DataRecord now.

field_types are generally from Schema setup
field_values setup is simply by datarecord.new_field=value.

So we have different keys for these 2 set, it's easy to get attribute not found in the code. Consider to make them consistent or just use one set all the time.

Possible Further Improvements

Currently all fields including system fields, user-added fields are in same level of data record, they mix together. How about keep system level fields in first level, and all user-added fields into a customized_fields attributes? E.g.

system_field_1
system_field_2
system_field_3
customized_fields
        user_defined_field_1
        user_defined_field_2
        .....

I think it will be easier for the system to manage fields and less confusing, no conflicting from system and user-defined fields.

@mdr223
Copy link
Collaborator

mdr223 commented Feb 3, 2025

Hi @chjuncn is there a reproducible example of field_types and field_values not sharing the same keys that you can share with me?

Overall I 100% agree that managing two separate dictionaries is ugly and I also would like to make these consistent by just managing one data structure.

Finally, can you expand on your text in Possible Further Improvements? Specifically, I tried to achieve what (I think) you're expressing, which is that all non-data fields are kept at the top-level of the DataRecord. All data fields are then stored in DataRecord.field_values (i.e. I think customized_fields == field_values?)

If my understanding is not complete, please correct me!

Also, one current limitation of the existing approach is that users cannot use fields with names that match the top-level fields in DataRecord (e.g. id, parent_id, schema, etc.). This is especially bad for fields like id which are common and which Gerardo already had a name collision with. I'm open to any suggestions, especially if they can overcome this name collision issue!

(The simplest idea for overcoming name collisions is simply to prefix those internal fields with an underscore and then we could disallow field names with _ -- or just the field names like _id, _parent_id, etc. -- but I'm still not a huge fan of this idea if there's something better.)

@chjuncn
Copy link
Collaborator Author

chjuncn commented Feb 4, 2025

is there a reproducible example of field_types and field_values not sharing the same keys that you can share with me?

  1. The existing code is OK now, as I updated get_field_names() function to use field_values. And I updated datarecord.to_df() to use field_values instead of schema.field_types.
  2. I found this bug in HTMLFileDirectorySource, its schema is WebPage without "filename" attribute, but get_item() sets up filename.
  3. It's easy to reproduce by just defining a NewSource with a NewSchema. Its get_item() sets different attributes from NewSchema definition.

I just feel it's too easy to get wrong by using DataRecord in this way.

If my understanding is not complete, please correct me!

I was wrong. I was confused by the field_values and system protected_values when I created this issue, that's why I think it's confusing :P

Also, one current limitation of the existing approach is that users cannot use fields with names that match the top-level fields in DataRecord (e.g. id, parent_id, schema, etc.). This is especially bad for fields like id which are common and which Gerardo already had a name collision with. I'm open to any suggestions, especially if they can overcome this name collision issue!

I got this issue as well.

I like the simplest solution to solve the collisions. In addition I propose another solution to solve the first issue I mentioned above:

  1. Rename field_values to populated_fields: list[dict], which is clear to the system which fields are populated, which are just field definitions (from schema)
  2. remove field_types form dr, instead, all need of field types/desc should be read from self.schema.
  3. rename get_field_names --> get_populated_fields or so ..

In this way, we separate what we set for data format, and what we have for data.

  1. self.schema is how the users want the dr to have including types/desc;
  2. populated_fields is what has been populated;

Overall, resolving name collision has higher priority, we can rename system attributes with '_'. I think my proposal can make the system has clearer logic and easy to use in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants