Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

24 common csv parser #35

Merged
merged 10 commits into from
May 2, 2024
Merged

24 common csv parser #35

merged 10 commits into from
May 2, 2024

Conversation

sgoldenCS
Copy link
Contributor

This pull request should close #24 and close #33. The implementation of CSVParser is essentially the PandasParser with the file_format field forced to 'csv'.

Personally, I'm not too fond of this solution as it will require a different parser for every file format and will result in a large amount of code duplication. However, I don't see another way to address #24 and #33 since it's apparent the current solution is not acceptable....

Moved the read_functions dictionary inside of the read_data_to_pandas function to avoid creation of a global variable.
Allows for easier future modifications if necessary.
@sgoldenCS sgoldenCS linked an issue Apr 22, 2024 that may be closed by this pull request
1 task
@sgoldenCS sgoldenCS requested a review from schr476 April 22, 2024 17:17
@dlersch
Copy link
Contributor

dlersch commented Apr 23, 2024

@sgoldenCS , this looks good. thank you. I have two suggestions (that I am happy to discuss):

  1. Rename the utest_pandas_parser_v0.py to: utest_csv_parser.py just so that it is consistent with everything else in the repo and is consistent with the parser name
  2. Remove the pandas_parser_v0.py You implemented a nice utility that does everything that the pandas_parser did. If people need a json_parser for example, they can just use your util and write their own parser, following your example

@sgoldenCS
Copy link
Contributor Author

@dlersch I'm happy to rename the unit test. It's just a name anyway.

As for removing the pandas_parser, I'm hesitant to do that. While the utility I've implemented does the same work, I really question the idea of having multiple essentially duplicate parsers. I know @schr476 believes that it's too complicated, but if the goal of the modules is to reduce unnecessary work, I don't see how writing 150 lines of duplicate code for every file format makes any sense. The code is almost the same size as the full "pandas_parser"! It also fragments the code base and introduces more opportunities for bugs.

If we want a registered version for each file type, that could be solved by properly addressing #32 and using the pandas_parser. If we're unhappy with the "pandas_parser" name, we can agree on a more generic name like "base_parser_to_pandas". Ultimately, no good rationale for making these changes has been offered besides complaints about naming...

@dlersch
Copy link
Contributor

dlersch commented Apr 23, 2024

@sgoldenCS I do see your points. If you could just rename the unit-test, that would be great and I will accept your pull-request. As for a more generic name for the pandas_parser -> How about dataframe_parser ? The object returned is technically a dataframe . This is just a suggestion. I have no attachments to the name

@schr476
Copy link
Contributor

schr476 commented Apr 23, 2024

@dlersch I'm happy to rename the unit test. It's just a name anyway.

As for removing the pandas_parser, I'm hesitant to do that. While the utility I've implemented does the same work, I really question the idea of having multiple essentially duplicate parsers. I know @schr476 believes that it's too complicated, but if the goal of the modules is to reduce unnecessary work, I don't see how writing 150 lines of duplicate code for every file format makes any sense. The code is almost the same size as the full "pandas_parser"! It also fragments the code base and introduces more opportunities for bugs.

If we want a registered version for each file type, that could be solved by properly addressing #32 and using the pandas_parser. If we're unhappy with the "pandas_parser" name, we can agree on a more generic name like "base_parser_to_pandas". Ultimately, no good rationale for making these changes has been offered besides complaints about naming...

I think we agreeed that the goal of the parser is to read a particular data format.
We can make the default output format to be Pandas and converters to another format as needed.

@sgoldenCS
Copy link
Contributor Author

I don't remember if we agreed to that, but I don't think any agreement invalidates my points. We can still have a registered version of the parser for each data format without having to rewrite all of the code each time we need a new file format.

@schr476
Copy link
Contributor

schr476 commented Apr 23, 2024

@sgoldenCS , If we conform to one output format (Pandas) then we can write converters to satisfy other needed.
Therefore you write a parser per data type and converters as needed (which I would assume is relatively small)

@sgoldenCS
Copy link
Contributor Author

I'm not sure I follow. I haven't mentioned output data types at all, so why are you talking about output converters?

I feel this needs to be an actual conversation, not just GitHub comments. I don't believe having these disagreements on a (semi-)public platform is fair to me or anyone else who receives notifications for it.

@schr476
Copy link
Contributor

schr476 commented Apr 23, 2024

I'm not sure I follow. I haven't mentioned output data types at all, so why are you talking about output converters?

I feel this needs to be an actual conversation, not just GitHub comments. I don't believe having these disagreements on a (semi-)public platform is fair to me or anyone else who receives notifications for it.

Pandas is the output type for the CSVParser.
For simplicity, I propose we use Pandas DFs as a common output format and move forward.

@sgoldenCS
Copy link
Contributor Author

I've removed the registration of the PandasParser and updated the name of the code file. I also removed all references to Pandas in the file that weren't required.

I will be preparing a small presentation to explain how I believe the "parser_to_dataframe.py" implementation will allow for much faster development while still maintaining good versioning practices.

@schr476 and I had a discussion this morning and I like the idea of defining a Parser Plugin to formalize the process of creating read_functions for the parser_to_dataframe.py module. If we agree on this approach, we can make an issue/branch and implement it.

The registry now uses the "parser_to_dataframe" as a base class with file_format given as a registry kwarg.

The unit test has been updated to fix naming issues and output logging info to stdout to verify modules are working appropriately with the registry.
@sgoldenCS
Copy link
Contributor Author

After my conversation with @schr476 yesterday, I have updated a few things.

  • The unit test has had naming issues fixed
  • The unit test now prints all logging messages (debug and above) to stdout
  • The parser_to_dataframe now accepts a "registry_config"
  • Documentation has been added in the comments of parser_to_dataframe to explain the configuration priorities
    • Highest priority is anything in the "config" that should be used by the user
    • Next is anything defined in the "registry_config", which should be reserved for the registry to pass kwargs to the module
    • Any remaining unspecified arguments are set by a default config defined in the module
  • The registry now registers the CSVParser using Parser2DataFrame with kwargs={'registry_config': {'file_format': 'csv'}}

The unit test for the CSVParser still works correctly and completes all tests, so none of these changes should break anything.

A few things to note:

  1. The csv_parser_v0.py code has not been deleted but is no longer registered. Since the new registered CSVParser implements all of the same functionality, we could delete it if people agree.
  2. The parser_to_dataframe.py code could use the new utility functions instead of the implementations in the module (they're identical, but using utilities could be nice)
  3. Using the registry_config works for now, but we should agree on an approach (see Registry Kwargs Updating Issue #32)

sgoldenCS added 2 commits May 2, 2024 09:09
Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)
The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing.
@sgoldenCS
Copy link
Contributor Author

We can complete this pull request and its associated issues. Any further changes can be made through a future issue/pull if we need to make them, but this code should serve as a good starting point.

I've deleted the csv_parser_v0.py file to avoid confusion as it is not registered and its functionality is fully implemented in Parser2DataFrame. The parser_utilities.py file in the utils folder is not being used, but the functions defined there could still be useful to someone in the future, so I'll leave them there for now.

@sgoldenCS sgoldenCS merged commit c63e630 into main May 2, 2024
sgoldenCS added a commit that referenced this pull request May 23, 2024
commit 60441a1
Author: Steven Goldenberg <[email protected]>
Date:   Thu May 23 11:59:21 2024 -0400

    Upload of basic Tensorflow model

    Still needs a unit test, but the model works well for relatively standard models. Subclassed models may need specialized modules.

commit b009362
Author: Steven Goldenberg <[email protected]>
Date:   Fri May 10 11:46:50 2024 -0400

    Squashed commit of the following:

    commit 7ff070a
    Author: Steven Goldenberg <[email protected]>
    Date:   Thu May 9 17:07:04 2024 -0400

        Format pandas_standard_scaler using black

        Unittests still work and changes seem to mostly be single quotes to double and removing spaces.

    commit dac7a31
    Author: Steven Goldenberg <[email protected]>
    Date:   Thu May 9 16:17:17 2024 -0400

        Make unittest for config IO

    commit 447a444
    Author: Steven Goldenberg <[email protected]>
    Date:   Thu May 9 15:35:09 2024 -0400

        Fix more documentation in pandas_standard_scaler

    commit 1a46420
    Author: Steven Goldenberg <[email protected]>
    Date:   Thu May 9 14:35:00 2024 -0400

        Update pandas_standard_scaler.py

        Adding lots of additional documentation. Some functions aren't fully documented, but this is a very good start.

    commit de25bd6
    Author: Steven Goldenberg <[email protected]>
    Date:   Thu May 9 13:21:29 2024 -0400

        Rename IO functions for yaml configurations

        [save/load]_config --> [save/load]_yaml_config to avoid confusion if we want another version for JSON or something else.

    commit a001284
    Author: Steven Goldenberg <[email protected]>
    Date:   Wed May 8 14:52:09 2024 -0400

        Simplify save/load config

        Saving and loading configurations are now done by utility functions in the utils folder. This simplifies the modules and allows for unit testing on the config I/O functions outside of any module.
        Utility functions try to properly handle FileNotFound and FileExists errors.

    commit 83f019f
    Merge: a3f758f c63e630
    Author: Steven Goldenberg <[email protected]>
    Date:   Thu May 2 09:25:19 2024 -0400

        Merge branch 'main' into 36-make-pandasstandardscaler-for-hugs

    commit a3f758f
    Author: Steven Goldenberg <[email protected]>
    Date:   Mon Apr 29 15:12:56 2024 -0400

        Fix reverse() and add unit test

    commit 79566d2
    Author: Steven Goldenberg <[email protected]>
    Date:   Mon Apr 29 14:53:35 2024 -0400

        Full implementation of scaler with unittests

        The implementation avoids using scikit-learn entirely for a number of reasons including no option for axis changes and no option for changing the epsilon value when dealing with small variances.

    commit ca21fcf
    Author: Steven Goldenberg <[email protected]>
    Date:   Fri Apr 26 17:35:18 2024 -0400

        Started implementation of PandasStandardScaler

        Using scikit-learn's StandardScaler as a base. Saving and loading is a bit tricky as the internal state of the scikit-learn implementation isn't easily saved. It looks like you can save it's internal __dict__ and then set the attributes on load, but I'm not sure if this is robust...

    commit 574d0ac
    Author: Steven Goldenberg <[email protected]>
    Date:   Fri Apr 26 16:17:59 2024 -0400

        Fix tab bug in pandas_standard_scaler.py

    commit 676d640
    Author: Steven Goldenberg <[email protected]>
    Date:   Fri Apr 26 16:07:31 2024 -0400

        Create pandas_standard_scaler.py

        Inital upload of a standard scaler that supports pandas dataframes as input. Maybe it should be renamed to DataFrame scaler to match the parser_to_dataframe.py file...

commit c63e630
Merge: eea09a5 ff117c0
Author: Steven Goldenberg <[email protected]>
Date:   Thu May 2 09:22:45 2024 -0400

    Merge pull request #35 from JeffersonLab/24-common-csv-parser

    24 common csv parser

commit ff117c0
Author: Steven Goldenberg <[email protected]>
Date:   Thu May 2 09:18:31 2024 -0400

    Delete csv_parser_v0.py

    The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing.

commit e11864f
Author: Steven Goldenberg <[email protected]>
Date:   Thu May 2 09:09:36 2024 -0400

    Save_Load Unit Tests

    Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)
@sgoldenCS sgoldenCS deleted the 24-common-csv-parser branch November 4, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change PandasParser to CSVParser Common CSV parser
4 participants