Squashed commit of the following:

commit 60441a1 Author: Steven Goldenberg <[email protected]> Date: Thu May 23 11:59:21 2024 -0400 Upload of basic Tensorflow model Still needs a unit test, but the model works well for relatively standard models. Subclassed models may need specialized modules. commit b009362 Author: Steven Goldenberg <[email protected]> Date: Fri May 10 11:46:50 2024 -0400 Squashed commit of the following: commit 7ff070a Author: Steven Goldenberg <[email protected]> Date: Thu May 9 17:07:04 2024 -0400 Format pandas_standard_scaler using black Unittests still work and changes seem to mostly be single quotes to double and removing spaces. commit dac7a31 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 16:17:17 2024 -0400 Make unittest for config IO commit 447a444 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 15:35:09 2024 -0400 Fix more documentation in pandas_standard_scaler commit 1a46420 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 14:35:00 2024 -0400 Update pandas_standard_scaler.py Adding lots of additional documentation. Some functions aren't fully documented, but this is a very good start. commit de25bd6 Author: Steven Goldenberg <[email protected]> Date: Thu May 9 13:21:29 2024 -0400 Rename IO functions for yaml configurations [save/load]_config --> [save/load]_yaml_config to avoid confusion if we want another version for JSON or something else. commit a001284 Author: Steven Goldenberg <[email protected]> Date: Wed May 8 14:52:09 2024 -0400 Simplify save/load config Saving and loading configurations are now done by utility functions in the utils folder. This simplifies the modules and allows for unit testing on the config I/O functions outside of any module. Utility functions try to properly handle FileNotFound and FileExists errors. commit 83f019f Merge: a3f758f c63e630 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:25:19 2024 -0400 Merge branch 'main' into 36-make-pandasstandardscaler-for-hugs commit a3f758f Author: Steven Goldenberg <[email protected]> Date: Mon Apr 29 15:12:56 2024 -0400 Fix reverse() and add unit test commit 79566d2 Author: Steven Goldenberg <[email protected]> Date: Mon Apr 29 14:53:35 2024 -0400 Full implementation of scaler with unittests The implementation avoids using scikit-learn entirely for a number of reasons including no option for axis changes and no option for changing the epsilon value when dealing with small variances. commit ca21fcf Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 17:35:18 2024 -0400 Started implementation of PandasStandardScaler Using scikit-learn's StandardScaler as a base. Saving and loading is a bit tricky as the internal state of the scikit-learn implementation isn't easily saved. It looks like you can save it's internal __dict__ and then set the attributes on load, but I'm not sure if this is robust... commit 574d0ac Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 16:17:59 2024 -0400 Fix tab bug in pandas_standard_scaler.py commit 676d640 Author: Steven Goldenberg <[email protected]> Date: Fri Apr 26 16:07:31 2024 -0400 Create pandas_standard_scaler.py Inital upload of a standard scaler that supports pandas dataframes as input. Maybe it should be renamed to DataFrame scaler to match the parser_to_dataframe.py file... commit c63e630 Merge: eea09a5 ff117c0 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:22:45 2024 -0400 Merge pull request #35 from JeffersonLab/24-common-csv-parser 24 common csv parser commit ff117c0 Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:18:31 2024 -0400 Delete csv_parser_v0.py The functionality of the CSV parser is being handled by the Parser2Dataframe now and is no longer used as the entrypoint to the registered CSVParser. Keeping this code will only make the repository more confusing. commit e11864f Author: Steven Goldenberg <[email protected]> Date: Thu May 2 09:09:36 2024 -0400 Save_Load Unit Tests Also fixes a few documentation errors and the unittest logging (now -v turns on extra logging from the module)
JeffersonLab · May 23, 2024 · a851c5d · a851c5d
1 parent e9bd796
commit a851c5d
Show file tree

Hide file tree

Showing 12 changed files with 673 additions and 143 deletions.
diff --git a/jlab_datascience_toolkit/cfgs/base_model_cfg.yaml b/jlab_datascience_toolkit/cfgs/base_model_cfg.yaml
@@ -0,0 +1,31 @@
+model_config:
+  class_name: Sequential
+  config:
+    dtype: float32
+    layers:
+    - class_name: Dense
+      config:
+        activation: relu
+        units: 128
+      module: keras.layers
+      registered_name: null
+    - class_name: Dense
+      config:
+        units: 1
+      module: keras.layers
+      registered_name: null
+    name: basic_model
+    trainable: true
+  module: keras
+  registered_name: null
+compile_config:
+  loss:
+    class_name: BinaryCrossentropy
+    config:
+      from_logits: true
+    module: keras.losses
+    registered_name: null
+  metrics:
+  - accuracy
+  optimizer: adam
+
diff --git a/jlab_datascience_toolkit/data_parser/csv_parser_v0.py b/jlab_datascience_toolkit/data_parser/csv_parser_v0.py
diff --git a/jlab_datascience_toolkit/data_parser/parser_to_dataframe.py b/jlab_datascience_toolkit/data_parser/parser_to_dataframe.py
@@ -1,4 +1,5 @@
 from jlab_datascience_toolkit.core.jdst_data_parser import JDSTDataParser
+from jlab_datascience_toolkit.utils.io import save_yaml_config, load_yaml_config
 from pathlib import Path
 import pandas as pd
 import logging
@@ -30,8 +31,9 @@ class Parser2DataFrame(JDSTDataParser):
             Format of files to parse. Currently supports csv, feather, json
             and pickle. Defaults to csv
         `read_kwargs: dict = {}`
-            Arguments to be passed 
+            Arguments to be passed to the read function determined by `file_format`
         `concat_kwargs: dict = {}`
+            Arguments to be passed to pd.concat()
 
     Attributes
     ----------
@@ -118,11 +120,7 @@ def load(self, path: str):
             path (str): Path to folder containing module files.
         """
         base_path = Path(path)
-        with open(base_path.joinpath('config.yaml'), 'r') as f:
-            loaded_config = yaml.safe_load(f)
-
-        self.config.update(loaded_config)
-        self.setup()
+        self.load_config(base_path)
 
     def save(self, path: str):
         """Save the entire module state to a folder at `path`
@@ -132,8 +130,7 @@ def save(self, path: str):
         """
         save_dir = Path(path)
         os.makedirs(save_dir)
-        with open(save_dir.joinpath('config.yaml'), 'w') as f:
-            yaml.safe_dump(self.config, f)
+        self.save_config(save_dir)
 
     def load_data(self) -> pd.DataFrame:
         """ Loads all files listed in `config['filepaths']` 
@@ -165,13 +162,19 @@ def load_data(self) -> pd.DataFrame:
 
         return output
 
-    def load_config(self, path: str):
-        parser_log.debug('Calling load()...')
-        return self.load(path)
+    def load_config(self, path: Path | str):
+        self.config.update(load_yaml_config(path))
+        self.setup()
+
+    def save_config(self, path: Path | str, overwrite=False):
+        """ Saves this modules configuration to the file specified by path
+            If path is a directory, we save the configuration as config.yaml
 
-    def save_config(self, path: str):
-        parser_log.debug('Calling save()...')
-        return self.save(path)
+        Args:
+            path (Path | str): Location for saved configuration. Either a filename or directory is 
+                acceptable.
+        """
+        save_yaml_config(self.config, path, overwrite)
 
     def save_data(self):
         return super().save_data()
diff --git a/jlab_datascience_toolkit/data_prep/__init__.py b/jlab_datascience_toolkit/data_prep/__init__.py
@@ -7,3 +7,7 @@
 
 from jlab_datascience_toolkit.data_prep.numpy_minmax_scaler import NumpyMinMaxScaler
 
+register(
+    id = "PandasStandardScaler_v0",
+    entry_point="jlab_datascience_toolkit.data_prep.pandas_standard_scaler:PandasStandardScaler"
+)