Releases: Open-Dataplatform/utils-databricks
v0.7.3: Version 0.7.3 (#32)
Minor HotFix
HotFix
Minor HotFix in _process_xml_data function
XML Parsing and Flattening
Overview
This pull request introduces significant enhancements to the existing data pipeline by extending its functionality to process XML files. The changes ensure seamless integration of XML data alongside the already supported formats, maintaining flexibility and robustness in the data processing workflow.
Key Updates
1. XML Parsing and Flattening:
- Implemented methods for reading, parsing, and flattening XML files using the xml format in Spark.
- Added recursive handling of nested structures and attributes to produce a clean, normalized schema.
2. Schema Management:
- Enhanced support for dynamic schema extraction from XSD (XML Schema Definition) files.
- Automated validation of XML files against XSD to ensure data integrity.
3. Data Quality Checks:
- Extended existing quality checks to validate XML data, including:
- Null value checks for critical columns.
- Range validation for numerical fields.
- Consistency checks for related fields (e.g., start_date < end_date).
4. Merge Functionality:
- Updated the execute_merge function to support XML-based datasets, ensuring proper handling of updates, inserts, and deletes.
- Simplified the merge implementation while maintaining detailed logging and statistics reporting (e.g., number of rows deleted, updated, or inserted).
5. Logging and Debugging:
- Improved logging to provide better insights into XML file processing, schema validation, and data transformations.
- Added support for SQL query formatting and structured log blocks for enhanced readability.
6. Documentation:
- Updated README and inline comments to reflect new XML support.
- Added examples and usage guidelines for working with XML files in the data pipeline.
Impact
This enhancement significantly improves the versatility of the data pipeline by enabling it to process XML files, a common data format in many industries. These changes not only broaden the scope of the pipeline but also ensure high data quality and consistent processing across all supported formats.
v0.6.9: Merge pull request #29 from xazms/version-0.6.9
Updated Description for the Latest Change
- Added a new function: Introduced parse_key_columns to standardize the handling of key_columns. This function ensures consistent processing by splitting comma-separated strings into lists or validating pre-existing lists, reducing potential input errors.
- Refactored relevant methods: Updated _check_for_duplicates, perform_data_quality_checks, and any other methods utilizing key_columns to use the parse_key_columns function. This enhances code maintainability and ensures uniform logic for handling key_columns across all operations.
- Enhanced clarity and robustness: Improved input validation and error handling related to key_columns usage, ensuring seamless integration in both SQL and DataFrame operations.
Wildcard support
Validator Class:
- Enhanced file filtering using wildcards for better accuracy.
- Improved error handling to gracefully terminate execution when no matching files are found, with clear logging and exit mechanisms.
DataFrameTransformer Class:
- Refined file matching logic to handle wildcard patterns consistently.
- Added robust handling and logging for cases where files are missing, ensuring transparent processing and clear error messages.
XLSX support
Version 0.6.7 (#27) * Update setup.cfg * Update standardization_template.ipynb * Update standardization_template.ipynb * Update reader.py Keep the whole folder path instead of removing the first part * Update reader.py * Update setup.cfg * Update reader.py Rollback * Create helper Created new file to store the helper functions * Update helper * Delete src/custom_utils/helper * Create helper.py * Update helper.py * Update reader.py * Update reader.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update writer.py * Update reader.py * Update writer.py * Update dataframe.py * Update reader.py * Update writer.py * Update dataframe.py * Update writer.py * Update writer.py * Update dataframe.py * Update reader.py * Update writer.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update writer.py * Update writer.py * Update writer.py * Update writer.py * Update dataframe.py * Update writer.py * Update dataframe.py * Update reader.py * Update reader.py * Update writer.py * Update writer.py * Update writer.py * Update writer.py * Update writer.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update writer.py * Update reader.py * Update reader.py * Update writer.py * Update reader.py * Update reader.py * Update dataframe.py * Update dataframe.py * Create config.py * Update __init__.py * Update config.py * Create validation.py * Update validation.py * Update validation.py * Update config.py * Update config.py * Update config.py * Update config.py * Update config.py * Update validation.py * Update reader.py * Update validation.py * Update config.py * Update writer.py * Update dataframe.py * Update dataframe.py * Update dataframe.py * Update validation.py * Update setup.cfg * Update standardization_template.ipynb * Update standardization_template.ipynb * Update standardization_template.ipynb * Update standardization_template.ipynb * Update setup.cfg * Update standardization_template.ipynb * Update config.py * added import os * dbutils update * update parameters in initialize_config * update validation function * Update and format the functions * Added process_and_flatten_json * Add reader and writer * Add SparkSession * Added config to my function * Updated dataframe with twi extra params * Removed dataframe from the function * added new print to get_json_depth * Log the depth once if the helper is provided * Log the depth once if the helper is provided * New update * Create new file to maintain quality * Updated quality * Update key_columns variable * add new param in perform_quality_check * Update param view_name * Create new function * Update * update * update * Update * Create new function * update * Created new function * Updated Query returned no results * created new function * Updated * Import os * Update standardization_template.ipynb * Update standardization_template.ipynb * removed rename_and_cast_columns * added print to the output * Update syntax error * Update class * remoed initilization of spark * Added spark as param * Update table and merge * added print query print * removed columns_of_interest and createOrReplace * update * update * update * Tag to v0.6.1 * Update standardization_template.ipynb * Update standardization_template.ipynb * Update standardization_template.ipynb * Update standardization_template.ipynb * Update standardization_template.ipynb * automatically call the setup_pipeline * Update init * Automatically initiate setup_pipeline * Update back * Update initialize_notebook * update * Update standardization_template.ipynb * Update standardization_template.ipynb * Update with a retunr of view * Update table_management * Return temp_view_name * table_management update * alter merge function * Update setup.cfg * DOnt raise error if table doesn exist * Update table_management and quality * Display newly merged data * Adjust merge function to use correct view * Formating and implementing best practice * Update to v0.6.3 * Update config * Update logger and config * Update * Update config and logger * Update intend * Update according to optianl params * Update last print modifications * Move get_mount_point to connector * Update validator * change file_type to file instead of schema * Update dataframe to exclude schema validation * Update file_paths * Update config * Update config and validation * Import os in validator * Update dataframe * Update dataframe * Initial update of new version * Update generate_feedback_timestamps * Update notebook exit raise error * Update helper * Remove exit notebook * Update DataFrame * Update Quality class * Update _raise_error * Update except Exception * Update template * Update Quality with removal of duplicates * Changed feedback_column to None as Default
Allow non json/xml
Enable other filetypes
Support for multiple key cols
minor bugs fixed. support for list of key cols
v0.6.4 Release pretty
pretty debug info
formatted SQL
v0.6.3 Release Me
small bug fixes done.