Releases · Open-Dataplatform/utils-databricks

Overview
This pull request introduces significant enhancements to the existing data pipeline by extending its functionality to process XML files. The changes ensure seamless integration of XML data alongside the already supported formats, maintaining flexibility and robustness in the data processing workflow.

Key Updates

1. XML Parsing and Flattening:

Implemented methods for reading, parsing, and flattening XML files using the xml format in Spark.
Added recursive handling of nested structures and attributes to produce a clean, normalized schema.

2. Schema Management:

Enhanced support for dynamic schema extraction from XSD (XML Schema Definition) files.
Automated validation of XML files against XSD to ensure data integrity.

3. Data Quality Checks:

Extended existing quality checks to validate XML data, including:
- Null value checks for critical columns.
- Range validation for numerical fields.
- Consistency checks for related fields (e.g., start_date < end_date).

4. Merge Functionality:

Updated the execute_merge function to support XML-based datasets, ensuring proper handling of updates, inserts, and deletes.
Simplified the merge implementation while maintaining detailed logging and statistics reporting (e.g., number of rows deleted, updated, or inserted).

5. Logging and Debugging:

Improved logging to provide better insights into XML file processing, schema validation, and data transformations.
Added support for SQL query formatting and structured log blocks for enhanced readability.

6. Documentation:

Updated README and inline comments to reflect new XML support.
Added examples and usage guidelines for working with XML files in the data pipeline.

Impact
This enhancement significantly improves the versatility of the data pipeline by enabling it to process XML files, a common data format in many industries. These changes not only broaden the scope of the pipeline but also ensure high data quality and consistent processing across all supported formats.

Updated Description for the Latest Change

Added a new function: Introduced parse_key_columns to standardize the handling of key_columns. This function ensures consistent processing by splitting comma-separated strings into lists or validating pre-existing lists, reducing potential input errors.
Refactored relevant methods: Updated _check_for_duplicates, perform_data_quality_checks, and any other methods utilizing key_columns to use the parse_key_columns function. This enhances code maintainability and ensures uniform logic for handling key_columns across all operations.
Enhanced clarity and robustness: Improved input validation and error handling related to key_columns usage, ensuring seamless integration in both SQL and DataFrame operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: Open-Dataplatform/utils-databricks

v0.7.3: Version 0.7.3 (#32)

HotFix

XML Parsing and Flattening

v0.6.9: Merge pull request #29 from xazms/version-0.6.9

Wildcard support

XLSX support

Allow non json/xml

Support for multiple key cols

v0.6.4 Release pretty

v0.6.3 Release Me