Detect tables that are not present in the mapping file #2205

aminmovahed-db · 2024-07-18T04:24:00Z

Changes

Linked issues

Resolves #1221

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: table-migration
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

src/databricks/labs/ucx/hive_metastore/mapping.py

JCZuurmond

@aminmovahed-db : As discussed, verify this is the right place for the not mapped tables or if it should be part of the table-migration instead as mapping.csv contains all tables crawled during the assessment and before the table-migration user might edited the mapping.csv

JCZuurmond

As discussed or additionally:

Reuse the is_migrated that uses the index
Update the dashboard to show the not migrated tables
Create an integration test checking that the log statements for non-migrated tables are there
Create the unit test as mentioned in your todo
Create a unit test for not_migrated_refresh

src/databricks/labs/ucx/hive_metastore/table_migrate.py

src/databricks/labs/ucx/hive_metastore/workflows.py

github-actions · 2024-07-30T01:33:29Z

✅ 40/40 passed, 5 flaky, 3 skipped, 50m7s total

Flaky tests:

🤪 test_migrate_managed_tables_with_acl (1m6.708s)
🤪 test_table_migration_job_refreshes_migration_status[hiveserde-migrate-external-tables-ctas] (29.024s)
🤪 test_migrate_view (1m56.559s)
🤪 test_migration_job_ext_hms[regular] (2m55.387s)
🤪 test_table_migration_job_refreshes_migration_status[hiveserde-migrate-external-hiveserde-tables-in-place-experimental] (2m9.077s)

_{Running from acceptance #5221}

JCZuurmond

@aminmovahed-db : How is it going? I think you are getting there!

Do not forget to upload a screenshot of the dashboard once you are finished

src/databricks/labs/ucx/hive_metastore/table_migrate.py

src/databricks/labs/ucx/hive_metastore/workflows.py

tests/integration/hive_metastore/test_workflows.py

src/databricks/labs/ucx/queries/assessment/main/40_4_remaining_tables.sql

Introduces #1221

This reverts commit 61282187343f8fe5ef81593c0acea67395f2bbe4.

Introduces #1221

tests/integration/hive_metastore/test_workflows.py

The approach has been significantly changed.

JCZuurmond

💯 LGTM! Please ask @nfx for a review after you fixed the CI

nfx

lgtm

@Retried

* Added `validate-table-locations` command for checking overlapping tables across workspaces ([#2341](#2341)). A new command, `validate-table-locations`, has been added to check for overlapping table locations across workspaces before migrating tables. This command is intended to ensure that tables can be migrated across workspaces without issues. The new command is part of the table migration workflows and uses a `LocationTrie` data structure to efficiently search for overlapping table locations. If any overlaps are found, the command logs a warning message and adds the conflicting tables to a list of all conflicts. This list is returned at the end of the command. The `validate-table-locations` command is intended to be run before migrating tables to ensure that the tables can be migrated without conflicts. The command includes a `workspace-ids` flag, which allows users to specify a list of workspace IDs to include in the validation. If this flag is not provided, the command will include all workspaces present in the account. This new command resolves issue [#673](#673). The `validate_table_locations` method is added to the `AccountAggregate` class and the `ExternalLocations` class has been updated to use the new `LocationTrie` class. The import section has also been updated to include new modules such as `LocationTrie` and `Table` from `databricks.labs.ucx.hive_metastore.locations` and `databricks.labs.ucx.hive_metastore.tables` respectively. Additionally, test cases have been added to ensure the correct functioning of the `LocationTrie` class. * Added references to hive_metastore catalog in all table references an… ([#2419](#2419)). In this release, we have updated various methods and functions across multiple files to include explicit references to the `hive_metastore` catalog in table references. This change aims to improve the accuracy and consistency of table references in the codebase, enhancing reliability and maintainability. Affected files include `azure.py`, `init_scripts.py`, `pipelines.py`, and others in the `databricks/labs/ucx/assessment` module, as well as test files in the `tests/unit/assessment` and `tests/unit/azure` directories. The `_try_fetch` method has been updated to include the catalog name in table references in all instances, ensuring the correct catalog is referenced in all queries. Additionally, various test functions in affected files have been updated to reference the `hive_metastore` catalog in SQL queries. This update is part of the resolution of issue [#2207](#2207) and promotes robust handling of catalog, schema, and table naming scenarios in hive metastore migration status management. * Added support for skipping views when migrating tables and views ([#2343](#2343)). In this release, we've added support for skipping both tables and views during the migration process in the `databricks labs ucx` command, addressing issue [#1937](#1937). The `skip` command has been enhanced to support skipping views, and new functions `skip_table_or_view` and `load_one` have been introduced to the `Table` class. Appropriate error handling and tests, including unit tests and integration tests, have been implemented to ensure the functionality works as expected. With these changes, users can now skip views during migration and have more flexibility when working with tables in the Unity Catalog. * Avoid false positives when linting for pyspark patterns ([#2381](#2381)). This release includes enhancements to the PySpark linter aimed at reducing false positives during linting. The linter has been updated to check the originating module when detecting PySpark calls, ensuring that warnings are triggered only for relevant nodes from the pyspark or dbutils modules. Specifically, the `ReturnValueMatcher` and `DirectFilesystemAccessMatcher` classes have been modified to include this new check. These changes improve the overall accuracy of the PySpark linter, ensuring that only pertinent warnings are surfaced during linting. Additionally, the commit includes updated unit tests to verify the correct behavior of the modified linter. Specific improvements have been made to avoid false positives when detecting the `listTables` function in the PySpark catalog, ensuring that the warning is only triggered for the actual PySpark `listTables` method call. * Bug: Generate custom warning when doing table size check and encountering DELTA_INVALID_FORMAT exception ([#2426](#2426)). A modification has been implemented in the `_safe_get_table_size` method within the `table_size.py` file of the `hive_metastore` package. This change addresses an issue ([#1913](#1913)) concerning the occurrence of a `DELTA_INVALID_FORMAT` exception while determining the size of a Delta table. Instead of raising an error, the exception is now converted into a warning, and the function proceeds to process the rest of the table. A corresponding warning message has been added to inform users about the issue and suggest checking the table structure. No new methods have been introduced, and existing functionalities have been updated to handle this specific exception more gracefully. The changes have been thoroughly tested with unit tests for the table size check when encountering a `DELTA_INVALID_FORMAT` error, employing a mock backend and a mock Spark session to simulate the error conversion. This change does not affect user documentation, CLI commands, workflows, or tables, and is solely intended for software engineers adopting the project. * Clean up left over uber principal resources for Azure ([#2370](#2370)). This commit includes modifications to the Azure access module of the UCX project to clean up resources if the creation of the uber principal fails midway. It addresses issues [#2360](#2360) (Azure part) and [#2363](#2363), and modifies the command `databricks labs ucx create-uber-principal` to include this functionality. The changes include adding new methods and modifying existing ones for working with Azure resources, such as `StorageAccount`, `AccessConnector`, and `AzureRoleAssignment`. Additionally, new unit and integration tests have been added and manually tested to ensure that the changes work as intended. The commit also includes new fixtures for testing storage accounts and access connectors, and a test case for getting, applying, and deleting storage permissions. The `azure_api_client` function has been updated to handle different input argument lengths and methods such as "get", "put", and "post". A new managed identity, "appIduser1", has been added to the Azure mappings file, and the corresponding role assignments have been updated. The changes include error handling mechanisms for certain scenarios that may arise during the creation of the uber service principal. * Crawlers: Use `TRUNCATE TABLE` instead of `DELETE FROM` when resetting crawler tables ([#2392](#2392)). In this release, the `.reset()` method for crawlers has been updated to use `TRUNCATE TABLE` instead of `DELETE FROM` when clearing out crawler tables, resulting in more efficient and idiomatic code. This change affects the existing `migrate-data-reconciliation` workflow and is accompanied by updated unit and integration tests to ensure correct functionality. The `reset()` method now accepts a table name argument, which is passed to the newly introduced `escape_sql_identifier()` utility function from the `databricks.labs.ucx.framework.utils` module for added safety. The migration status is now refreshed using the `TRUNCATE TABLE` command, which removes all records from the table, providing improved performance compared to the previous implementation. The `SHOW DATABASES` and `TRUNCATE TABLE` queries are validated in the `refresh_migration_status` workflow test, which now checks if the `TRUNCATE TABLE` query is used instead of `DELETE FROM` when resetting crawler tables. * Detect tables that are not present in the mapping file ([#2205](#2205)). In this release, we have introduced a new method `get_remaining_tables()` that returns a list of tables in the Hive metastore that have not been processed by the migration tool. This method performs a full refresh of the index and checks each table in the Hive metastore against the index to determine if it has been migrated. We have also added a new private method `_is_migrated()` to check if a given table has already been migrated. Additionally, we have replaced the `refresh_migration_status` method with `update_migration_status` in several workflows to present a more accurate representation of the migration process in the dashboard. A new SQL script, 04_1_remaining_hms_tables.sql, has been added to list the remaining tables in Hive Metastore which are not present in the mapping file. We have also added a new test for the table migration job that verifies that tables not present in the mapping file are detected and reported. A new test function `test_refresh_migration_status_published_remained_tables` has been added to ensure that the migration process correctly handles the case where tables have been published to the target metadata store but still remain in the source metadata store. These changes are intended to improve the functionality of the migration tool for Hive metastore tables and resolve issue [#1221](#1221). * Fixed ConcurrentDeleteReadException in migrate-view task during table migration ([#2282](#2282)). In this release, we have implemented a fix for the ConcurrentDeleteReadException that occurred during the `migrate-view` command's migration task. The solution involved moving the refreshing of migration status between batches from within the batches. Along with the fix, we added a new method `index()` to the `TableMigrate` class, which checks if a table has been migrated or not. This method is utilized in the `_view_can_be_migrated` method to ensure that all dependencies of a view have been migrated before migrating the view. The `index_full_refresh()` method, which earlier performed this check, has been modified to refresh the index between batches instead of within batches. It is worth noting that the changes made have been manually tested, but no unit tests, integration tests, or verification on staging environments have been added. The target audience for this release is software engineers who adopt this project. No new documentation, commands, workflows, or tables have been added or modified in this release. * Fixed documentation typos: `create-missing-pricipals` -> `create-missing-principals` ([#2357](#2357)). This pull request resolves typographical errors in the `create-missing-principals` command documentation, correcting the mistaken usage of `create-missing-pricipals` throughout the project documentation. The changes encompass the command description, the UCX command section, and the manual process documentation for AWS storage credentials. The `create-missing-principals` command, utilized by Cloud Admins to create and configure new AWS roles for Unity Catalog access, remains functionally unaltered. * Fixed linting for Spark Python workflow tasks ([#2349](#2349)). This commit updates the linter to support PySpark tasks in workflows by modifying the existing `experimental-workflow-linter` to correctly handle these tasks. Previously, the linter assumed Python files were Jupyter notebooks, but PySpark tasks are top-level Python files run as `__main__`. This change introduces a new `ImportFileResolver` class to resolve imports for PySpark tasks, and updates to the `DependencyResolver` class to properly handle them. Additionally, unit and integration tests have been updated and added to ensure the correct behavior of the linter. The `DependencyResolver` constructor now accepts an additional `import_resolver` argument in some instances. This commit resolves issue [#2213](#2213) and improves the accuracy and versatility of the linter for different types of Python files. * Fixed missing `security_policy` when updating SQL warehouse config ([#2409](#2409)). In this release, we have added new methods `GetWorkspaceWarehouseConfigResponseSecurityPolicy` and `SetWorkspaceWarehouseConfigRequestSecurityPolicy` to improve handling of SQL warehouse config security policy. We have introduced a new variable `security_policy` to store the security policy value, which is used when updating the SQL warehouse configuration, ensuring that the required security policy is set and fixing the `InvalidParameterValue: Endpoint security policy is required and must be one of NONE, DATA_ACCESS_CONTROL, PASSTHROUGH` error. Additionally, when the `enable_serverless_compute` error occurs, the new SQL warehouse data access config is printed in the log, allowing users to manually configure the uber principal in the UI. We have also updated the `create_uber_principal` method to set the security policy correctly and added parameterized tests to test the setting of the warehouse configuration security policy. The `test_create_global_spn` method has been updated to include the `security_policy` parameter in the `create_global_spn` method call, and new test cases have been added to verify the warehouse config's security policy is correctly updated. These enhancements will help make the system more robust and user-friendly. * Fixed raise logs `ResourceDoesNotExists` when iterating the log paths ([#2382](#2382)). In this commit, we have improved the handling of the `ResourceDoesNotExist` exception when iterating through log paths in the open-source library. Previously, the exception was not being properly raised or handled, resulting in unreliable code behavior. To address this, we have added unit tests in the `test_install.py` file that accurately reflect the actual behavior when iterating log paths. We have also modified the test to raise the `ResourceDoesNotExist` exception when the result is iterated over, rather than when the method is called. Additionally, we have introduced the `ResourceDoesNotExistIter` class to make it easier to simulate the error during testing. These changes ensure that the code can gracefully handle cases where the specified log path does not exist, improving the overall reliability and robustness of the library. Co-authored by Andrew Snare. * Generate custom error during installation due to external metastore connectivity issues ([#2425](#2425)). In this release, we have added a new custom error `OperationFailed` to the `InstallUcxError` enumeration in the `databricks/labs/ucx/install.py` file. This change is accompanied by an exception handler that checks if the error message contains a specific string related to AWS credentials, indicating an issue with external metastore connectivity. If this condition is met, a new `OperationFailed` error is raised with a custom message, providing instructions for resolving the external metastore connectivity issue and re-running the UCX installation. This enhancement aims to provide a more user-friendly error message for users encountering issues during UCX installation due to external metastore connectivity. The functionality of the `_create_database` method has been modified to include this new error handling mechanism, with no alterations made to the existing functionality. Although the changes have not been tested using unit tests, integration tests, or staging environments, they have been manually tested on a workspace with incorrect external metastore connectivity. * Improve logging when waiting for workflows to complete ([#2364](#2364)). This pull request enhances the logging and error handling of the databricks labs ucx project, specifically when executing workflows. It corrects a bug where `skip_job_wait` was not handled correctly, and now allows for a specified timeout period instead of the default 20 minutes. Job logs are replicated into the local logfile if a workflow times out. The existing `databricks labs ucx ensure-assessment-run` and `databricks labs ucx migrate-tables` commands have been updated, and unit and integration tests have been added. These improvements provide more detailed and informative logs, and ensure the quality of the code through added tests. The behavior of the `wait_get_run_job_terminated_or_skipped` method has been modified in the tests, and the `assign_metastore` function has also been updated, but the specific changes are not specified in the commit message. * Lint dependencies consistently ([#2400](#2400)). In this release, the `files.py` module in the `databricks/labs/ucx/source_code/linters` package has been updated to ensure consistent linting of jobs. Previously, the linting sequence was not consistent and did not provide inherited context when linting jobs. These issues have been resolved by modifying the `_lint_one` function to build an inherited tree when linting files or notebooks, and by adding a `FileLinter` object to determine which file/notebook linter to use for linting. The `_lint_task` function has also been updated to yield a `LocatedAdvice` object instead of a tuple, and takes an additional argument, `linted_paths`, which is a set of paths that have been previously linted. Additionally, the `_lint_notebook` and `_lint_file` functions have been removed as their functionality is now encompassed by the `_lint_one` function. These changes ensure consistent linting of jobs and inherited context. Unit tests were run and passed. Co-authored by Eric Vergnaud. * Make Lakeview names unique ([#2354](#2354)). A change has been implemented to guarantee the uniqueness of dataset names in Lakeview dashboards, addressing issue [#2345](#2345) where non-unique names caused system errors. This change includes renaming the `count` dataset to `fourtytwo` in the `datasets` field and updating the `name` field for a widget query from `count` to 'counter'. These internal adjustments streamline the Lakeview system's functionality while ensuring consistent and unique dataset naming. * Optimisation: when detecting if a file is a notebook only read the start instead of the whole file ([#2390](#2390)). In this release, we have optimized the detection of notebook files during linting in a non-Workspace path in our open-source library. This has been achieved by modifying the `is_a_notebook` function in the `base.py` file to improve efficiency. The function now checks the start of the file instead of loading the entire file, which is more resource-intensive. If the content of the file is not available, it will attempt to read the file header when opening the file, instead of returning False if there's an error. The `magic_header` is used to determine if the file is a notebook or not. If the content is available, the function will check if it starts with the `magic_header` to determine if a file is a notebook or not, without having to read the entire file, which can be time-consuming and resource-intensive. This change will improve the linting performance and reduce resource usage, making the library more efficient and user-friendly for software engineers. * Retry dashboard install on DeadlineExceeded ([#2379](#2379)). In this release, we have added the DeadlineExceeded exception to the @Retried decorator in the _create_dashboard method, which is used to create a lakeview dashboard from SQL queries in a folder. This modification to the existing functionality is intended to improve the reliability of dashboard installation by retrying the operation up to four minutes in case of a DeadlineExceeded error. This change resolves issues [#2376](#2376), [#2377](#2377), and [#2389](#2389), which were related to dashboard installation timeouts. Software engineers will benefit from this update as it will ensure successful installation of dashboards, even in scenarios where timeouts were previously encountered. * Updated databricks-labs-lsql requirement from <0.8,>=0.5 to >=0.5,<0.9 ([#2416](#2416)). In this update, we have modified the version requirement for the `databricks-labs-lsql` library to allow version 0.8, which includes bug fixes and improvements in dashboard creation and deployment. We have changed the version constraint from '<0.8,>=0.5' to '>=0.5,<0.9' to accommodate the latest version while preventing future major version upgrades. This change enhances the overall design of the system and simplifies the code for managing dashboard deployment. Additionally, we have introduced a new test that verifies the `deploy_dashboard` method is no longer being used, utilizing the `deprecated_call` function from pytest to ensure that calling the method raises a deprecation warning. This ensures that the system remains maintainable and up-to-date with the latest version of the `databricks-labs-lsql` library. * Updated ext hms detection to include more conf attributes ([#2414](#2414)). In this enhancement, the code for installing UCX has been updated to include additional configuration attributes for detecting external Hive Metastore (HMS). The previous implementation failed to recognize specific attributes like "spark.hadoop.hive.metastore.uris". This update introduces a new list of recognized prefixes for external HMS spark attributes, namely "spark_conf.spark.sql.hive.metastore", "spark_conf.spark.hadoop.hive.metastore", "spark_conf.spark.hadoop.javax.jdo.option", and "spark_conf.spark.databricks.hive.metastore". This change enables the extraction of a broader set of configuration attributes during installation, thereby improving the overall detection and configuration of external HMS attributes. * Updated sqlglot requirement from <25.11,>=25.5.0 to >=25.5.0,<25.12 ([#2415](#2415)). In this pull request, we have updated the `sqlglot` dependency to a new version range, `>=25.5.0,<25.12`, to allow for the latest version of `sqlglot` to be used while ensuring that the version does not exceed 25.12. This change includes several breaking changes, new features, and bug fixes. New features include support for ALTER VIEW AS SELECT, UNLOAD, and other SQL commands, as well as improvements in performance. Bug fixes include resolutions for issues related to OUTER/CROSS APPLY parsing, GENERATE_TIMESTAMP_ARRAY, and other features. Additionally, there are changes that improve the handling of various SQL dialects, including BigQuery, DuckDB, Snowflake, Oracle, and ClickHouse. The changelog and commits also reveal fixes for bugs and improvements in the library since the last version used in the project. By updating the `sqlglot` dependency to the new version range, the project can leverage the newly added features, bug fixes, and performance improvements while ensuring that the version does not exceed 25.12. * Updated sqlglot requirement from <25.9,>=25.5.0 to >=25.5.0,<25.11 ([#2403](#2403)). In this release, we have updated the required version range for the `sqlglot` library from '[25.5.0, 25.9)' to '[25.5.0, 25.11)'. This update allows us to utilize newer versions of `sqlglot` while maintaining compatibility with previous versions. The `sqlglot` team's latest release, version 25.10.0, includes several breaking changes, new features, bug fixes, and refactors. Notable changes include support for STREAMING tables in Databricks, transpilation of Snowflake's CONVERT_TIMEZONE function for DuckDB, support for GENERATE_TIMESTAMP_ARRAY in BigQuery, and the ability to parse RENAME TABLE as a Command in Teradata. Due to these updates, we strongly advise conducting thorough testing before deploying to production, as these changes may impact the functionality of the project. * Use `load_table` instead of deprecated `is_view` in failing integration test `test_mapping_skips_tables_databases` ([#2412](#2412)). In this release, the `is_view` parameter of the `skip_table_or_view` method in the `test_mapping_skips_tables_databases` integration test has been deprecated and replaced with a more flexible `load_table` parameter. This change allows for greater control and customization when specifying how a table should be loaded during the test. The `load_table` parameter is a callable that returns a `Table` object, which contains information about the table's schema, name, object type, and table format. This improvement removes the use of a deprecated parameter, enhancing the maintainability of the test code. Dependency updates: * Updated sqlglot requirement from <25.9,>=25.5.0 to >=25.5.0,<25.11 ([#2403](#2403)). * Updated databricks-labs-lsql requirement from <0.8,>=0.5 to >=0.5,<0.9 ([#2416](#2416)). * Updated sqlglot requirement from <25.11,>=25.5.0 to >=25.5.0,<25.12 ([#2415](#2415)).

@Retried

* Added `validate-table-locations` command for checking overlapping tables across workspaces ([#2341](#2341)). A new command, `validate-table-locations`, has been added to check for overlapping table locations across workspaces before migrating tables. This command is intended to ensure that tables can be migrated across workspaces without issues. The new command is part of the table migration workflows and uses a `LocationTrie` data structure to efficiently search for overlapping table locations. If any overlaps are found, the command logs a warning message and adds the conflicting tables to a list of all conflicts. This list is returned at the end of the command. The `validate-table-locations` command is intended to be run before migrating tables to ensure that the tables can be migrated without conflicts. The command includes a `workspace-ids` flag, which allows users to specify a list of workspace IDs to include in the validation. If this flag is not provided, the command will include all workspaces present in the account. This new command resolves issue [#673](#673). The `validate_table_locations` method is added to the `AccountAggregate` class and the `ExternalLocations` class has been updated to use the new `LocationTrie` class. The import section has also been updated to include new modules such as `LocationTrie` and `Table` from `databricks.labs.ucx.hive_metastore.locations` and `databricks.labs.ucx.hive_metastore.tables` respectively. Additionally, test cases have been added to ensure the correct functioning of the `LocationTrie` class. * Added references to hive_metastore catalog in all table references an… ([#2419](#2419)). In this release, we have updated various methods and functions across multiple files to include explicit references to the `hive_metastore` catalog in table references. This change aims to improve the accuracy and consistency of table references in the codebase, enhancing reliability and maintainability. Affected files include `azure.py`, `init_scripts.py`, `pipelines.py`, and others in the `databricks/labs/ucx/assessment` module, as well as test files in the `tests/unit/assessment` and `tests/unit/azure` directories. The `_try_fetch` method has been updated to include the catalog name in table references in all instances, ensuring the correct catalog is referenced in all queries. Additionally, various test functions in affected files have been updated to reference the `hive_metastore` catalog in SQL queries. This update is part of the resolution of issue [#2207](#2207) and promotes robust handling of catalog, schema, and table naming scenarios in hive metastore migration status management. * Added support for skipping views when migrating tables and views ([#2343](#2343)). In this release, we've added support for skipping both tables and views during the migration process in the `databricks labs ucx` command, addressing issue [#1937](#1937). The `skip` command has been enhanced to support skipping views, and new functions `skip_table_or_view` and `load_one` have been introduced to the `Table` class. Appropriate error handling and tests, including unit tests and integration tests, have been implemented to ensure the functionality works as expected. With these changes, users can now skip views during migration and have more flexibility when working with tables in the Unity Catalog. * Avoid false positives when linting for pyspark patterns ([#2381](#2381)). This release includes enhancements to the PySpark linter aimed at reducing false positives during linting. The linter has been updated to check the originating module when detecting PySpark calls, ensuring that warnings are triggered only for relevant nodes from the pyspark or dbutils modules. Specifically, the `ReturnValueMatcher` and `DirectFilesystemAccessMatcher` classes have been modified to include this new check. These changes improve the overall accuracy of the PySpark linter, ensuring that only pertinent warnings are surfaced during linting. Additionally, the commit includes updated unit tests to verify the correct behavior of the modified linter. Specific improvements have been made to avoid false positives when detecting the `listTables` function in the PySpark catalog, ensuring that the warning is only triggered for the actual PySpark `listTables` method call. * Bug: Generate custom warning when doing table size check and encountering DELTA_INVALID_FORMAT exception ([#2426](#2426)). A modification has been implemented in the `_safe_get_table_size` method within the `table_size.py` file of the `hive_metastore` package. This change addresses an issue ([#1913](#1913)) concerning the occurrence of a `DELTA_INVALID_FORMAT` exception while determining the size of a Delta table. Instead of raising an error, the exception is now converted into a warning, and the function proceeds to process the rest of the table. A corresponding warning message has been added to inform users about the issue and suggest checking the table structure. No new methods have been introduced, and existing functionalities have been updated to handle this specific exception more gracefully. The changes have been thoroughly tested with unit tests for the table size check when encountering a `DELTA_INVALID_FORMAT` error, employing a mock backend and a mock Spark session to simulate the error conversion. This change does not affect user documentation, CLI commands, workflows, or tables, and is solely intended for software engineers adopting the project. * Clean up left over uber principal resources for Azure ([#2370](#2370)). This commit includes modifications to the Azure access module of the UCX project to clean up resources if the creation of the uber principal fails midway. It addresses issues [#2360](#2360) (Azure part) and [#2363](#2363), and modifies the command `databricks labs ucx create-uber-principal` to include this functionality. The changes include adding new methods and modifying existing ones for working with Azure resources, such as `StorageAccount`, `AccessConnector`, and `AzureRoleAssignment`. Additionally, new unit and integration tests have been added and manually tested to ensure that the changes work as intended. The commit also includes new fixtures for testing storage accounts and access connectors, and a test case for getting, applying, and deleting storage permissions. The `azure_api_client` function has been updated to handle different input argument lengths and methods such as "get", "put", and "post". A new managed identity, "appIduser1", has been added to the Azure mappings file, and the corresponding role assignments have been updated. The changes include error handling mechanisms for certain scenarios that may arise during the creation of the uber service principal. * Crawlers: Use `TRUNCATE TABLE` instead of `DELETE FROM` when resetting crawler tables ([#2392](#2392)). In this release, the `.reset()` method for crawlers has been updated to use `TRUNCATE TABLE` instead of `DELETE FROM` when clearing out crawler tables, resulting in more efficient and idiomatic code. This change affects the existing `migrate-data-reconciliation` workflow and is accompanied by updated unit and integration tests to ensure correct functionality. The `reset()` method now accepts a table name argument, which is passed to the newly introduced `escape_sql_identifier()` utility function from the `databricks.labs.ucx.framework.utils` module for added safety. The migration status is now refreshed using the `TRUNCATE TABLE` command, which removes all records from the table, providing improved performance compared to the previous implementation. The `SHOW DATABASES` and `TRUNCATE TABLE` queries are validated in the `refresh_migration_status` workflow test, which now checks if the `TRUNCATE TABLE` query is used instead of `DELETE FROM` when resetting crawler tables. * Detect tables that are not present in the mapping file ([#2205](#2205)). In this release, we have introduced a new method `get_remaining_tables()` that returns a list of tables in the Hive metastore that have not been processed by the migration tool. This method performs a full refresh of the index and checks each table in the Hive metastore against the index to determine if it has been migrated. We have also added a new private method `_is_migrated()` to check if a given table has already been migrated. Additionally, we have replaced the `refresh_migration_status` method with `update_migration_status` in several workflows to present a more accurate representation of the migration process in the dashboard. A new SQL script, 04_1_remaining_hms_tables.sql, has been added to list the remaining tables in Hive Metastore which are not present in the mapping file. We have also added a new test for the table migration job that verifies that tables not present in the mapping file are detected and reported. A new test function `test_refresh_migration_status_published_remained_tables` has been added to ensure that the migration process correctly handles the case where tables have been published to the target metadata store but still remain in the source metadata store. These changes are intended to improve the functionality of the migration tool for Hive metastore tables and resolve issue [#1221](#1221). * Fixed ConcurrentDeleteReadException in migrate-view task during table migration ([#2282](#2282)). In this release, we have implemented a fix for the ConcurrentDeleteReadException that occurred during the `migrate-view` command's migration task. The solution involved moving the refreshing of migration status between batches from within the batches. Along with the fix, we added a new method `index()` to the `TableMigrate` class, which checks if a table has been migrated or not. This method is utilized in the `_view_can_be_migrated` method to ensure that all dependencies of a view have been migrated before migrating the view. The `index_full_refresh()` method, which earlier performed this check, has been modified to refresh the index between batches instead of within batches. It is worth noting that the changes made have been manually tested, but no unit tests, integration tests, or verification on staging environments have been added. The target audience for this release is software engineers who adopt this project. No new documentation, commands, workflows, or tables have been added or modified in this release. * Fixed documentation typos: `create-missing-pricipals` -> `create-missing-principals` ([#2357](#2357)). This pull request resolves typographical errors in the `create-missing-principals` command documentation, correcting the mistaken usage of `create-missing-pricipals` throughout the project documentation. The changes encompass the command description, the UCX command section, and the manual process documentation for AWS storage credentials. The `create-missing-principals` command, utilized by Cloud Admins to create and configure new AWS roles for Unity Catalog access, remains functionally unaltered. * Fixed linting for Spark Python workflow tasks ([#2349](#2349)). This commit updates the linter to support PySpark tasks in workflows by modifying the existing `experimental-workflow-linter` to correctly handle these tasks. Previously, the linter assumed Python files were Jupyter notebooks, but PySpark tasks are top-level Python files run as `__main__`. This change introduces a new `ImportFileResolver` class to resolve imports for PySpark tasks, and updates to the `DependencyResolver` class to properly handle them. Additionally, unit and integration tests have been updated and added to ensure the correct behavior of the linter. The `DependencyResolver` constructor now accepts an additional `import_resolver` argument in some instances. This commit resolves issue [#2213](#2213) and improves the accuracy and versatility of the linter for different types of Python files. * Fixed missing `security_policy` when updating SQL warehouse config ([#2409](#2409)). In this release, we have added new methods `GetWorkspaceWarehouseConfigResponseSecurityPolicy` and `SetWorkspaceWarehouseConfigRequestSecurityPolicy` to improve handling of SQL warehouse config security policy. We have introduced a new variable `security_policy` to store the security policy value, which is used when updating the SQL warehouse configuration, ensuring that the required security policy is set and fixing the `InvalidParameterValue: Endpoint security policy is required and must be one of NONE, DATA_ACCESS_CONTROL, PASSTHROUGH` error. Additionally, when the `enable_serverless_compute` error occurs, the new SQL warehouse data access config is printed in the log, allowing users to manually configure the uber principal in the UI. We have also updated the `create_uber_principal` method to set the security policy correctly and added parameterized tests to test the setting of the warehouse configuration security policy. The `test_create_global_spn` method has been updated to include the `security_policy` parameter in the `create_global_spn` method call, and new test cases have been added to verify the warehouse config's security policy is correctly updated. These enhancements will help make the system more robust and user-friendly. * Fixed raise logs `ResourceDoesNotExists` when iterating the log paths ([#2382](#2382)). In this commit, we have improved the handling of the `ResourceDoesNotExist` exception when iterating through log paths in the open-source library. Previously, the exception was not being properly raised or handled, resulting in unreliable code behavior. To address this, we have added unit tests in the `test_install.py` file that accurately reflect the actual behavior when iterating log paths. We have also modified the test to raise the `ResourceDoesNotExist` exception when the result is iterated over, rather than when the method is called. Additionally, we have introduced the `ResourceDoesNotExistIter` class to make it easier to simulate the error during testing. These changes ensure that the code can gracefully handle cases where the specified log path does not exist, improving the overall reliability and robustness of the library. Co-authored by Andrew Snare. * Generate custom error during installation due to external metastore connectivity issues ([#2425](#2425)). In this release, we have added a new custom error `OperationFailed` to the `InstallUcxError` enumeration in the `databricks/labs/ucx/install.py` file. This change is accompanied by an exception handler that checks if the error message contains a specific string related to AWS credentials, indicating an issue with external metastore connectivity. If this condition is met, a new `OperationFailed` error is raised with a custom message, providing instructions for resolving the external metastore connectivity issue and re-running the UCX installation. This enhancement aims to provide a more user-friendly error message for users encountering issues during UCX installation due to external metastore connectivity. The functionality of the `_create_database` method has been modified to include this new error handling mechanism, with no alterations made to the existing functionality. Although the changes have not been tested using unit tests, integration tests, or staging environments, they have been manually tested on a workspace with incorrect external metastore connectivity. * Improve logging when waiting for workflows to complete ([#2364](#2364)). This pull request enhances the logging and error handling of the databricks labs ucx project, specifically when executing workflows. It corrects a bug where `skip_job_wait` was not handled correctly, and now allows for a specified timeout period instead of the default 20 minutes. Job logs are replicated into the local logfile if a workflow times out. The existing `databricks labs ucx ensure-assessment-run` and `databricks labs ucx migrate-tables` commands have been updated, and unit and integration tests have been added. These improvements provide more detailed and informative logs, and ensure the quality of the code through added tests. The behavior of the `wait_get_run_job_terminated_or_skipped` method has been modified in the tests, and the `assign_metastore` function has also been updated, but the specific changes are not specified in the commit message. * Lint dependencies consistently ([#2400](#2400)). In this release, the `files.py` module in the `databricks/labs/ucx/source_code/linters` package has been updated to ensure consistent linting of jobs. Previously, the linting sequence was not consistent and did not provide inherited context when linting jobs. These issues have been resolved by modifying the `_lint_one` function to build an inherited tree when linting files or notebooks, and by adding a `FileLinter` object to determine which file/notebook linter to use for linting. The `_lint_task` function has also been updated to yield a `LocatedAdvice` object instead of a tuple, and takes an additional argument, `linted_paths`, which is a set of paths that have been previously linted. Additionally, the `_lint_notebook` and `_lint_file` functions have been removed as their functionality is now encompassed by the `_lint_one` function. These changes ensure consistent linting of jobs and inherited context. Unit tests were run and passed. Co-authored by Eric Vergnaud. * Make Lakeview names unique ([#2354](#2354)). A change has been implemented to guarantee the uniqueness of dataset names in Lakeview dashboards, addressing issue [#2345](#2345) where non-unique names caused system errors. This change includes renaming the `count` dataset to `fourtytwo` in the `datasets` field and updating the `name` field for a widget query from `count` to 'counter'. These internal adjustments streamline the Lakeview system's functionality while ensuring consistent and unique dataset naming. * Optimisation: when detecting if a file is a notebook only read the start instead of the whole file ([#2390](#2390)). In this release, we have optimized the detection of notebook files during linting in a non-Workspace path in our open-source library. This has been achieved by modifying the `is_a_notebook` function in the `base.py` file to improve efficiency. The function now checks the start of the file instead of loading the entire file, which is more resource-intensive. If the content of the file is not available, it will attempt to read the file header when opening the file, instead of returning False if there's an error. The `magic_header` is used to determine if the file is a notebook or not. If the content is available, the function will check if it starts with the `magic_header` to determine if a file is a notebook or not, without having to read the entire file, which can be time-consuming and resource-intensive. This change will improve the linting performance and reduce resource usage, making the library more efficient and user-friendly for software engineers. * Retry dashboard install on DeadlineExceeded ([#2379](#2379)). In this release, we have added the DeadlineExceeded exception to the @Retried decorator in the _create_dashboard method, which is used to create a lakeview dashboard from SQL queries in a folder. This modification to the existing functionality is intended to improve the reliability of dashboard installation by retrying the operation up to four minutes in case of a DeadlineExceeded error. This change resolves issues [#2376](#2376), [#2377](#2377), and [#2389](#2389), which were related to dashboard installation timeouts. Software engineers will benefit from this update as it will ensure successful installation of dashboards, even in scenarios where timeouts were previously encountered. * Updated databricks-labs-lsql requirement from <0.8,>=0.5 to >=0.5,<0.9 ([#2416](#2416)). In this update, we have modified the version requirement for the `databricks-labs-lsql` library to allow version 0.8, which includes bug fixes and improvements in dashboard creation and deployment. We have changed the version constraint from '<0.8,>=0.5' to '>=0.5,<0.9' to accommodate the latest version while preventing future major version upgrades. This change enhances the overall design of the system and simplifies the code for managing dashboard deployment. Additionally, we have introduced a new test that verifies the `deploy_dashboard` method is no longer being used, utilizing the `deprecated_call` function from pytest to ensure that calling the method raises a deprecation warning. This ensures that the system remains maintainable and up-to-date with the latest version of the `databricks-labs-lsql` library. * Updated ext hms detection to include more conf attributes ([#2414](#2414)). In this enhancement, the code for installing UCX has been updated to include additional configuration attributes for detecting external Hive Metastore (HMS). The previous implementation failed to recognize specific attributes like "spark.hadoop.hive.metastore.uris". This update introduces a new list of recognized prefixes for external HMS spark attributes, namely "spark_conf.spark.sql.hive.metastore", "spark_conf.spark.hadoop.hive.metastore", "spark_conf.spark.hadoop.javax.jdo.option", and "spark_conf.spark.databricks.hive.metastore". This change enables the extraction of a broader set of configuration attributes during installation, thereby improving the overall detection and configuration of external HMS attributes. * Updated sqlglot requirement from <25.11,>=25.5.0 to >=25.5.0,<25.12 ([#2415](#2415)). In this pull request, we have updated the `sqlglot` dependency to a new version range, `>=25.5.0,<25.12`, to allow for the latest version of `sqlglot` to be used while ensuring that the version does not exceed 25.12. This change includes several breaking changes, new features, and bug fixes. New features include support for ALTER VIEW AS SELECT, UNLOAD, and other SQL commands, as well as improvements in performance. Bug fixes include resolutions for issues related to OUTER/CROSS APPLY parsing, GENERATE_TIMESTAMP_ARRAY, and other features. Additionally, there are changes that improve the handling of various SQL dialects, including BigQuery, DuckDB, Snowflake, Oracle, and ClickHouse. The changelog and commits also reveal fixes for bugs and improvements in the library since the last version used in the project. By updating the `sqlglot` dependency to the new version range, the project can leverage the newly added features, bug fixes, and performance improvements while ensuring that the version does not exceed 25.12. * Updated sqlglot requirement from <25.9,>=25.5.0 to >=25.5.0,<25.11 ([#2403](#2403)). In this release, we have updated the required version range for the `sqlglot` library from '[25.5.0, 25.9)' to '[25.5.0, 25.11)'. This update allows us to utilize newer versions of `sqlglot` while maintaining compatibility with previous versions. The `sqlglot` team's latest release, version 25.10.0, includes several breaking changes, new features, bug fixes, and refactors. Notable changes include support for STREAMING tables in Databricks, transpilation of Snowflake's CONVERT_TIMEZONE function for DuckDB, support for GENERATE_TIMESTAMP_ARRAY in BigQuery, and the ability to parse RENAME TABLE as a Command in Teradata. Due to these updates, we strongly advise conducting thorough testing before deploying to production, as these changes may impact the functionality of the project. * Use `load_table` instead of deprecated `is_view` in failing integration test `test_mapping_skips_tables_databases` ([#2412](#2412)). In this release, the `is_view` parameter of the `skip_table_or_view` method in the `test_mapping_skips_tables_databases` integration test has been deprecated and replaced with a more flexible `load_table` parameter. This change allows for greater control and customization when specifying how a table should be loaded during the test. The `load_table` parameter is a callable that returns a `Table` object, which contains information about the table's schema, name, object type, and table format. This improvement removes the use of a deprecated parameter, enhancing the maintainability of the test code. Dependency updates: * Updated sqlglot requirement from <25.9,>=25.5.0 to >=25.5.0,<25.11 ([#2403](#2403)). * Updated databricks-labs-lsql requirement from <0.8,>=0.5 to >=0.5,<0.9 ([#2416](#2416)). * Updated sqlglot requirement from <25.11,>=25.5.0 to >=25.5.0,<25.12 ([#2415](#2415)).

aminmovahed-db self-assigned this Jul 18, 2024

aminmovahed-db force-pushed the feature/issue_#1221 branch 3 times, most recently from 50375c7 to d0116d9 Compare July 18, 2024 06:16

nfx previously requested changes Jul 18, 2024

View reviewed changes

src/databricks/labs/ucx/hive_metastore/mapping.py Outdated Show resolved Hide resolved

src/databricks/labs/ucx/hive_metastore/mapping.py Outdated Show resolved Hide resolved

src/databricks/labs/ucx/hive_metastore/mapping.py Outdated Show resolved Hide resolved

JCZuurmond self-requested a review July 19, 2024 09:50

JCZuurmond requested changes Jul 19, 2024

View reviewed changes

aminmovahed-db force-pushed the feature/issue_#1221 branch from d0116d9 to 8da47a8 Compare July 25, 2024 02:02

JCZuurmond approved these changes Jul 25, 2024

View reviewed changes

src/databricks/labs/ucx/hive_metastore/table_migrate.py Show resolved Hide resolved

JCZuurmond reviewed Jul 25, 2024

View reviewed changes

src/databricks/labs/ucx/hive_metastore/table_migrate.py Outdated Show resolved Hide resolved

aminmovahed-db force-pushed the feature/issue_#1221 branch from 8da47a8 to bb83c5c Compare July 28, 2024 08:59

aminmovahed-db commented Jul 28, 2024

View reviewed changes

src/databricks/labs/ucx/hive_metastore/workflows.py Outdated Show resolved Hide resolved

aminmovahed-db force-pushed the feature/issue_#1221 branch 3 times, most recently from c2482d3 to 55e05ee Compare July 28, 2024 09:13

nfx marked this pull request as ready for review July 30, 2024 00:56

nfx requested review from a team and gcwang-db July 30, 2024 00:56

nfx had a problem deploying to account-admin July 30, 2024 00:56 — with GitHub Actions Failure

JCZuurmond requested changes Jul 30, 2024

View reviewed changes

aminmovahed-db added 5 commits August 5, 2024 11:39

Detect tables that are not present in the mapping file

3b541e5

Introduces #1221

Revert "Detect tables that are not present in the mapping file"

6d063da

This reverts commit 61282187343f8fe5ef81593c0acea67395f2bbe4.

Detect tables that are not present in the mapping file

ed1dec5

Introduces #1221

Updating table_migrate.py as per the comments

dc3d086

Incorporating the review comments

9a5ba53

aminmovahed-db force-pushed the feature/issue_#1221 branch from 55e05ee to 9a5ba53 Compare August 5, 2024 03:38

aminmovahed-db temporarily deployed to account-admin August 5, 2024 03:38 — with GitHub Actions Inactive

aminmovahed-db commented Aug 5, 2024

View reviewed changes

tests/integration/hive_metastore/test_workflows.py Outdated Show resolved Hide resolved

Adding the latest changes - yet not final

ca06041

aminmovahed-db temporarily deployed to account-admin August 9, 2024 02:14 — with GitHub Actions Inactive

Renaming the test names

e797bac

aminmovahed-db had a problem deploying to account-admin August 9, 2024 04:52 — with GitHub Actions Error

Formatting fixed

38615df

aminmovahed-db had a problem deploying to account-admin August 9, 2024 05:06 — with GitHub Actions Error

Formatting fixed

376375a

aminmovahed-db temporarily deployed to account-admin August 9, 2024 05:12 — with GitHub Actions Inactive

aminmovahed-db requested a review from JCZuurmond August 9, 2024 05:34

Merge branch 'main' into feature/issue_#1221

74faac2

aminmovahed-db had a problem deploying to account-admin August 14, 2024 07:13 — with GitHub Actions Error

Fix a bug

17a3aa8

aminmovahed-db had a problem deploying to account-admin August 14, 2024 07:24 — with GitHub Actions Failure

JCZuurmond approved these changes Aug 14, 2024

View reviewed changes

Merge branch 'main' into feature/issue_#1221

9c67c3d

aminmovahed-db had a problem deploying to account-admin August 15, 2024 05:49 — with GitHub Actions Failure

Modify the integration test due to a function change

9da5fce

aminmovahed-db temporarily deployed to account-admin August 15, 2024 07:00 — with GitHub Actions Inactive

Formatting fix

a1c4c24

aminmovahed-db temporarily deployed to account-admin August 15, 2024 09:03 — with GitHub Actions Inactive

aminmovahed-db requested a review from nfx August 15, 2024 09:40

nfx approved these changes Aug 15, 2024

View reviewed changes

nfx merged commit 56db4ce into main Aug 15, 2024
6 checks passed

nfx deleted the feature/issue_#1221 branch August 15, 2024 11:12

nfx mentioned this pull request Aug 15, 2024

Release v0.33.0 #2427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect tables that are not present in the mapping file #2205

Detect tables that are not present in the mapping file #2205

aminmovahed-db commented Jul 18, 2024 •

edited

Loading

JCZuurmond left a comment

JCZuurmond left a comment •

edited

Loading

github-actions bot commented Jul 30, 2024 •

edited

Loading

JCZuurmond left a comment

JCZuurmond left a comment

nfx left a comment

Detect tables that are not present in the mapping file #2205

Detect tables that are not present in the mapping file #2205

Conversation

aminmovahed-db commented Jul 18, 2024 • edited Loading

Changes

Linked issues

Functionality

Tests

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jul 30, 2024 • edited Loading

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond left a comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

aminmovahed-db commented Jul 18, 2024 •

edited

Loading

JCZuurmond left a comment •

edited

Loading

github-actions bot commented Jul 30, 2024 •

edited

Loading