Release v0.36.0 #2635

nfx · 2024-09-13T17:12:09Z

Added upload and download cli commands to upload and download a file to/from a collection of workspaces (#2508). In this release, the Databricks Labs Unified CLI (Command Line Interface) for UCX (Unified CLI for Workspaces, Clusters, and Tables) has been updated with new upload and download commands. The upload command allows users to upload a file to a single workspace or a collection of workspaces, while the download command enables users to download a CSV file from a single workspace or a collection of workspaces. This enhances the efficiency of uploading or downloading the same file to multiple workspaces. Both commands display a warning or information message upon completion, and ensure the file schema is correct before uploading CSV files. This feature includes new methods for uploading and downloading files for multiple workspaces, as well as new unit and integration tests. Users can refer to the contributing instructions to help improve the project.
Added ability to run create-table-mapping command as collection (#2602). This PR introduces the capability to run the create-table-mapping command as a collection in the databricks labs ucx CLI, providing increased flexibility and automation for workflows. A new optional boolean flag, run-as-collection, has been added to the create-table-mapping command, allowing users to indicate if they want to run it as a collection with a default value of False. The updated create_table_mapping function now accepts additional arguments, enabling efficient creation of table mappings for multiple workspaces. Users are encouraged to test this feature in various scenarios and provide feedback for further improvements.
Added comment on the source tables to capture that they have been deprecated (#2548). A new method, _sql_add_migrated_comment(self, table: Table, target_table_key: str), has been added to the table_migrate.py file to mark deprecated source tables with a comment indicating their deprecated status and directing users to the new table. This method is currently being used in three existing methods within the same file to add comments to deprecated tables as part of the migration process. In addition, a new SQL query has been added to set a comment on the source table hive_metastore.db1_src.managed_dbfs, indicating that it is deprecated and directing users to the new table ucx_default.db1_dst.managed_dbfs. A unit test has also been updated to ensure that the migration process correctly adds the deprecation comment to the source table. This change is part of a larger effort to deprecate and migrate data from old tables to new tables and provides guidance for users to migrate to the new table.
Added documentation for PrincipalACl migration and delete-missing-principal cmd (#2552). In this open-source library release, the UCX project has added a new command delete-missing-principals, applicable only for AWS, to delete IAM roles created by UCX. This command lists all IAM roles generated by the principal-prefix-access command and allows for the selection of multiple roles to delete. It checks if the selected roles are mapped to any storage credentials and seeks confirmation before deleting the role and its associated inline policy. Additionally, updates have been made to the create-uber-principal and migrate-locations commands to apply location ACLs from existing clusters and grant necessary permissions to users. The create-catalogs-schemas command has been updated to apply catalog and schema ACLs from existing clusters for both Azure and AWS. The migrate-tables command has also been updated to apply table and view ACLs from existing clusters for both Azure and AWS. The documentation of commands that require admin privileges in the UCX project has also been updated.
Added linting for spark.sql(...) calls (#2558). This commit introduces linting for spark.sql(...) calls to enhance code quality and consistency by addressing issue #2558. The previous SparkSqlPyLinter linter only checked for table migration, but not other SQL linters like DirectFsAccess linters. This has been rectified by incorporating additional SQL linters for spark.sql(...) calls, improving the overall linting functionality of the system. The commit also introduces an abstract base class called Fixer, which enforces the inclusion of a name property for all derived classes. Additionally, minor improvements and changes have been made to the codebase. The commit resolves issue #2551, and updates the testing process in test_functional.py to test spark-sql-directfs.py, ensuring the proper functioning of the linted spark.sql(...) calls.
Document: clarify that the assessment job is not intended to be re-run (#2560). In this release, we have updated the behavior of the assessment job for Databricks Labs Unity Catalog (UCX) to address confusion around its re-run functionality. Moving forward, the assessment job should only be executed once during the initial setup of UCX and should not be re-run to refresh the inventory or findings. If a re-assessment is necessary, UCX will need to be reinstalled first. This change aligns the actual functionality of the assessment job and will not affect the daily job that updates parts of the inventory. The assessment workflow is designed to detect incompatible entities and provide information for the migration process. It can be executed in parallel or sequentially, and its output is stored in Delta tables for further analysis and decision-making through the assessment report.
Enabled migrate-credentials command to run as collection (#2532). In this pull request, the migrate-credentials command in the UCX project's CLI has been updated with a new optional flag, run_as_collection, which allows the command to operate on multiple workspaces as a collection. This change introduces the get_contexts function and modifies the delete_missing_principals function to support the new functionality. The migrate-credentials command's behavior for Azure and AWS has been updated to accept an additional acc_client argument in its tests. Comprehensive tests and documentation have been added to ensure the reliability and robustness of the new functionality. It is recommended to review the attached testing evidence and ensure the new functionality works as intended without introducing any unintended side effects.
Escape column names in target tables of the table migration (#2563). In this release, the escape_sql_identifier function in the utils.py file has been enhanced with a new maxsplit parameter, providing more control over the maximum number of splits performed on the input string. This addresses issue #2544 and is part of the existing workflow "-migration-ones". The "tables.py" file in the "databricks/labs/ucx/hive_metastore" directory has been updated to escape column names in target tables, preventing SQL injection attacks. Additionally, a new ColumnInfo class and several utility functions have been added to the fixtures.py file in the databricks.labs.ucx project for generating SQL schemas and column casting. The integration tests for migrating Hive Metastore tables have been updated with new tests to handle column names that require escaping. Lastly, the test_manager.py file in the tests/unit/workspace_access directory has been refactored by removing the mock_backend fixture and adding the test_inventory_permission_manager_init method to test the initialization of the PermissionManager class. These changes improve security, functionality, and test coverage for software engineers utilizing these libraries in their projects.
Explain why metastore is checked to exists in group migration workflow in docstring (#2614). In the updated workflows.py file, the docstring for the verify_metastore_attached method has been revised to explain the necessity of checking if a metastore is attached to the workspace. The reason for this check is that account level groups are only available when a metastore is attached, which is crucial for the group migration workflow to function properly. The method itself remains the same, only verifying the presence of a metastore attached to the workspace and causing the workflow to fail if no metastore is found. This modification enhances the clarity of the metastore check's importance in the context of the group migration workflow.
Fixed infinite recursion when visiting a dependency graph (#2562). This change addresses an issue of infinite recursion that can occur when visiting a dependency graph, particularly when many files in a package import the package itself. The visit method has been modified to only visit each parent/child pair once, preventing the recursion that can occur in such cases. The dependencies property has been added to the DependencyGraph class, and the DependencyGraphVisitor class has been introduced to handle visiting nodes and tracking visited pairs. These modifications improve the robustness of the library by preventing infinite recursion during dependency resolution. The change includes added unit tests to ensure correct behavior and addresses a blocker for a previous pull request. The functionality of the code remains unchanged.
Fixed migrate acls CLI command (#2617). In this release, the migrate acls command in the ucx project's CLI has been updated to address issue #2617. The changes include the removal of ACL type parameters from the migrate ACL command, simplifying its usage and eliminating the need for explicit type specifications. The legacy_table_acl and principal parameters have been removed from the migrate_acls function, while the hms_fed parameter remains unchanged and retains its default value if not explicitly provided. These modifications streamline the ACL migration process in the ucx CLI, making it easier for users to manage access control lists.
Fixes pip install statement in debug notebook (#2545). In this release, we have addressed an issue in the debug notebook where the pip install statement for wheel was incorrectly surrounded by square brackets, causing the notebook run to fail. We have removed the superfluous square brackets and modified the remote_wheels list to be joined as a string before being passed to the DEBUG_NOTEBOOK format. It is important to note that this change solely affects the debug notebook and does not involve any alterations to user documentation, CLI commands, workflows, or tables. Furthermore, no new methods have been added, and existing functionality remains unchanged. The change has been manually tested for accuracy, but it does not include any unit tests, integration tests, or staging environment verification.
More escaping of SQL identifiers (#2530). This commit includes updates to SQL identifier escaping, addressing a missed SQL statement in one of the crawlers and adding support for less-known Spark/Databricks corner cases where backticks in names of identifiers need to be doubled when quoting. The escape_sql_identifier function has been modified to consider this new case, and the changes affect the existing migrate-data-reconciliation workflow. Additionally, the TableIdentifier class has been updated to properly escape identifiers, handling the backticks-in-names scenario. These improvements ensure better handling of SQL identifiers, improving the overall functionality of the codebase. Unit tests have been updated to reflect these changes.
Retry deploy workflow on InternalError (#2525). In the 'workflows.py' file, the _deploy_workflow function has been updated to include a retry mechanism using the @retried decorator, which handles InternalError exceptions during workflow creation. This enhancement aims to improve the resilience of deploying workflows by automatically retrying in case of internal errors, thereby addressing issue #2522. This change is part of our ongoing efforts to ensure a robust and fault-tolerant deployment process. The retry mechanism is configured with a timeout of 2 minutes to prevent extended waiting in case of persistent issues, thus enhancing overall system efficiency and reliability.
Updated databricks-labs-lsql requirement from <0.10,>=0.5 to >=0.5,<0.11 (#2580). In this release, we have updated the requirement for the databricks-labs-lsql package to version 0.10 or lower, with an upper limit of 0.11. Previously, the package version was constrained to be greater than or equal to 0.5 and less than 0.10. This update will allow users to utilize the latest version of the package, which includes new features and bug fixes. For more detailed information on the changes included in this update, please refer to the changelog and release notes provided in the commit message.
Updated sqlglot requirement from <25.20,>=25.5.0 to >=25.5.0,<25.21 (#2549). In this pull request, we are updating the sqlglot requirement in the pyproject.toml file from a range of >=25.5.0,<25.20 to >=25.5.0,<25.21. This change allows for the installation of the latest version of sqlglot, while ensuring that the version does not exceed 25.21. The update was made in response to a pull request from Dependabot, which identified a new version of sqlglot. The PR includes details of the sqlglot changelog and commits, but as reviewers, we can focus on the specific change made to our project. The sqlglot package is a SQL parser and transpiler that we use as a dependency in this project. This update will ensure that our project is using the latest version of this package, which may include bug fixes, new features, or improvements in performance.

Dependency updates:

Updated sqlglot requirement from <25.20,>=25.5.0 to >=25.5.0,<25.21 (#2549).
Updated databricks-labs-lsql requirement from <0.10,>=0.5 to >=0.5,<0.11 (#2580).

* Added `upload` and `download` cli commands to `upload` and `download` a file to/from a collection of workspaces ([#2508](#2508)). In this release, the Databricks Labs Unified CLI (Command Line Interface) for UCX (Unified CLI for Workspaces, Clusters, and Tables) has been updated with new `upload` and `download` commands. The `upload` command allows users to upload a file to a single workspace or a collection of workspaces, while the `download` command enables users to download a CSV file from a single workspace or a collection of workspaces. This enhances the efficiency of uploading or downloading the same file to multiple workspaces. Both commands display a warning or information message upon completion, and ensure the file schema is correct before uploading CSV files. This feature includes new methods for uploading and downloading files for multiple workspaces, as well as new unit and integration tests. Users can refer to the contributing instructions to help improve the project. * Added ability to run `create-table-mapping` command as collection ([#2602](#2602)). This PR introduces the capability to run the `create-table-mapping` command as a collection in the `databricks labs ucx` CLI, providing increased flexibility and automation for workflows. A new optional boolean flag, `run-as-collection`, has been added to the `create-table-mapping` command, allowing users to indicate if they want to run it as a collection with a default value of False. The updated `create_table_mapping` function now accepts additional arguments, enabling efficient creation of table mappings for multiple workspaces. Users are encouraged to test this feature in various scenarios and provide feedback for further improvements. * Added comment on the source tables to capture that they have been deprecated ([#2548](#2548)). A new method, `_sql_add_migrated_comment(self, table: Table, target_table_key: str)`, has been added to the `table_migrate.py` file to mark deprecated source tables with a comment indicating their deprecated status and directing users to the new table. This method is currently being used in three existing methods within the same file to add comments to deprecated tables as part of the migration process. In addition, a new SQL query has been added to set a comment on the source table `hive_metastore.db1_src.managed_dbfs`, indicating that it is deprecated and directing users to the new table `ucx_default.db1_dst.managed_dbfs`. A unit test has also been updated to ensure that the migration process correctly adds the deprecation comment to the source table. This change is part of a larger effort to deprecate and migrate data from old tables to new tables and provides guidance for users to migrate to the new table. * Added documentation for PrincipalACl migration and delete-missing-principal cmd ([#2552](#2552)). In this open-source library release, the UCX project has added a new command `delete-missing-principals`, applicable only for AWS, to delete IAM roles created by UCX. This command lists all IAM roles generated by the `principal-prefix-access` command and allows for the selection of multiple roles to delete. It checks if the selected roles are mapped to any storage credentials and seeks confirmation before deleting the role and its associated inline policy. Additionally, updates have been made to the `create-uber-principal` and `migrate-locations` commands to apply location ACLs from existing clusters and grant necessary permissions to users. The `create-catalogs-schemas` command has been updated to apply catalog and schema ACLs from existing clusters for both Azure and AWS. The `migrate-tables` command has also been updated to apply table and view ACLs from existing clusters for both Azure and AWS. The documentation of commands that require admin privileges in the UCX project has also been updated. * Added linting for `spark.sql(...)` calls ([#2558](#2558)). This commit introduces linting for `spark.sql(...)` calls to enhance code quality and consistency by addressing issue [#2558](#2558). The previous SparkSqlPyLinter linter only checked for table migration, but not other SQL linters like DirectFsAccess linters. This has been rectified by incorporating additional SQL linters for `spark.sql(...)` calls, improving the overall linting functionality of the system. The commit also introduces an abstract base class called Fixer, which enforces the inclusion of a `name` property for all derived classes. Additionally, minor improvements and changes have been made to the codebase. The commit resolves issue [#2551](#2551), and updates the testing process in `test_functional.py` to test `spark-sql-directfs.py`, ensuring the proper functioning of the linted `spark.sql(...)` calls. * Document: clarify that the `assessment` job is not intended to be re-run ([#2560](#2560)). In this release, we have updated the behavior of the `assessment` job for Databricks Labs Unity Catalog (UCX) to address confusion around its re-run functionality. Moving forward, the `assessment` job should only be executed once during the initial setup of UCX and should not be re-run to refresh the inventory or findings. If a re-assessment is necessary, UCX will need to be reinstalled first. This change aligns the actual functionality of the `assessment` job and will not affect the daily job that updates parts of the inventory. The `assessment` workflow is designed to detect incompatible entities and provide information for the migration process. It can be executed in parallel or sequentially, and its output is stored in Delta tables for further analysis and decision-making through the assessment report. * Enabled `migrate-credentials` command to run as collection ([#2532](#2532)). In this pull request, the `migrate-credentials` command in the UCX project's CLI has been updated with a new optional flag, `run_as_collection`, which allows the command to operate on multiple workspaces as a collection. This change introduces the `get_contexts` function and modifies the `delete_missing_principals` function to support the new functionality. The `migrate-credentials` command's behavior for Azure and AWS has been updated to accept an additional `acc_client` argument in its tests. Comprehensive tests and documentation have been added to ensure the reliability and robustness of the new functionality. It is recommended to review the attached testing evidence and ensure the new functionality works as intended without introducing any unintended side effects. * Escape column names in target tables of the table migration ([#2563](#2563)). In this release, the `escape_sql_identifier` function in the `utils.py` file has been enhanced with a new `maxsplit` parameter, providing more control over the maximum number of splits performed on the input string. This addresses issue [#2544](#2544) and is part of the existing workflow "-migration-ones". The "tables.py" file in the "databricks/labs/ucx/hive_metastore" directory has been updated to escape column names in target tables, preventing SQL injection attacks. Additionally, a new `ColumnInfo` class and several utility functions have been added to the `fixtures.py` file in the `databricks.labs.ucx` project for generating SQL schemas and column casting. The integration tests for migrating Hive Metastore tables have been updated with new tests to handle column names that require escaping. Lastly, the `test_manager.py` file in the `tests/unit/workspace_access` directory has been refactored by removing the `mock_backend` fixture and adding the `test_inventory_permission_manager_init` method to test the initialization of the `PermissionManager` class. These changes improve security, functionality, and test coverage for software engineers utilizing these libraries in their projects. * Explain why metastore is checked to exists in group migration workflow in docstring ([#2614](#2614)). In the updated `workflows.py` file, the docstring for the `verify_metastore_attached` method has been revised to explain the necessity of checking if a metastore is attached to the workspace. The reason for this check is that account level groups are only available when a metastore is attached, which is crucial for the group migration workflow to function properly. The method itself remains the same, only verifying the presence of a metastore attached to the workspace and causing the workflow to fail if no metastore is found. This modification enhances the clarity of the metastore check's importance in the context of the group migration workflow. * Fixed infinite recursion when visiting a dependency graph ([#2562](#2562)). This change addresses an issue of infinite recursion that can occur when visiting a dependency graph, particularly when many files in a package import the package itself. The `visit` method has been modified to only visit each parent/child pair once, preventing the recursion that can occur in such cases. The `dependencies` property has been added to the DependencyGraph class, and the `DependencyGraphVisitor` class has been introduced to handle visiting nodes and tracking visited pairs. These modifications improve the robustness of the library by preventing infinite recursion during dependency resolution. The change includes added unit tests to ensure correct behavior and addresses a blocker for a previous pull request. The functionality of the code remains unchanged. * Fixed migrate acls CLI command ([#2617](#2617)). In this release, the `migrate acls` command in the ucx project's CLI has been updated to address issue [#2617](#2617). The changes include the removal of ACL type parameters from the `migrate ACL` command, simplifying its usage and eliminating the need for explicit type specifications. The `legacy_table_acl` and `principal` parameters have been removed from the `migrate_acls` function, while the `hms_fed` parameter remains unchanged and retains its default value if not explicitly provided. These modifications streamline the ACL migration process in the ucx CLI, making it easier for users to manage access control lists. * Fixes pip install statement in debug notebook ([#2545](#2545)). In this release, we have addressed an issue in the debug notebook where the pip install statement for wheel was incorrectly surrounded by square brackets, causing the notebook run to fail. We have removed the superfluous square brackets and modified the `remote_wheels` list to be joined as a string before being passed to the DEBUG_NOTEBOOK format. It is important to note that this change solely affects the debug notebook and does not involve any alterations to user documentation, CLI commands, workflows, or tables. Furthermore, no new methods have been added, and existing functionality remains unchanged. The change has been manually tested for accuracy, but it does not include any unit tests, integration tests, or staging environment verification. * More escaping of SQL identifiers ([#2530](#2530)). This commit includes updates to SQL identifier escaping, addressing a missed SQL statement in one of the crawlers and adding support for less-known Spark/Databricks corner cases where backticks in names of identifiers need to be doubled when quoting. The `escape_sql_identifier` function has been modified to consider this new case, and the changes affect the existing `migrate-data-reconciliation` workflow. Additionally, the `TableIdentifier` class has been updated to properly escape identifiers, handling the backticks-in-names scenario. These improvements ensure better handling of SQL identifiers, improving the overall functionality of the codebase. Unit tests have been updated to reflect these changes. * Retry deploy workflow on `InternalError` ([#2525](#2525)). In the 'workflows.py' file, the `_deploy_workflow` function has been updated to include a retry mechanism using the `@retried` decorator, which handles `InternalError` exceptions during workflow creation. This enhancement aims to improve the resilience of deploying workflows by automatically retrying in case of internal errors, thereby addressing issue [#2522](#2522). This change is part of our ongoing efforts to ensure a robust and fault-tolerant deployment process. The retry mechanism is configured with a timeout of 2 minutes to prevent extended waiting in case of persistent issues, thus enhancing overall system efficiency and reliability. * Updated databricks-labs-lsql requirement from <0.10,>=0.5 to >=0.5,<0.11 ([#2580](#2580)). In this release, we have updated the requirement for the databricks-labs-lsql package to version 0.10 or lower, with an upper limit of 0.11. Previously, the package version was constrained to be greater than or equal to 0.5 and less than 0.10. This update will allow users to utilize the latest version of the package, which includes new features and bug fixes. For more detailed information on the changes included in this update, please refer to the changelog and release notes provided in the commit message. * Updated sqlglot requirement from <25.20,>=25.5.0 to >=25.5.0,<25.21 ([#2549](#2549)). In this pull request, we are updating the sqlglot requirement in the pyproject.toml file from a range of >=25.5.0,<25.20 to >=25.5.0,<25.21. This change allows for the installation of the latest version of sqlglot, while ensuring that the version does not exceed 25.21. The update was made in response to a pull request from Dependabot, which identified a new version of sqlglot. The PR includes details of the sqlglot changelog and commits, but as reviewers, we can focus on the specific change made to our project. The sqlglot package is a SQL parser and transpiler that we use as a dependency in this project. This update will ensure that our project is using the latest version of this package, which may include bug fixes, new features, or improvements in performance. Dependency updates: * Updated sqlglot requirement from <25.20,>=25.5.0 to >=25.5.0,<25.21 ([#2549](#2549)). * Updated databricks-labs-lsql requirement from <0.10,>=0.5 to >=0.5,<0.11 ([#2580](#2580)).

nfx requested review from a team and mohanab-db September 13, 2024 17:12

nfx had a problem deploying to account-admin September 13, 2024 17:12 — with GitHub Actions Failure

nfx merged commit 91a0af1 into main Sep 13, 2024
5 of 6 checks passed

nfx deleted the prepare/0.36.0 branch September 13, 2024 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.36.0 #2635

Release v0.36.0 #2635

nfx commented Sep 13, 2024

Release v0.36.0 #2635

Release v0.36.0 #2635

Conversation

nfx commented Sep 13, 2024