Document: clarify that the `assessment` job is not intended to be re-run. #2560

asnare · 2024-09-10T08:55:36Z

Changes

Following on from some discussions yesterday, this PR updates the project document to clarify that the assessment job is only intended to be run once and should not be re-run to refresh the inventory and/or findings. The intent is that UCX should be reinstalled first if the assessment needs to be re-run.

Linked issues

As part of #2074 a new (daily) job will be introduced that refreshes parts of the inventory; however the assessment job will remain as-is. (These documentation changes are intended to reflect the status quo.)

Functionality

updated user documentation

In particular, it's not supposed to be run multiple times to update/refresh the inventory.

github-actions · 2024-09-10T08:56:35Z

✅ 2/2 passed, 9s total

_{Running from acceptance #5634}

JCZuurmond

A bit of rewording

JCZuurmond · 2024-09-10T08:58:50Z

README.md

@@ -380,6 +378,8 @@ for before starting the migration process.

 After UCX assessment workflow is executed, the assessment dashboard will be populated with findings and common recommendations. See [this guide](docs/assessment.md) for more details.

+The UCX assessment workflow is intended to only be run once; re-running it is not supported. If the inventory and findings for a workspace need to be updated then UCX should first be [uninstalled](#uninstall-ucx) and reinstalled.


Suggested change

The UCX assessment workflow is intended to only be run once; re-running it is not supported. If the inventory and findings for a workspace need to be updated then UCX should first be [uninstalled](#uninstall-ucx) and reinstalled.

The UCX assessment workflow is intended to only be run once; re-running it is not supported. If the inventory and findings for a workspace need to be updated, then UCX should be reinstalled by first [uninstalling](#uninstall-ucx) and then [installing](#install-ucx).

@JCZuurmond: You're right, it was a little clumsy. Using your suggestion as inspiration I've rephrased it slightly and removed the passive voice. Does it seem better now?

nfx

lgtm

JCZuurmond · 2024-09-12T11:16:55Z

Logged an issue to enforce this: #2598

* Added `upload` and `download` cli commands to `upload` and `download` a file to/from a collection of workspaces ([#2508](#2508)). In this release, the Databricks Labs Unified CLI (Command Line Interface) for UCX (Unified CLI for Workspaces, Clusters, and Tables) has been updated with new `upload` and `download` commands. The `upload` command allows users to upload a file to a single workspace or a collection of workspaces, while the `download` command enables users to download a CSV file from a single workspace or a collection of workspaces. This enhances the efficiency of uploading or downloading the same file to multiple workspaces. Both commands display a warning or information message upon completion, and ensure the file schema is correct before uploading CSV files. This feature includes new methods for uploading and downloading files for multiple workspaces, as well as new unit and integration tests. Users can refer to the contributing instructions to help improve the project. * Added ability to run `create-table-mapping` command as collection ([#2602](#2602)). This PR introduces the capability to run the `create-table-mapping` command as a collection in the `databricks labs ucx` CLI, providing increased flexibility and automation for workflows. A new optional boolean flag, `run-as-collection`, has been added to the `create-table-mapping` command, allowing users to indicate if they want to run it as a collection with a default value of False. The updated `create_table_mapping` function now accepts additional arguments, enabling efficient creation of table mappings for multiple workspaces. Users are encouraged to test this feature in various scenarios and provide feedback for further improvements. * Added comment on the source tables to capture that they have been deprecated ([#2548](#2548)). A new method, `_sql_add_migrated_comment(self, table: Table, target_table_key: str)`, has been added to the `table_migrate.py` file to mark deprecated source tables with a comment indicating their deprecated status and directing users to the new table. This method is currently being used in three existing methods within the same file to add comments to deprecated tables as part of the migration process. In addition, a new SQL query has been added to set a comment on the source table `hive_metastore.db1_src.managed_dbfs`, indicating that it is deprecated and directing users to the new table `ucx_default.db1_dst.managed_dbfs`. A unit test has also been updated to ensure that the migration process correctly adds the deprecation comment to the source table. This change is part of a larger effort to deprecate and migrate data from old tables to new tables and provides guidance for users to migrate to the new table. * Added documentation for PrincipalACl migration and delete-missing-principal cmd ([#2552](#2552)). In this open-source library release, the UCX project has added a new command `delete-missing-principals`, applicable only for AWS, to delete IAM roles created by UCX. This command lists all IAM roles generated by the `principal-prefix-access` command and allows for the selection of multiple roles to delete. It checks if the selected roles are mapped to any storage credentials and seeks confirmation before deleting the role and its associated inline policy. Additionally, updates have been made to the `create-uber-principal` and `migrate-locations` commands to apply location ACLs from existing clusters and grant necessary permissions to users. The `create-catalogs-schemas` command has been updated to apply catalog and schema ACLs from existing clusters for both Azure and AWS. The `migrate-tables` command has also been updated to apply table and view ACLs from existing clusters for both Azure and AWS. The documentation of commands that require admin privileges in the UCX project has also been updated. * Added linting for `spark.sql(...)` calls ([#2558](#2558)). This commit introduces linting for `spark.sql(...)` calls to enhance code quality and consistency by addressing issue [#2558](#2558). The previous SparkSqlPyLinter linter only checked for table migration, but not other SQL linters like DirectFsAccess linters. This has been rectified by incorporating additional SQL linters for `spark.sql(...)` calls, improving the overall linting functionality of the system. The commit also introduces an abstract base class called Fixer, which enforces the inclusion of a `name` property for all derived classes. Additionally, minor improvements and changes have been made to the codebase. The commit resolves issue [#2551](#2551), and updates the testing process in `test_functional.py` to test `spark-sql-directfs.py`, ensuring the proper functioning of the linted `spark.sql(...)` calls. * Document: clarify that the `assessment` job is not intended to be re-run ([#2560](#2560)). In this release, we have updated the behavior of the `assessment` job for Databricks Labs Unity Catalog (UCX) to address confusion around its re-run functionality. Moving forward, the `assessment` job should only be executed once during the initial setup of UCX and should not be re-run to refresh the inventory or findings. If a re-assessment is necessary, UCX will need to be reinstalled first. This change aligns the actual functionality of the `assessment` job and will not affect the daily job that updates parts of the inventory. The `assessment` workflow is designed to detect incompatible entities and provide information for the migration process. It can be executed in parallel or sequentially, and its output is stored in Delta tables for further analysis and decision-making through the assessment report. * Enabled `migrate-credentials` command to run as collection ([#2532](#2532)). In this pull request, the `migrate-credentials` command in the UCX project's CLI has been updated with a new optional flag, `run_as_collection`, which allows the command to operate on multiple workspaces as a collection. This change introduces the `get_contexts` function and modifies the `delete_missing_principals` function to support the new functionality. The `migrate-credentials` command's behavior for Azure and AWS has been updated to accept an additional `acc_client` argument in its tests. Comprehensive tests and documentation have been added to ensure the reliability and robustness of the new functionality. It is recommended to review the attached testing evidence and ensure the new functionality works as intended without introducing any unintended side effects. * Escape column names in target tables of the table migration ([#2563](#2563)). In this release, the `escape_sql_identifier` function in the `utils.py` file has been enhanced with a new `maxsplit` parameter, providing more control over the maximum number of splits performed on the input string. This addresses issue [#2544](#2544) and is part of the existing workflow "-migration-ones". The "tables.py" file in the "databricks/labs/ucx/hive_metastore" directory has been updated to escape column names in target tables, preventing SQL injection attacks. Additionally, a new `ColumnInfo` class and several utility functions have been added to the `fixtures.py` file in the `databricks.labs.ucx` project for generating SQL schemas and column casting. The integration tests for migrating Hive Metastore tables have been updated with new tests to handle column names that require escaping. Lastly, the `test_manager.py` file in the `tests/unit/workspace_access` directory has been refactored by removing the `mock_backend` fixture and adding the `test_inventory_permission_manager_init` method to test the initialization of the `PermissionManager` class. These changes improve security, functionality, and test coverage for software engineers utilizing these libraries in their projects. * Explain why metastore is checked to exists in group migration workflow in docstring ([#2614](#2614)). In the updated `workflows.py` file, the docstring for the `verify_metastore_attached` method has been revised to explain the necessity of checking if a metastore is attached to the workspace. The reason for this check is that account level groups are only available when a metastore is attached, which is crucial for the group migration workflow to function properly. The method itself remains the same, only verifying the presence of a metastore attached to the workspace and causing the workflow to fail if no metastore is found. This modification enhances the clarity of the metastore check's importance in the context of the group migration workflow. * Fixed infinite recursion when visiting a dependency graph ([#2562](#2562)). This change addresses an issue of infinite recursion that can occur when visiting a dependency graph, particularly when many files in a package import the package itself. The `visit` method has been modified to only visit each parent/child pair once, preventing the recursion that can occur in such cases. The `dependencies` property has been added to the DependencyGraph class, and the `DependencyGraphVisitor` class has been introduced to handle visiting nodes and tracking visited pairs. These modifications improve the robustness of the library by preventing infinite recursion during dependency resolution. The change includes added unit tests to ensure correct behavior and addresses a blocker for a previous pull request. The functionality of the code remains unchanged. * Fixed migrate acls CLI command ([#2617](#2617)). In this release, the `migrate acls` command in the ucx project's CLI has been updated to address issue [#2617](#2617). The changes include the removal of ACL type parameters from the `migrate ACL` command, simplifying its usage and eliminating the need for explicit type specifications. The `legacy_table_acl` and `principal` parameters have been removed from the `migrate_acls` function, while the `hms_fed` parameter remains unchanged and retains its default value if not explicitly provided. These modifications streamline the ACL migration process in the ucx CLI, making it easier for users to manage access control lists. * Fixes pip install statement in debug notebook ([#2545](#2545)). In this release, we have addressed an issue in the debug notebook where the pip install statement for wheel was incorrectly surrounded by square brackets, causing the notebook run to fail. We have removed the superfluous square brackets and modified the `remote_wheels` list to be joined as a string before being passed to the DEBUG_NOTEBOOK format. It is important to note that this change solely affects the debug notebook and does not involve any alterations to user documentation, CLI commands, workflows, or tables. Furthermore, no new methods have been added, and existing functionality remains unchanged. The change has been manually tested for accuracy, but it does not include any unit tests, integration tests, or staging environment verification. * More escaping of SQL identifiers ([#2530](#2530)). This commit includes updates to SQL identifier escaping, addressing a missed SQL statement in one of the crawlers and adding support for less-known Spark/Databricks corner cases where backticks in names of identifiers need to be doubled when quoting. The `escape_sql_identifier` function has been modified to consider this new case, and the changes affect the existing `migrate-data-reconciliation` workflow. Additionally, the `TableIdentifier` class has been updated to properly escape identifiers, handling the backticks-in-names scenario. These improvements ensure better handling of SQL identifiers, improving the overall functionality of the codebase. Unit tests have been updated to reflect these changes. * Retry deploy workflow on `InternalError` ([#2525](#2525)). In the 'workflows.py' file, the `_deploy_workflow` function has been updated to include a retry mechanism using the `@retried` decorator, which handles `InternalError` exceptions during workflow creation. This enhancement aims to improve the resilience of deploying workflows by automatically retrying in case of internal errors, thereby addressing issue [#2522](#2522). This change is part of our ongoing efforts to ensure a robust and fault-tolerant deployment process. The retry mechanism is configured with a timeout of 2 minutes to prevent extended waiting in case of persistent issues, thus enhancing overall system efficiency and reliability. * Updated databricks-labs-lsql requirement from <0.10,>=0.5 to >=0.5,<0.11 ([#2580](#2580)). In this release, we have updated the requirement for the databricks-labs-lsql package to version 0.10 or lower, with an upper limit of 0.11. Previously, the package version was constrained to be greater than or equal to 0.5 and less than 0.10. This update will allow users to utilize the latest version of the package, which includes new features and bug fixes. For more detailed information on the changes included in this update, please refer to the changelog and release notes provided in the commit message. * Updated sqlglot requirement from <25.20,>=25.5.0 to >=25.5.0,<25.21 ([#2549](#2549)). In this pull request, we are updating the sqlglot requirement in the pyproject.toml file from a range of >=25.5.0,<25.20 to >=25.5.0,<25.21. This change allows for the installation of the latest version of sqlglot, while ensuring that the version does not exceed 25.21. The update was made in response to a pull request from Dependabot, which identified a new version of sqlglot. The PR includes details of the sqlglot changelog and commits, but as reviewers, we can focus on the specific change made to our project. The sqlglot package is a SQL parser and transpiler that we use as a dependency in this project. This update will ensure that our project is using the latest version of this package, which may include bug fixes, new features, or improvements in performance. Dependency updates: * Updated sqlglot requirement from <25.20,>=25.5.0 to >=25.5.0,<25.21 ([#2549](#2549)). * Updated databricks-labs-lsql requirement from <0.10,>=0.5 to >=0.5,<0.11 ([#2580](#2580)).

Clarify that the assessment job is only intended to be run once.

5afd326

In particular, it's not supposed to be run multiple times to update/refresh the inventory.

asnare added the documentation Improvements or additions to documentation label Sep 10, 2024

asnare self-assigned this Sep 10, 2024

asnare requested review from a team and ericvergnaud September 10, 2024 08:55

asnare temporarily deployed to account-admin September 10, 2024 08:55 — with GitHub Actions Inactive

asnare changed the title ~~Document: clarify that the assessment job is not intended to be re-run.~~ Document: clarify that the assessment job is not intended to be re-run. Sep 10, 2024

JCZuurmond reviewed Sep 10, 2024

View reviewed changes

Rephrase for clarity and avoid the passive voice.

335393e

asnare temporarily deployed to account-admin September 10, 2024 09:21 — with GitHub Actions Inactive

nfx approved these changes Sep 10, 2024

View reviewed changes

nfx merged commit 7156733 into main Sep 10, 2024
6 checks passed

nfx deleted the doc/clarify-assessment-rerun-support branch September 10, 2024 10:27

nfx mentioned this pull request Sep 13, 2024

Release v0.36.0 #2635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document: clarify that the `assessment` job is not intended to be re-run. #2560

Document: clarify that the `assessment` job is not intended to be re-run. #2560

asnare commented Sep 10, 2024

github-actions bot commented Sep 10, 2024 •

edited

Loading

JCZuurmond left a comment

JCZuurmond Sep 10, 2024

asnare Sep 10, 2024 •

edited

Loading

nfx left a comment

JCZuurmond commented Sep 12, 2024

		@@ -380,6 +378,8 @@ for before starting the migration process.

		After UCX assessment workflow is executed, the assessment dashboard will be populated with findings and common recommendations. See [this guide](docs/assessment.md) for more details.

		The UCX assessment workflow is intended to only be run once; re-running it is not supported. If the inventory and findings for a workspace need to be updated then UCX should first be [uninstalled](#uninstall-ucx) and reinstalled.

Document: clarify that the assessment job is not intended to be re-run. #2560

Document: clarify that the assessment job is not intended to be re-run. #2560

Conversation

asnare commented Sep 10, 2024

Changes

Linked issues

Functionality

github-actions bot commented Sep 10, 2024 • edited Loading

JCZuurmond left a comment

Choose a reason for hiding this comment

JCZuurmond Sep 10, 2024

Choose a reason for hiding this comment

asnare Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

JCZuurmond commented Sep 12, 2024

Document: clarify that the `assessment` job is not intended to be re-run. #2560

Document: clarify that the `assessment` job is not intended to be re-run. #2560

github-actions bot commented Sep 10, 2024 •

edited

Loading

asnare Sep 10, 2024 •

edited

Loading