Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MigrationSequencer for jobs #3008

Merged
merged 106 commits into from
Nov 1, 2024
Merged

Add MigrationSequencer for jobs #3008

merged 106 commits into from
Nov 1, 2024

Conversation

ericvergnaud
Copy link
Contributor

@ericvergnaud ericvergnaud commented Oct 17, 2024

Changes

Add a MigrationSequencer class to sequence the migration steps for jobs.

The PR includes the following resources in its sequence:

  • Jobs
  • Job tasks
  • Job tasks dependencies
  • Job clusters
  • Cluster

Other elements part of the sequence are added later

Linked issues

Progresses #1415
Supersedes #2980

Tests

  • added unit tests
  • added integration tests

Copy link

github-actions bot commented Oct 17, 2024

✅ 71/71 passed, 1 flaky, 4 skipped, 54m3s total

Flaky tests:

  • 🤪 test_running_real_assessment_job_ext_hms (14m14.862s)

Running from acceptance #7176

src/databricks/labs/ucx/sequencing/sequencing.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/sequencing/sequencing.py Outdated Show resolved Hide resolved

def generate_steps(self) -> Iterable[MigrationStep]:
# algo adapted from Kahn topological sort. The main differences is that
# we want the same step number for all nodes with same dependency depth
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requirement is not strictly necessary. also, having a queue here is a more common form and more maintainable by others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strictly necessary.
We want to find migrateable assets for a given owner i.e. step number = 1 for owner X.
In the original algo, only first step for all owners has step number 1. As a consequence, you can't know if item 3 is 3 because it depends on items 1 or 2 or just by an uncontrolled side effect of the sorting algorithm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step number is still relative for all owners. we can compute it on-read in a widget via ROW_NUMBER() window function e.g. SELECT *, ROW_NUMBER() OVER (PARTITION BY owner ORDER BY global_sequence_number) AS owner_step_number. Still, the step number is not globally deterministic and can differ depending on the job task list order. that order is not deterministic on the API side.

i don't want to tradeoff clarity here.

TLDR: algo implementation here has to be more straightforward for maintainability reasons and the sequence number we store in a database has to be global (e.g. global_sequence_number), because we can compute owner_sequence_number on read via window function.

please iterate on this quicker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the above query would return inaccurate results i.e. step 1 for owner would seem migrateable when it might not be.
And tbh I don't think original Kahn is more maintainable since we have to change it anyway because the inputs do not form a DAG (see #3009). Maybe it's a misuse in the first place.

src/databricks/labs/ucx/sequencing/sequencing.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/sequencing/sequencing.py Outdated Show resolved Hide resolved
tests/unit/sequencing/test_sequencing.py Outdated Show resolved Hide resolved
def test_sequence_steps_from_job_task_with_cluster(ws, simple_dependency_resolver, mock_path_lookup) -> None:
"""Sequence a job with a task referencing a cluster.

Sequence: # TODO: @JCZuurmond: Would expect cluster first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx and @ericvergnaud : What order do we expect?

If I would go about it, I would:

  1. Cluster
  2. Task
  3. Job

Not sure though, depends a bit on the changes the cluster requires

Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In separate PR:

  • Release gradually to the assessment workflow, maybe use a feature flag from the configuration which is disable by default
  • Explain above in the migration sequencing dashboard
  • Add to migration process as well

@JCZuurmond
Copy link
Contributor

Extending test coverage, moving to in progress for now

@JCZuurmond JCZuurmond marked this pull request as draft October 29, 2024 16:05
task_node = self._nodes.get(("TASK", task_id), None)
if task_node:
return task_node
job_node = self.register_workflow_job(job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericvergnaud : Why do we go from the task to the job instead of the other way around?

Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx: I highlighted the most important changes in the comments below

self._nodes: dict[MigrationNodeKey, MigrationNode] = {}

# Outgoing references contains edges in the graph pointing from a node to a set of nodes that the node
# references. These references follow the API references, e.g. a job contains tasks in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made it easier for me to decide what should reference what. Now, the code follows the references that API objects have

"""Sequence a job with a task that references a non-existing cluster.

Sequence:
1. Cluster # TODO: Do we still expect this reference?
Copy link
Contributor

@JCZuurmond JCZuurmond Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we expect here?

The test is an unhappy path where the cluster cannot be found. We could still add the cluster to the sequence as we have the reference to it (the ID). However, we cannot continue the sequence from the cluster as it is not found. (Not implemented yet but a cluster could reference a policy or init script that should be added to the sequence.)

Note that adding an unresolved node breaks with the flow that we are used to.

# `jobs.Job.settings.tasks`, thus a job has an outgoing reference to each of those tasks.
self._outgoing_references: dict[MigrationNodeKey, set[MigrationNode]] = defaultdict(set)

def register_job(self, job: jobs.Job) -> MaybeMigrationNode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx : Please comment on the signature. Do you think this is the right one? An other option would be to add an external facing register_jobs() that loops over all jobs it can find and then calls this method (which we can make protected).

Or we could rename this to a JobMigrationSequencer with a register and generate_steps public method. Later we can add a DashboardMigrationSequencer and abstract away a parent Sequencer class enforcing this API

@ericvergnaud : You added the dependency graph here. Why did you do that? It was not used (yet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it a bit more, maybe we need to make this method protected access as well and introduce a build_dependency_graph instead. I am thinking that it makes sense to get this closer to the other linter code.

Open questions I have:

  • Should this method expect a job or job_id. The later is similar to the cluster: get the job given the id, if not found return problem, otherwise parse the job like the code below.
  • Should we distinguish between "resolve" and "register", where we do both in the same method now. If so, should we do an early return of the resolve problems, if any.
  • Should the job_migration_node get the problems from the tasks and clusters? I think it makes more sense to isolate the problems to the node it directly applies too and surface all the problems when building the dependency graph.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register_jobs() is preferred.

Should this method expect a job or job_id. The later is similar to the cluster: get the job given the id, if not found return problem, otherwise parse the job like the code below.

jobs.Job - we get tasks from it.

Should we distinguish between "resolve" and "register", where we do both in the same method now. If so, should we do an early return of the resolve problems, if any.

imho, we need just register()

Should the job_migration_node get the problems from the tasks and clusters? I think it makes more sense to isolate the problems to the node it directly applies too and surface all the problems when building the dependency graph.

tasks and job cluster problems have to rollup to a job. less objects for users to think of. though, if you think that having separate nodes for tasks in the graph is easier to implement - go ahead with ("TASK", f"{job.job_id}/{task.task_key}")

@JCZuurmond JCZuurmond marked this pull request as ready for review October 31, 2024 10:37
@JCZuurmond JCZuurmond changed the title Implement migration sequencing (phase 1) Add MigrationSequencer for jobs Oct 31, 2024
@JCZuurmond JCZuurmond self-assigned this Oct 31, 2024
@JCZuurmond JCZuurmond added the feat/migration-progress Issues related to the migration progress workflow label Oct 31, 2024
Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3008 (comment)

  • register_jobs -> also reflected in the return type `list[MaybeMigrationNode
  • kept the tasks as separate nodes

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nfx nfx merged commit b97a8d5 into main Nov 1, 2024
6 of 7 checks passed
@nfx nfx deleted the migration-sequencing-phase-1 branch November 1, 2024 16:59
nfx added a commit that referenced this pull request Nov 8, 2024
* Added `MigrationSequencer` for jobs ([#3008](#3008)). In this commit, a `MigrationSequencer` class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable of `MigrationStep` objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue [#1415](#1415) and supersedes issue [#2980](#2980). Additionally, the commit removes some unnecessary imports and fixtures from a test file.
* Added `phik` to known list ([#3198](#3198)). In this release, we have added `phik` to the known list in the provided JSON file. This change addresses part of issue [#1931](#1931), as outlined in the linked issues. The `phik` key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding the `phik` key.
* Added `pmdarima` to known list ([#3199](#3199)). In this release, we are excited to announce the addition of support for the `pmdarima` library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have added `pmdarima` to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integrating `pmdarima`, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue [#1931](#1931) and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available.
* Added `preshed` to known list ([#3220](#3220)). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython, `preshed` is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules, `preshed` and "preshed.about," this addition partially resolves issue [#1931](#1931), improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage the `preshed` library's features and optimized routines for their projects, reducing development time and increasing efficiency.
* Added `py-cpuinfo` to known list ([#3221](#3221)). In this release, we have added support for the `py-cpuinfo` library to our project, enabling the use of the `cpuinfo` functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue [#1931](#1931) and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources.
* Cater for empty python cells ([#3212](#3212)). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the `_python_trees` dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the `_load_children_from_tree` method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input.
* Create `TODO` issues every nightly run ([#3196](#3196)). A commit has been made to update the `acceptance` repository version in the `acceptance.yml` GitHub workflow from `acceptance/v0.4.0` to `acceptance/v0.4.2`, which affects the integration tests. The `Run nightly tests` step in the GitHub repository's workflow has also been updated to use a newer version of the `databrickslabs/sandbox/acceptance` action, from `v0.3.1` to `v0.4.2`. Software engineers should verify that the new version of the `acceptance` repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues.
* Fixed Integration test failure of migration_tables ([#3108](#3108)). This release includes a fix for two integration tests (`test_migrate_managed_table_to_external_table_without_conversion` and `test_migrate_managed_table_to_external_table_with_clone`) related to Hive Metastore table migration, addressing issues [#3054](#3054) and [#3055](#3055). Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing `@pytest.mark.skip` markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase.
* Replace MockInstallation with MockPathLookup for testing fixtures ([#3215](#3215)). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue [#3115](#3115).
* Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 ([#3224](#3224)). The open-source library `sqlglot` has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpiling `ANY` to `EXISTS`, supporting the `MEDIAN()` function, wrapping values in `NOT value IS ...`, and parsing information schema views into a single identifier. New features include support for the `JSONB_EXISTS` function in PostgreSQL, transpiling `ANY` to `EXISTS` in Spark, transpiling Snowflake's `TIMESTAMP()` function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding a `NULL` filter on `ARRAY_AGG` only for columns, improving parsing of `WITH FILL ... INTERPOLATE` in Clickhouse, generating `LOG(...)` for `exp.Ln` in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release.
* Use acceptance/v0.4.0 ([#3192](#3192)). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the `databrickslabs/sandbox/acceptance` runner to `acceptance/v0.4.0` and granting write permissions for the `issues` field in the `permissions` section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. A `TODO` comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly.
* Warn about errors instead to avoid job task failure ([#3219](#3219)). In this change, the `refresh_report` method in `jobs.py` has been updated to log warnings instead of raising errors when certain problems are encountered during its execution. Previously, if there were any errors during the linting process, a `ManyError` exception was raised, causing the job task to fail. Now, errors are logged as warnings, allowing the job task to continue running successfully. This resolves issue [#3214](#3214) and ensures that the job task will not fail due to linting errors, allowing users to be aware of any issues that occurred during the linting process while still completing the job task successfully. The updated method checks for errors during the linting process, adds them to a list, and constructs a string of error messages if there are any. This string of error messages is then logged as a warning using the `logger.warning` function, allowing the method to continue executing and the job task to complete successfully.
* [DOC] Add dashboard section ([#3222](#3222)). In this release, we have added a new dashboard section to the project documentation, which provides visualizations of UCX's outcomes to help users better understand and manage their UCX environment. The new section includes a table listing the available dashboards, including the Azure service principals dashboard. This dashboard displays information about Azure service principals discovered by UCX in configurations from various sources such as clusters, cluster policies, job clusters, pipelines, and warehouses. Each dashboard has text widgets that offer detailed information about the contents and are designed to help users understand UCX's results and progress in a more visual and interactive way. The Azure service principals dashboard specifically offers users valuable insights into their Azure service principals within the UCX environment.
* [DOC] README.md rewrite ([#3211](#3211)). The Databricks Labs UCX package offers a suite of tools for migrating data objects from the Hive metastore to Unity Catalog (UC), encompassing a comprehensive table migration process. This process consists of table mapping, data access setup, creating new UC resources, and migrating Hive metastore data objects. Table mapping is achieved using a table mapping file that defaults to mapping all tables/views to UC tables while preserving the original schema and names, but can be customized as needed. Data access setup involves creating and modifying cloud principals and credentials for UC data. New UC resources are created without affecting existing Hive metastore resources, and users can choose from various strategies for migrating tables based on their format and location. Additionally, the package provides installation resources, including a README notebook, a DEBUG notebook, debug logs, and installation configuration, as well as utility commands for viewing and repairing workflows. The migration process also includes an assessment workflow, group migration workflow, data reconciliation, and code migration commands.
* [chore] Added tests to verify linter not being stuck in the infinite loop ([#3225](#3225)). In this release, we have added new functional tests to ensure that the linter does not get stuck in an infinite loop, addressing a bug that was fixed in version 0.46.0 related to the default format change from Parquet to Delta in Databricks Runtime 8.0 and a SQL parse error. These tests involve creating data frames, writing them to tables, and reading from those tables, using PySpark's SQL functions and a system information schema table to demonstrate the corrected behavior. The tests also include SQL queries that select columns from a system information schema table with a specified limit, using a withColumn() method to add a new column to a data frame based on a condition. These new tests provide assurance that the linter will not get stuck in an infinite loop and that SQL queries with table parameters are supported.
* [internal] Temporarily disable integration tests due to ES-1302145 ([#3226](#3226)). In this release, the integration tests for moving tables, views, and aliasing tables have been temporarily disabled due to issue ES-1302145. The `test_move_tables`, `test_move_views`, and `test_alias_tables` functions were previously decorated with `@retried` to handle potential `NotFound` exceptions and had a timeout of 2 minutes, but are now marked with `@pytest.mark.skip("ES-1302145")`. Once the issue is resolved, the `@pytest.mark.skip` decorator should be removed to re-enable the tests. The remaining code in the file, including the `test_move_tables_no_from_schema`, `test_move_tables_no_to_schema`, and `test_move_views_no_from_schema` functions, is unchanged and still functional.
* use a path instance for MISSING_SOURCE_PATH and add test ([#3217](#3217)). In this release, the handling of MISSING_SOURCE_PATH has been improved by replacing the string representation with a Path instance using Pathlib, which simplifies checks for missing source paths and enables the addition of a new test for the DependencyProblem class. This test verifies the behavior of the newly introduced method, is_path_missing(), in the DependencyProblem class for determining if a given problem is caused by a missing path. Co-authored by Eric Vergnaud, these changes not only improve the handling and testing of missing paths but also contribute to enhancing the source code analysis functionality of the databricks/labs/ucx project.

Dependency updates:

 * Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 ([#3224](#3224)).
@nfx nfx mentioned this pull request Nov 8, 2024
nfx added a commit that referenced this pull request Nov 8, 2024
* Added `MigrationSequencer` for jobs
([#3008](#3008)). In this
commit, a `MigrationSequencer` class has been added to manage the
migration sequence for various resources including jobs, job tasks, job
task dependencies, job clusters, and clusters. The class builds a graph
of dependencies and analyzes it to generate the migration sequence,
which is returned as an iterable of `MigrationStep` objects. These
objects contain information about the object type, ID, name, owner,
required step IDs, and step number. The commit also includes new unit
and integration tests to ensure the functionality is working correctly.
The migration sequence is used in tests for assessing the sequencing
feature, and it handles tasks that reference existing or non-existing
clusters or job clusters, and new cluster definitions. This change is
linked to issue
[#1415](#1415) and
supersedes issue
[#2980](#2980).
Additionally, the commit removes some unnecessary imports and fixtures
from a test file.
* Added `phik` to known list
([#3198](#3198)). In this
release, we have added `phik` to the known list in the provided JSON
file. This change addresses part of issue
[#1931](#1931), as outlined
in the linked issues. The `phik` key has been added with an empty list
as its value, consistent with the structure of other keys in the JSON
file. It is important to note that no existing functionality has been
altered and no new methods have been introduced in this commit. The
scope of the change is confined to updating the known list in the JSON
file by adding the `phik` key.
* Added `pmdarima` to known list
([#3199](#3199)). In this
release, we are excited to announce the addition of support for the
`pmdarima` library, an open-source Python library for automatic seasonal
decomposition of time series. With this commit, we have added `pmdarima`
to our known list of libraries, providing our users with access to its
various methods and functions for data preprocessing, model selection,
and visualization. The library is particularly useful for fitting ARIMA
models and testing for seasonality. By integrating `pmdarima`, users can
now perform time series analysis and forecasting with greater ease and
efficiency. This change partly resolves issue
[#1931](#1931) and
underscores our commitment to providing our users with access to the
latest and most innovative open-source libraries available.
* Added `preshed` to known list
([#3220](#3220)). A new
library, "preshed," has been added to our project's supported libraries,
enhancing compatibility and enabling efficient utilization of its
capabilities. Developed using Cython, `preshed` is a Python interface to
Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra
routines. With the inclusion of two modules, `preshed` and
"preshed.about," this addition partially resolves issue
[#1931](#1931), improving
the project's overall performance and reliability in sparse linear
algebra tasks. Software engineers can now leverage the `preshed`
library's features and optimized routines for their projects, reducing
development time and increasing efficiency.
* Added `py-cpuinfo` to known list
([#3221](#3221)). In this
release, we have added support for the `py-cpuinfo` library to our
project, enabling the use of the `cpuinfo` functionality that it
provides. With this addition, developers can now access detailed
information about the CPU, such as the number of cores, current
frequency, and vendor, which can be useful for performance tuning and
optimization. This change partially resolves issue
[#1931](#1931) and does not
affect any existing functionality or add new methods to the codebase. We
believe that this improvement will enhance the capabilities of our
project and enable more efficient use of CPU resources.
* Cater for empty python cells
([#3212](#3212)). In this
release, we have resolved an issue where certain notebook cells in the
dependency builder were causing crashes. Specifically, empty or
comment-only cells were identified as the source of the problem. To
address this, we have implemented a check to account for these cases,
ensuring that an empty tree is stored in the `_python_trees` dictionary
if the input cell does not produce a valid tree. This change helps
prevent crashes in the dependency builder caused by empty or
comment-only cells. Furthermore, we have added a test to verify the fix
on a failed repository. If a cell does not produce a tree, the
`_load_children_from_tree` method will not be executed for that cell,
skipping the loading of any children trees. This enhancement improves
the overall stability and reliability of the library by preventing
crashes caused by invalid input.
* Create `TODO` issues every nightly run
([#3196](#3196)). A commit
has been made to update the `acceptance` repository version in the
`acceptance.yml` GitHub workflow from `acceptance/v0.4.0` to
`acceptance/v0.4.2`, which affects the integration tests. The `Run
nightly tests` step in the GitHub repository's workflow has also been
updated to use a newer version of the
`databrickslabs/sandbox/acceptance` action, from `v0.3.1` to `v0.4.2`.
Software engineers should verify that the new version of the
`acceptance` repository contains all necessary updates and fixes, and
that the integration tests continue to function as expected.
Additionally, testing the updated action is important to ensure that the
nightly tests run successfully with up-to-date code and can catch
potential issues.
* Fixed Integration test failure of migration_tables
([#3108](#3108)). This
release includes a fix for two integration tests
(`test_migrate_managed_table_to_external_table_without_conversion` and
`test_migrate_managed_table_to_external_table_with_clone`) related to
Hive Metastore table migration, addressing issues
[#3054](#3054) and
[#3055](#3055). Previously
skipped due to underlying problems, these tests have now been unskipped,
enhancing the migration feature's test coverage. No changes have been
made to the existing functionality, as the focus is solely on including
the previously skipped tests in the testing suite. The changes involve
removing `@pytest.mark.skip` markers from the test functions, ensuring
they run and provide a more comprehensive test coverage for the Hive
Metastore migration feature. In addition, this release includes an
update to DirectFsAccess integration tests, addressing issues related to
the removal of DFSA collectors and ensuring proper handling of different
file types, with no modifications made to other parts of the codebase.
* Replace MockInstallation with MockPathLookup for testing fixtures
([#3215](#3215)). In this
release, we have updated the testing fixtures in our unit tests by
replacing the MockInstallation class with MockPathLookup. Specifically,
we have modified the _load_sources function to use MockPathLookup
instead of MockInstallation for loading sources. This change not only
enhances the testing capabilities of the module but also introduces a
new logger, logger, for more precise logging within the module.
Additionally, we have updated the _load_sources function calls in the
test_notebook.py file to pass the file path directly instead of a
SourceContainer object. This modification allows for more flexible and
straightforward testing of file-related functionality, thereby fixing
issue [#3115](#3115).
* Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30
([#3224](#3224)). The
open-source library `sqlglot` has been updated to version 25.29.0 with
this release, incorporating several breaking changes, new features, and
bug fixes. The breaking changes include transpiling `ANY` to `EXISTS`,
supporting the `MEDIAN()` function, wrapping values in `NOT value IS
...`, and parsing information schema views into a single identifier. New
features include support for the `JSONB_EXISTS` function in PostgreSQL,
transpiling `ANY` to `EXISTS` in Spark, transpiling Snowflake's
`TIMESTAMP()` function, and adding support for hexadecimal literals in
Teradata. Bug fixes include handling a Move edge case in the semantic
differ, adding a `NULL` filter on `ARRAY_AGG` only for columns,
improving parsing of `WITH FILL ... INTERPOLATE` in Clickhouse,
generating `LOG(...)` for `exp.Ln` in TSQL, and optionally parsing a
Stream expression. The full changelog can be found in the pull request,
which also includes a list of the commits included in this release.
* Use acceptance/v0.4.0
([#3192](#3192)). A change
has been made to the GitHub Actions workflow file for acceptance tests,
updating the version of the `databrickslabs/sandbox/acceptance` runner
to `acceptance/v0.4.0` and granting write permissions for the `issues`
field in the `permissions` section. These updates will allow for the use
of the latest version of the acceptance tests and provide the necessary
permissions to interact with issues. A `TODO` comment has been added to
indicate that the new version of the acceptance tests needs to be
updated elsewhere in the codebase. This change will ensure that the
acceptance tests are up-to-date and functioning properly.
* Warn about errors instead to avoid job task failure
([#3219](#3219)). In this
change, the `refresh_report` method in `jobs.py` has been updated to log
warnings instead of raising errors when certain problems are encountered
during its execution. Previously, if there were any errors during the
linting process, a `ManyError` exception was raised, causing the job
task to fail. Now, errors are logged as warnings, allowing the job task
to continue running successfully. This resolves issue
[#3214](#3214) and ensures
that the job task will not fail due to linting errors, allowing users to
be aware of any issues that occurred during the linting process while
still completing the job task successfully. The updated method checks
for errors during the linting process, adds them to a list, and
constructs a string of error messages if there are any. This string of
error messages is then logged as a warning using the `logger.warning`
function, allowing the method to continue executing and the job task to
complete successfully.
* [DOC] Add dashboard section
([#3222](#3222)). In this
release, we have added a new dashboard section to the project
documentation, which provides visualizations of UCX's outcomes to help
users better understand and manage their UCX environment. The new
section includes a table listing the available dashboards, including the
Azure service principals dashboard. This dashboard displays information
about Azure service principals discovered by UCX in configurations from
various sources such as clusters, cluster policies, job clusters,
pipelines, and warehouses. Each dashboard has text widgets that offer
detailed information about the contents and are designed to help users
understand UCX's results and progress in a more visual and interactive
way. The Azure service principals dashboard specifically offers users
valuable insights into their Azure service principals within the UCX
environment.
* [DOC] README.md rewrite
([#3211](#3211)). The
Databricks Labs UCX package offers a suite of tools for migrating data
objects from the Hive metastore to Unity Catalog (UC), encompassing a
comprehensive table migration process. This process consists of table
mapping, data access setup, creating new UC resources, and migrating
Hive metastore data objects. Table mapping is achieved using a table
mapping file that defaults to mapping all tables/views to UC tables
while preserving the original schema and names, but can be customized as
needed. Data access setup involves creating and modifying cloud
principals and credentials for UC data. New UC resources are created
without affecting existing Hive metastore resources, and users can
choose from various strategies for migrating tables based on their
format and location. Additionally, the package provides installation
resources, including a README notebook, a DEBUG notebook, debug logs,
and installation configuration, as well as utility commands for viewing
and repairing workflows. The migration process also includes an
assessment workflow, group migration workflow, data reconciliation, and
code migration commands.
* [chore] Added tests to verify linter not being stuck in the infinite
loop ([#3225](#3225)). In
this release, we have added new functional tests to ensure that the
linter does not get stuck in an infinite loop, addressing a bug that was
fixed in version 0.46.0 related to the default format change from
Parquet to Delta in Databricks Runtime 8.0 and a SQL parse error. These
tests involve creating data frames, writing them to tables, and reading
from those tables, using PySpark's SQL functions and a system
information schema table to demonstrate the corrected behavior. The
tests also include SQL queries that select columns from a system
information schema table with a specified limit, using a withColumn()
method to add a new column to a data frame based on a condition. These
new tests provide assurance that the linter will not get stuck in an
infinite loop and that SQL queries with table parameters are supported.
* [internal] Temporarily disable integration tests due to ES-1302145
([#3226](#3226)). In this
release, the integration tests for moving tables, views, and aliasing
tables have been temporarily disabled due to issue ES-1302145. The
`test_move_tables`, `test_move_views`, and `test_alias_tables` functions
were previously decorated with `@retried` to handle potential `NotFound`
exceptions and had a timeout of 2 minutes, but are now marked with
`@pytest.mark.skip("ES-1302145")`. Once the issue is resolved, the
`@pytest.mark.skip` decorator should be removed to re-enable the tests.
The remaining code in the file, including the
`test_move_tables_no_from_schema`, `test_move_tables_no_to_schema`, and
`test_move_views_no_from_schema` functions, is unchanged and still
functional.
* use a path instance for MISSING_SOURCE_PATH and add test
([#3217](#3217)). In this
release, the handling of MISSING_SOURCE_PATH has been improved by
replacing the string representation with a Path instance using Pathlib,
which simplifies checks for missing source paths and enables the
addition of a new test for the DependencyProblem class. This test
verifies the behavior of the newly introduced method, is_path_missing(),
in the DependencyProblem class for determining if a given problem is
caused by a missing path. Co-authored by Eric Vergnaud, these changes
not only improve the handling and testing of missing paths but also
contribute to enhancing the source code analysis functionality of the
databricks/labs/ucx project.

Dependency updates:

* Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30
([#3224](#3224)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/migration-progress Issues related to the migration progress workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants