Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterManager - skip tests under Java server #130

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
5d0307d
Atomic message handlers sample
drewhoskins Jun 19, 2024
bca534a
Remove resize jobs to reduce code size
drewhoskins Jun 19, 2024
8b0a6ed
Misc polish
drewhoskins Jun 19, 2024
fb7b32f
Add test
drewhoskins Jun 19, 2024
42d1f12
Format code
drewhoskins Jun 19, 2024
c96f06d
Continue as new
drewhoskins Jun 20, 2024
6944099
Formatting
drewhoskins Jun 20, 2024
ec1fb89
Feedback, readme, restructure files and directories
drewhoskins Jun 22, 2024
dd58c64
Format
drewhoskins Jun 22, 2024
37e56ed
More feedback. Add test-continue-as-new flag.
drewhoskins Jun 24, 2024
a1506b1
Feedback; throw ApplicationFailures from update handlers
drewhoskins Jun 24, 2024
2cad3dd
Formatting
drewhoskins Jun 24, 2024
d5db7d7
__init__.py
drewhoskins Jun 24, 2024
f39841c
Fix lint issues
drewhoskins Jun 24, 2024
344d694
Dan Feedback
drewhoskins Jun 25, 2024
fc74a69
More typehints
drewhoskins Jun 25, 2024
0b84c25
s/atomic/safe/
drewhoskins Jun 25, 2024
c8e9075
Fix and demo idempotency
drewhoskins Jun 26, 2024
4fc6dac
Compatibility with 3.8
drewhoskins Jun 26, 2024
3ba8882
More feedback
drewhoskins Jun 27, 2024
f47369e
Re-add tests
drewhoskins Jun 27, 2024
5dc6185
Fix flaky test
drewhoskins Jun 27, 2024
5b45b21
Improve update and tests
drewhoskins-temporal Jul 8, 2024
ce4d384
Ruff linting
drewhoskins-temporal Jul 8, 2024
52429bd
Use consistent verbs, improve health check
drewhoskins-temporal Jul 8, 2024
74867f1
poe format
drewhoskins-temporal Jul 8, 2024
c6bdd12
Minor sample improvements
drewhoskins-temporal Jul 8, 2024
62f24a2
Skip update tests under Java test server
dandavison Jul 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Some examples require extra dependencies. See each sample's directory for specif
* [polling](polling) - Recommended implementation of an activity that needs to periodically poll an external resource waiting its successful completion.
* [prometheus](prometheus) - Configure Prometheus metrics on clients/workers.
* [pydantic_converter](pydantic_converter) - Data converter for using Pydantic models.
* [safe_message_handlers](updates_and_signals/safe_message_handlers/) - Safely handling updates and signals.
* [schedules](schedules) - Demonstrates a Workflow Execution that occurs according to a schedule.
* [sentry](sentry) - Report errors to Sentry.
* [worker_specific_task_queues](worker_specific_task_queues) - Use unique task queues to ensure activities run on specific workers.
Expand Down
9 changes: 5 additions & 4 deletions polling/frequent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@ To run, first see [README.md](../../README.md) for prerequisites.

Then, run the following from this directory to run the sample:

```bash
poetry run python run_worker.py
poetry run python run_frequent.py
```
poetry run python run_worker.py

Then, in another terminal, run the following to execute the workflow:

poetry run python run_frequent.py

The Workflow will continue to poll the service and heartbeat on every iteration until it succeeds.

Expand Down
10 changes: 6 additions & 4 deletions polling/infrequent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@ To run, first see [README.md](../../README.md) for prerequisites.

Then, run the following from this directory to run the sample:

```bash
poetry run python run_worker.py
poetry run python run_infrequent.py
```
poetry run python run_worker.py

Then, in another terminal, run the following to execute the workflow:

poetry run python run_infrequent.py


Since the test service simulates being _down_ for four polling attempts and then returns _OK_ on the fifth poll attempt, the Workflow will perform four Activity retries with a 60-second poll interval, and then return the service result on the successful fifth attempt.

Expand Down
10 changes: 6 additions & 4 deletions polling/periodic_sequence/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ To run, first see [README.md](../../README.md) for prerequisites.

Then, run the following from this directory to run the sample:

```bash
poetry run python run_worker.py
poetry run python run_periodic.py
```
poetry run python run_worker.py

Then, in another terminal, run the following to execute the workflow:

poetry run python run_periodic.py


This will start a Workflow and Child Workflow to periodically poll an Activity.
The Parent Workflow is not aware about the Child Workflow calling Continue-As-New, and it gets notified when it completes (or fails).
155 changes: 155 additions & 0 deletions tests/updates_and_signals/safe_message_handlers/workflow_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
import asyncio
import uuid

import pytest
from temporalio.client import Client, WorkflowUpdateFailedError
from temporalio.exceptions import ApplicationError
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker

from updates_and_signals.safe_message_handlers.activities import (
assign_nodes_to_job,
find_bad_nodes,
unassign_nodes_for_job,
)
from updates_and_signals.safe_message_handlers.workflow import (
ClusterManagerAssignNodesToJobInput,
ClusterManagerDeleteJobInput,
ClusterManagerInput,
ClusterManagerWorkflow,
)


async def test_safe_message_handlers(client: Client, env: WorkflowEnvironment):
if env.supports_time_skipping:
pytest.skip(
"Java test server: https://github.com/temporalio/sdk-java/issues/1903"
)
task_queue = f"tq-{uuid.uuid4()}"
async with Worker(
client,
task_queue=task_queue,
workflows=[ClusterManagerWorkflow],
activities=[assign_nodes_to_job, unassign_nodes_for_job, find_bad_nodes],
):
cluster_manager_handle = await client.start_workflow(
ClusterManagerWorkflow.run,
ClusterManagerInput(),
id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
task_queue=task_queue,
)
await cluster_manager_handle.signal(ClusterManagerWorkflow.start_cluster)

allocation_updates = []
for i in range(6):
allocation_updates.append(
cluster_manager_handle.execute_update(
ClusterManagerWorkflow.assign_nodes_to_job,
ClusterManagerAssignNodesToJobInput(
total_num_nodes=2, job_name=f"task-{i}"
),
)
)
results = await asyncio.gather(*allocation_updates)
for result in results:
assert len(result.nodes_assigned) == 2

await asyncio.sleep(1)

deletion_updates = []
for i in range(6):
deletion_updates.append(
cluster_manager_handle.execute_update(
ClusterManagerWorkflow.delete_job,
ClusterManagerDeleteJobInput(job_name=f"task-{i}"),
)
)
await asyncio.gather(*deletion_updates)

await cluster_manager_handle.signal(ClusterManagerWorkflow.shutdown_cluster)

result = await cluster_manager_handle.result()
assert result.num_currently_assigned_nodes == 0


async def test_update_idempotency(client: Client, env: WorkflowEnvironment):
if env.supports_time_skipping:
pytest.skip(
"Java test server: https://github.com/temporalio/sdk-java/issues/1903"
)
task_queue = f"tq-{uuid.uuid4()}"
async with Worker(
client,
task_queue=task_queue,
workflows=[ClusterManagerWorkflow],
activities=[assign_nodes_to_job, unassign_nodes_for_job, find_bad_nodes],
):
cluster_manager_handle = await client.start_workflow(
ClusterManagerWorkflow.run,
ClusterManagerInput(),
id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
task_queue=task_queue,
)

await cluster_manager_handle.signal(ClusterManagerWorkflow.start_cluster)

result_1 = await cluster_manager_handle.execute_update(
ClusterManagerWorkflow.assign_nodes_to_job,
ClusterManagerAssignNodesToJobInput(
total_num_nodes=5, job_name="jobby-job"
),
)
# simulate that in calling it twice, the operation is idempotent
result_2 = await cluster_manager_handle.execute_update(
ClusterManagerWorkflow.assign_nodes_to_job,
ClusterManagerAssignNodesToJobInput(
total_num_nodes=5, job_name="jobby-job"
),
)
# the second call should not assign more nodes (it may return fewer if the health check finds bad nodes
# in between the two signals.)
assert result_1.nodes_assigned >= result_2.nodes_assigned


async def test_update_failure(client: Client, env: WorkflowEnvironment):
if env.supports_time_skipping:
pytest.skip(
"Java test server: https://github.com/temporalio/sdk-java/issues/1903"
)
task_queue = f"tq-{uuid.uuid4()}"
async with Worker(
client,
task_queue=task_queue,
workflows=[ClusterManagerWorkflow],
activities=[assign_nodes_to_job, unassign_nodes_for_job, find_bad_nodes],
):
cluster_manager_handle = await client.start_workflow(
ClusterManagerWorkflow.run,
ClusterManagerInput(),
id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
task_queue=task_queue,
)

await cluster_manager_handle.signal(ClusterManagerWorkflow.start_cluster)

await cluster_manager_handle.execute_update(
ClusterManagerWorkflow.assign_nodes_to_job,
ClusterManagerAssignNodesToJobInput(
total_num_nodes=24, job_name="big-task"
),
)
try:
# Try to assign too many nodes
await cluster_manager_handle.execute_update(
ClusterManagerWorkflow.assign_nodes_to_job,
ClusterManagerAssignNodesToJobInput(
total_num_nodes=3, job_name="little-task"
),
)
except WorkflowUpdateFailedError as e:
assert isinstance(e.cause, ApplicationError)
assert e.cause.message == "Cannot assign 3 nodes; have only 1 available"
finally:
await cluster_manager_handle.signal(ClusterManagerWorkflow.shutdown_cluster)
result = await cluster_manager_handle.result()
assert result.num_currently_assigned_nodes + result.num_bad_nodes == 24
Empty file added updates_and_signals/__init__.py
Empty file.
22 changes: 22 additions & 0 deletions updates_and_signals/safe_message_handlers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Atomic message handlers

This sample shows off important techniques for handling signals and updates, aka messages. In particular, it illustrates how message handlers can interleave or not be completed before the workflow completes, and how you can manage that.

* Here, using workflow.wait_condition, signal and update handlers will only operate when the workflow is within a certain state--between cluster_started and cluster_shutdown.
* You can run start_workflow with an initializer signal that you want to run before anything else other than the workflow's constructor. This pattern is known as "signal-with-start."
* Message handlers can block and their actions can be interleaved with one another and with the main workflow. This can easily cause bugs, so we use a lock to protect shared state from interleaved access.
* Message handlers should also finish before the workflow run completes. One option is to use a lock.
* An "Entity" workflow, i.e. a long-lived workflow, periodically "continues as new". It must do this to prevent its history from growing too large, and it passes its state to the next workflow. You can check `workflow.info().is_continue_as_new_suggested()` to see when it's time. Just make sure message handlers have finished before doing so.
* Message handlers can be made idempotent. See update `ClusterManager.assign_nodes_to_job`.

To run, first see [README.md](../../README.md) for prerequisites.

Then, run the following from this directory to run the worker:
\
poetry run python worker.py

Then, in another terminal, run the following to execute the workflow:

poetry run python starter.py

This will start a worker to run your workflow and activities, then start a ClusterManagerWorkflow and put it through its paces.
Empty file.
45 changes: 45 additions & 0 deletions updates_and_signals/safe_message_handlers/activities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import asyncio
from dataclasses import dataclass
from typing import List, Set

from temporalio import activity


@dataclass
class AssignNodesToJobInput:
nodes: List[str]
job_name: str


@activity.defn
async def assign_nodes_to_job(input: AssignNodesToJobInput) -> None:
print(f"Assigning nodes {input.nodes} to job {input.job_name}")
await asyncio.sleep(0.1)


@dataclass
class UnassignNodesForJobInput:
nodes: List[str]
job_name: str


@activity.defn
async def unassign_nodes_for_job(input: UnassignNodesForJobInput) -> None:
print(f"Deallocating nodes {input.nodes} from job {input.job_name}")
await asyncio.sleep(0.1)


@dataclass
class FindBadNodesInput:
nodes_to_check: Set[str]


@activity.defn
async def find_bad_nodes(input: FindBadNodesInput) -> Set[str]:
await asyncio.sleep(0.1)
bad_nodes = set([n for n in input.nodes_to_check if int(n) % 5 == 0])
if bad_nodes:
print(f"Found bad nodes: {bad_nodes}")
else:
print("No new bad nodes found.")
return bad_nodes
84 changes: 84 additions & 0 deletions updates_and_signals/safe_message_handlers/starter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
import argparse
import asyncio
import logging
import uuid
from typing import Optional

from temporalio import common
from temporalio.client import Client, WorkflowHandle

from updates_and_signals.safe_message_handlers.workflow import (
ClusterManagerAssignNodesToJobInput,
ClusterManagerDeleteJobInput,
ClusterManagerInput,
ClusterManagerWorkflow,
)


async def do_cluster_lifecycle(wf: WorkflowHandle, delay_seconds: Optional[int] = None):

await wf.signal(ClusterManagerWorkflow.start_cluster)

print("Assigning jobs to nodes...")
allocation_updates = []
for i in range(6):
allocation_updates.append(
wf.execute_update(
ClusterManagerWorkflow.assign_nodes_to_job,
ClusterManagerAssignNodesToJobInput(
total_num_nodes=2, job_name=f"task-{i}"
),
)
)
await asyncio.gather(*allocation_updates)

print(f"Sleeping for {delay_seconds} second(s)")
if delay_seconds:
await asyncio.sleep(delay_seconds)

print("Deleting jobs...")
deletion_updates = []
for i in range(6):
deletion_updates.append(
wf.execute_update(
ClusterManagerWorkflow.delete_job,
ClusterManagerDeleteJobInput(job_name=f"task-{i}"),
)
)
await asyncio.gather(*deletion_updates)

await wf.signal(ClusterManagerWorkflow.shutdown_cluster)


async def main(should_test_continue_as_new: bool):
# Connect to Temporal
client = await Client.connect("localhost:7233")

print("Starting cluster")
cluster_manager_handle = await client.start_workflow(
ClusterManagerWorkflow.run,
ClusterManagerInput(test_continue_as_new=should_test_continue_as_new),
id=f"ClusterManagerWorkflow-{uuid.uuid4()}",
task_queue="safe-message-handlers-task-queue",
id_reuse_policy=common.WorkflowIDReusePolicy.TERMINATE_IF_RUNNING,
)
delay_seconds = 10 if should_test_continue_as_new else 1
await do_cluster_lifecycle(cluster_manager_handle, delay_seconds=delay_seconds)
result = await cluster_manager_handle.result()
print(
f"Cluster shut down successfully."
f" It had {result.num_currently_assigned_nodes} nodes assigned at the end."
)


if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description="Atomic message handlers")
parser.add_argument(
"--test-continue-as-new",
help="Make the ClusterManagerWorkflow continue as new before shutting down",
action="store_true",
default=False,
)
args = parser.parse_args()
asyncio.run(main(args.test_continue_as_new))
Loading