Feature request: add `started`, ... to job metadata #542

soxofaan · 2024-07-23T11:04:53Z

job metadata at GET /jobs/{job_id} currently lists these timestamps:

created (required): Date and time of creation
updated (optional): Date and time of the last status change

This is a feature request to add

started (optional): date and time when the job was started (POST /jobs/{job_id}/results)
stopped (optional): date and time when job stopped running (because of reaching status finished/error/canceled)

Context: we are handling some larger openEO use cases where a significant number of jobs has to be managed. We noticed that the "created" timestamp is not always a very informing aspect, while a "started" timestamp would be more relevant. For example because the jobs are created in bulk in advance, while they are started over a longer period, possibly hours or days after creation.

The text was updated successfully, but these errors were encountered:

soxofaan · 2024-07-23T11:08:06Z

FYI: I'm willing to create a PR for this (should be pretty straightforward I guess). Unless there are objections to the idea in general

soxofaan · 2024-07-23T11:12:29Z

cc @HansVRP

HansVRP · 2024-07-23T13:27:17Z

sounds excellent. Is 'started' then called when running?

soxofaan · 2024-07-24T10:29:51Z

After discussing this some more, it might be more useful and scalable to not add toplevel timestamps, but a "timeline" construct to keep track of various lifetime events of a batch jobs, e.g. (added comments are for illustration)

  "timeline": [
    ["created", "2017-01-01T09:32:12Z"],
    ["started", "2017-01-05T12:34:56Z"],   # user started job 4 days after creation
    ["queued", "2017-01-05T12:35:01Z"],  # reached status "queued" 5s later
    ["running", "2017-01-05T12:39:10Z"],  # reached status "running" after 4 minutes
  ],

Note that I did not define the timeline here as a mapping object, but as an array/list of tuples: it has an explicit order, and it supports repeating an event if that is necessary (e.g. restarting a job).

m-mohr · 2024-09-12T16:39:28Z

This sounds like a simplified version of the logs to me, so I'm a bit sceptical.
You can already express that in a human-readable way in the logs through the log timestamps and corresponding messages.

soxofaan · 2024-09-13T07:51:13Z

my proposal at #542 (comment) is a lot more primitive than logs. It's just a list of event-timestamp pairs (events could be predefined enum). It's small data, so can be easily included directly in job metadata, no need for extra endpoint like logs.

But it doesn't have to be that listing, the initial question is about how to include the actual start and stop time of jobs (in addition to create time and "last status change" time)

m-mohr · 2024-10-12T11:34:55Z

What's the usecase for having start and stop time? Or is it actually to the effective runtime (stop - start) that you want to get? Usually updated should be the stop time (after execution has finished), although that may differ if you make changes to the metadata of the job afterwards.

soxofaan · 2024-10-14T10:02:04Z

From our end, there are multiple use cases:

execution benchmarking and profiling in the context of algorithm hosting (e.g. APEx and related use cases). Here you want to build insights/stats on how long jobs are queued before running, how long they run untile failure or success, ...
large scale client-side batch job management. E.g as a user I want to run hundreds/thousands of job, but max a handful in parallel. But to manage my resources/credits I want to be able to kill runaway jobs.

One could get this info from actively polling the job status and checking status transitions, but if you want decent time resolution you would be forced to spam the back-end with status polling requests. However, the back-end probably has full, exact view on the lifecycle of a batch job anyway, so it feels like a waste to try to guess all this from the client side.

Usually updated should be the stop time

The problem with updated is that it is just about time of the last status change, so if you didn't poll in time, you might have missed the info you're after. Differently put, it forces the user to spam the backend with status requests if they want more precise insights

m-mohr · 2024-10-14T12:30:38Z

Just trying to understand things better right now, to get to a good solution...

First use case: Does this need to be exposed publicly though? It seems that this can be done internally.

Second use case: That's what budget was meant for, but it's specified in the currency of the backend, not in time (unless the currency is time). Isn't the actual number of consumed resources (as reported in usage - is that "live"?) more meaningful here? The plain time doesn't necessarily have any relation to the credits.

soxofaan · 2024-10-16T15:19:30Z

First use case: Does this need to be exposed publicly though? It seems that this can be done internally.

We'd prefer to decouple the benchmarking system from the particular backends-under-test here, and use standardized metadata/reporting instead of having to invent, implement and maintain some reporting backchannel for each possible backend-under-test.

Isn't the actual number of consumed resources (as reported in usage - is that "live"?) more meaningful here?

credit/cost consumption is indeed important to users, but so is time consumption.
Both are relevant. And they are relevant in different context: credit consumption is for the long term big picture view: "how much will my application cost each month?"; while time consumption is important now: "it feels my jobs are slow at the moment".

The plain time doesn't necessarily have any relation to the credits.

Indeed that's the point of this feature request: not to replace credits/budget, but to add insights about the timing of the job

m-mohr · 2024-11-15T23:09:16Z

What would a reasonable set of properties be?

created (exists) -> time when POST /jobs was executed (the only property that never changes)
queued -> time when POST /jobs/:id/results was executed
started -> time when the job switched from queued to running
updated (exists) -> last update (includes when the job errored/finished/was cancelled unless updated afterwards)
(ended -> job errored/finished/was cancelled - not sure about this property, it's usually the same as updated)
expires -> Whenever the results expire

created and updated can't be set to null, but the others must be nullable, right?

So ended - started => runtime.

soxofaan mentioned this issue Jul 23, 2024

Include cancelling long running jobs Open-EO/openeo-python-client#596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: add `started`, ... to job metadata #542

Feature request: add `started`, ... to job metadata #542

soxofaan commented Jul 23, 2024

soxofaan commented Jul 23, 2024

soxofaan commented Jul 23, 2024

HansVRP commented Jul 23, 2024

soxofaan commented Jul 24, 2024

m-mohr commented Sep 12, 2024 •

edited

Loading

soxofaan commented Sep 13, 2024

m-mohr commented Oct 12, 2024 •

edited

Loading

soxofaan commented Oct 14, 2024

m-mohr commented Oct 14, 2024

soxofaan commented Oct 16, 2024

m-mohr commented Nov 15, 2024

Feature request: add started, ... to job metadata #542

Feature request: add started, ... to job metadata #542

Comments

soxofaan commented Jul 23, 2024

soxofaan commented Jul 23, 2024

soxofaan commented Jul 23, 2024

HansVRP commented Jul 23, 2024

soxofaan commented Jul 24, 2024

m-mohr commented Sep 12, 2024 • edited Loading

soxofaan commented Sep 13, 2024

m-mohr commented Oct 12, 2024 • edited Loading

soxofaan commented Oct 14, 2024

m-mohr commented Oct 14, 2024

soxofaan commented Oct 16, 2024

m-mohr commented Nov 15, 2024

Feature request: add `started`, ... to job metadata #542

Feature request: add `started`, ... to job metadata #542

m-mohr commented Sep 12, 2024 •

edited

Loading

m-mohr commented Oct 12, 2024 •

edited

Loading