Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream contributions from Union.ai #5769

Merged
merged 26 commits into from
Oct 22, 2024
Merged

Upstream contributions from Union.ai #5769

merged 26 commits into from
Oct 22, 2024

Commits on Sep 30, 2024

  1. Overlap create execution blob store reads/writes

    This change modifies launch paths stemming from `launchExecutionAndPrepareModel` to overlap blob store write and read calls, which dominate end-to-end latency (as seen in the traces below).
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    8437865 View commit details
    Browse the repository at this point in the history
  2. Overlap FutureFileReader blob store writes/reads

    This change updates `FutureFileReader.Cache` and `FutureFileReader.RetrieveCache` to use overlapped write and reads, respectively, to reduce end-to-end latency. The read path is a common operation on each iteration of the propeller `Handle` loop for dynamic nodes.
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    9a874fb View commit details
    Browse the repository at this point in the history
  3. Fix async notifications tests

    I didn't chase down why assumptions changed here and why these tests broke, but fixing them with more explicit checks.
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    ea56a96 View commit details
    Browse the repository at this point in the history
  4. Overlap fetching input and output data

    This change updates `GetExecutionData`, `GetNodeExecutionData`, and `GetTaskExecutionData` to use overlapped reads when fetching input and output data.
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    3e1f249 View commit details
    Browse the repository at this point in the history
  5. Add configuration for launchplan cache resync duration

    Currently, the launchplan cache resync duration uses the DownstreamEval duration configuration which is also used for the sync period on the k8s client. This means if we want to configure a more aggressive launchplan cache resync, we would also incur overhead in syncing all k8s resources (ex. Pods from `PodPlugin`). By adding a separate configuration value we can update these independently.
    
    Signed-off-by: Andrew Dye <[email protected]>
    hamersaw authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    6f2b2fe View commit details
    Browse the repository at this point in the history
  6. Enqueue owner on launchplan terminal state

    This PR enqueues the owner workflow for evaluation when the launchplan auto refresh cache detects a launchplan in a terminal state.
    
    Signed-off-by: Andrew Dye <[email protected]>
    hamersaw authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    85ac321 View commit details
    Browse the repository at this point in the history
  7. Add client-go metrics

    Register a few metric callbacks with the client-go metrics interface so that we can monitor request latencies and rate limiting of kubeclient.
    
    ```
    ❯ curl http://localhost:10254/metrics | rg k8s_client
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.005"} 84
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.01"} 87
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.025"} 89
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.05"} 99
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.1"} 114
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.25"} 117
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.5"} 117
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="1"} 117
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="2.5"} 117
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="5"} 117
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="10"} 117
    k8s_client_rate_limiter_latency_bucket{verb="GET",le="+Inf"} 117
    k8s_client_rate_limiter_latency_sum{verb="GET"} 1.9358371670000003
    k8s_client_rate_limiter_latency_count{verb="GET"} 117
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.005"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.01"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.025"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.05"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.1"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.25"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.5"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="1"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="2.5"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="5"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="10"} 6
    k8s_client_rate_limiter_latency_bucket{verb="POST",le="+Inf"} 6
    k8s_client_rate_limiter_latency_sum{verb="POST"} 1.0542e-05
    k8s_client_rate_limiter_latency_count{verb="POST"} 6
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.005"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.01"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.025"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.05"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.1"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.25"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.5"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="1"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="2.5"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="5"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="10"} 1
    k8s_client_rate_limiter_latency_bucket{verb="PUT",le="+Inf"} 1
    k8s_client_rate_limiter_latency_sum{verb="PUT"} 5e-07
    k8s_client_rate_limiter_latency_count{verb="PUT"} 1
    k8s_client_request_latency_bucket{verb="GET",le="0.005"} 84
    k8s_client_request_latency_bucket{verb="GET",le="0.01"} 86
    k8s_client_request_latency_bucket{verb="GET",le="0.025"} 89
    k8s_client_request_latency_bucket{verb="GET",le="0.05"} 99
    k8s_client_request_latency_bucket{verb="GET",le="0.1"} 112
    k8s_client_request_latency_bucket{verb="GET",le="0.25"} 117
    k8s_client_request_latency_bucket{verb="GET",le="0.5"} 117
    k8s_client_request_latency_bucket{verb="GET",le="1"} 117
    k8s_client_request_latency_bucket{verb="GET",le="2.5"} 117
    k8s_client_request_latency_bucket{verb="GET",le="5"} 117
    k8s_client_request_latency_bucket{verb="GET",le="10"} 117
    k8s_client_request_latency_bucket{verb="GET",le="+Inf"} 117
    k8s_client_request_latency_sum{verb="GET"} 2.1254330859999997
    k8s_client_request_latency_count{verb="GET"} 117
    k8s_client_request_latency_bucket{verb="POST",le="0.005"} 5
    k8s_client_request_latency_bucket{verb="POST",le="0.01"} 5
    k8s_client_request_latency_bucket{verb="POST",le="0.025"} 5
    k8s_client_request_latency_bucket{verb="POST",le="0.05"} 6
    k8s_client_request_latency_bucket{verb="POST",le="0.1"} 6
    k8s_client_request_latency_bucket{verb="POST",le="0.25"} 6
    k8s_client_request_latency_bucket{verb="POST",le="0.5"} 6
    k8s_client_request_latency_bucket{verb="POST",le="1"} 6
    k8s_client_request_latency_bucket{verb="POST",le="2.5"} 6
    k8s_client_request_latency_bucket{verb="POST",le="5"} 6
    k8s_client_request_latency_bucket{verb="POST",le="10"} 6
    k8s_client_request_latency_bucket{verb="POST",le="+Inf"} 6
    k8s_client_request_latency_sum{verb="POST"} 0.048558582
    k8s_client_request_latency_count{verb="POST"} 6
    k8s_client_request_latency_bucket{verb="PUT",le="0.005"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="0.01"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="0.025"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="0.05"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="0.1"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="0.25"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="0.5"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="1"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="2.5"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="5"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="10"} 1
    k8s_client_request_latency_bucket{verb="PUT",le="+Inf"} 1
    k8s_client_request_latency_sum{verb="PUT"} 0.002381375
    k8s_client_request_latency_count{verb="PUT"} 1
    k8s_client_request_total{code="200",method="GET"} 120
    k8s_client_request_total{code="200",method="PUT"} 1
    k8s_client_request_total{code="409",method="POST"} 6
    ```
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    a72c538 View commit details
    Browse the repository at this point in the history
  8. Histogram Bucket Options

    Add abstraction to be able to pass buckets custom defined to histogram vectors.
    
    Signed-off-by: Andrew Dye <[email protected]>
    squiishyy authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    39b249f View commit details
    Browse the repository at this point in the history
  9. Add org to CreateUploadLocation

    Signed-off-by: Andrew Dye <[email protected]>
    katrogan authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    4752c1e View commit details
    Browse the repository at this point in the history
  10. Add config for grpc MaxMessageSizeBytes

    We need to make the grpc max recv message size in propeller's admin client configurable to match the server-side configuration we support in admin.
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    4996328 View commit details
    Browse the repository at this point in the history
  11. Move storage cache settings to correct location

    Signed-off-by: Andrew Dye <[email protected]>
    mbarrien authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    0b935a6 View commit details
    Browse the repository at this point in the history
  12. added lock to memstore make threadsafe

    Signed-off-by: Andrew Dye <[email protected]>
    hamersaw authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    69f47dd View commit details
    Browse the repository at this point in the history
  13. Add read replica host config and connection

    - Add a new field to the postgres db config struct, `readReplicaHost`.
    - Add a new endpoint in the `database` package to enable establishing a connection with a db without creating it if it doesn't exist
    
    Signed-off-by: Andrew Dye <[email protected]>
    squiishyy authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    071c137 View commit details
    Browse the repository at this point in the history
  14. Fix type assertion when an event is missed while connection to apiser…

    …ver was severed
    
    Signed-off-by: Andrew Dye <[email protected]>
    EngHabu authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    05dfd9d View commit details
    Browse the repository at this point in the history
  15. Log and monitor failures to validate access tokens

    Signed-off-by: Andrew Dye <[email protected]>
    katrogan authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    7ce2ca8 View commit details
    Browse the repository at this point in the history
  16. Dask dashboard should have a separate log config

    Signed-off-by: Andrew Dye <[email protected]>
    EngHabu authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    c68d2db View commit details
    Browse the repository at this point in the history
  17. adjust Dask LogName to (Dask Runner Logs)

    Signed-off-by: Andrew Dye <[email protected]>
    fiedlerNr9 authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    d00a159 View commit details
    Browse the repository at this point in the history
  18. Fix k3d local setup prefix

    I was trying to use `setup_local_dev.sh`, and it wasn't working out of the box. Looks like it expects `k3d-` prefix for the kubecontext
    
    Ran `setup_local_dev.sh`
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    6db5458 View commit details
    Browse the repository at this point in the history
  19. Override ArrayNode log links with map plugin

    This PR adds a configuration option to override ArrayNode log links with those defined in the map plugin. The map plugin contains it's own configuration for log links, which may differ from those defined on the PodPlugin. ArrayNode, executing subNodes as regular tasks (ie. using the PodPlugin) means that it uses the default PodPlugin log templates.
    
    Signed-off-by: Andrew Dye <[email protected]>
    hamersaw authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    0221c54 View commit details
    Browse the repository at this point in the history
  20. Add histogram stopwatch to stow storage

    This change
    * Adds a new `HistogramStopWatch` to promutils. This [allows for aggregating latencies](https://prometheus.io/docs/practices/histograms/#quantiles) across pods and computing quantiles at query time
    * Adds `HistogramStopWatch` latency metrics for stow so that we can reason about storage latencies in aggregate. Existing latency metrics remain.
    
    - [x] Added unittests
    
    Signed-off-by: Andrew Dye <[email protected]>
    andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    c763345 View commit details
    Browse the repository at this point in the history
  21. Fix metrics scale division in timer

    * Fix metrics scale division in timer
    
    Signed-off-by: Iaroslav Ciupin <[email protected]>
    
    Signed-off-by: Andrew Dye <[email protected]>
    iaroslav-ciupin authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    b699642 View commit details
    Browse the repository at this point in the history
  22. CreateDownloadLink: Head before signing

    Signed-off-by: Andrew Dye <[email protected]>
    iaroslav-ciupin authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    04c4a04 View commit details
    Browse the repository at this point in the history
  23. Unexpectedly deleted pod metrics

    * Count when we see unexpectedly terminated pods
    
    Signed-off-by: Andrew Dye <[email protected]>
    iaroslav-ciupin authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    765ce2e View commit details
    Browse the repository at this point in the history
  24. Don't send inputURI for start-node

    * send empty `inputUri` for `start-node` in node execution event to flyteadmin and therefore, GetNodeExecutionData will not attempt to download non-existing inputUri as it was doing before this change.
    * add DB migration to clear `input_uri` in existing `node_executions` table for start nodes.
    
    Signed-off-by: Andrew Dye <[email protected]>
    iaroslav-ciupin authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    1846764 View commit details
    Browse the repository at this point in the history
  25. Fix cluster pool assignment validation

    Signed-off-by: Andrew Dye <[email protected]>
    iaroslav-ciupin authored and andrewwdye committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    895344d View commit details
    Browse the repository at this point in the history

Commits on Oct 22, 2024

  1. Merge remote-tracking branch 'origin' into union/upstream

    Signed-off-by: Eduardo Apolinario <[email protected]>
    eapolinario committed Oct 22, 2024
    Configuration menu
    Copy the full SHA
    12de9a6 View commit details
    Browse the repository at this point in the history