Native multi-arch ClusterBuildStrategy #1634

aleskandro · 2024-06-26T10:58:59Z

Changes

This PR proposes a sample multi-arch native ClusterBuildStrategy. The ClusterBuildStrategy runs as a main orchestrator pod. It creates one "agent" job for each architecture requested by the Build. The agent jobs are responsible for building the container image and coordinate with the orchestrator pod via messages in FIFO pipes sent through the kubectl client.

The agent jobs run through the following coordination barriers:

Wait for the assets to be uploaded (source code, pull secrets, etc...)
Waiting for the image download to start
Waiting for the image download to finish

The orchestrator pod:

Create the agent jobs
Upload the assets and signal the completion of the assets upload through a write to the FIFO pipe in each agent job
Start the download of the oci archive
Inform the agent jobs the download is completed (so that they can complete)
Create a manifest-list and push it to the final registry.

The service account that runs the strategy must be able to create, list, get and watch Jobs and Pods. It also needs to be allowed the create verb for the pods/exec resource. In OKD/Openshift, the service account must also be able to create pods requiring the privileged SecurityContextConstraints.

Relates to #1119

/kind feature

Submitter Checklist

Includes tests if functionality changed/was added
Includes docs if changes are user-facing => See Documentation for the multiarch-native-buildah ClusterStrategy website#125
Set a kind label on this PR
Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

Add sample build strategy to orchestrate multi-arch container image builds. The service account executing this build strategy must have the ability to manage Jobs and Pods, as well as have the ability to exec into Pods.

/cc @adambkaplan

adambkaplan

/approve

@aleskandro thanks for the PR! I took the liberty of updating the PR description to make the release notes linter happy, and use the word "agent" to promote inclusive language.

🤔 I'll need to check if our test suite automatically runs a smoke test for each sample build strategy, or if that is something we need to add to the e2e test code directly.

openshift-ci · 2024-06-27T12:58:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [adambkaplan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SaschaSchwarze0 · 2024-06-27T13:38:26Z

🤔 I'll need to check if our test suite automatically runs a smoke test for each sample build strategy, or if that is something we need to add to the e2e test code directly.

Needs to be added specifically. Though, what would we test in GitHub actions, there is no way to create a KinD cluster with nodes of different architectures ...

aleskandro · 2024-06-27T14:24:39Z

🤔 I'll need to check if our test suite automatically runs a smoke test for each sample build strategy, or if that is something we need to add to the e2e test code directly.

Needs to be added specifically. Though, what would we test in GitHub actions, there is no way to create a KinD cluster with nodes of different architectures ...

I think for the matter of testing, it is good enough to build a single-arch image into a manifest-list, i.e., having only one element in the architectures list of the Build object. For sure some for loops will execute only once, but that should be ok?

What do you think?

aleskandro · 2024-06-27T14:27:24Z

...1beta1/buildstrategy/multiarch-native-buildah/buildstrategy_multiarch_native_buildah_cr.yaml

+          kube_version=$(curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
+            "https:${KUBERNETES_PORT#tcp:}/version" | sed -n 's/^.*gitVersion.*"v\(.*\)+.*$/\1/p')
+          arch=$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/')
+          curl -L -o /usr/local/bin/kubectl \


just adding a note for this as this will be executed every time a new build is spawn and reaches an outside network. I'm not sure if we can have an image with kubectl (and how we should consider the right version to use).

Per discussion we had live - agree that this step can be avoided if a single image with all needed binaries is published.

adambkaplan · 2024-07-24T18:45:00Z

...1beta1/buildstrategy/multiarch-native-buildah/buildstrategy_multiarch_native_buildah_cr.yaml

+          kube_version=$(curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
+            "https:${KUBERNETES_PORT#tcp:}/version" | sed -n 's/^.*gitVersion.*"v\(.*\)+.*$/\1/p')
+          arch=$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/')
+          curl -L -o /usr/local/bin/kubectl \


Per discussion we had live - agree that this step can be avoided if a single image with all needed binaries is published.

...1beta1/buildstrategy/multiarch-native-buildah/buildstrategy_multiarch_native_buildah_cr.yaml

aleskandro · 2024-07-24T21:56:01Z

.../buildstrategy/multiarch-native-buildah/rolebinding_multiarch_native_buildah_scc_okd_cr.yaml

@@ -0,0 +1,14 @@
+# NOTE: This is needed secifically in OKD/OpenShift environments.


/cc @adambkaplan

Thanks @dorzel

aleskandro · 2024-07-25T13:11:03Z

/unhold

(added by mistake)

aleskandro · 2024-07-25T14:26:09Z

This failure on e2e (v1.27.11, v0.59.2) seems unrelated:

   Copying blob sha256:df2c3b2eb7cc63351bb32f26457bbe0402af8082548f26975f0c329bc7841881
  Copying blob sha256:efe636eac583776a8a114d50fef15bc65b648f3d2bb53326cf1f21cc5ef2b3ae
  Copying blob sha256:fe17849545bb51455d3f7c8773ded2dbb1d6668a85bd00564573a4b88afd36f6
  Error: creating build container: copying system image from manifest list: writing blob: adding layer with blob "sha256:fe17849545bb51455d3f7c8773ded2dbb1d6668a85bd00564573a4b88afd36f6": creating read-only layer with ID "fdd09facb36803af0795aedc41a5c13fdb1c835d7ff211ddfedac7813bd2dd80": no space left on device

This other on e2e (v1.27.11, v0.53.5), too, as the same test seemed to pass with v1alpha1.

• [FAILED] [95.119 seconds]
Using One-Off Builds Embed BuildSpec in BuildRun [It] should build an image using Buildpacks and a Git source
/home/runner/work/build/build/test/e2e/v1beta1/e2e_one_off_builds_test.go:64

  Timeline >>
  2024-07-25T14:06:34Z 1 Still waiting for build run 'onoff-fjrst' to succeed.
  [FAILED] in [It] - /home/runner/work/build/build/test/e2e/v1beta1/validators_test.go:134 @ 07/25/24 14:07:09.039
  2024-07-25T14:07:09Z 1 Print failed BuildRun's log
  [PANICKED] in [AfterEach] - /opt/hostedtoolcache/go/1.21.12/x64/src/runtime/panic.go:261 @ 07/25/24 14:07:09.041
  << Timeline

  [FAILED] BuildRun status doesn't move to Succeeded
  Expected
      <v1.ConditionStatus>: False
  not to equal
      <v1.ConditionStatus>: False
  In [It] at: /home/runner/work/build/build/test/e2e/v1beta1/validators_test.go:134 @ 07/25/24 14:07:09.039

  Full Stack Trace
    github.com/shipwright-io/build/test/e2e/v1beta1_test.validateBuildRunToSucceed.func1()
    	/home/runner/work/build/build/test/e2e/v1beta1/validators_test.go:134 +0x322
    reflect.Value.call({0x1a327a0?, 0xc000688390?, 0xc0007f5a20?}, {0x1d9a3e0, 0x4}, {0x3060d60, 0x0, 0xc0007f5ab2?})
    	/opt/hostedtoolcache/go/1.21.12/x64/src/reflect/value.go:596 +0xce7
    reflect.Value.Call({0x1a327a0?, 0xc000688390?, 0x0?}, {0x3060d60?, 0x1?, 0x3?})
    	/opt/hostedtoolcache/go/1.21.12/x64/src/reflect/value.go:380 +0xb9
    github.com/onsi/gomega/internal.(*AsyncAssertion).buildActualPoller.func3()
    	/home/runner/work/build/build/vendor/github.com/onsi/gomega/internal/async_assertion.go:325 +0x11f
    github.com/onsi/gomega/internal.(*AsyncAssertion).match(0xc0001f6d90, {0x2090c18?, 0xc0004e2b80}, 0x1, {0xc0004e2b90, 0x1, 0x1})
    	/home/runner/work/build/build/vendor/github.com/onsi/gomega/internal/async_assertion.go:540 +0x82c
    github.com/onsi/gomega/internal.(*AsyncAssertion).Should(0xc0001f6d90, {0x2090c18, 0xc0004e2b80}, {0xc0004e2b90, 0x1, 0x1})
    	/home/runner/work/build/build/vendor/github.com/onsi/gomega/internal/async_assertion.go:145 +0x86
    github.com/shipwright-io/build/test/e2e/v1beta1_test.validateBuildRunToSucceed(0xc0001a04d0, 0xc0002d8f00)
    	/home/runner/work/build/build/test/e2e/v1beta1/validators_test.go:144 +0x362
    github.com/shipwright-io/build/test/e2e/v1beta1_test.glob..func4.2.2()
    	/home/runner/work/build/build/test/e2e/v1beta1/e2e_one_off_builds_test.go:80 +0x450

  There were additional failures detected.  To view them in detail run ginkgo -vv

adambkaplan · 2024-07-25T16:22:09Z

@SaschaSchwarze0 do you know if GitHub actions run the matrixed tests on the same machine? That could explain the out of disk issue.

SaschaSchwarze0 · 2024-07-26T07:45:21Z

@SaschaSchwarze0 do you know if GitHub actions run the matrixed tests on the same machine? That could explain the out of disk issue.

I would assume not given a common use case for matrix is also to use it for the os itself.

Disk space had not been an issue since we started to use https://github.com/shipwright-io/build/blob/v0.13.0/.github/workflows/ci.yml#L135-L144. I might not have looked too carefully, but did we have it somewhere else outside of this PR? Running concurrent jobs, letting them running until the result is streamed over (and therefore duplicated) to the buildrun pod and not deleting them (not seeing a kubectl delete for them and relying on ttlSecondsAfterFinished of one day) certainly can be disk-intensive.

aleskandro · 2024-07-26T14:02:25Z

@SaschaSchwarze0 do you know if GitHub actions run the matrixed tests on the same machine? That could explain the out of disk issue.

I would assume not given a common use case for matrix is also to use it for the os itself.

Disk space had not been an issue since we started to use https://github.com/shipwright-io/build/blob/v0.13.0/.github/workflows/ci.yml#L135-L144. I might not have looked too carefully, but did we have it somewhere else outside of this PR? Running concurrent jobs, letting them running until the result is streamed over (and therefore duplicated) to the buildrun pod and not deleting them (not seeing a kubectl delete for them and relying on ttlSecondsAfterFinished of one day) certainly can be disk-intensive.

Builds and BuildRuns should be deleted after each case, according to

build/test/e2e/v1alpha1/e2e_test.go

Lines 30 to 48 in 433160e

    
           AfterEach(func() { 
        
           	if CurrentSpecReport().Failed() { 
        
           		printTestFailureDebugInfo(testBuild, testBuild.Namespace, testID) 
        
           	} else if buildRun != nil { 
        
           		validateServiceAccountDeletion(buildRun, testBuild.Namespace) 
        
           	} 
        
           	if buildRun != nil { 
        
           		testBuild.DeleteBR(buildRun.Name) 
        
           		buildRun = nil 
        
           	} 
        
           	if build != nil { 
        
           		testBuild.DeleteBuild(build.Name) 
        
           		build = nil 
        
           	} 
        
           })

I'm running it a few times locally to see if I find anything that can be related to my changes... I can't re-run it here without your help though.

SaschaSchwarze0 · 2024-07-29T07:23:20Z

Builds and BuildRuns should be deleted after each case

Indeed they are. I missed the ownerreference in the job you are creating.

dorzel · 2024-07-30T16:45:16Z

...1beta1/buildstrategy/multiarch-native-buildah/buildstrategy_multiarch_native_buildah_cr.yaml

+            "https:${KUBERNETES_PORT#tcp:}/version" | sed -n 's/^.*gitVersion.*"v\(.*\)\+".*$/\1/p')
+          arch=$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/')
+          curl -L -o /usr/local/bin/kubectl \
+            "https://storage.googleapis.com/kubernetes-release/release/v${kube_version}/bin/linux/${arch}/kubectl"


I ran into this failing when kubernetes version was reported by the server as v1.28.9+416ecaf. Google wants only v1.28.9 it seems.

@dorzel good point, I just submitted a change that should cover that case too.

https://github.com/shipwright-io/build/pull/1634/files#diff-d244d2f9d7466893a1cfb33fef3b6c411fcae737e967fed9571c9ac30f3ad652R158

This commits adds a sample multi-arch native ClusterBuildStrategy. The ClusterBuildStrategy runs as a main orchestrator pod. It creates one slave job for each architecture requested by the Build. The slave jobs are responsible for building the container image and coordinate with the orchestrator pod via messages in FIFO pipes sent through the kubectl client. The slave jobs run through the following coordination barriers: 1. Wait for the assets to be uploaded (source code, pull secrets, etc...) 2. Waiting for the image download to start 3. Waiting for the image download to finish The orchestrator pod: 1. Create the slave jobs 2. Upload the assets and signal the completion of the assets upload through a write to the FIFO pipe in each slave job 3. Start the download of the oci archive 4. Inform the slave jobs the download is completed 5. Creates a manifest-list and push it to the final registry. The service account that runs the strategy must be able to list, get and watch jobs and pods. It also needs to be allowed the create verb for the pods/exec resource, and create for the jobs one. In OKD/Openshift, the service account must also be able to create pods requiring the privileged SecurityContextConstraints.

… and E2E cases When objects are not cluster-scoped, like rolebindings or local build strategies, if the TEST_NAMESPACE variable is set, the objects should be applied to that given namespace. If not, apply to the current context's namespace. Some test cases were also implementing with an additional creation step for the same buildstrategies defined in the same folder recursively applied by the install-strategies target. This commit deletes those function calls to rely on the Makefile target for creating the non-cluster-scoped build strategies too.

SaschaSchwarze0

/lgtm

pull-request-size bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 26, 2024

openshift-ci bot requested a review from adambkaplan June 26, 2024 10:59

openshift-ci bot added the release-note Label for when a PR has specified a release note label Jun 26, 2024

qu1queee requested review from qu1queee and SaschaSchwarze0 June 27, 2024 06:23

adambkaplan approved these changes Jun 27, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2024

aleskandro commented Jun 27, 2024

View reviewed changes

aleskandro force-pushed the multiarch-native-builds branch from ac2ea3e to e674b83 Compare July 8, 2024 12:04

aleskandro force-pushed the multiarch-native-builds branch 5 times, most recently from c021cd6 to 5373a51 Compare July 19, 2024 23:45

aleskandro requested a review from adambkaplan July 23, 2024 19:25

adambkaplan reviewed Jul 24, 2024

View reviewed changes

aleskandro force-pushed the multiarch-native-builds branch 7 times, most recently from b135d1d to ee72f5e Compare July 24, 2024 21:45

aleskandro commented Jul 24, 2024

View reviewed changes

openshift-ci bot requested a review from adambkaplan July 24, 2024 21:56

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 25, 2024

aleskandro force-pushed the multiarch-native-builds branch from ee72f5e to 433160e Compare July 25, 2024 11:09

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 25, 2024

aleskandro mentioned this pull request Jul 25, 2024

Documentation for the multiarch-native-buildah ClusterStrategy shipwright-io/website#125

Closed

dorzel reviewed Jul 30, 2024

View reviewed changes

aleskandro added 3 commits July 30, 2024 13:06

[Test] Native multi-arch ClusterBuildStrategy

475255a

aleskandro force-pushed the multiarch-native-builds branch from 433160e to be5660b Compare July 30, 2024 17:06

SaschaSchwarze0 approved these changes Aug 4, 2024

View reviewed changes

openshift-ci bot assigned SaschaSchwarze0 Aug 4, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2024

SaschaSchwarze0 added this to the release-v0.14.0 milestone Aug 4, 2024

openshift-merge-bot bot merged commit 78a3c7b into shipwright-io:main Aug 4, 2024
16 checks passed

aleskandro mentioned this pull request Aug 7, 2024

Documentation for the multiarch-native-buildah ClusterStrategy #1660

Merged

3 tasks

SaschaSchwarze0 added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native multi-arch ClusterBuildStrategy #1634

Native multi-arch ClusterBuildStrategy #1634

aleskandro commented Jun 26, 2024 •

edited

Loading

adambkaplan left a comment

openshift-ci bot commented Jun 27, 2024

SaschaSchwarze0 commented Jun 27, 2024

aleskandro commented Jun 27, 2024 •

edited

Loading

aleskandro Jun 27, 2024

adambkaplan Jul 24, 2024

adambkaplan Jul 24, 2024

aleskandro Jul 24, 2024

aleskandro commented Jul 25, 2024

aleskandro commented Jul 25, 2024

adambkaplan commented Jul 25, 2024

SaschaSchwarze0 commented Jul 26, 2024

aleskandro commented Jul 26, 2024

SaschaSchwarze0 commented Jul 29, 2024

dorzel Jul 30, 2024

aleskandro Jul 30, 2024

aleskandro Jul 30, 2024

SaschaSchwarze0 left a comment

		@@ -0,0 +1,14 @@
		# NOTE: This is needed secifically in OKD/OpenShift environments.

Native multi-arch ClusterBuildStrategy #1634

Native multi-arch ClusterBuildStrategy #1634

Conversation

aleskandro commented Jun 26, 2024 • edited Loading

Changes

Submitter Checklist

Release Notes

adambkaplan left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 27, 2024

SaschaSchwarze0 commented Jun 27, 2024

aleskandro commented Jun 27, 2024 • edited Loading

aleskandro Jun 27, 2024

Choose a reason for hiding this comment

adambkaplan Jul 24, 2024

Choose a reason for hiding this comment

adambkaplan Jul 24, 2024

Choose a reason for hiding this comment

aleskandro Jul 24, 2024

Choose a reason for hiding this comment

aleskandro commented Jul 25, 2024

aleskandro commented Jul 25, 2024

adambkaplan commented Jul 25, 2024

SaschaSchwarze0 commented Jul 26, 2024

aleskandro commented Jul 26, 2024

SaschaSchwarze0 commented Jul 29, 2024

dorzel Jul 30, 2024

Choose a reason for hiding this comment

aleskandro Jul 30, 2024

Choose a reason for hiding this comment

aleskandro Jul 30, 2024

Choose a reason for hiding this comment

SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

aleskandro commented Jun 26, 2024 •

edited

Loading

aleskandro commented Jun 27, 2024 •

edited

Loading