The dogfooding
cluster runs the instance of Tekton that is used for all the CI/CD needs
of Tekton itself.
- Configuration for the CI is in tekton
- The cluster has two node pools
Secrets which have been applied to the dogfooding
cluster but are not committed here are:
-
GitHub
personal access tokens:- In the default namespace:
bot-token-github
used for syncing label configuration and org configurationgithub-token
used to create a draft release
- In the
tekton-ci
namespace:bot-token-github
used for custom interceptors and CI jobsci-webhook
contains the secret used to verify pull request webhook requests for plumbing CI.
- In the mario namespace:
mario-github-secret
contains the secret used to verify comment webhook requests to the mario service are coming from githubmario-github-token
used for updating PRs
- In the bastion-z namespace:
s390x-k8s-ssh
used to ssh access s390x remote machine
- In the bastion-p namespace:
ppc64le-cluster
headless service & endpoint to resolve remote machine addressppc64le-k8s-ssh
used to ssh access ppc64le remote machine
- In the default namespace:
-
GCP
secrets:nightly-account
is used by nightly releases to push releases to the nightly bucket. It's a token for service account[email protected]
.release-secret
is used by Tekton Pipeline to push pipeline artifacts to a GCS bucket. It's also used to push images built by cron trigger (or Mario to the image registry on GCP.
-
K8s service account secrets. These secrets are used in pipeline resources of type cluster, to give enabled Tekton pipelines to deploy to target clusters with specific service accounts:
- dogfooding-tekton-cd-token
- dogfooding-tekton-cleaner-token
- dogfooding-tektonci-default-token
- robocat-tekton-deployer-token
- robocat-tektoncd-cadmin-token
-
K8s configuration secrets. These secrets are used in Tekton CD services to deploy resources to a cluster using the embedded k8s client configuration:
$ kubectl get secret -l app=tekton.cd NAME TYPE DATA AGE tektoncd-dogfooding kubeconfig 1 18s tektoncd-dogfooding-tekton-cd kubeconfig 1 18s tektoncd-dogfooding-tekton-ci-default kubeconfig 1 15s tektoncd-dogfooding-tektoncd-cleaner kubeconfig 1 15s tektoncd-dogfooding-tektonci-default kubeconfig 1 11s tektoncd-prow-cluster-config-bot kubeconfig 1 13s tektoncd-prow-github-admin-default kubeconfig 1 11s tektoncd-robocat-cadmin kubeconfig 1 9s tektoncd-robocat-tekton-deployer kubeconfig 1 8s
-
Netlify API Token, in the
dns-manager
namespace, namednetlify-credentials
-
Lots of other secrets, hopefully we can add more documentation on them here as we go.
Ingress resources use the GCP Load Balancers for HTTPS termination and offload
to kubernetes services in the dogfooding
cluster.
SSL Certificate are generated automatically using a ClusterIssuer
managed by
cert-manager.
- To install
cert-manager
follow this guide - To deploy the
ClusterIssuer
:
kubectl apply -f https://github.com/tektoncd/plumbing/blob/main/tekton/certificates/clusterissuer.yaml
The DNS names are automatically provisioned through annotations on the ingresses themselves.
To see the IP of an ingress in the cluster:
kubectl get ingress <ingress-name>
A full example of an ingress with HTTPS certificate and DNS name provisioning:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
acme.cert-manager.io/http01-edit-in-place: "true"
cert-manager.io/cluster-issuer: letsencrypt-prod
dns.gardener.cloud/dnsnames: 'dashboard.dogfooding.tekton.dev'
dns.gardener.cloud/ttl: "3600"
name: ing
namespace: tekton-pipelines
spec:
tls:
- secretName: dashboard-dogfooding-tekton-dev-tls
hosts:
- dashboard.dogfooding.tekton.dev
rules:
- host: dashboard.dogfooding.tekton.dev
http:
paths:
- backend:
service:
name: tekton-dashboard
port:
number: 9097
path: /*
pathType: ImplementationSpecific
The dogfooding
cluster is comprised of two node pools. One is used for workloads that operate with
Workload Identity, a feature of GKE which maps Kubernetes Service Accounts to Google Cloud IAM Service
Accounts. The other is used for workloads that don't use Workload Identity and rely instead on mechanisms
like mounted Secrets or that run unauthenticated.
Choosing the correct pool for a workload should really only depend on whether it utilizes the
Workload Identity feature or not.
default-pool
is used for most workloads. It doesn't have the GKE Metadata Server enabled and therefore doesn't support workloads running with Workload Identity.workload-id
has the GKE Metadata Server enabled and is used for workloads operating with Workload Identity. The only workload that currently requires Workload Identity is "pipelinerun-logs" which shows Stackdriver log entries for PipelineRuns.
Manifests for various resources are deployed to the dogfooding
clusters from different repositories.
For the plumbing repo, manifest are applied nightly through two cronjobs:
Manifests from other repos (pipeline, dashboard and triggers) are applied manually for now.
Service accounts definitions are stored in git and are applied as part of CD, expect for the case of Cluster Roles and related bindings, as they would require giving too broad access to the CD service account.
Tekton services are deployed using the deploy-release.sh
script, which submits a kubernets Job
to the robocat
cluster, to trigger a deployment on the
dogfooding
cluster. The Job
triggers and event listener on the robocat
cluster, and triggers
a Tekton task that downloads a release from the release bucket, optionally applies overlays and
deploys the result to the dogfooding
cluster using a dedicated service account.
DNS records for the tekton.dev
are hosted by Netlify. Gardeners External DNS Manager
is installed in the dogfooding
cluster in the dns-manager
namespace, and it watches for DNSEntries
and annotated
ingresses and services in all namespaces.
DNS Manager is installed from the v0.11.4 tag using helm as follows:
# From a cloned https://github.com/gardener/external-dns-management
helm install dns-manager charts/external-dns-management \
--namespace=dns-manager \
--set configuration.disableNamespaceRestriction=true \
--set configuration.identifier=tekton-dogfooding-default \
--set vpa.enabled=false \
--set createCRDs=true \
--set resources.requests.memory=256Mi \
--set resources.limits.memory=512Mi \
--set 'custom.volumes:' \
--set 'custom.volumeMounts:'
The DNS Provider for Netlify is installed through the following resource:
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSProvider
metadata:
name: netlify
namespace: dns-manager
spec:
type: netlify-dns
secretRef:
name: netlify-credentials
domains:
include:
- tekton.dev
Tekton Pipelines is configured in the dogfooding
cluster to generate CloudEvents
which are sent every time a TaskRun
or PipelineRun
is executed.
CloudEvents
are sent by Tekton Pipelines to an event broker. Trigger
resources
can be defined to pick-up events from broken and have them delivered to consumers.
The broker installed is based on Knative Eventing running on top of a Kafka backend. Knative Eventing is installed following the official guide from the Knative project:
# Install the CRDs
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.0.0/eventing-crds.yaml
# Install the core components
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.0.0/eventing-core.yaml
# Verify the installation
kubectl get pods -n knative-eventing
The Kafka backend is installed, as recommended in the Knative guide, using the strimzi operator:
# Create the namespace
kubectl create namespace kafka
# Install in the kafka namespace
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
# Apply the `Kafka` Cluster CR file
kubectl apply -f https://strimzi.io/examples/latest/kafka/kafka-persistent-single.yaml -n kafka
# Verify the installation
kubectl wait kafka/my-cluster --for=condition=Ready --timeout=300s -n kafka
A Knative Channel is installed next:
# Install the Kafka "Consolidated" Channel
kubectl apply -f https://storage.googleapis.com/knative-nightly/eventing-kafka/latest/channel-consolidated.yaml
# Edit the "config-kafka" config-map in the "knative-eventing" namespace
# Replace "REPLACE_WITH_CLUSTER_URL" with my-cluster-kafka-bootstrap.kafka:9092/
kubectl edit cm/config-kafka -n knative-eventing
Install the Knative Kafka Broker following the official guide:
# Kafka Controller
kubectl apply -f https://github.com/knative-sandbox/eventing-kafka-broker/releases/download/knative-v1.0.0/eventing-kafka-controller.yaml
# Kafka Broken Data plane
kubectl apply -f https://github.com/knative-sandbox/eventing-kafka-broker/releases/download/knative-v1.0.0/eventing-kafka-broker.yaml
Create a broker resource:
kind: Broker
metadata:
name: default
namespace: default
spec:
config:
apiVersion: v1
kind: ConfigMap
name: kafka-broker-config
namespace: knative-eventing
delivery:
retry: 0
The retry: 0
part means that event delivery won't be retried on failure.
This is required because Tekton Triggers replies to CloudEvents with a JSON body
but no CloudEvents headers, which is interpreted by the message dispatcher as
a failure - see the feature proposal
on Triggers for more details.
The Kafka UI allows viewing and searching for events stored by Kafka. Events are retained by Kafka for some time (but not indefinitely), which helps when debugging event based integrations. The Kafka UI allows managing channels and creating new events, so it is not publicly accessible. To access the Kafka UI, port-forward the service port:
# Set up port forwarding
export POD_NAME=$(kubectl get pods --namespace kafka -l "app.kubernetes.io/name=kafka-ui,app.kubernetes.io/instance=kafka-ui" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace kafka port-forward $POD_NAME 8080:8080
# Point the browser to http://localhost:8080
The Kafka UI is installed via an helm chart as recommended in the Kubernetes installation guide.
helm install kafka-ui kafka-ui/kafka-ui --set envs.config.KAFKA_CLUSTERS_0_NAME=my-cluster --set envs.config.KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=my-cluster-kafka-bootstrap:9092 --set envs.config.KAFKA_CLUSTERS_0_ZOOKEEPER=my-cluster-zookeeper-nodes:2181 --namespace kafka
Tekton Pipelines is the only CloudEvents
producer in the cluster. It's configured to send all events to the broker:
data:
default-cloud-events-sink: http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/default
CloudEvents
are consumed from the broker via a Knative Eventing CRD called Trigger
.
The dogfooding
cluster is setup so that all TaskRun
start, running and finish events are forwarded from the
broker to the tekton-events
event listener, in the default
namespace.
This initial filtering of events allows to reduce the load on the event listener.
The following Triggers
are defined in the cluster:
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: taskrun-start-events-to-tekton-events-el
namespace: default
spec:
broker: default
filter:
attributes:
type: dev.tekton.event.taskrun.started.v1
subscriber:
uri: http://el-tekton-events.default.svc.cluster.local:8080
---
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: taskrun-running-events-to-tekton-events-el
namespace: default
spec:
broker: default
filter:
attributes:
type: dev.tekton.event.taskrun.running.v1
subscriber:
uri: http://el-tekton-events.default.svc.cluster.local:8080
---
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: taskrun-successful-events-to-tekton-events-el
namespace: default
spec:
broker: default
filter:
attributes:
type: dev.tekton.event.taskrun.successful.v1
subscriber:
uri: http://el-tekton-events.default.svc.cluster.local:8080
---
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
name: taskrun-failed-events-to-tekton-events-el
namespace: default
spec:
broker: default
filter:
attributes:
type: dev.tekton.event.taskrun.failed.v1
subscriber:
uri: http://el-tekton-events.default.svc.cluster.local:8080
Occasionally, the Kafka cluster may stop working. Connecting via the Kafka UI
shows the cluster as down. The el-tekton-events
deployment logs don't get
any new entry.
The kafka cluster logs show an error related to TLS certificates.
The solution in this case is to kill all Pods
in the kafka
namespace and
wait for things to start working again.