Skip to content

Commit

Permalink
Allow committer builds to use scripts/ci, dev and actions from the PR
Browse files Browse the repository at this point in the history
In our process, we generally do not let the scripts in the "build
images" workflow to use `scripts/ci`, `dev` and `action` scripts to come
from the PR. This is a security feature that prevent Pull Requests from
forks to run code on a worker that can potentially access sensitive
information - such as GITHUB_TOKEN with write access to Github
Registry.

This, however, causes troubles, because in order to test any changes
in those scripts affecting building image, you have to close your
PR from the fork and push one directly to Apache repository (there
in-line build workflows are used from "Test" workflow and those
PRs are safe to run, because only committers can push directly to the
`apache/airflow` repository branches.

This PR changes default behaviour for committer PRs. Rather than
do the same as "regular" PRs, those PRs will not use scripts from
the target branch, instead they will use scripts from the incoming
PRs of the committers. This is equally safe as running PRs from
the `apache/airflow` branch - because we have a reviewed list
of committers in our code and "selective checks" job that
checks it is run always in the context and with the code of
the "target" branch, which means that you cannot manipulate the
list of actors.

The Girhub actor is retrieved from pull requests github
context (event/pull_request/user/login) so it is impossible to
spoof it by the incoming PR.

As part of this PR - list of available selective checks and
documentation of PR labels and selective checks (wrongly named
as "static checks") were reviewed and updated.

While impolementing this, we also realised that we can simplify
branch information retrieval. The code that we had in workflow
was written a long time ago, when the target branch was always
"main" - so we had to check-out the target commit to be able to
retrieve branch_defaults.py and get the branches from there. However
it's already for quite some time that "pull request workflow"
uses "base_ref" as the base commit, which means that in `main` it
is `main` and in `v2-8-test` it is `v2-8-stable`, which means that
we already have the correct `AIRFLOW_BRANCH` and the
`DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH` without having to check
the incoming commit. Which means that we do not need to override
scripts in the build-info step, we only need to check it out
temporarily to fetch the incoming PR and it's parent to see
what files are changed in the incoming PR.
  • Loading branch information
potiuk committed Jan 28, 2024
1 parent 777dfef commit 6d44d34
Show file tree
Hide file tree
Showing 12 changed files with 294 additions and 159 deletions.
144 changes: 81 additions & 63 deletions .github/workflows/build-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ jobs:
default-constraints-branch: ${{ steps.selective-checks.outputs.default-constraints-branch }}
runs-on: ${{steps.selective-checks.outputs.runs-on}}
is-self-hosted-runner: ${{ steps.selective-checks.outputs.is-self-hosted-runner }}
is-committer-build: ${{ steps.selective-checks.outputs.is-committer-build }}
is-airflow-runner: ${{ steps.selective-checks.outputs.is-airflow-runner }}
is-amd-runner: ${{ steps.selective-checks.outputs.is-amd-runner }}
is-arm-runner: ${{ steps.selective-checks.outputs.is-arm-runner }}
Expand Down Expand Up @@ -106,70 +107,39 @@ jobs:
}
}' --jq '.data.node.labels.nodes[]' | jq --slurp -c '[.[].name]' >> ${GITHUB_OUTPUT}
if: github.event_name == 'pull_request_target'
# Retrieve it to be able to determine which files has changed in the incoming commit of the PR
# we checkout the target commit and its parent to be able to compare them
- name: Cleanup repo
run: docker run -v "${GITHUB_WORKSPACE}:/workspace" -u 0:0 bash -c "rm -rf /workspace/*"
- uses: actions/checkout@v4
with:
ref: ${{ env.TARGET_COMMIT_SHA }}
persist-credentials: false
fetch-depth: 2
- name: "Setup python"
uses: actions/setup-python@v4
with:
python-version: 3.8
- name: "Retrieve defaults from branch_defaults.py"
# We cannot "execute" the branch_defaults.py python code here because that would be
# a security problem (we cannot run any code that comes from the sources coming from the PR.
# Therefore, we extract the branches via embedded Python code
# we need to do it before next step replaces checked-out breeze and scripts code coming from
# the PR, because the PR defaults have to be retrieved here.
id: defaults
run: |
python - <<EOF >> ${GITHUB_ENV}
from pathlib import Path
import re
import sys
DEFAULTS_CONTENT = Path('dev/breeze/src/airflow_breeze/branch_defaults.py').read_text()
BRANCH_PATTERN = r'^AIRFLOW_BRANCH = "(.*)"$'
CONSTRAINTS_BRANCH_PATTERN = r'^DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH = "(.*)"$'
branch = re.search(BRANCH_PATTERN, DEFAULTS_CONTENT, re.MULTILINE).group(1)
constraints_branch = re.search(CONSTRAINTS_BRANCH_PATTERN, DEFAULTS_CONTENT, re.MULTILINE).group(1)
output = f"""
DEFAULT_BRANCH={branch}
DEFAULT_CONSTRAINTS_BRANCH={constraints_branch}
""".strip()
print(output)
# Stdout is redirected to GITHUB_ENV but we also print it to stderr to see it in ci log
print(output, file=sys.stderr)
EOF
- name: Checkout target branch to 'target-airflow' folder to use ci/scripts and breeze from there.
####################################################################################################
# WE ONLY DO THAT CHECKOUT ABOVE TO RETRIEVE THE TARGET COMMIT AND IT'S PARENT. DO NOT RUN ANY CODE
# RIGHT AFTER THAT AS WE ARE GOING TO RESTORE THE TARGET BRANCH CODE IN THE NEXT STEP.
####################################################################################################
- name: Checkout target branch to use ci/scripts and breeze from there.
uses: actions/checkout@v4
with:
path: "target-airflow"
ref: ${{ github.base_ref }}
persist-credentials: false
- name: >
Override "scripts/ci", "dev" and ".github/actions" with the target branch
so that the PR does not override it
# We should not override those scripts which become part of the image as they will not be
# changed in the image built - we should only override those that are executed to build
# the image.
shell: bash
run: |
rm -rfv "scripts/ci"
mv -v "target-airflow/scripts/ci" "scripts"
rm -rfv "dev"
mv -v "target-airflow/dev" "."
rm -rfv ".github/actions"
mv -v "target-airflow/.github/actions" ".github"
####################################################################################################
# HERE EVERYTHING IS PERFECTLY SAFE TO RUN. AT THIS POINT WE HAVE THE TARGET BRANCH CHECKED OUT
# AND WE CAN RUN ANY CODE FROM IT. WE CAN RUN BREEZE COMMANDS, WE CAN RUN SCRIPTS, WE CAN RUN
# COMPOSITE ACTIONS. WE CAN RUN ANYTHING THAT IS IN THE TARGET BRANCH AND THERE IS NO RISK THAT
# CODE WILL BE RUN FROM THE PR.
####################################################################################################
- name: "Setup python"
uses: actions/setup-python@v4
with:
python-version: 3.8
- name: "Install Breeze"
uses: ./.github/actions/breeze
####################################################################################################
# WE RUN SELECTIVE CHECKS HERE USING THE TARGET COMMIT AND ITS PARENT TO BE ABLE TO COMPARE THEM
# AND SEE WHAT HAS CHANGED IN THE PR. THE CODE IS STILL RUN FROM THE TARGET BRANCH, SO IT IS SAFE
# TO RUN IT, WE ONLY PASS TARGET_COMMIT_SHA SO THAT SELECTIVE CHECKS CAN SEE WHAT'S COMING IN THE PR
####################################################################################################
- name: Selective checks
id: selective-checks
env:
Expand Down Expand Up @@ -210,6 +180,15 @@ jobs:
with:
ref: ${{ needs.build-info.outputs.target-commit-sha }}
persist-credentials: false
####################################################################################################
# BE VERY CAREFUL HERE! THIS LINE AND THE END OF THE WARNING. HERE WE CHECK OUT THE TARGET
# COMMIT AND ITS PARENT TO BE ABLE TO COMPARE THEM BUT ALSO TO BE ABLE TO BUILD THE IMAGE FROM
# THE INCOMING PR, RATHER THAN FROM TARGET BRANCH. THIS IS A SECURITY RISK, BECAUSE THE PR
# CAN CONTAIN ANY CODE AND WE EXECUTE IT HERE. THEREFORE, WE NEED TO BE VERY CAREFUL WHAT WE
# DO HERE. WE SHOULD NOT EXECUTE ANY CODE THAT COMES FROM THE PR. WE SHOULD NOT RUN ANY BREEZE
# COMMAND NOR SCRIPTS NOR COMPOSITE ACTIONS. WE SHOULD ONLY RUN CODE THAT IS EMBEDDED DIRECTLY IN
# THIS WORKFLOW - BECAUSE THIS IS THE ONLY CODE THAT WE CAN TRUST.
####################################################################################################
- name: Checkout target branch to 'target-airflow' folder to use ci/scripts and breeze from there.
uses: actions/checkout@v4
with:
Expand All @@ -218,18 +197,26 @@ jobs:
persist-credentials: false
- name: >
Override "scripts/ci", "dev" and ".github/actions" with the target branch
so that the PR does not override it
# We should not override those scripts which become part of the image as they will not be
# changed in the image built - we should only override those that are executed to build
# the image.
so that the PR does not override them
shell: bash
run: |
echo
echo -e "\033[33m Override scripts and dev with target branch ones for non-committer builds!\033[0m"
echo
rm -rfv "scripts/ci"
mv -v "target-airflow/scripts/ci" "scripts"
rm -rfv "dev"
mv -v "target-airflow/dev" "."
rm -rfv ".github/actions"
mv -v "target-airflow/.github/actions" ".github"
if: needs.build-info.outputs.is-committer-build != 'true'
####################################################################################################
# HERE IT'S A BIT SAFER. THE `dev`, `scripts/ci` AND `.github/actions` ARE NOW COMING FROM THE
# BASE_REF - WHICH IS THE TARGET BRANCH OF THE PR. WE CAN TRUST THAT THOSE SCRIPTS ARE SAVE TO RUN.
# ALL THE REST OF THE CODE COMES FROM THE PR, AND FOR EXAMPLE THE CODE IN THE `Dockerfile.ci` CAN
# BE RUN SAFELY AS PART OF DOCKER BUILD. BECAUSE IT RUNS INSIDE THE DOCKER CONTAINER AND IT IS
# ISOLATED FROM THE RUNNER.
####################################################################################################
- name: >
Build CI Images ${{needs.build-info.outputs.all-python-versions-list-as-string}}:${{env.IMAGE_TAG}}
uses: ./.github/actions/build-ci-images
Expand Down Expand Up @@ -268,6 +255,15 @@ jobs:
with:
ref: ${{ needs.build-info.outputs.target-commit-sha }}
persist-credentials: false
####################################################################################################
# BE VERY CAREFUL HERE! THIS LINE AND THE END OF THE WARNING. HERE WE CHECK OUT THE TARGET
# COMMIT AND ITS PARENT TO BE ABLE TO COMPARE THEM BUT ALSO TO BE ABLE TO BUILD THE IMAGE FROM
# THE INCOMING PR, RATHER THAN FROM TARGET BRANCH. THIS IS A SECURITY RISK, BECAUSE THE PR
# CAN CONTAIN ANY CODE AND WE EXECUTE IT HERE. THEREFORE, WE NEED TO BE VERY CAREFUL WHAT WE
# DO HERE. WE SHOULD NOT EXECUTE ANY CODE THAT COMES FROM THE PR. WE SHOULD NOT RUN ANY BREEZE
# COMMAND NOR SCRIPTS NOR COMPOSITE ACTIONS. WE SHOULD ONLY RUN CODE THAT IS EMBEDDED DIRECTLY IN
# THIS WORKFLOW - BECAUSE THIS IS THE ONLY CODE THAT WE CAN TRUST.
####################################################################################################
- name: Checkout target branch to 'target-airflow' folder to use ci/scripts and breeze from there.
uses: actions/checkout@v4
with:
Expand All @@ -276,18 +272,26 @@ jobs:
persist-credentials: false
- name: >
Override "scripts/ci", "dev" and ".github/actions" with the target branch
so that the PR does not override it
# We should not override those scripts which become part of the image as they will not be
# changed in the image built - we should only override those that are executed to build
# the image.
so that the PR does not override them
shell: bash
run: |
echo
echo -e "\033[33m Override scripts and dev with target branch ones for non-committer builds!\033[0m"
echo
rm -rfv "scripts/ci"
mv -v "target-airflow/scripts/ci" "scripts"
rm -rfv "dev"
mv -v "target-airflow/dev" "."
rm -rfv ".github/actions"
mv -v "target-airflow/.github/actions" ".github"
if: needs.build-info.outputs.is-committer-build != 'true'
####################################################################################################
# HERE IT'S A BIT SAFER. THE `dev`, `scripts/ci` AND `.github/actions` ARE NOW COMING FROM THE
# BASE_REF - WHICH IS THE TARGET BRANCH OF THE PR. WE CAN TRUST THAT THOSE SCRIPTS ARE SAVE TO RUN.
# ALL THE REST OF THE CODE COMES FROM THE PR, AND FOR EXAMPLE THE CODE IN THE `Dockerfile.ci` CAN
# BE RUN SAFELY AS PART OF DOCKER BUILD. BECAUSE IT RUNS INSIDE THE DOCKER CONTAINER AND IT IS
# ISOLATED FROM THE RUNNER.
####################################################################################################
- name: "Install Breeze"
uses: ./.github/actions/breeze
- name: >
Expand Down Expand Up @@ -330,6 +334,15 @@ jobs:
with:
ref: ${{ needs.build-info.outputs.target-commit-sha }}
persist-credentials: false
####################################################################################################
# BE VERY CAREFUL HERE! THIS LINE AND THE END OF THE WARNING. HERE WE CHECK OUT THE TARGET
# COMMIT AND ITS PARENT TO BE ABLE TO COMPARE THEM BUT ALSO TO BE ABLE TO BUILD THE IMAGE FROM
# THE INCOMING PR, RATHER THAN FROM TARGET BRANCH. THIS IS A SECURITY RISK, BECAUSE THE PR
# CAN CONTAIN ANY CODE AND WE EXECUTE IT HERE. THEREFORE, WE NEED TO BE VERY CAREFUL WHAT WE
# DO HERE. WE SHOULD NOT EXECUTE ANY CODE THAT COMES FROM THE PR. WE SHOULD NOT RUN ANY BREEZE
# COMMAND NOR SCRIPTS NOR COMPOSITE ACTIONS. WE SHOULD ONLY RUN CODE THAT IS EMBEDDED DIRECTLY IN
# THIS WORKFLOW - BECAUSE THIS IS THE ONLY CODE THAT WE CAN TRUST.
####################################################################################################
- name: Checkout target branch to 'target-airflow' folder to use ci/scripts and breeze from there.
uses: actions/checkout@v4
with:
Expand All @@ -338,10 +351,7 @@ jobs:
persist-credentials: false
- name: >
Override "scripts/ci", "dev" and ".github/actions" with the target branch
so that the PR does not override it
# We should not override those scripts which become part of the image as they will not be
# changed in the image built - we should only override those that are executed to build
# the image.
so that the PR does not override them
shell: bash
run: |
rm -rfv "scripts/ci"
Expand All @@ -350,6 +360,14 @@ jobs:
mv -v "target-airflow/dev" "."
rm -rfv ".github/actions"
mv -v "target-airflow/.github/actions" ".github"
if: needs.build-info.outputs.is-committer-build != 'true'
####################################################################################################
# HERE IT'S A BIT SAFER. THE `dev`, `scripts/ci` AND `.github/actions` ARE NOW COMING FROM THE
# BASE_REF - WHICH IS THE TARGET BRANCH OF THE PR. WE CAN TRUST THAT THOSE SCRIPTS ARE SAVE TO RUN.
# ALL THE REST OF THE CODE COMES FROM THE PR, AND FOR EXAMPLE THE CODE IN THE `Dockerfile.ci` CAN
# BE RUN SAFELY AS PART OF DOCKER BUILD. BECAUSE IT RUNS INSIDE THE DOCKER CONTAINER AND IT IS
# ISOLATED FROM THE RUNNER.
####################################################################################################
- name: "Start ARM instance"
run: ./scripts/ci/images/ci_start_arm_instance_and_connect_to_docker.sh
- name: "Install Breeze"
Expand Down
28 changes: 0 additions & 28 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,34 +146,6 @@ jobs:
persist-credentials: false
- name: "Install Breeze"
uses: ./.github/actions/breeze
- name: "Retrieve defaults from branch_defaults.py"
id: defaults
# We could retrieve it differently here - by just importing the variables and
# printing them from python code, however we want to have the same code as used in
# the build-images.yml (there we cannot import python code coming from the PR - we need to
# treat the python code as text and extract the variables from there.
run: |
python - <<EOF >> ${GITHUB_ENV}
from pathlib import Path
import re
import sys
DEFAULTS_CONTENT = Path('dev/breeze/src/airflow_breeze/branch_defaults.py').read_text()
BRANCH_PATTERN = r'^AIRFLOW_BRANCH = "(.*)"$'
CONSTRAINTS_BRANCH_PATTERN = r'^DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH = "(.*)"$'
branch = re.search(BRANCH_PATTERN, DEFAULTS_CONTENT, re.MULTILINE).group(1)
constraints_branch = re.search(CONSTRAINTS_BRANCH_PATTERN, DEFAULTS_CONTENT, re.MULTILINE).group(1)
output = f"""
DEFAULT_BRANCH={branch}
DEFAULT_CONSTRAINTS_BRANCH={constraints_branch}
""".strip()
print(output)
# Stdout is redirected to GITHUB_ENV but we also print it to stderr to see it in ci log
print(output, file=sys.stderr)
EOF
- name: "Get information about the Workflow"
id: source-run-info
run: breeze ci get-workflow-info 2>> ${GITHUB_OUTPUT}
Expand Down
2 changes: 1 addition & 1 deletion dev/breeze/doc/08_ci_tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The selective-check command will produce the set of ``name=value`` pairs of outp
from the context of the commit/PR to be merged via stderr output.

More details about the algorithm used to pick the right tests and the available outputs can be
found in `Selective Checks <dev/breeze/SELECTIVE_CHECKS.md>`_.
found in `Selective Checks <ci/04_selective_checks.md>`_.

These are all available flags of ``selective-check`` command:

Expand Down
Loading

0 comments on commit 6d44d34

Please sign in to comment.