Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve portability of reproducible tarballs by replacing external tar command with tarfile module #4660

Open
wants to merge 17 commits into
base: 5.0.x
Choose a base branch
from

Conversation

lexming
Copy link
Contributor

@lexming lexming commented Sep 27, 2024

fixes #4657

  • use more portable --date argument for touch
  • catch failed commands inside the pipeline
  • move generation of command to make reproducible archives intro its own method
  • replace harcoded pattern in tests of reproducible archives command for call to filetools.reproducible_archive_cmd
  • make new implementation of reproducible_archive_cmd using the tarfile module
  • added new filetools.make_archive() method and related unit test
  • change handling of filename argument in filetools.get_source_tarball_from_git() to expect filenames without extension
    • archive extension is appended as needed by make_archive()
    • known archive extensions are automatically removed from filename if it contains one
  • add required argument to filetools.find_extensions()
  • checksums of sources from git repos are only verified with Python 3.9+

@boegel boegel added this to the 5.0 milestone Oct 2, 2024
@boegel boegel changed the title use more portable --date argument for touch command used in reproducible tarballs use more portable --date argument for touch command used in reproducible tarballs Oct 2, 2024
test/framework/filetools.py Outdated Show resolved Hide resolved
@lexming lexming changed the title use more portable --date argument for touch command used in reproducible tarballs improve portability of reproducible tarballs by replacing external tar command with tarfile module Oct 9, 2024
@lexming
Copy link
Contributor Author

lexming commented Nov 6, 2024

@boegel This one is ready. As discussed, archives will be made with tarfile instead of the tar command and checksums of sources from git repos will only be checked with Python 3.9+.

@boegel boegel changed the title improve portability of reproducible tarballs by replacing external tar command with tarfile module improve portability of reproducible tarballs by replacing external tar command with tarfile module Nov 13, 2024

# cleanup (repo_name dir does not exist in dry run mode)
remove(tmpdir)

return archive_path


def make_archive(dir_path, archive_name=None, archive_dir=None, reproducible=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to make_tar_xz, since we're creating a very specific archive (a tarball), compressed with a hardcoded compression algorithms (XZ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even make_reproducible_tar_xz

if isinstance(filename, dict):
filename = filename['filename']
chksum_input = filename['filename']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexming Should we be more careful here, and also use filename.get('filename', None) to avoid crashing if the filename key is not set?

# checksums of tarballs made by EB of git repos cannot be reliably checked prior to Python 3.9
if chksum_input_git is not None:
self.log.deprecated(
"Reproducible tarballs of git repos are only supported in Python 3.9+. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Reproducible tarballs of git repos are only supported in Python 3.9+. "
"Reproducible tarballs of Git repos are only possible when using Python 3.9+ to run EasyBuild. "

raise EasyBuildError("git_config currently only supports filename ending in .tar.gz")
file_ext = find_extension(filename, required=False)
if file_ext:
print_warning(f"Ignoring extension of filename '{filename}' set in git_config parameter")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to require cleaning up a whole bunch of easyconfigs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than printing a warning, we should enforce the use of .tar.xz as extension here (to "ease" the transition, in some way)

Comment on lines +2798 to +2799
a reproducible archive. Other formats like .gz are not reproducible due to
arbitrary strings and timestamps getting added into their metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is new to me, on what source is this based, do we have a reference we can point to?

# reset file permissions by applying go+u,go-w
user_mode = tarinfo.mode & stat.S_IRWXU
group_mode = (user_mode >> 3) & ~stat.S_IWGRP # user mode without write
other_mode = group_mode >> 3 # user mode without write
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
other_mode = group_mode >> 3 # user mode without write
other_mode = group_mode >> 3 # same as group mode

tarinfo.mode = (tarinfo.mode & ~0o77) | group_mode | other_mode
# reset ownership numeric UID/GID 0
tarinfo.uid = tarinfo.gid = 0
tarinfo.uname = tarinfo.gname = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why does empty user/group name make sense, what does this imply?

return None

if sys.version_info[0] >= 3 and sys.version_info[1] < 9:
# ignore any checksum for given filename due to changes in python/cpython#90021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ignore any checksum for given filename due to changes in python/cpython#90021
# ignore any checksum for given filename due to changes in https://github.com/python/cpython/issues/90021

return archive_path

# TODO: replace with TarFile.add(recursive=True) when support for Python 3.6 drops
# since Python v3.7 tarfile automatically orders the list of files added to the archive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a reference for this?

archive.add(filepath, arcname=file_name, recursive=False, filter=archive_filter)
_log.debug("File/folder added to archive '%s': %s", archive_filename, filepath)

_log.info("Archive '%s' created successfully", archive_filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_log.info("Archive '%s' created successfully", archive_filename)
_log.info(f"Archive '{archive_path}' created successfully")

@boegel
Copy link
Member

boegel commented Nov 13, 2024

@lexming We need to make sure that git_config entries that still use .tar.gz as extension don't leading to clone+tar+xz every time, we also need to check whether the source filename with .tar.xz already exists?

raise EasyBuildError("git_config currently only supports filename ending in .tar.gz")
file_ext = find_extension(filename, required=False)
if file_ext:
print_warning(f"Ignoring extension of filename '{filename}' set in git_config parameter")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than printing a warning, we should enforce the use of .tar.xz as extension here (to "ease" the transition, in some way)

@boegel
Copy link
Member

boegel commented Nov 28, 2024

TODO here:

  • change so function that creates tarballs for git_config sources honors the provided file extension;
  • only .tar.xz can be produced reproducibly (and only using Python >= 3.9);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Blockers
Development

Successfully merging this pull request may close these issues.

2 participants