Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove top-level directory from extracted files if it has only single child #401

Closed
3 tasks
dkrako opened this issue May 11, 2023 · 1 comment · Fixed by #579
Closed
3 tasks

Remove top-level directory from extracted files if it has only single child #401

dkrako opened this issue May 11, 2023 · 1 comment · Fixed by #579
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@dkrako
Copy link
Contributor

dkrako commented May 11, 2023

Description of the problem

Some datasets have a top-level directory in the archive that has just a single node in it.
This is unnecessary and creates much longer filepaths than needed.

Description of a solution

Instead it would be nice to remove top-level directories by default if they only have a single node.

Example:
after extracting the archive archive.zip into the directory archive we get the following filestructure:

archive/mydata/
archive/mydata/1.csv
archive/mydata/2.csv
archive/mydata/3.csv
archive/mydata/4.csv
archive/mydata/5.csv

This feature would then instead create the following structure:

archive/
archive/1.csv
archive/2.csv
archive/3.csv
archive/4.csv
archive/5.csv

But in case of

archive/mydata/
archive/mydata/1.csv
archive/mydata/2.csv
archive/mydata/3.csv
archive/mydata/4.csv
archive/mydata/5.csv
archive/README.md

The structure would have to be left untouched.

Minimum acceptance criteria

  • remove top-level directories by default in Dataset.extract() and utils.archives.extract()
  • add an optional parameter to these methods (for example remove_top_level: bool = True)
  • tests in tests/utils/archives_test.py
@dkrako dkrako added the enhancement New feature or request label May 11, 2023
@dkrako dkrako added the good first issue Good for newcomers label Sep 1, 2023
@dkrako dkrako added this to the Sprint 17 milestone Sep 1, 2023
@dkrako
Copy link
Contributor Author

dkrako commented Sep 7, 2023

I have just checked how utils.archives.extract() is currently tested in tests/utils/archives_test.py.

I think it would be best to just adjust fixture_archive such that it always creates one top-level directory with a single child.

The current tests just check if the function executes without errors but doesn't check for the extracted files:

@pytest.mark.parametrize(
'recursive',
[
pytest.param(False, id='recursive_false'),
pytest.param(True, id='recursive_true'),
],
)
@pytest.mark.parametrize(
'remove_finished',
[
pytest.param(False, id='remove_finished_false'),
pytest.param(True, id='remove_finished_true'),
],
)
def test_extract_archive_destination_path_None(recursive, remove_finished, archive):
extract_archive(
source_path=archive,
destination_path=None,
recursive=recursive,
remove_finished=remove_finished,
)

If you have the capacity, it would be nice to check the extracted file structure for the existing cases too.
It should be sufficient to just check if the respective directory has the expected set of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
2 participants