-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pooch library to manage test files #14
Comments
We've added a |
I looked into this a little bit this morning. Some initial thoughts.
My thought was to
I've written a script to create a registry of the files so from this point things should be fairly simple |
In orix' case this would indeed simplify the links in the registry. I used pooch the first time in kikuchipy, where files in the registry come from different sources (the package itself, the kikuchipy-data repo and various Zenodo repos), and therefore didn't have any base URL to point to. I then just fitted kikuchipy's use of pooch to orix, hence the use of no base URL. |
If we intend to make the RoscettaSciIO data module or parts of it part of the public API, the docs should take some inspiration from the scipy.datasets module. |
The SciPy datasets module has a |
I don't think that making the data public is terribly helpful. Most of the data is empty or very sparse/small. If we wanted something like a data module it's probably better to have that in downstream packages.
That is a good idea. I can add that to the list of things to add. |
It might also be worth thinking about how new data is added for new tests and how that works with the CI.
|
I'd go with two separate PRs in the beginning, and then later on see if there is a benefit to automate parts of the process. It might also be that adding a dataset function is not as straight-forward as a 1-1 mapping to a particular file in the data repo, which might not be as easy to automate. |
So for the CI then you would have something like:
It's not the best process but I think it would work. It's going to end up being a little funny regarding merging. For example if the PR for the rosettasciio-data isn't merged first than the CI will fail once the rosettasciio PR is merged. |
Yes, this is the important part. :) Is it necessary to have a separate repository? If not, what is the advantage? Is it to simplify avoid having many version of the cache when developing rosettasciio? Can the workflow be simplify as:
I don't know how pooch work but from what I read this is what I would expect. |
So I think that calling something like the fetch method in path_dict = {path: _fetch(path) for path in tia_files]
# loading the file in some test
def test_hs_load():
hs.load(path_dict["64x64_TEM_images_acquire.emi"] |
So @hakonanes can probably answer better than I can but I'll give it a shot.
The separate repository is just to have a repository of data somewhere. We could also host the data someplace like Zenodo or any other number of places. We can also link to data inside of rosettasciio: Something like:
Will set up a link to 'msa_files/example2.msa' to .cache/rosettasciio/example2.msa` Pooch It will go fetch the data from wherever as long as we give it a URL. I think the intention is to decrease the size of Rosettasciio, correct? Which requires that we move the data, Maybe we should think about if reducing the size of the test module makes sense. The test datasets shouldn't be downloaded except if you are doing development so I don't know if there is too much reason not to include the test data.
Yea something like that would probably work. More than anything we probably need to have a defined workflow for how things should work. |
Well could we keep the data in the same repository, but exclude it from releases? If we move all data files to one folder |
This seems like a good way to handle things. Just so I understand this, the export-ignore only refers to releases etc. If I was to clone the repo I would still get the test files? In that case is there much of a reason to have pooch fetch the data? Maybe if we want to make the .data attribute public but I kind of feel like no one really wants to see any of the test data. It would be far more useful to make a curated repo of good datasets and have those available for people to play with. |
@ericpre, yes, Pooch is quite straight-forward to use. One caveat I encountered with kikuchipy was that I wanted to update a file already registered with Pooch. In cases like this it is important that files are served from permanent locations, like a permanent link from GitHub (pointing to a commit, as specified here) or a version of a Zenodo repo. One can then just update the file and point the registry to the new permanent link or Zenodo repo version.
An alternative to git attributes is to specify files to include/exclude in a MANIFEST.in file or similar. |
To be more precise, I think that the use case we are interested in is:
Would it be possible to have such a function to access data: def get_file_path(file_path):
# `file_path` is the path to the file relative to test directory
if os.path.exists(path):
return file_path
else:
# download the file to the cache and get the path
path = ...
return path It may need some tweaks to play well with the pooch registry but if this is possible, then it means that when adding test file in a PR, we will only need updating the pooch registry and not create an additional PR. In development install from local git repository, the file will also be available through git and not download anything and for released version, the file will be downloaded and accessed from the cache. |
Your example snippet is exactly what we do in kikuchipy. We can bypass Pooch, but still have to check whether the hash is correct. |
Thanks @hakonanes for the information. In the case of rosettasciio (test data file), do we need to check the hash? Git will manage the file and actually, in development scenario, this is rather the opposite, we may want to change the file and run the test suite against the new file. CI will use the local file after checking out the repository using git and pooch should only be used in the conda-forge feedstock or to check user installation, etc. |
The hash check is mainly a security measure, which I guess can be bypassed in a test where the file origin can be verified otherwise. |
Just so I get this right the idea is to:
Kind of wondering if there is a good use case for 2 or if we are reaching there. I like the concept of remote data storage and it is necessary if you are trying to store things over 100 mbs. What I don't love is having multiple sources all over because that seems hard to track. A couple of things to consider.
I'm more for 3 or 4. It might be worth seeing if zipping the data files makes sense. Most of the files are 0 so we could probably reduce their size by a factor of 5 at least. Pooch just seems to add lots of complications. It is great for things like adding data for api access but for testing data it seems like it falls flat |
There will only one source, the only difference is how to access it: either local (development scenario) or through pooch when it is available in the package (distribution scenario)
There should be anything to done for this point and there is nothing special regarding the life cycle of the test data file in the git repository. Maybe to be more clear, there is an example with edax test files, which is done manually and pooch will standardise it so that it would be easier to extend its usage. The main point is to reduce the size of the wheels and source dist and not being limited by then.
Yes, 3 and 4. :)
We had a look at this and this is done for some test file (for example FEI emd) but it isn't convenient in development scenario... The problem is that it is quite easy to have files which are in the 50-200KB range and it adds up...
Yes, this is a very valid question. I expect that the initial effort to set it up will be useful to also add a registry for dataset. In my opinion, we need to be able the files installed in different environment and at the same distribute wheels or conda packages which are 30 MB or is not good. |
Describe the functionality you would like to see.
Use pooch to manage test files and avoid packaging it, while still being able to run the test suite from distribution.
Describe the context
hyperspy/hyperspy#2816
#11 (comment)
Additional information
@hakonanes has experience with using it in kikuchipy.
The text was updated successfully, but these errors were encountered: