-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCHEMATIC-183] Use paths from file view for manifest generation #1529
base: develop
Are you sure you want to change the base?
Conversation
schematic/store/synapse.py
Outdated
synapse_id=datasetId, syn=self.syn, download_file=False | ||
) | ||
# Identify all folders nested under the dataset folder | ||
folders = synapseutils.walk(self.syn, datasetId, includeTypes=["folder"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: Analysis like this can be done outside of the code. I took a look at one of the NF projects (JHU biobank) and I didn't realize projects have such deep nesting so to be extreme
- 62 folders within the folder will take around 11 seconds for both solutions (this is an actual folder count)
- 9000 folders however... I didn't finish the walk solution but it took ~20 seconds for the entityview solution. (This is hypothetical, I'm not sure what the upper bound is - we can find out in snowflake but my power is out) I can show you the query to find out this upper bound if youd like.
One other thought if you had the fileview, is using the path and do something like this ..
select * from .... where path like 'myproject/myfolder/path/%'
That should get you all the files under a folder
@BryanFauble I've made some changes and addressed the comments, if you want to review it again it's ready now. The tests that failed previously are passing locally now so there should only be minimal changes if any going forward |
@BryanFauble have you seen an error like the ones from the latest run of the integration tests action?
|
Yeah I am already working to resolve this. It's probably not the root cause of the test failure, but it's obfuscating what the real reason of the test failures are. I will re-run the GH action when I get this resolved. |
Okay, thanks for the heads up! @BryanFauble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_api.py::TestSynapseStorage::test_get_dataset_files
This test fails locally. I took a look and here's the issue:
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/flask/app.py", line 1517, in full_dispatch_request
rv = self.dispatch_request()
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/flask/app.py", line 1503, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
response = function(request)
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/security/security_handler_factory.py", line 388, in wrapper
return function(request)
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
response = function(request)
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/validation.py", line 399, in wrapper
return function(request)
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
return function(**kwargs)
File "/Users/tyu/sage/schematic/schematic_api/api/routes.py", line 533, in get_files_storage_dataset
file_lst = store.getFilesInStorageDataset(
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/Users/tyu/sage/schematic/schematic/store/synapse.py", line 710, in getFilesInStorageDataset
dataset_path = f"'{self.storageFileviewTable.loc[self.storageFileviewTable['id']==datasetId,'path'][0]}/%'"
File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/pandas/core/series.py", line 1118, in __getitem__
return self._values[key]
IndexError: index 0 is out of bounds for axis 0 with size 0
Co-authored-by: Thomas Yu <[email protected]>
Okay with that last comment I think that's everything provided the tests pass. |
Quality Gate passedIssues Measures |
I took a deeper look at the |
Resolves SCHEMATIC-183 and switches to using the
path
column from synapse file views for file paths when generating manifests and getting files in a dataset.Tests are provided in a couple of formats per conversations with @thomasyu888