Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-183] Use paths from file view for manifest generation #1529

Open
wants to merge 92 commits into
base: develop
Choose a base branch
from

Conversation

GiaJordan
Copy link
Contributor

@GiaJordan GiaJordan commented Oct 30, 2024

Resolves SCHEMATIC-183 and switches to using the path column from synapse file views for file paths when generating manifests and getting files in a dataset.

Tests are provided in a couple of formats per conversations with @thomasyu888

@GiaJordan GiaJordan marked this pull request as draft October 30, 2024 20:35
@GiaJordan GiaJordan changed the title Fds 2293 file paths for manifest gen Use paths from file view for manifest generation Oct 30, 2024
synapse_id=datasetId, syn=self.syn, download_file=False
)
# Identify all folders nested under the dataset folder
folders = synapseutils.walk(self.syn, datasetId, includeTypes=["folder"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: Analysis like this can be done outside of the code. I took a look at one of the NF projects (JHU biobank) and I didn't realize projects have such deep nesting so to be extreme

  • 62 folders within the folder will take around 11 seconds for both solutions (this is an actual folder count)
  • 9000 folders however... I didn't finish the walk solution but it took ~20 seconds for the entityview solution. (This is hypothetical, I'm not sure what the upper bound is - we can find out in snowflake but my power is out) I can show you the query to find out this upper bound if youd like.

One other thought if you had the fileview, is using the path and do something like this ..

select * from .... where path like 'myproject/myfolder/path/%'

That should get you all the files under a folder

tests/integration/test_commands.py Show resolved Hide resolved
@GiaJordan
Copy link
Contributor Author

@BryanFauble I've made some changes and addressed the comments, if you want to review it again it's ready now. The tests that failed previously are passing locally now so there should only be minimal changes if any going forward

@GiaJordan
Copy link
Contributor Author

GiaJordan commented Nov 21, 2024

@BryanFauble have you seen an error like the ones from the latest run of the integration tests action?

--- Logging error ---
Traceback (most recent call last):
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn
    sock = connection.create_connection(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/socket.py", line 967, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request
    raise new_e
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect
    self.sock = sock = self._new_conn()
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f1ccf6fc8e0>: Failed to resolve 'dev.sagedpe.org' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='dev.sagedpe.org', port=[44](https://github.com/Sage-Bionetworks/schematic/actions/runs/11942387272/job/33289323317?pr=1529#step:11:45)3): Max retries exceeded with url: /telemetry/v1/traces (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f1ccf6fc8e0>: Failed to resolve 'dev.sagedpe.org' ([Errno -2] Name or service not known)"))

During handling of the above exception, another exception occurred:

@BryanFauble
Copy link
Collaborator

@BryanFauble have you seen an error like the ones from the latest run of the integration tests action?

--- Logging error ---
Traceback (most recent call last):
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn
    sock = connection.create_connection(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/socket.py", line 967, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request
    raise new_e
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect
    self.sock = sock = self._new_conn()
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f1ccf6fc8e0>: Failed to resolve 'dev.sagedpe.org' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
  File "/home/runner/work/schematic/schematic/.venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='dev.sagedpe.org', port=[44](https://github.com/Sage-Bionetworks/schematic/actions/runs/11942387272/job/33289323317?pr=1529#step:11:45)3): Max retries exceeded with url: /telemetry/v1/traces (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f1ccf6fc8e0>: Failed to resolve 'dev.sagedpe.org' ([Errno -2] Name or service not known)"))

During handling of the above exception, another exception occurred:

Yeah I am already working to resolve this.

It's probably not the root cause of the test failure, but it's obfuscating what the real reason of the test failures are. I will re-run the GH action when I get this resolved.

@GiaJordan
Copy link
Contributor Author

Yeah I am already working to resolve this.

It's probably not the root cause of the test failure, but it's obfuscating what the real reason of the test failures are. I will re-run the GH action when I get this resolved.

Okay, thanks for the heads up! @BryanFauble

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/test_api.py::TestSynapseStorage::test_get_dataset_files

This test fails locally. I took a look and here's the issue:

  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/flask/app.py", line 1517, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/flask/app.py", line 1503, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
    response = function(request)
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/security/security_handler_factory.py", line 388, in wrapper
    return function(request)
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
    response = function(request)
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/validation.py", line 399, in wrapper
    return function(request)
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
    return function(**kwargs)
  File "/Users/tyu/sage/schematic/schematic_api/api/routes.py", line 533, in get_files_storage_dataset
    file_lst = store.getFilesInStorageDataset(
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/tyu/sage/schematic/schematic/store/synapse.py", line 710, in getFilesInStorageDataset
    dataset_path = f"'{self.storageFileviewTable.loc[self.storageFileviewTable['id']==datasetId,'path'][0]}/%'"
  File "/Users/tyu/anaconda3/envs/schematic/lib/python3.10/site-packages/pandas/core/series.py", line 1118, in __getitem__
    return self._values[key]
IndexError: index 0 is out of bounds for axis 0 with size 0

@GiaJordan
Copy link
Contributor Author

Okay with that last comment I think that's everything provided the tests pass.
@BryanFauble and @thomasyu888 I've re-requested your review

Copy link

sonarcloud bot commented Nov 23, 2024

@thomasyu888
Copy link
Member

thomasyu888 commented Nov 23, 2024

I took a deeper look at the test_store.py, please take a look: #1554. re: failing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants