Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-183] Use paths from file view for manifest generation #1529

Open
wants to merge 92 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 91 commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
8a80ce3
add test for clause method
GiaJordan Oct 29, 2024
a2de0c4
add method to process dataset id into query clause
GiaJordan Oct 29, 2024
45193a1
use new method for validation
GiaJordan Oct 29, 2024
80b6bda
update clause method
GiaJordan Oct 29, 2024
d810576
update file based manifest gen test
GiaJordan Oct 29, 2024
b1e60af
consolidate filebased manifest gen tests
GiaJordan Oct 29, 2024
33796b4
update test layout
GiaJordan Oct 29, 2024
62d8b28
use fileview for file paths
GiaJordan Oct 30, 2024
8b012d3
add functionality to just return filename
GiaJordan Oct 31, 2024
8e670ad
make syn id regex a util function
GiaJordan Oct 31, 2024
c90ebf0
add non-api integration test for detFilesInStorageDataset
GiaJordan Oct 31, 2024
f9b9bcc
fix typo and mismatched ids
GiaJordan Oct 31, 2024
be4f6b6
add case for nested data structure
GiaJordan Oct 31, 2024
b39afe2
get nested files as well
GiaJordan Nov 1, 2024
566e741
add return type annotation
GiaJordan Nov 1, 2024
86fb15d
add docstring
GiaJordan Nov 1, 2024
2fa1ad6
add str prefix
GiaJordan Nov 1, 2024
cd6713b
revert param name
GiaJordan Nov 4, 2024
9438811
change datasetid clause method
GiaJordan Nov 4, 2024
e076770
get files in doubly+ nested files
GiaJordan Nov 5, 2024
19188ea
add comments
GiaJordan Nov 5, 2024
c46a100
add test cases for filtering results
GiaJordan Nov 5, 2024
ccead76
add test case for filtered results
GiaJordan Nov 7, 2024
26e5539
Update README.md
jaymedina Oct 21, 2024
2f835bd
Update README.md
jaymedina Oct 21, 2024
f228245
Update README.md
jaymedina Oct 22, 2024
ae11b85
updated data model type rules to include error param
andrewelamb Oct 25, 2024
2daacb9
fix validate type attribute to use msg level param
andrewelamb Oct 25, 2024
2982d8e
added error handling
andrewelamb Oct 25, 2024
a1e0783
run black
andrewelamb Oct 25, 2024
450fbdf
Update CODEOWNERS
thomasyu888 Nov 1, 2024
b16bf55
Update scan_repo.yml
thomasyu888 Nov 1, 2024
8d50e1a
Update .github/CODEOWNERS
thomasyu888 Nov 1, 2024
1336fc6
Update .github/workflows/scan_repo.yml
thomasyu888 Nov 1, 2024
c61f39c
Attach additional telemetry data to OTEL traces (#1519)
BryanFauble Nov 1, 2024
ce4d642
feat: added tracing for cross manifest validation and file name valid…
linglp Nov 1, 2024
31f3f1d
Updating contribution doc to expect squash and merge (#1534)
BryanFauble Nov 5, 2024
256403c
[FDS-2491] Integration tests for Schematic API Test plan (#1512)
BryanFauble Nov 5, 2024
856fef6
[FDS-2500] Add Integration Tests for: Manifest Validation (#1516)
jaymedina Nov 6, 2024
d6fc9ad
[FDS-2449] Lock `sphinx` version and update `poetry.lock` (#1530)
jaymedina Nov 7, 2024
08008ae
filter based on filenames if given
GiaJordan Nov 8, 2024
d0aa01d
change manifest exclusion method
GiaJordan Nov 8, 2024
38cedd5
Update file annotation store process to require filename be present i…
BryanFauble Nov 7, 2024
22f0bba
Revert "Update file annotation store process to require filename be p…
BryanFauble Nov 7, 2024
4580e06
Don't attempt to annotate the table
BryanFauble Nov 7, 2024
ce5c349
Updates for integration test failures (#1537)
BryanFauble Nov 12, 2024
d661b9a
add test for bug case
GiaJordan Nov 11, 2024
951a061
update test for table tidyness
GiaJordan Nov 11, 2024
89fb9a8
remove unused import
GiaJordan Nov 11, 2024
0c9e773
remove etag column if already present when building temp file view
GiaJordan Nov 11, 2024
ab4ece7
catch all exceptions to switch to sequential mode
GiaJordan Nov 12, 2024
65acb33
update test for updated data
GiaJordan Nov 12, 2024
2e6d51f
Revert "update test for updated data"
GiaJordan Nov 12, 2024
9a00288
Revert "catch all exceptions to switch to sequential mode"
GiaJordan Nov 12, 2024
2170974
catch ValueErrors as well
GiaJordan Nov 12, 2024
65fb55d
[FDS-2525] Authenticated export of telemetry data (#1527)
BryanFauble Nov 13, 2024
5e891ef
update mocking for unit tests
GiaJordan Nov 13, 2024
4d6fd09
Merge branch 'develop' into fds-2293-file-paths-for-manifest-gen
thomasyu888 Nov 14, 2024
5f5cc43
update test assertions for format
GiaJordan Nov 14, 2024
3f90ef0
update tests assertions
GiaJordan Nov 14, 2024
f4ad7a1
add mocked integration test for getting dataset files
GiaJordan Nov 15, 2024
ee43ad6
use dataset clause method
GiaJordan Nov 15, 2024
5a4cd90
Revert "add mocked integration test for getting dataset files"
GiaJordan Nov 15, 2024
59c5a69
add mocked test for get files in dataset
GiaJordan Nov 15, 2024
d2eee35
clean comments
GiaJordan Nov 15, 2024
0eddd47
remove comment
GiaJordan Nov 15, 2024
5fcbeb1
add test ids
GiaJordan Nov 15, 2024
a1b0c90
remove unneeded param
GiaJordan Nov 15, 2024
c43debc
add ids
GiaJordan Nov 15, 2024
89a2b9f
use syn store fixture
GiaJordan Nov 15, 2024
9cdad89
change variables
GiaJordan Nov 15, 2024
51983bc
change to global var
GiaJordan Nov 18, 2024
84011b0
change case
GiaJordan Nov 18, 2024
3471e18
change case
GiaJordan Nov 18, 2024
82082be
update test for dataset clause
GiaJordan Nov 19, 2024
c542241
update use of dataset clause method
GiaJordan Nov 19, 2024
8a314a7
comments
GiaJordan Nov 19, 2024
bc9f479
change test name case
GiaJordan Nov 19, 2024
79f5bda
update descriptions
GiaJordan Nov 19, 2024
5d0599b
undo development change
GiaJordan Nov 20, 2024
c1cd79a
add comment
GiaJordan Nov 20, 2024
feb82e8
remove dev work
GiaJordan Nov 20, 2024
c1587c6
remove temp test marks
GiaJordan Nov 20, 2024
2769d6a
update test for new expected order
GiaJordan Nov 20, 2024
dd9d6a9
change method for gathering files in a dataset
GiaJordan Nov 20, 2024
800a4cd
update mock test
GiaJordan Nov 20, 2024
95cfeed
update other mocked test
GiaJordan Nov 20, 2024
9701078
change method for building dataset path
GiaJordan Nov 22, 2024
7cd0ce0
wrap path in quotes
GiaJordan Nov 22, 2024
65dcc35
update quotes for dataset path
GiaJordan Nov 22, 2024
f9d037d
reformat and add exception
GiaJordan Nov 22, 2024
e07f973
add unit test
GiaJordan Nov 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions schematic/manifest/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1904,6 +1904,8 @@ def get_manifest(
# TODO: avoid explicitly exposing Synapse store functionality
# just instantiate a Store class and let it decide at runtime/config
# the store type
# TODO: determine which parts of fileview are necessary for `get` operations
# and pass query parameters at object instantiation to avoid having to re-query
if access_token:
# for getting an existing manifest on AWS
store = SynapseStorage(access_token=access_token)
Expand Down
4 changes: 3 additions & 1 deletion schematic/models/validate_attribute.py
Original file line number Diff line number Diff line change
Expand Up @@ -2119,7 +2119,9 @@ def filename_validation(

where_clauses = []

dataset_clause = f"parentId='{dataset_scope}'"
dataset_clause = SynapseStorage.build_clause_from_dataset_id(
dataset_id=dataset_scope
)
where_clauses.append(dataset_clause)

self._login(
Expand Down
158 changes: 65 additions & 93 deletions schematic/store/synapse.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,30 @@ def query_fileview(
else:
raise AccessCredentialsError(self.storageFileview)

@staticmethod
def build_clause_from_dataset_id(
dataset_id: Optional[str] = None, dataset_folder_list: Optional[list] = None
) -> str:
"""
Method to build a where clause for a Synapse FileView query based on a dataset ID that can be used before an object is initialized.
Args:
dataset_id: Synapse ID of a dataset that should be used to limit the query
dataset_folder_list: List of Synapse IDs of a dataset and all its subfolders that should be used to limit the query
Returns:
clause for the query or an empty string if no dataset ID is provided
"""
# Calling this method without specifying synIDs will complete but will not scope the view
if (not dataset_id) and (not dataset_folder_list):
return ""

# This will be used to gather files under a dataset recursively with a fileview query instead of walking
if dataset_folder_list:
search_folders = ", ".join(f"'{synId}'" for synId in dataset_folder_list)
return f"parentId IN ({search_folders})"

# `dataset_id` should be provided when all files are stored directly under the dataset folder
return f"parentId='{dataset_id}'"

def _build_query(
self, columns: Optional[list] = None, where_clauses: Optional[list] = None
):
Expand Down Expand Up @@ -666,7 +690,7 @@ def getStorageDatasetsInProject(self, projectId: str) -> list[tuple[str, str]]:
def getFilesInStorageDataset(
self, datasetId: str, fileNames: List = None, fullpath: bool = True
) -> List[Tuple[str, str]]:
"""Gets all files in a given dataset folder.
"""Gets all files (excluding manifest files) in a given dataset folder.

Args:
datasetId: synapse ID of a storage dataset.
Expand All @@ -680,105 +704,53 @@ def getFilesInStorageDataset(
Raises:
ValueError: Dataset ID not found.
"""
# select all files within a given storage dataset folder (top level folder in
# a Synapse storage project or folder marked with contentType = 'dataset')
walked_path = synapseutils.walk(
self.syn, datasetId, includeTypes=["folder", "file"]
)
file_list = []

current_entity_location = self.synapse_entity_tracker.get(
synapse_id=datasetId, syn=self.syn, download_file=False
)
# Get path to dataset folder by using childern to avoid cases where the dataset is the scope of the view
child_path = self.storageFileviewTable.loc[
self.storageFileviewTable["parentId"] == datasetId, "path"
]
if child_path.empty:
thomasyu888 marked this conversation as resolved.
Show resolved Hide resolved
raise LookupError(
f"Dataset {datasetId} could not be found in fileview {self.storageFileview}."
)
child_path = child_path.iloc[0]

def walk_back_to_project(
current_location: Entity, location_prefix: str, skip_entry: bool
) -> str:
"""
Recursively walk back up the project structure to get the paths of the
names of each of the directories where we started the walk function.

Args:
current_location (Entity): The current entity location in the project structure.
location_prefix (str): The prefix to prepend to the path.
skip_entry (bool): Whether to skip the current entry in the path. When
this is True it means we are looking at our starting point. If our
starting point is the project itself we can go ahead and return
back the project as the prefix.

Returns:
str: The path of the names of each of the directories up to the project root.
"""
if (
skip_entry
and "concreteType" in current_location
and current_location["concreteType"] == PROJECT_ENTITY
):
return f"{current_location.name}/{location_prefix}"
# Get the dataset path by eliminating the child's portion of the path to account for nested datasets
parent = child_path.split("/")[:-1]
parent = "/".join(parent)

updated_prefix = (
location_prefix
if skip_entry
else f"{current_location.name}/{location_prefix}"
)
if (
"concreteType" in current_location
and current_location["concreteType"] == PROJECT_ENTITY
):
return updated_prefix
current_location = self.synapse_entity_tracker.get(
synapse_id=current_location["parentId"],
syn=self.syn,
download_file=False,
)
return walk_back_to_project(
current_location=current_location,
location_prefix=updated_prefix,
skip_entry=False,
)
# Format dataset path to be used in table query
dataset_path = f"'{parent}/%'"

prefix = walk_back_to_project(
current_location=current_entity_location,
location_prefix="",
skip_entry=True,
)
# When querying, only include files to exclude entity files and subdirectories
where_clauses = [f"path like {dataset_path}", "type='file'"]

project_id = self.getDatasetProject(datasetId)
project = self.synapse_entity_tracker.get(
synapse_id=project_id, syn=self.syn, download_file=False
)
project_name = project.name
file_list = []
# Requery the fileview to specifically get the files in the given dataset
self.query_fileview(columns=["id", "path"], where_clauses=where_clauses)

# iterate over all results
for dirpath, _, path_filenames in walked_path:
# iterate over all files in a folder
for path_filename in path_filenames:
if ("manifest" not in path_filename[0] and not fileNames) or (
fileNames and path_filename[0] in fileNames
):
# don't add manifest to list of files unless it is specified in the
# list of specified fileNames; return all found files
# except the manifest if no fileNames have been specified
# TODO: refactor for clarity/maintainability

if fullpath:
# append directory path to filename
if dirpath[0].startswith(f"{project_name}/"):
path_without_project_prefix = (
dirpath[0] + "/"
).removeprefix(f"{project_name}/")
path_filename = (
prefix + path_without_project_prefix + path_filename[0],
path_filename[1],
)
else:
path_filename = (
prefix + dirpath[0] + "/" + path_filename[0],
path_filename[1],
)
# Exclude manifest files
non_manifest_files = self.storageFileviewTable.loc[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worthwhile documenting somewhere that if people upload a file with "manifest" in the name it will be included here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 I think we could note that higher up in the docstring instead of under one of the parameters and in the description of the storage/dataset/files endpoint.
as an aside, do we know of any users that don't use the default basename and change it in their configs? I guess we could also just check for both

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GiaJordan good point, for now, I searched "basename" in the data_curator_config: https://github.com/search?q=repo%3ASage-Bionetworks%2Fdata_curator_config%20basename&type=code and didn't get anything returned, but we'll probably want to watch for that eventually to pull from the basename config instead of using "manifest" because people can technically change that.

Can you help create a ticket for that to track?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can. It could also be technically changed locally in the config.yml but I'm not aware of projects that change that parameter currently

~self.storageFileviewTable["path"].str.contains("synapse_storage_manifest"),
:,
]

# Remove all files that are not in the list of fileNames
if fileNames:
filename_regex = "|".join(fileNames)

matching_files = non_manifest_files["path"].str.contains(
filename_regex, case=False, regex=True
)

non_manifest_files = non_manifest_files.loc[matching_files, :]

# Truncate path if necessary
if not fullpath:
non_manifest_files.path = non_manifest_files.path.apply(os.path.basename)
GiaJordan marked this conversation as resolved.
Show resolved Hide resolved

# add file name file id tuple, rearranged so that id is first and name follows
file_list.append(path_filename[::-1])
# Return list of files as expected by other methods
file_list = list(non_manifest_files.itertuples(index=False, name=None))

return file_list

Expand Down
9 changes: 5 additions & 4 deletions schematic/utils/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@
# pylint: disable=anomalous-backslash-in-string

import logging

from typing import Any, Mapping, Sequence, Union, Optional
from functools import reduce
import re
from functools import reduce
from typing import Any, Mapping, Optional, Sequence, Union

from schematic.utils.general import SYN_ID_REGEX

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -69,7 +70,7 @@ def parse_syn_ids(
if not syn_ids:
return None

project_regex = re.compile("(syn\d+\,?)+")
project_regex = re.compile(SYN_ID_REGEX)
valid = project_regex.fullmatch(syn_ids)

if not valid:
Expand Down
2 changes: 2 additions & 0 deletions schematic/utils/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@

T = TypeVar("T")

SYN_ID_REGEX = r"(syn\d+\,?)+"


def find_duplicates(_list: list[T]) -> set[T]:
"""Find duplicate items in a list"""
Expand Down
4 changes: 2 additions & 2 deletions schematic_api/api/openapi/api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -692,8 +692,8 @@ paths:
- Synapse Storage
/storage/dataset/files:
get:
summary: Get all files in a given dataset folder
description: Get all files in a given dataset folder
summary: Get all files (excluding manifest files) in a given dataset folder
description: Get all files (excluding manifest files) in a given dataset folder
operationId: schematic_api.api.routes.get_files_storage_dataset
security:
- access_token: []
Expand Down
24 changes: 12 additions & 12 deletions tests/integration/test_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@
import uuid
from io import BytesIO

import numpy as np
import pandas as pd
import pytest
import requests
from openpyxl import load_workbook
from click.testing import CliRunner
import pandas as pd
import numpy as np
from openpyxl import load_workbook

from schematic.configuration.configuration import Configuration, CONFIG
from schematic.configuration.configuration import CONFIG, Configuration
from schematic.manifest.commands import manifest
from schematic.models.commands import model
from tests.conftest import ConfigurationForTesting
Expand Down Expand Up @@ -665,18 +665,18 @@ def test_generate_bulk_rna_google_sheet_manifest(
# Reset config to it's default values
CONFIG.load_config("config_example.yml")

assert result.output.split("\n")[7] == (
assert result.output.split("\n")[8] == (
thomasyu888 marked this conversation as resolved.
Show resolved Hide resolved
"Find the manifest template using this Google Sheet URL:"
)
assert result.output.split("\n")[8].startswith(
assert result.output.split("\n")[9].startswith(
"https://docs.google.com/spreadsheets/d/"
)
assert result.output.split("\n")[9] == (
assert result.output.split("\n")[10] == (
"Find the manifest template using this CSV file path: "
"./CLI_gs_bulk_rna.csv"
)

google_sheet_url = result.output.split("\n")[8]
google_sheet_url = result.output.split("\n")[9]

# Download the Google Sheets content as an Excel file and load into openpyxl
export_url = f"{google_sheet_url}/export?format=xlsx"
Expand Down Expand Up @@ -908,18 +908,18 @@ def test_generate_bulk_rna_google_sheet_manifest_with_annotations(
os.remove("tests/data/example.BulkRNA-seqAssay.schema.json")
os.remove("./CLI_gs_bulk_rna_annos.csv")

assert result.output.split("\n")[10] == (
assert result.output.split("\n")[11] == (
"Find the manifest template using this Google Sheet URL:"
)
assert result.output.split("\n")[11].startswith(
assert result.output.split("\n")[12].startswith(
"https://docs.google.com/spreadsheets/d/"
)
assert result.output.split("\n")[12] == (
assert result.output.split("\n")[13] == (
"Find the manifest template using this CSV file path: "
"./CLI_gs_bulk_rna_annos.csv"
)

google_sheet_url = result.output.split("\n")[11]
google_sheet_url = result.output.split("\n")[12]

# Download the Google Sheets content as an Excel file and load into openpyxl
export_url = f"{google_sheet_url}/export?format=xlsx"
Expand Down
Loading
Loading