Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update ingest-fastq, allow to match samples against assay table to determine collections names (#198) #203

Conversation

Nicolai-vKuegelgen
Copy link
Contributor

@Nicolai-vKuegelgen Nicolai-vKuegelgen commented Nov 8, 2023

This was initially only intended to solve #198 , but I ended up adding some additional changes / enhancements to other aspects of cubi-tk sodar ingest-fastq and other (cubi-tk snappy itransfer functions).

Added Features:

  • Added the --match-column flag that enables comparing the contents of the "(??)" match group from the src-regex against any specified column of the assay table of a given project (automatically determined from destination) to determine the (new) 'collection_name' parameter in the destination pattern based on the last material column of the assay (or alternatively another column given via --collection-column). If this is not used the "(??)" match group is assumed to equal the irods collection names (this mirrors previous behaviour).
  • The cubi-tk snappy itransfer can now take a LZ_path as destination (this was intended based on descriptions in the code but never implemented)
  • cubi-tk sodar ingest-fastq now fully inherits the possibility of using an LZ_path, LZ_UUID or Sodar porject UUID as a destination from the cubi-tk snappy itransfer functions
  • cubi-tk sodar ingest-fastq similarly inherits the possibility to automatically validate & move landing zones

Changed Features:

  • Renamed the -mflag from --remote-dir-mapping to --sample-collection-mapping and slightly reduced its scope. Previously this allowed applying a regex to the complete remote irods path (and could in-fact make the path non-viable), which allowed adapting the extracted sample/collection name but also renamed the remote files. Now the regex is applied only to extracted "(??)" match group that determines the collection name (either directly or by matching the final modified values to an assay table column).
  • Changed the default_src_regex to omit the sample barcode (S[0-9]+) from the sample name (this should fix improve ingest-fastq #4 )
  • Changed the default destination pattern to include "raw_data" ({collection_name}/raw_data/{date}/{filename} instead of {sample}/{date}/{filename})

Removed Features:

  • removed --add-suffix flag since it's only use was adding a string to the end (i.e. after file ending) of the remote filename. This seemed like unwanted behavior to me and the remote filepaths can otherwise be controlled by --remote-dir-pattern. Renaming of files inside the remote patterns also does not seem to serve any useful purpose.
  • removed unused --basepath flag

Copy link
Contributor

github-actions bot commented Nov 8, 2023

Please format your code with black: make black.

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

Attention: 99 lines in your changes are missing coverage. Please review.

Comparison is base (9026388) 77.84% compared to head (78ae06a) 77.02%.

❗ Current head 78ae06a differs from pull request most recent head a0cf1d7. Consider uploading reports for the commit a0cf1d7 to get more accurate results

Files Patch % Lines
cubi_tk/sodar/ingest_fastq.py 33.96% 70 Missing ⚠️
cubi_tk/snappy/itransfer_common.py 3.33% 29 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #203      +/-   ##
==========================================
- Coverage   77.84%   77.02%   -0.82%     
==========================================
  Files          98       98              
  Lines        7330     7434     +104     
==========================================
+ Hits         5706     5726      +20     
- Misses       1624     1708      +84     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bobermayer
Copy link
Contributor

I can't run this:

cubi-tk sodar ingest-fastq -h
...
File "/fast/work/users/obermayb_c/cubi-tk/cubi_tk/snappy/snappy_workflows.py", line 23
    if m := re.search(r"wf\s*=\s*(\w+)\(", f.read()):
          ^
SyntaxError: invalid syntax

Copy link
Contributor

@sellth sellth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only looked at these parts so far, submitted some comments. As I've never used this function and it's quite complex, someone else should also review.

cubi_tk/sodar/ingest_fastq.py Show resolved Hide resolved
cubi_tk/sodar/ingest_fastq.py Outdated Show resolved Hide resolved
cubi_tk/sodar/ingest_fastq.py Outdated Show resolved Hide resolved
cubi_tk/sodar/ingest_fastq.py Outdated Show resolved Hide resolved
cubi_tk/sodar/ingest_fastq.py Outdated Show resolved Hide resolved
cubi_tk/sodar/ingest_fastq.py Outdated Show resolved Hide resolved
Comment on lines +247 to +248
and in_column.lower()
in re.sub("(Parameter Value|Comment|Characteristics)", "", head).lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re.sub does not remove the square brackets of these column headers. Intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, I want to remove these part of the column names since they are "invisible" and if people have a column named i.e. 'Comment' then checking for that string as part of the name will make problems. The brackets should not have that issue and removing both them requires a much more complicated regex.

cubi_tk/sodar/ingest_fastq.py Outdated Show resolved Hide resolved
@sellth
Copy link
Contributor

sellth commented Nov 9, 2023

I can't run this:

cubi-tk sodar ingest-fastq -h
...
File "/fast/work/users/obermayb_c/cubi-tk/cubi_tk/snappy/snappy_workflows.py", line 23
    if m := re.search(r"wf\s*=\s*(\w+)\(", f.read()):
          ^
SyntaxError: invalid syntax

@bobermayer This operator was introduced in Python 3.8, which is at the moment the oldest supported version.

Copy link
Contributor

@bobermayer bobermayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me and works as intended

@Nicolai-vKuegelgen Nicolai-vKuegelgen merged commit 8b3662d into main Nov 27, 2023
6 checks passed
@Nicolai-vKuegelgen Nicolai-vKuegelgen deleted the 198-allow-ingest-fastq-to-match-filenames-to-a-specified-column-of-the-assay-table branch November 27, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow ingest-fastq to match filenames to a specified column of the assay table improve ingest-fastq
3 participants