Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ShigrafS
Copy link

Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )

Original Issue: #142

Previously, improperly formatted chromsizes files (e.g., files with spaces instead of tabs or hidden characters) could be silently parsed into a DataFrame, leading to NaN values in the "length" column. This resulted in downstream crashes, such as the ValueError: cannot convert float NaN to integer when attempting to compute bins.

This update improves the validation in the read_chromsizes function by immediately converting the "length" column to numeric values and checking for NaNs. If any NaN values are encountered, a clear ValueError is raised, informing the user to ensure the file is properly formatted as a tab-delimited file with exactly two columns: sequence name and integer length. This proactive validation helps users catch formatting issues earlier in the pipeline, preventing cryptic error messages later.

Error Before Fix:

  • Cryptic error when chromsizes file is not properly formatted.

  • Example error message when misformatted chromsizes file is used:

    ValueError: cannot convert float NaN to integer
    
  • Example of command causing the error:

    cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool
    

Cause:

  • When hidden characters or formatting issues were present in the .chrom.sizes file, such as spaces instead of tabs, the file was misinterpreted, leading to NaN values being parsed.
  • This issue was overly permissive, allowing incorrect files to pass unnoticed. For instance, a file that misinterpreted a chrom name (allele1) as a valid sequence with NaN as its length would cause problems downstream.

Solution:

  • Immediate validation of the chromsizes file by converting the "length" column to numeric values and checking for NaNs right away.
  • If any invalid data is found, a clear error is raised to guide the user to correct the issue.

This ensures that errors are caught early, avoiding confusing issues later in the pipeline and improving overall robustness.

@ShigrafS ShigrafS changed the title Improve chromsizes file validation to catch formatting errors early (… Improve chromsizes File Validation to Catch Formatting Errors Early Feb 26, 2025
@nvictus
Copy link
Member

nvictus commented Feb 26, 2025

Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.

@ShigrafS
Copy link
Author

@nvictus Sure, I'll do that and let you know.

@ShigrafS
Copy link
Author

ShigrafS commented Mar 1, 2025

@nvictus
I have added the unit test and made some minor tweaks as well.
Kindly look into it.

Whether to enable verbose logging for diagnostics.
"""
# Check if the input is a file-like object (StringIO or file path) and inspect the first line for delimiters
if isinstance(filepath_or, (str, io.StringIO)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple issues:

  • StringIO is not the only kind of file-like object that can be given to pandas, so this is too restrictive.
  • The str case does not account for URLs which can also be given to pandas.

Copy link
Member

@nvictus nvictus Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps look into using pd.read_csv(filepath_or, sep="\t", nrows=1, header=None) as a way to test correctness.

Comment on lines 245 to 247
# Read the chromosome size file into a DataFrame
if verbose:
print(f"Reading chromsizes file: {filepath_or}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

name_patterns: tuple[str, ...] = (r"^chr[0-9]+$", r"^chr[XY]$", r"^chrM$"),
all_names: bool = False,
verbose: bool = False, # Optional parameter to enable verbose output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is little use for a "verbose" mode if a function isn't doing much and isn't running for long periods of time.

Comment on lines -215 to +217
Parse a ``<db>.chrom.sizes`` or ``<db>.chromInfo.txt`` file from the UCSC
database, where ``db`` is a genome assembly name.
Parse a `<db>.chrom.sizes` or `<db>.chromInfo.txt` file from the UCSC
database, where `db` is a genome assembly name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestructuredText uses double backticks instead of markdown's single backticks, so these were not typos.


"""
if isinstance(filepath_or, str) and filepath_or.endswith(".gz"):
kwargs.setdefault("compression", "gzip")
chromtable = pd.read_csv(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, pandas.read_csv has an option on_bad_lines, which is set to "error" by default. I don't think this existed back when the original issue was raised. Have you checked that this already handles some of the cases that used to fail before?

@ShigrafS
Copy link
Author

ShigrafS commented Mar 5, 2025

@nvictus I've made all the required changes.
Kindly look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants