Improve chromsizes File Validation to Catch Formatting Errors Early #458

ShigrafS · 2025-02-26T10:29:12Z

Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )

Original Issue: #142

Previously, improperly formatted chromsizes files (e.g., files with spaces instead of tabs or hidden characters) could be silently parsed into a DataFrame, leading to NaN values in the "length" column. This resulted in downstream crashes, such as the ValueError: cannot convert float NaN to integer when attempting to compute bins.

This update improves the validation in the read_chromsizes function by immediately converting the "length" column to numeric values and checking for NaNs. If any NaN values are encountered, a clear ValueError is raised, informing the user to ensure the file is properly formatted as a tab-delimited file with exactly two columns: sequence name and integer length. This proactive validation helps users catch formatting issues earlier in the pipeline, preventing cryptic error messages later.

Error Before Fix:

Cryptic error when chromsizes file is not properly formatted.
Example error message when misformatted chromsizes file is used:
```
ValueError: cannot convert float NaN to integer
```

Example of command causing the error:

cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool

Cause:

When hidden characters or formatting issues were present in the .chrom.sizes file, such as spaces instead of tabs, the file was misinterpreted, leading to NaN values being parsed.
This issue was overly permissive, allowing incorrect files to pass unnoticed. For instance, a file that misinterpreted a chrom name (allele1) as a valid sequence with NaN as its length would cause problems downstream.

Solution:

Immediate validation of the chromsizes file by converting the "length" column to numeric values and checking for NaNs right away.
If any invalid data is found, a clear error is raised to guide the user to correct the issue.

This ensures that errors are caught early, avoiding confusing issues later in the pipeline and improving overall robustness.

…pen2c#209)

nvictus · 2025-02-26T19:50:50Z

Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.

ShigrafS · 2025-02-27T09:58:51Z

@nvictus Sure, I'll do that and let you know.

…r/util.py

…o rea-chromsize in util.py

for more information, see https://pre-commit.ci

ShigrafS · 2025-03-01T07:39:34Z

@nvictus
I have added the unit test and made some minor tweaks as well.
Kindly look into it.

nvictus · 2025-03-04T01:27:23Z

src/cooler/util.py

+        Whether to enable verbose logging for diagnostics.
+    """
+    # Check if the input is a file-like object (StringIO or file path) and inspect the first line for delimiters
+    if isinstance(filepath_or, (str, io.StringIO)):


A couple issues:

StringIO is not the only kind of file-like object that can be given to pandas, so this is too restrictive.

The str case does not account for URLs which can also be given to pandas.

Perhaps look into using pd.read_csv(filepath_or, sep="\t", nrows=1, header=None) as a way to test correctness.

nvictus · 2025-03-04T01:49:29Z

src/cooler/util.py

+    # Read the chromosome size file into a DataFrame
+    if verbose:
+        print(f"Reading chromsizes file: {filepath_or}")


Remove this.

src/cooler/util.py

nvictus · 2025-03-04T01:52:34Z

src/cooler/util.py

    name_patterns: tuple[str, ...] = (r"^chr[0-9]+$", r"^chr[XY]$", r"^chrM$"),
    all_names: bool = False,
+    verbose: bool = False,  # Optional parameter to enable verbose output


There is little use for a "verbose" mode if a function isn't doing much and isn't running for long periods of time.

nvictus · 2025-03-04T02:00:01Z

src/cooler/util.py

-    Parse a ``<db>.chrom.sizes`` or ``<db>.chromInfo.txt`` file from the UCSC
-    database, where ``db`` is a genome assembly name.
+    Parse a `<db>.chrom.sizes` or `<db>.chromInfo.txt` file from the UCSC
+    database, where `db` is a genome assembly name.


RestructuredText uses double backticks instead of markdown's single backticks, so these were not typos.

nvictus · 2025-03-04T02:02:27Z

src/cooler/util.py


-    """
-    if isinstance(filepath_or, str) and filepath_or.endswith(".gz"):
-        kwargs.setdefault("compression", "gzip")
    chromtable = pd.read_csv(


Today, pandas.read_csv has an option on_bad_lines, which is set to "error" by default. I don't think this existed back when the original issue was raised. Have you checked that this already handles some of the cases that used to fail before?

Co-authored-by: Nezar Abdennur <[email protected]>

for more information, see https://pre-commit.ci

…zes.

ShigrafS · 2025-03-05T07:39:15Z

@nvictus I've made all the required changes.
Kindly look into it.

Improve chromsizes file validation to catch formatting errors early (o…

f093a77

…pen2c#209)

ShigrafS changed the title ~~Improve chromsizes file validation to catch formatting errors early (…~~ Improve chromsizes File Validation to Catch Formatting Errors Early Feb 26, 2025

ShigrafS and others added 3 commits March 1, 2025 07:06

Added a test function for the chromsize check introduced in src/coole…

c8616a8

…r/util.py

Fixed pytest error in test_chromsize_check.py and made minor tweaks t…

a7bc883

…o rea-chromsize in util.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

55986ad

for more information, see https://pre-commit.ci

nvictus reviewed Mar 4, 2025

View reviewed changes

nvictus added 2 commits March 3, 2025 20:46

Move tests to test_util and fix line lengths

34bb070

Remove carriage returns and fix line lengths

db29975

nvictus reviewed Mar 4, 2025

View reviewed changes

src/cooler/util.py Show resolved Hide resolved

nvictus reviewed Mar 4, 2025

View reviewed changes

ShigrafS and others added 3 commits March 4, 2025 13:58

Update src/cooler/util.py

77cd6a1

Co-authored-by: Nezar Abdennur <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ea498d

for more information, see https://pre-commit.ci

Removed verbose and added pandas built in on_bad_lines in def chromsi…

ffa8363

…zes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Improve chromsizes File Validation to Catch Formatting Errors Early #458

ShigrafS commented Feb 26, 2025

nvictus commented Feb 26, 2025

ShigrafS commented Feb 27, 2025

ShigrafS commented Mar 1, 2025

nvictus Mar 4, 2025

nvictus Mar 4, 2025 •

edited

Loading

nvictus Mar 4, 2025

nvictus Mar 4, 2025

nvictus Mar 4, 2025

nvictus Mar 4, 2025

ShigrafS commented Mar 5, 2025

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Are you sure you want to change the base?

Improve chromsizes File Validation to Catch Formatting Errors Early #458

Conversation

ShigrafS commented Feb 26, 2025

Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )

Error Before Fix:

Cause:

Solution:

nvictus commented Feb 26, 2025

ShigrafS commented Feb 27, 2025

ShigrafS commented Mar 1, 2025

nvictus Mar 4, 2025

Choose a reason for hiding this comment

nvictus Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

nvictus Mar 4, 2025

Choose a reason for hiding this comment

nvictus Mar 4, 2025

Choose a reason for hiding this comment

nvictus Mar 4, 2025

Choose a reason for hiding this comment

nvictus Mar 4, 2025

Choose a reason for hiding this comment

ShigrafS commented Mar 5, 2025

nvictus Mar 4, 2025 •

edited

Loading