-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve chromsizes File Validation to Catch Formatting Errors Early #458
base: master
Are you sure you want to change the base?
Conversation
Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of |
@nvictus Sure, I'll do that and let you know. |
…o rea-chromsize in util.py
for more information, see https://pre-commit.ci
@nvictus |
src/cooler/util.py
Outdated
Whether to enable verbose logging for diagnostics. | ||
""" | ||
# Check if the input is a file-like object (StringIO or file path) and inspect the first line for delimiters | ||
if isinstance(filepath_or, (str, io.StringIO)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple issues:
- StringIO is not the only kind of file-like object that can be given to pandas, so this is too restrictive.
- The
str
case does not account for URLs which can also be given to pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps look into using pd.read_csv(filepath_or, sep="\t", nrows=1, header=None)
as a way to test correctness.
src/cooler/util.py
Outdated
# Read the chromosome size file into a DataFrame | ||
if verbose: | ||
print(f"Reading chromsizes file: {filepath_or}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this.
src/cooler/util.py
Outdated
name_patterns: tuple[str, ...] = (r"^chr[0-9]+$", r"^chr[XY]$", r"^chrM$"), | ||
all_names: bool = False, | ||
verbose: bool = False, # Optional parameter to enable verbose output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is little use for a "verbose" mode if a function isn't doing much and isn't running for long periods of time.
Parse a ``<db>.chrom.sizes`` or ``<db>.chromInfo.txt`` file from the UCSC | ||
database, where ``db`` is a genome assembly name. | ||
Parse a `<db>.chrom.sizes` or `<db>.chromInfo.txt` file from the UCSC | ||
database, where `db` is a genome assembly name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RestructuredText uses double backticks instead of markdown's single backticks, so these were not typos.
|
||
""" | ||
if isinstance(filepath_or, str) and filepath_or.endswith(".gz"): | ||
kwargs.setdefault("compression", "gzip") | ||
chromtable = pd.read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, pandas.read_csv
has an option on_bad_lines
, which is set to "error" by default. I don't think this existed back when the original issue was raised. Have you checked that this already handles some of the cases that used to fail before?
Co-authored-by: Nezar Abdennur <[email protected]>
for more information, see https://pre-commit.ci
@nvictus I've made all the required changes. |
Improve chromsizes File Validation to Catch Formatting Errors Early (#209 )
Original Issue: #142
Previously, improperly formatted chromsizes files (e.g., files with spaces instead of tabs or hidden characters) could be silently parsed into a DataFrame, leading to
NaN
values in the "length" column. This resulted in downstream crashes, such as theValueError: cannot convert float NaN to integer
when attempting to compute bins.This update improves the validation in the
read_chromsizes
function by immediately converting the "length" column to numeric values and checking forNaN
s. If anyNaN
values are encountered, a clearValueError
is raised, informing the user to ensure the file is properly formatted as a tab-delimited file with exactly two columns: sequence name and integer length. This proactive validation helps users catch formatting issues earlier in the pipeline, preventing cryptic error messages later.Error Before Fix:
Cryptic error when chromsizes file is not properly formatted.
Example error message when misformatted chromsizes file is used:
Example of command causing the error:
Cause:
.chrom.sizes
file, such as spaces instead of tabs, the file was misinterpreted, leading toNaN
values being parsed.allele1
) as a valid sequence withNaN
as its length would cause problems downstream.Solution:
chromsizes
file by converting the "length" column to numeric values and checking forNaN
s right away.This ensures that errors are caught early, avoiding confusing issues later in the pipeline and improving overall robustness.