Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection analysis - empty output #11

Open
gspirito opened this issue Apr 28, 2022 · 5 comments
Open

Outlier detection analysis - empty output #11

gspirito opened this issue Apr 28, 2022 · 5 comments
Assignees
Labels
question Further information is requested stale

Comments

@gspirito
Copy link

Hi, I performed the preprocessing and postprocessing analyses on a cohort of cram files, obtaining two folders, 'motifs' and 'samples', whose content seems to be correct. Also the postprocessing did not give any errors or warnings.

However, running this command:

python outliers.py -i /<path>/postprocessing/motifs/ --bootstrapCI -pc 95 -is -m /<path>/manifest.txt

I get a one-line output:

Motif Threshold Outlier samples Group counts Status

With no errors or warnings.

I also tried to give the 'samples' or 'postprocessing' directories as input, however I still do not get a result.

Which directory should I use as input?

Thanks.

@lfearnley
Copy link
Contributor

Hi @gspirito!

I'm assuming that you've got a directory structure that looks like:

//postprocessing/motifs/
|---3mers/
|---4mers/
|---5mers/

I think the combination of both manifest and input arguments might be having some unintentional consequences. Could you please try python outliers.py -i /<path>/postprocessing/ --bootstrapCI -pc 95 -is and let me know how you go?

@lfearnley lfearnley self-assigned this May 2, 2022
@lfearnley lfearnley added the question Further information is requested label May 2, 2022
@lfearnley lfearnley added the stale label May 9, 2022
@gspirito
Copy link
Author

gspirito commented May 9, 2022

Hi, thanks for the response, the directory structure looks like that.

Tried re-running with python outliers.py -i //postprocessing/ --bootstrapCI -pc 95 -is
and I got this error:

Motif	Threshold	Outlier samples	Group counts	Status
Traceback (most recent call last):
  File "/homenfs/gspirito/software/superSTR-main/Python/outliers.py", line 208, in <module>
    df = pd.read_csv(file)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 2061, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 155 fields in line 19, saw 156

@lfearnley
Copy link
Contributor

Hi @gspirito! I'm sorry about the delay; I missed the notification.

What's happening here is that for some reason there's an extra field in the output data that isn't expected.

Is there any chance that you'd be able to please email me the CSV failing ([email protected])? If not, I'm happy to try and debug this with you here but it might be a little tedious - we'd need to start by checking that none of your samples/identifiers had commas in them, etc.

@gspirito
Copy link
Author

Hi, thanks for the answer, I sent you the folders via Google Drive.

@lfearnley
Copy link
Contributor

Right! Sorry about this delay.

The samples in the files you sent through have varying read lengths - the samples with prefix ASD have 151nt reads and the samples with prefix HG and NA are 150nt.

We didn't implement code to handle this automatically because the specifics of how you handle this can impact your outlier calls - in a strict sense the samples with 151nt read length may be outliers relative to those with 150nt because their read lengths differ, not because of their repeat content.

The way we handled this in the superSTR manuscript was to use read trimming (via trimmomatic - http://www.usadellab.org/cms/?page=trimmomatic) prior to processing of sequencing data. You could potentially also assign all 151nt long expansions to the 150nt bin (effectively setting a max repeat length), however you'll want to proceed with a little caution here - I can send you a script to do this is that's of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

2 participants