Outlier detection analysis - empty output #11

gspirito · 2022-04-28T09:05:33Z

Hi, I performed the preprocessing and postprocessing analyses on a cohort of cram files, obtaining two folders, 'motifs' and 'samples', whose content seems to be correct. Also the postprocessing did not give any errors or warnings.

However, running this command:

python outliers.py -i /<path>/postprocessing/motifs/ --bootstrapCI -pc 95 -is -m /<path>/manifest.txt

I get a one-line output:

Motif Threshold Outlier samples Group counts Status

With no errors or warnings.

I also tried to give the 'samples' or 'postprocessing' directories as input, however I still do not get a result.

Which directory should I use as input?

Thanks.

The text was updated successfully, but these errors were encountered:

lfearnley · 2022-04-29T10:16:52Z

Hi @gspirito!

I'm assuming that you've got a directory structure that looks like:

//postprocessing/motifs/
|---3mers/
|---4mers/
|---5mers/

I think the combination of both manifest and input arguments might be having some unintentional consequences. Could you please try python outliers.py -i /<path>/postprocessing/ --bootstrapCI -pc 95 -is and let me know how you go?

gspirito · 2022-05-09T12:14:00Z

Hi, thanks for the response, the directory structure looks like that.

Tried re-running with python outliers.py -i //postprocessing/ --bootstrapCI -pc 95 -is
and I got this error:

Motif	Threshold	Outlier samples	Group counts	Status
Traceback (most recent call last):
  File "/homenfs/gspirito/software/superSTR-main/Python/outliers.py", line 208, in <module>
    df = pd.read_csv(file)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py", line 2061, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 155 fields in line 19, saw 156

lfearnley · 2022-05-19T23:37:32Z

Hi @gspirito! I'm sorry about the delay; I missed the notification.

What's happening here is that for some reason there's an extra field in the output data that isn't expected.

Is there any chance that you'd be able to please email me the CSV failing ([email protected])? If not, I'm happy to try and debug this with you here but it might be a little tedious - we'd need to start by checking that none of your samples/identifiers had commas in them, etc.

gspirito · 2022-05-26T14:13:34Z

Hi, thanks for the answer, I sent you the folders via Google Drive.

lfearnley · 2022-06-06T23:26:44Z

Right! Sorry about this delay.

The samples in the files you sent through have varying read lengths - the samples with prefix ASD have 151nt reads and the samples with prefix HG and NA are 150nt.

We didn't implement code to handle this automatically because the specifics of how you handle this can impact your outlier calls - in a strict sense the samples with 151nt read length may be outliers relative to those with 150nt because their read lengths differ, not because of their repeat content.

The way we handled this in the superSTR manuscript was to use read trimming (via trimmomatic - http://www.usadellab.org/cms/?page=trimmomatic) prior to processing of sequencing data. You could potentially also assign all 151nt long expansions to the 150nt bin (effectively setting a max repeat length), however you'll want to proceed with a little caution here - I can send you a script to do this is that's of use.

lfearnley self-assigned this May 2, 2022

lfearnley added the question Further information is requested label May 2, 2022

lfearnley added the stale label May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier detection analysis - empty output #11

Outlier detection analysis - empty output #11

gspirito commented Apr 28, 2022

lfearnley commented Apr 29, 2022

gspirito commented May 9, 2022

lfearnley commented May 19, 2022

gspirito commented May 26, 2022

lfearnley commented Jun 6, 2022

Outlier detection analysis - empty output #11

Outlier detection analysis - empty output #11

Comments

gspirito commented Apr 28, 2022

lfearnley commented Apr 29, 2022

gspirito commented May 9, 2022

lfearnley commented May 19, 2022

gspirito commented May 26, 2022

lfearnley commented Jun 6, 2022