Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition to SQLite backend - use transcript-specific databases #192

Open
wants to merge 100 commits into
base: new_data_store
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
78e923b
DataStore: small fix in SQL query, add to-do comment
hendrikweisser Mar 9, 2021
e5d7621
Eventalign_collapse: add flag for writing to DB (or TSV file), fix er…
hendrikweisser Mar 9, 2021
51a87e4
Merge branch 'new_data_store' of https://github.com/tleonardi/nanocom…
hendrikweisser Mar 9, 2021
d7edde9
better log message in 'DataStore.__init__'
hendrikweisser Mar 10, 2021
ba2da54
Merge branch 'new_data_store' of https://github.com/tleonardi/nanocom…
hendrikweisser Mar 10, 2021
afe005c
coding style (added spaces for readability)
hendrikweisser Mar 11, 2021
7f60da1
store read-level kmer stats in database (needed for whitelisting)
hendrikweisser Mar 11, 2021
da7eb4b
Whitelist: read data from SQLite, filter reads during query
hendrikweisser Mar 12, 2021
2fefe7c
add function to check validity of sample dictionary to 'common.py'
hendrikweisser Mar 16, 2021
8f52c1c
new class 'DatabaseWrapper' for reusable DB interaction code; adapt '…
hendrikweisser Mar 16, 2021
6121376
Whitelist: fix filtering condition for sample subsets
hendrikweisser Mar 17, 2021
989af13
consolidate database code in 'DataStore', remove 'DatabaseWrapper' (n…
hendrikweisser Mar 17, 2021
6dbbdb2
DataStore: small fixes, move 'DBCreateMode' (enum) to top level
hendrikweisser Mar 23, 2021
031a995
Eventalign_collapse: small fix in 'DataStore' call
hendrikweisser Mar 23, 2021
9cc0806
Whitelist: updated 'DataStore' call
hendrikweisser Mar 23, 2021
a1cb349
SampComp: get data from SQLite DB; some refactoring
hendrikweisser Mar 23, 2021
fd0d446
TxComp: update to changes in 'SampComp'; some refactoring
hendrikweisser Mar 23, 2021
69563e0
refactor DataStore, create child classes 'DS_EventAlign' and 'DS_Samp…
hendrikweisser Mar 29, 2021
b1ad899
use 'DataStore_EventAlign' in 'Eventalign_collapse' and 'Whitelist'
hendrikweisser Mar 29, 2021
b8de4e9
TxComp: simplify 'txCompare' results data structure (remove 'lowCov' …
hendrikweisser Mar 29, 2021
7fae6d8
SampComp: small logging fix and update of 'txCompare' results
hendrikweisser Mar 29, 2021
9e96e8a
DataStore: add functions to store sample information and whitelisted …
hendrikweisser Mar 31, 2021
63429d5
SampComp: store whitelisted reads in SQLite DB
hendrikweisser Mar 31, 2021
edf0081
TxComp: use 'ST' as abbrev. for (Student) t-test
hendrikweisser Apr 6, 2021
536ab64
DataStore: fix DB init (tables creation), rework DB schema for SampCo…
hendrikweisser Apr 6, 2021
24881a4
SampComp: remove unused parameters, write output to SQLite
hendrikweisser Apr 6, 2021
fa0ec95
TxComp: coding style - added whitespace
hendrikweisser Jul 16, 2021
112e50d
DataStore: split 'gmm_results' SQL table into two; improve error logg…
hendrikweisser Jul 16, 2021
6d00f2e
add new class 'PostProcess' for data export etc. (work in progress)
hendrikweisser Jul 16, 2021
d8324c7
SampComp/TxComp/DataStore: limit to one univariate and one GMM-based …
hendrikweisser Jul 21, 2021
a7e9269
DataStore: improve definition of SQL tables
hendrikweisser Jul 21, 2021
5b35d30
TxComp: cosmetic changes
hendrikweisser Jul 21, 2021
833cc2e
DataStore/SampComp: add DB columns for adj. p-values, adapt DB schema…
hendrikweisser Jul 21, 2021
2a35003
TxComp: combine collection of functions into class 'TxComp'
hendrikweisser Jul 21, 2021
3ff8038
DataStore: small bug fixes (add 'self' for method calls)
hendrikweisser Jul 21, 2021
6270cdc
SampComp: use new 'TxComp' class, simplify parameter handling
hendrikweisser Jul 21, 2021
bd3499b
TxComp/DataStore: bug fixes (use of 'sequence_context')
hendrikweisser Jul 21, 2021
39c5009
SampComp: add multiple testing correction, remove 'shelve' export
hendrikweisser Jul 23, 2021
c6faa5e
PostProcess: implement 'save_report' for SQLite data
hendrikweisser Jul 23, 2021
bc3e8ba
PostProcess: remove 'save_shift_stats' (now included in 'save_report')
hendrikweisser Jul 23, 2021
479b443
Eventalign_collapse: remove TSV output option, simplify parameters
hendrikweisser Aug 12, 2021
543653d
SampComp: remove irrelevant data from output queue tuple (thanks Tomm…
hendrikweisser Aug 12, 2021
1721746
common: update function to build dict. with sample information
hendrikweisser Aug 16, 2021
fc3f0bb
DataStore: add to-do comment
hendrikweisser Aug 16, 2021
3219601
main: update command line options
hendrikweisser Aug 16, 2021
e7ef11e
main: fix PostProcess (TSV export) usage
hendrikweisser Aug 17, 2021
f6d82ee
main: update CLI documentation (minimal examples), make report genera…
hendrikweisser Aug 17, 2021
75193d7
Eventalign_collapse: small optimization (input file reading)
hendrikweisser Aug 25, 2021
d967101
SuperParser: add spaces (coding style)
hendrikweisser Aug 25, 2021
5980637
Whitelist: add to-do comment
hendrikweisser Aug 26, 2021
314bf1f
DataStore: reduce file size of 'eventalign_collapse' output DB (by ab…
hendrikweisser Aug 26, 2021
7571d2e
redesign: use one SQLite DB per transcript
hendrikweisser Sep 13, 2021
924c9d5
remove unused function 'check_sample_dict'
hendrikweisser Sep 13, 2021
33f65be
SampComp: implement FDR p-value adjustment with 'n_tests' parameter
hendrikweisser Sep 13, 2021
0843354
refactor (new DataStore function 'add_or_reset_column'), fix column n…
hendrikweisser Sep 15, 2021
6c684ef
DataStore_master: store input file paths in 'samples' DB
hendrikweisser Sep 17, 2021
8a3e50c
Whitelist: use 'DataStore.add_or_reset_column'
hendrikweisser Sep 17, 2021
d52792f
common: allow multiple totals in function 'get_hash_bin'
hendrikweisser Sep 17, 2021
8a3c636
SampComp: perform p-value adjustment; some clean-up
hendrikweisser Sep 17, 2021
ebef54e
Eventalign_collapse: implement parallel reading of multiple input fil…
hendrikweisser Sep 17, 2021
99a2e4e
main: update CLI for 'eventalign_collapse'
hendrikweisser Sep 17, 2021
f18b5e9
common: small simplification (logging to file)
hendrikweisser Sep 23, 2021
a8c0eab
DataStore: store transcript status in master DB; check for existence …
hendrikweisser Sep 23, 2021
39b58aa
SampComp: for every transcript, write processing status to master DB
hendrikweisser Sep 23, 2021
bbdd94c
main: update CLI for SampComp
hendrikweisser Sep 23, 2021
9dedba6
DataStore: store processing parameters in master DB
hendrikweisser Sep 23, 2021
190aed3
SampComp: enqueue transcripts in main thread, check seq. length in wo…
hendrikweisser Sep 23, 2021
4dc4b1a
Eventalign_collapse: fix use of master DB lock
hendrikweisser Sep 23, 2021
3730079
SampComp: store parameters in master DB
hendrikweisser Sep 23, 2021
6743376
Eventalign_collapse/DataStore: fix errors in 'eventalign_collapse' step
hendrikweisser Sep 24, 2021
1219c55
DataStore: fix UPSERT statement for storing parameters
hendrikweisser Sep 24, 2021
b57e575
EventAlign_collapse: add to-do comment
hendrikweisser Sep 24, 2021
9b506f7
SampComp: small fix
hendrikweisser Sep 24, 2021
4a5aed9
DataStore: add to-do comment
hendrikweisser Sep 29, 2021
1f00cc1
PostProcess: update (rewrite) 'save_report' function
hendrikweisser Sep 29, 2021
38a9c8b
main: update CLI
hendrikweisser Oct 1, 2021
b2bca8a
DataStore/SampComp: add functions to reset databases to pre-SampComp …
hendrikweisser Oct 21, 2021
cf9a1e0
SampComp/DataStore: option to store fitted GMMs in transcript databases
hendrikweisser Oct 29, 2021
ea6e720
TxComp: don't scale intensities/dwell times before GMM fit
hendrikweisser Nov 1, 2021
56dff1c
SampComp/Whitelist: perform downsampling of reads per-sample (now as …
hendrikweisser Nov 3, 2021
23fbe23
DataStore: calculate standard-scaled kmer intensities/dwell times and…
hendrikweisser Nov 4, 2021
3665ba3
Eventalign_collapse: calculate mean and std. dev. of kmer intensities…
hendrikweisser Nov 4, 2021
0672b0e
DataStore: store read summary stats for data scaling; remove scaling …
hendrikweisser Nov 4, 2021
e1f6083
TxComp: use 'dwelltime' consistently
hendrikweisser Nov 4, 2021
6613d75
SampComp: scale intensity/dwell time data when reading from DB; make …
hendrikweisser Nov 4, 2021
cebf5c7
Eventalign_collapse: add option to filter transcripts by accession
hendrikweisser Nov 7, 2021
8ed34cc
SampComp/Eventalign_collapse: small fixes
hendrikweisser Nov 7, 2021
f49ae81
store log10-transformed dwell times in transcript databases
hendrikweisser Nov 8, 2021
1897a82
TxComp: return all fitted GMMs, not just best
hendrikweisser Dec 11, 2021
b93b15a
SampComp/DataStore: add option to store components of all fitted GMMs…
hendrikweisser Dec 11, 2021
9e464b2
TxComp/DataStore: store more information about fitted GMMs
hendrikweisser Dec 13, 2021
03af595
SampComp/DataStore: use per-sequence mean/std.dev. for intensity scal…
hendrikweisser Dec 16, 2021
6f35500
SampComp: stop pre-scaling intensities; TxComp: scale data for each t…
hendrikweisser Jan 5, 2022
4833de6
allow running 'eventalign_collapse' on a single input (via file or st…
hendrikweisser Mar 3, 2022
7d0b502
DataStore: fix constraint failure when 'eventalign_collapse' is used …
hendrikweisser Mar 3, 2022
3c24c16
Eventalign_collapse: improved log message
hendrikweisser Mar 7, 2022
d7253d7
main: fix error in argument parsing for 'eventalign_collapse'
hendrikweisser Mar 7, 2022
7d78f4d
TxComp: for 2-GMMs, ensure component 1 has lower intensity than compo…
hendrikweisser Mar 11, 2022
7945dcf
main: small fixes, doc improvements
hendrikweisser Mar 17, 2022
2f6758f
Eventalign_collapse: fix std. dev. calculations for reads with only o…
hendrikweisser Jun 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
737 changes: 575 additions & 162 deletions nanocompore/DataStore.py

Large diffs are not rendered by default.

549 changes: 272 additions & 277 deletions nanocompore/Eventalign_collapse.py

Large diffs are not rendered by default.

240 changes: 240 additions & 0 deletions nanocompore/PostProcess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# -*- coding: utf-8 -*-

#~~~~~~~~~~~~~~IMPORTS~~~~~~~~~~~~~~#
# Std lib
import os.path

# Third party
from loguru import logger

# Local package
from nanocompore.common import *
from nanocompore.DataStore import DataStore_master, DataStore_transcript

#~~~~~~~~~~~~~~MAIN CLASS~~~~~~~~~~~~~~#
class PostProcess(object):
"""Helper class for post-processing `SampComp` results"""

def __init__(self, input_dir:str, master_db:str="nanocompore.db", bed_path:str=None):
self._input_dir = input_dir
self._master_db_path = os.path.join(input_dir, master_db)
self._bed_path = bed_path


def save_all(self, outpath_prefix=None, pvalue_thr=0.01):
"""
Save all text reports including genomic coordinate if a bed file was provided
* outpath_prefix
outpath + prefix to use as a basename for output files.
If not given, it will use the same prefix as the database.
* pvalue_thr
pvalue threshold to report significant sites in bed files
"""
if not outpath_prefix:
outpath_prefix = self._db_path.replace("SampComp.db", "")
logger.debug("Save reports to {}".format(outpath_prefix))

# Save reports
logger.debug("\tSaving extended tabular report")
self.save_report(output_fn = outpath_prefix + "nanocompore_results.tsv")
logger.debug("\tSaving shift results")
self.save_shift_stats(output_fn = outpath_prefix + "nanocompore_shift_stats.tsv")

# Save bed and bedgraph files for each method used
if self._bed_path:
logger.debug("\tSaving significant genomic coordinates in Bed and Bedgraph format")
for m in self._metadata["pvalue_tests"]:
self.save_to_bed(
output_fn = outpath_prefix+"sig_sites_{}_thr_{}.bed".format(m, pvalue_thr),
bedgraph=False, pvalue_field=m, pvalue_thr=pvalue_thr, span=5, title="Nanocompore Significant Sites")
self.save_to_bed(
output_fn = outpath_prefix+"sig_sites_{}_thr_{}.bedgraph".format(m, pvalue_thr),
bedgraph=True, pvalue_field=m, pvalue_thr=pvalue_thr, title="Nanocompore Significant Sites")


def save_to_bed(self, output_fn=None, bedgraph=False, pvalue_field=None, pvalue_thr=0.01, span=5, convert=None, assembly=None, title=None):
"""
Save the position of significant positions in the genome space in BED6 or BEDGRAPH format.
The resulting file can be used in a genome browser to visualise significant genomic locations.
The option is only available if `SampCompDB` if initialised with a BED file containing genome annotations.
* output_fn
Path to file where to write the data
* bedgraph
save file in bedgraph format instead of bed
* pvalue_field
specifies what column to use as BED score (field 5, as -log10)
* pvalue_thr
only report positions with pvalue<=thr
* span
The size of each BED feature.
If size=5 (default) features correspond to kmers.
If size=1 features correspond to the first base of each kmer.
* convert
one of 'ensembl_to_ucsc' or 'ucsc_to_ensembl". Convert chromosome named between Ensembl and Ucsc conventions
* assembly
required if convert is used. One of "hg38" or "mm10"
"""
if self._bed_path is None:
raise NanocomporeError("In order to generate a BED file PostProcess needs to be initialised with a transcriptome BED")
if span < 1:
raise NanocomporeError("span has to be >=1")
if span != 5 and bedgraph:
raise NanocomporeError("Span is ignored when generating bedGraph files")
if pvalue_field not in self.results:
raise NanocomporeError(("The field '%s' is not in the results" % pvalue_field))
if "results" not in self.__dict__:
raise NanocomporeError("It looks like there's not results slot in SampCompDB")
if convert not in [None, "ensembl_to_ucsc", "ucsc_to_ensembl"]:
raise NanocomporeError("Convert value not valid")
if convert is not None and assembly is None:
raise NanocomporeError("The assembly argument is required in order to do the conversion. Choose one of 'hg38' or 'mm10' ")

with open(output_fn, "w") as bed_file:
if title is not None:
if not bedgraph:
bed_file.write('track type=bed name="%s" description="%s"\n' % (title, pvalue_field))
else:
bed_file.write('track type=bedGraph name="%s" description="%s"\n' % (title, pvalue_field))

Record = namedtuple('Record', ['chr', 'genomicPos', 'ref_id', 'strand', 'ref_kmer', pvalue_field])
threshold = -log(pvalue_thr, 10)
for record in self.results[list(Record._fields)].itertuples(index=False, name="Record"):
pvalue = getattr(record, pvalue_field)
if np.isnan(pvalue):
pvalue = 0
elif pvalue < sys.float_info.min:
pvalue = -log(sys.float_info.min, 10)
else:
pvalue = -log(pvalue, 10)
if not bedgraph and pvalue < threshold:
continue
if bedgraph:
if record.strand == "+":
start_pos = record.genomicPos + 2
else:
start_pos = record.genomicPos - 2
end_pos = start_pos + 1
else:
if record.strand == "+":
start_pos = record.genomicPos
else:
start_pos = record.genomicPos - span + 1
end_pos = start_pos + span
line = bedline([record.chr, start_pos, end_pos, "%s_%s" % (record.ref_id, record.ref_kmer),
pvalue, record.strand])
if convert == "ensembl_to_ucsc":
line = line.translateChr(assembly=assembly, target="ucsc", patches=True)
elif convert == "ucsc_to_ensembl":
line = line.translateChr(assembly=assembly, target="ens", patches=True)
if bedgraph:
bed_file.write("%s\t%s\t%s\t%s\n" % (line.chr, line.start, line.end, line.score))
else:
bed_file.write("%s\t%s\t%s\t%s\t%s\t%s\n" % (line.chr, line.start, line.end,
line.name, line.score, line.strand))


def save_report(self, output_fn:str=None, significance_thresholds={"adj_gmm_pvalue": 0.01}, details=False):
"""
Saves a tabulated text dump of the database containing all the statistical results for all the positions
* output_fn
Path to file where to write the data. If None, data is returned to the standard output.
"""
## TODO: can this be done in a "with ..." clause?
if output_fn is None:
fp = sys.stdout
elif isinstance(output_fn, str):
try:
fp = open(output_fn, "w")
except:
raise NanocomporeError("Error opening output file %s" % output_fn)
else:
raise NanocomporeError("output_fn needs to be a string or None")

where = []
order = []
for k, v in significance_thresholds.items():
where.append(f"{k} <= {v}")
order.append(k)
if details or not order:
order = ["transcriptid", "kmer_pos"]
# TODO: depending on 'details', results will be ordered by p-value or by transcript/pos. - problem?
sql = "SELECT * FROM test_results LEFT JOIN (SELECT id, name, subdir FROM transcripts) ON transcriptid = id"
if where:
sql += f" WHERE " + " OR ".join(where)
sql += " ORDER BY " + ", ".join(order)
exclude_cols = ["transcriptid", "id", "name", "subdir"]
with DataStore_master(self._master_db_path) as master:
if not details: # write output from master DB directly to TSV
master.cursor.execute(sql)
row = master.cursor.fetchone()
if row:
self.write_tsv_header(fp, row)
self.write_tsv_row(fp, row)
for row in master.cursor:
self.write_tsv_row(fp, row)
else:
# include GMM stats?
master.cursor.execute("SELECT value FROM parameters WHERE step = 'SC' AND name = 'fit_gmm'")
row = master.cursor.fetchone()
with_gmm = (row is not None) and (row[0] == "True")
master.cursor.execute(sql)
row = master.cursor.fetchone()
first_row = True # do we need to write the header?
while row:
current_tx = row["name"]
db_path = os.path.join(self._input_dir, str(row["subdir"]), current_tx + ".db")
with DataStore_transcript(db_path, current_tx, row["transcriptid"]) as db:
while row and (row["name"] == current_tx): # still on the same transcript
# get data from transcript DB:
kmer_pos = row["kmer_pos"]
seq_query = "SELECT sequenceid FROM kmers WHERE position = ? LIMIT 1"
seq_row = db.cursor.execute(seq_query, (kmer_pos, )).fetchone()
seq = list(master.sequence_mapping.keys())[seq_row[0]]
kmer_query = "SELECT * FROM kmer_stats WHERE kmer_pos = ?"
kmer_row = db.cursor.execute(kmer_query, (kmer_pos, )).fetchone()
if with_gmm:
gmm_query = "SELECT cluster_counts FROM gmm_stats WHERE kmer_pos = ?"
gmm_row = db.cursor.execute(gmm_query, (kmer_pos, )).fetchone()
if not gmm_row: # may be 'None' if best GMM only had one component
gmm_row = [None]
else:
gmm_row = None
if first_row:
self.write_tsv_header(fp, row, seq, kmer_row, gmm_row)
first_row = False
self.write_tsv_row(fp, row, seq, kmer_row, gmm_row)
row = master.cursor.fetchone()


@staticmethod
def write_tsv_header(fp, master_row, seq=None, kmer_row=None, gmm_row=None):
fp.write("transcript\tkmer_pos")
if seq: # squeeze kmer sequence in at this point
fp.write("\tkmer_seq")
for k in master_row.keys():
if "pvalue" in k:
fp.write("\t" + k)
if kmer_row:
for k in kmer_row.keys()[1:]: # skip 'kmer_pos'
if "pvalue" not in k: # skip p-value columns (already written from master DB)
fp.write("\t" + k)
if gmm_row:
fp.write("\tcluster_counts")
fp.write("\n")


@staticmethod
def write_tsv_row(fp, master_row, seq=None, kmer_row=None, gmm_row=None):
fp.write(master_row["name"] + "\t" + str(master_row["kmer_pos"]))
if seq: # squeeze kmer sequence in at this point
fp.write("\t" + seq)
for k in master_row.keys():
if "pvalue" in k:
fp.write("\t" + str(master_row[k]))
if kmer_row:
for k in kmer_row.keys()[1:]: # skip 'kmer_pos'
if "pvalue" not in k: # skip p-value columns (already written from master DB)
fp.write("\t" + str(kmer_row[k]))
if gmm_row:
fp.write("\t" + str(gmm_row[0]))
fp.write("\n")
Loading