Interactive genomic annotation at scale.
Most available interactive genome browsers show features aligned along a single chromosome at a time. There are some situations where genomic features need to be visualised in two dimensions, with the same or with different genomes as the X and Y axes. Examples shown here include sequence similarity dotplots between a pair of haplotypes and between reference genomes from two closely related species, and scatter plots of unbinned individual HiC contact pairs within and between two haplotypes that show fine scale detail.
Browse 1D charts and 2D heatmap plots, that scale themselves when zoomed from the entire genome down to individual points and back.
Designed for Galaxy interactive tools. Works on a laptop. Built on the Holoviz ecosystem.
Help wanted. PR and suggestions welcomed.
This is new work in progress.
Development started in late October 2024. A draft framework description and specification is here.
This proof of concept shows how millions of features can be generated along a couple of chromosomes and plotted using rasterize and datashader, with each tap converted into contig and offset. It runs in an IPython notebook, or if the dependencies are available, can be served from this repository's root, as:
panel serve scripts/holoSeq_random.py --show
Edit the default 10000 xmax value to get a sense of scale capacity - 10M is not a problem. There is very little code needed for plotting. Most of the code is needed to create some sample contigs of fixed length into some arbitrary ordering along the axes. Their lengths are cumulated and the resulting array represents axis offsets to the starting nucleotide of each contig, used for calculating the x and y coordinates for randomly generated points so the are correctly located in an internally consistent 2D space.
This is proposed as a generalisable model for a linear display of genomic features located on a set of independent contigs in holoSeq displays.
# see https://github.com/holoviz-topics/holoSeq
# Needs dependencies - uncomment and run the next line to install them in a notebook
# ! pip install datashader dask[dataframe] holoviews[recommended] pandas matplotlib bokeh
# or in a python venv, use
# pip -r requirements.txt
from bisect import bisect_left
from collections import OrderedDict
import itertools
import numpy as np
import holoviews as hv
import panel as pn
from holoviews.operation.datashader import (
rasterize,
dynspread,
)
hv.extension("bokeh")
pn.extension()
def showTap(x, y):
# populate a string widget with tap coordinates translated into contig and offset
if np.isnan(x) or np.isnan(y):
s = "Click on plot for coordinates as contig:offset"
else:
i = bisect_left(hstarts, x)
chrx = hnames[i - 1]
offsx = x - hstarts[i - 1]
i = bisect_left(hstarts, y)
chry = hnames[i - 1]
offsy = y - hstarts[i - 1]
s = "X axis %s:%d Y axis %s:%d" % (chrx, offsx, chry, offsy)
str_pane = pn.pane.Str(
s,
styles={"font-size": "10pt", "color": "darkblue", "text-align": "center"},
width=width,
)
return str_pane
xwidth = 3000000
xmax = 10000
width = 1000
rng = np.random.default_rng(1) # all plots will be identical !
hlen = [xwidth/2, xwidth/4, xwidth/8, xwidth/8]
hstarts = list(itertools.accumulate(hlen))
hstarts.insert(0,0)
hnames = ['chr%d' % i for i in range(1,5)]
xcoords = np.array([rng.uniform(0,xwidth) for i in range(xmax)])
yoffset = rng.normal(0,100000,xmax)
ycoords = np.array(xcoords + yoffset)
title = "%d random points" % xmax
ticks = [(hstarts[i], hnames[i]) for i in range(4)]
points = hv.Points((xcoords,ycoords))
stream = hv.streams.Tap(source=points, x=np.nan, y=np.nan)
showloc = pn.bind(showTap, x=stream.param.x, y=stream.param.y)
pnc = pn.Column(
showloc,
pn.pane.HoloViews(
dynspread(
rasterize(points)
.relabel("%s" % title)
.opts(
cmap="inferno",
cnorm="log",
colorbar=True,
width=width,
height=width,
xticks=ticks,
yticks=ticks,
xrotation=45,
fontsize={"xticks": 7, "yticks": 7},
tools=["tap"],
scalebar=True,
scalebar_range="x",
scalebar_location="top_left",
scalebar_unit=("bp"),
)
)
),
)
pnc.servable(title=title, )
This is proposed as a generalisable model for a linear display of genomic features located on a set of independent contigs in holoSeq displays.
Operating with very large inputs, an important element of the holoSeq model is an intermediate compressed pre-computed plotting format, including metadata. The intermediate file can be much smaller than text input formats, and will always be correctly rendered by the plotting program.
Precomputing the plot coordinates for pairs from a 60GB input PAF file will take a couple of hours. Best done once, because the compressed precomputed plot can be widely shared for viewing. A few hundred million pairs from that PAF file will take 10-12 minutes for panel to load and display. Galaxy's interactive tools make it very straightforward to deploy holoSeq or other holoViz applications to give users interactive access to visualisations of very large data.
Requires both Python and the python-venv module installed. These steps ensure that the virtual environment directory can be made and deleted without any other changes to your system.
- Clone this repository in a convenient and disposable directory:
git clone https://github.com/holoviz-topics/holoSeq
- Change directory to that new holoSeq subdirectory, prepare the python virtual environment needed, and start the demonstration
cd holoSeq
python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
panel serve scripts/holoseq_display.py --show --args --inFile data/mUroPar1H1H2.paf_cisH1_hseq.gz --size 1000
Expect to see output shown below, and a web browser window should pop open. Takes 10-20 seconds to read the 3.4M pairs and to show the interactive visualisation:
panel serve scripts/holoseq_display.py --args --inFile data/mUroPar1_cis1.hseq.gz --size 1000
2024-10-29 17:06:33,645 Starting Bokeh server version 3.6.0 (running on Tornado 6.4.1)
2024-10-29 17:06:33,646 User authentication hooks NOT provided (default user enabled)
2024-10-29 17:06:33,649 Bokeh app running at: http://localhost:5006/holoseq_display
2024-10-29 17:06:33,649 Starting Bokeh server with process id: 388673
--show
option opens a browser window and immediately starts processing the script.--size
option determines both x and y plot window size.--inFile
must be the path of a prepared hseq compressed coordinate data file, such as the supplied demonstration HiC heatmap.- Delay will vary with processing power and the number of coordinates
When the plot appears, the view is controlled by the usual Bokeh tools in the toolbar on the left of the plot.
- Mouse click anywhere on the plot to see the coordinates.
- Zoom with the mouse scroll wheel
- Pan by grabbing with the left mouse button.
- Only pairs involving H1 contigs (H1 cis) are used in the demonstration.
Briefly, the framework creates the minimum data required to create a plot. A genome lengths file is required, and the named contigs can be reordered by name or length. The axes are defined by the ordering. The lengths are cumulated to give an offset to the first nucleotide of each contig, so the track can be read and feature locations converted into the plot coordinate system, and stored as a compressed intermediate file. The display application reads these pre-computed plot coordinate files, with enough metadata about the reference sequence to add tic marks to the axes and to back-calculate the stream of user tap coordinates. A converter for PAF to compressed hseq format for input is available and was used to generate the demonstration. Bigwig is working and other common genomic annotation formats, such as gff and vcf will follow.
Multiple input files will produce a stack of plots that work independently:
panel serve scripts/holoseq_display.py --show --args --inFile \
data/mUroPar1.paf_cis1.hseq.gz small.paf_cis1.hseq.gz --size 1000
Data size on disk using the hseq format provides more than an order of magnitude shrinkage, from about 300MB of input PAF to 23MB of hseq coordinate format data.
The original PAF is 1.2GB and contains about 14M rows. About 3.6M pairs of points had both contigs on the mUroPar1 paternal haplotype - so about 1/4 of all rows. The sample used in the demonstration is a 23M gzip containing all the information needed to plot these 3.6M pairs.
The demonstration holoSeq plot hseq format coordinate data was prepared from a PAF file. That was
output from a Galaxy VGP workflow, mapping the arima HiC reads against the assembled haplotypes,
running Bellerophon to remove chimeric Arima reads and merging into a bam containing all HiC pairs.
These were converted to a sam file with header using samtools view
, then processed into a PAF
using this AWK script:
#!/bin/awk -f
{
if ($1=="@SQ") { chromlen[substr($2,4)] = substr($3,4) }
else
if (substr($1,0,1) == "@") {foo = 1}
else
{
if ($2 AND 16) { sign1="-"};
if (! ($2 AND 16)) { sign1="+"};
if ($2 AND 32) { sign2="-"};
if (! ($2 AND 32)) { sign2="+"};
if ($7 == "=")
{ printf ("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $3, chromlen[$3], $4, $4+length($10), sign1, $3, chromlen[$3], $8, $8 + length($10), length($10), length($10), 255)}
else
{ printf ("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", $3, chromlen[$3], $4, $4+length($10), sign1, $7, chromlen[$7], $8, $8 + length($10), length($10), length($10), 255)}
}
}
This repository includes a python script conversion utility for PAF inputs,
scripts/holoSeq_prepare_paf.py
, that works with the awk PAF output and converts it into a
compressed coordinate file. The compressed demonstration plotting data were prepared using:
python scripts/holoSeq_prepare_paf.py \
--inFile data/mUroPar1.paf \
--title "VGP Arctic Ground Squirrel arima HiC contact matrix, paternal haplotype"
for Mashmap paf inputs, use:
python scripts/holoSeq_prepare_paf.py \
--inFile data/hg002_2k99.paf \
--title "hg002 Mashmap" \
--hap_indicator None \
--contig_sort length
This step produces outputs containing subsets of contact point pairs.
These can be viewed like the supplied local demonstration example using panel
as described above
The images are highly interactive in a Jupyter notebook but these static screenshots give some idea of what they can do.
- ChrY
Zooming in to the H1 Y chromosome shows
There is an obvious yellowish diagonal line, where the majority of contact pairs have their two ends close together along the Y chromosome. There is a large, dense and chaotic rectangular pattern at one end, and some dense linear features at an angle to the main diagonal. There is a variable, thin scattering of off-diagonal dots, representing the less common pairs with their two ends much further apart on the Y chromosome.
Perhaps there is some biology there because it seems to correspond to the very large region of non B DNA repeats in the Human T2T HG002 Y chromosome, described in https://www.nature.com/articles/s41586-023-06457-y and seen in this figure from the paper.
- Some validation of point coordinates
Here are the first few pairs on chr1 for H1 and H2 - they are about 20k from the start of the assembly - presumably no HiC pairs in the actual telomeres.
and the H1-H2 pairs
The first records in the PAF for a pair nearest offset 0 are at 19087/19310 and 19310/19087 so expect to see the very first points in SUPER1 of H1 somewhere near those coordinates.
SUPER_1H1 284260672 19087 19238 - SUPER_1H1 284260672 19310 19461 151 151 255
SUPER_1H1 284260672 19310 19461 - SUPER_1H1 284260672 19087 19238 151 151 255
Zooming right in to the start of SUPER1 on H1 shows that the single point shown above resolves into a pair of points at the expected coordinates, confirming that the image may be showing what's in the data...
- Interesting features
Below are the points in H1 and H2 in the same region of the trans plot where I noticed an odd looking break in the main diagonal as an example of what's possible in the notebook.
Original proof of concept application: hapsHiCpafHoloview.ipynb is a plotter for paired HiC contact pointss.ss, input as a PAF.
This uses HiC data from the Arctic Ground Squirrel mUroPar1. It shows the three images and about 14 million points.
The matrix is symmetrical so only one half is needed but it is visually more impressive as shown
The notebook makes 3 interactive plots showing H1/H1, H2/H2 "cis" pairs, and H1/H2 "trans" pairs. These work well with 14 million points. Occasional pauses before rescaling and refresh, but can pan and wheel zoom down from all ponts to single pairs easily.
Click anywhere to see the coordinates or drag, and use the mouse wheel to zoom. When run locally with 4.2.5, it is astounding. 14 million points. The old version does not do datashader properly so goes all blocky as you zoom.
There's ~250 lines of code to make the axes - must be ordinal for datashader to work so have to map the contigs end to end before can assign x/y grid coordinates to the pairs read from the paf input file.
That code already contains a PAF converter, and the generic IPython notebook visualiser will be built using the existing visualisation components, adding 1D converters and tracks, with grouping by common reference sequence in the header and vertical stack layout.
Contigs and sizes have to be inferred from the data so it's 2 passes. The actual holoviews/panel code is ~20 lines once the data are in a grid. Holoviews is dynamite.
Clone the repo and cd
into it.
git clone https://github.com/amaloney/holoSeq
cd holoSeq
Create a virtual environment using your favorite method, e.g. conda
, venv
, poetry
, pixi
etc. We will use conda
as an example. No conda
environment file is supplied within the repo,
however, we can generate one using
pyproject2conda
. See pyproject2conda
on how
to install it in your system.
pyproject2conda yaml --file pyproject.toml \
--no-header \
--name holoseq-dev \
--channel conda-forge \
--python-include infer \
--extra dev --extra notebooks --extra test \
--output environment.yaml
conda create env --file environment.yaml
conda activate holoseq-dev
Next install holoSeq
into the virtual environment, and install the pre-commit hooks. If you would
like to contribute to work with Jupyter notebooks, be sure to install notebooks
in the pip
command.
pip install --editable .
# For installing notebooks development as well, use the following command instead of the one above.
#pip install --editable .[notebooks]
pre-commit install
Run the following command to begin an interactive data exploration app in your browser.
panel serve --session-token-expiration 86400 scripts/squirrel.py \
--show --args --inFile data/mUroPar1H1H2.paf_cisH1_hseq.gz --size 1000