- Introduction
- Tools
- Installation
- Usage
- Use
u4falign
for manipulating alignment results - Difference between
read-SAM
andfrag-SAM
Falign
is a sequence alignment toolkit for long noisy chromosome conformation capture (3C) reads, such as Pore-C.
Falign
is written in C and C++ programming language.
Falign
relies on the HTSLIB
library. We have dowanloaded one copy of that and save in ./htslib/htslib-1.19.1.tar.bz2
to make it convenient for off-line installation.
Two tools are released together in this toolkit.
falign
. The alignment tool.dip3d
. A utility used internally byDip3D
, a diploid human 3D genome construction algorithm.
$ git clone https://github.com/xiaochuanle/Falign.git
$ cd Falign
$ ./htslib/install-htslib.sh
In Step 2 above, we install HTSLIB
in directory ./htslib/htslib
:
$ ls ./htslib/htslib
bin include lib share
We have to tell Falign
where to link HTSLIB
. To do this, we export the location of this directory to an environment variable name LIBHTS
:
$ export LIBHTS=$(pwd)/htslib/htslib
$ echo $LIBHTS
/share/home/chuanlex/chenying/data/Falign/htslib/htslib
Before installing Falign
, make sure you have executed Step 3 above. Otherwise the linker will complain that it cannot find -lhts
.
$ cd src/ && make && cd ..
Falign
is installed in the directory Linux-amd64/bin/
:
$ ls Linux-amd64/bin/
dip3d falign
The directory ara-sample-data/
provides sample reference file reference.fa.gz
and sample reads file reads.fq.gz
.
./Linux-amd64/bin/falign build-repeat ara-sample-data/reference.fa.gz ara-sample-data/reference.fa.gz.repeat.bed
./Linux-amd64/bin/falign -repeat_bed ara-sample-data/reference.fa.gz.repeat.bed -num_threads 4 ara-sample-data/reference.fa.gz ara-sample-data/reads.fq.gz > map.paf
The mapping results are output to the file map.paf
in PAF format.
Falign
supports different output formats (specified by the -outfmt
option):
- PAF
- SAM
- BAM
- FRAG-SAM
- FRAG-BAM
In each output result (note that every alignment has for offsets: read start, read end, reference start, reference end), falign
adds the following additional fields:
qS:i:
the nearest restiction enzyme site to the read start positionqE:i:
the nearest restiction enzyme site to the read end positionvS:i:
the nearest restriction enzyme site to the reference start positionvE:i:
the nearest restriction enzyme site to the reference end positionpi:f:
percentage of identity of the alignmentSA:Z:
a homologous map of the fragment
By convention, in SAM
mapping results, a read may contain multiple mapping results. For saving space, the read sequence is usually only presented in the first mapping result of this read. In the second to last mapping results of this read, the sequence field
is filled by a start *
. falign
outputs SAM
mapping results in this manner. And we call SAM
mapping results output in this manner read-SAM
. Since a Pore-C read always contains multiple fragments, a Pore-C read usually contains many mapping results:
0cd79600-51cf-4255-a6f7-0e9660721e85 16 2 1794202 1 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 16 2 1791136 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 0 2 1733063 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 16 4 12727850 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85 0 2 1713207 60 ...
In the example above, the read 0cd79600-51cf-4255-a6f7-0e9660721e85
contains five alignment results.
Some tools such as whatshap taking BAM
format as input will complain if there exist too many duplicate read names. In this case we suggest transfer read-SAM
to frag-SAM
. In frag-SAM
every fragment is treadted as an individual read. We can use the sam2frag-sam
command in the u4falign
to transfer read-SAM
to frag-SAM
:
$ u4falign sam2frag-sam map.sam > frag-map.sam
After transformation, we have
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:000:0000000000:0001794201 16 2 1794202 1 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:001:0000000000:0001791135 16 2 1791136 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:002:0000000000:0001733062 0 2 1733063 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:003:0000000001:0012727849 16 4 12727850 60 ...
0cd79600-51cf-4255-a6f7-0e9660721e85_0000000001:004:0000000000:0001713206 0 2 1713207 60 ...
In frag-SAM
, the sequence field
in every alignment results is represented by the fragment sequence (not the whose read sequence). Note that we add a suffix to every read name to avoid duplicate read names in the SAM
file. The meanings of the fields in the suffix string is
read-id:fragment-id:reference-sequence-id:reference-mapping-position
- Chuan-Le Xiao, [email protected]
- Ying Chen, [email protected]
GPLv3