This repository has been archived by the owner on Oct 17, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 27
Lift symmetric CpG methcounts from hg19 to hg38
Benjamin Decato edited this page Mar 17, 2016
·
2 revisions
cpgmap=~jqu/cmb-01/cpgmaps/human/hg19_cpg.over_hg38_cpg.map
~jqu/panfs/tools/git/jqujqu_python_scripts/netmap.py -n $cpgmap -o Sample_hg38.meth.unsorted Sample_hg19.meth
LC_ALL=C sort -k1,1 -k2,2g Sample_hg38.meth.unsorted -o Sample_hg38.meth && rm Sample_hg38.meth.unsorted
A $cpgmap
file looks like this
chr1 895356 895357 + chr1 959976 959977 +
chr1 895358 895359 + chr1 959978 959979 +
chr1 895365 895366 + chr1 959985 959986 +
The first 4 columns are the cytosine locations in the old assembly, and the last 4 columns are corresponding cytosine locations in the new assembly.
The python script takes both formats for input methcounts
file. The output will be in the new methcounts
format.
Here goes how the cpg map was generated.
- Get CpG (2bp) locations in the old assembly and in the new assembly. We need FASTA files for both old and new assemblies. You can also construct
TMP_OLD_CPG.BED
andTMP_NEW_CPG.BED
frommethcounts
files if available.
old_assembly=~/panfs/cache/human/hg19_chroms
new_assembly=~/panfs/cache/human/hg38_chroms
~jqu/cmb-01/tools/py_scripts/fa_motif.py -d ${old_assembly} -m CG -o TMP_OLD_CPG.BED
~jqu/cmb-01/tools/py_scripts/fa_motif.py -d ${new_assembly} -m CG -o TMP_NEW_CPG.BED
- Name old CpG sites with their old location
awk '{OFS="\t"; print $1, $2, $3, $1":"$2":"$3":"$4":"$6, 0, $6}' < TMP_OLD_CPG.BED > TMP_OLD_CPG_NAMED.BED
- Download proper chain file from UCSC Genome Browser. Use
liftOver
to map CpG sites to new assembly and sort by new locations.
chain=~/panfs/cache/liftOverchains/hg19ToHg38.over.chain.gz
liftOver TMP_OLD_CPG_NAMED.BED $chain TMP_OLD_CPG_NAMED_LIFTED.BED UNMAPPED
LC_ALL=C sort -k1,1 -k2,2g -k3,3g TMP_OLD_CPG_NAMED_LIFTED.BED -o TMP_OLD_CPG_NAMED_LIFTED.BED.SORTED
- Match lifted position with true CpG location with
bedtools
. Only keep mapping locations without indels, and output single base-pair location for cytosines in CpG dinucleotides on the "+" strand.
bedtools closest -a TMP_OLD_CPG_NAMED_LIFTED.BED.SORTED -b TMP_NEW_CPG.BED | \
awk '{OFS="\t";
if ($8==$2 && $9==$3) {split($4, a, ":"); print a[1],a[2],a[2]+1,"+", $7,$8,$8+1,"+"}
}' > TMP_CPG_MAP
- Keep one entry from duplicated sites and remove temporary files
export LC_ALL=C;
sort -k5,5 -k6,6g -u TMP_CPG_MAP | sort -k1,1 -k2,2g -u /dev/stdin > old_cpg.over_new_cpg.map
rm TMP* UNMAPPED
cpgmap=old_cpg.over_new_cpg.map
Your CpG map file should be named with the old and new assemblies.