This tool maps aliases, previous or withdrawn symbols to current approved HGNC symbols. Checks for any entries that are not gene symbols and fixes capitalizations. Ensembl and RefSeq annotation files sometimes use previous or alias symbols. This tool also checks Ensembl and NCBI annotation files for different genome versions and shows which gene symbol is used. This way a more appropriate gene set can be used to avoid the false negatives in the variant discovery process.
Clone this repository to install.
git clone https://github.com/barslmn/bioscripts
To get the most up to date annotation files run the ./get-data.sh
. This process takes some time as it downloads various gene symbol annotation files from NCBI and Ensembl and HGNC. Check install section at cross-symbol-checker.org to see what is being downloaded.
Installing with docker.
docker pull barslmn/cross-symbol-checker
Main operation is performed at cross-symbol-checker.sh
this file takes a single symbol as its input and returns the status of all of the approved, alias, previous or withdrawn symbols in all available annotation sources.
Another script is check-geneset.sh
which takes one or more gene and target assembly as its arguments. This script is a wrapper around cross-symbol-checker and parses its output.
Example use: Check single gene for all annotation sources:
cross-symbol-checker.sh SHFM6
Check multiple genes for given annotation source:
check-geneset.sh SHFM6 ADA2 -a GRCh37 -s RefSeq
Or check against a custom target:
check-geneset.sh SHFM6 ADA2 -a GRCh37 -s RefSeq -t my.target.bed.gz
Example use with docker: Check single gene for all annotation sources:
docker run barslmn:cross-symbol-checker /opt/cross-symbol-checker/cross-symbol-checker.sh SHFM6
Check multiple genes for given annotation source:
docker run barslmn:cross-symbol-checker /opt/cross-symbol-checker/check-geneset.sh SHFM6 ADA2 -a GRCh37 -s RefSeq
./cross-symbol-checker.sh --help
Usage: ./cross-symbol-checker.sh symbol This script checks given symbol against every possible assembly -c --no-cross-check Don't check annotation sources. Just check alternative gene symbols and exit. -h --help Display this help message and exit. -V Print current version and exit Functionality of the script can be further altered with environment variables. These are mainly used by check-geneset.sh. CSC_SOURCES Limit which annotation sources to be used. CSC_ASSEMBLIES Limit which assemblies sources to be used. CSC_VERSIONS Limit which versions sources to be used. CSC_TARGETS Custom target file. CSC_LOGLVL Set log level. Default is INFO.
./check-geneset.sh --help
Usage: ./check-geneset.sh symbol1 symbol2 -a T2T -s RefSeq -v latest -a --assembly Default assemblies are GRCh37, GRCh38 and T2T. You can use multiple assemblies by quoting them together like -a "GRCh37 GRCh38" -h --help Display this help message and exit. -o --only-target By default check-geneset.sh will run for every assembly. Use this option to check only against given target file. -s --source Default assemblies are GRCh37, GRCh38 and T2T. You can use multiple assemblies by quoting them together like -a "GRCh37 GRCh38" -t --target Custom bed file. If file name has the format: source.assembly.version.bed, columns in the output table will be filled accordingly. Custom file should look like this: chrom start end symbol chr1 1266694 1270686 TAS1R3 chr1 1270656 1284730 DVL1 chr1 1288069 1297157 MXRA8 -v --version There is only one version for all of the assemblies which is latest. You can install older assemblies and specify them with this parameter. -V Print current version and exit