Skip to content

Commit

Permalink
Update RegionKey documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
nicolaasuni committed Nov 27, 2018
1 parent b7041d4 commit 612b3fa
Show file tree
Hide file tree
Showing 15 changed files with 65 additions and 26 deletions.
53 changes: 46 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Nicola Asuni. [VariantKey - A Reversible Numerical Representation of Human Genet
* [VariantKey Properties](#vkproperties)
* [VariantKey Input values](#vkinput)
* **[RegionKey](#regionkey)**
* [RegionKey Properties](#rkproperties)
* [Encoding String IDs](#esid)
* [Binary file formats for lookup tables](#binaryfiles)
* [C Library](#clib)
Expand All @@ -57,7 +58,7 @@ The [VariantKey Format](#vkformat) doesn't represent universal codes, it only en

This software library can be used to generate and reverse [VariantKey](#vkformat)s and [RegionKey](#regionkey)s.


----------

<a name="quickstart"></a>
## Quick Start
Expand Down Expand Up @@ -98,6 +99,8 @@ cd c
make test
```

----------

<a name="hgvdefinition"></a>
## Human Genetic Variant Definition

Expand Down Expand Up @@ -421,11 +424,12 @@ Normalized variant | 19 | 29238771 | C | G
* **`ALT`** - *alternate non-reference allele* :
String containing a sequence of [nucleotide letters](https://en.wikipedia.org/wiki/Nucleic_acid_notation).

----------

<a name="regionkey"></a>
## RegionKey

*RegionKey* encodes a human genetic region (defined as the set of *chromosome*, *start position*, *end position* and *strand direction*) in a 64 bit unsigned integer number.
*RegionKey* encodes a human genomic region (defined as the set of *chromosome*, *start position*, *end position* and *strand direction*) in a 64 bit unsigned integer number.

RegionKey allows to repesent a region as a single entity, and provides analogous properties as the ones listed in [VariantKey Properties](#vkproperties).

Expand All @@ -442,7 +446,24 @@ The RegionKey is composed of 4 sections arranged in 64 bit:
STRAND
```

* **`CHROM`** : 5 bit to represent the chromosome.

Example of RegionKey encoding:

```
| CHROM | STARTPOS | ENDPOS | STRAND |
------------------+-------+------------------------------+------------------------------+--------+
Raw variant | chr19 | 29238771 | 29239026 | +1 |
Normalized region | 19 | 29238771 | 29239026 | +1 |
------------------+-------+------------------------------+------------------------------+--------+
RegionKey bin | 10011 | 0001101111100010010111110011 | 0001101111100010011011110010 | 01 0 |
------------------+-------+------------------------------+---------------------------------------+
RegionKey hex | 98DF12F98DF13792 |
RegionKey dec | 11015544076609075090 |
------------------+------------------------------------------------------------------------------+
```

* **`CHROM`** : 5 bit to represent the chromosome.
An identifier from the reference genome. It only has 26 valid values: autosomes from 1 to 22, the sex chromosomes X=23 and Y=24, mitochondria MT=25 and a symbol NA=0 to indicate an invalid value.

```
0 4
Expand All @@ -458,7 +479,8 @@ The RegionKey is composed of 4 sections arranged in 64 bit:

The chromosome is encoded as in VariantKey.

* **`STARTPOS`** : 28 bit for the region START position.
* **`STARTPOS`** : 28 bit for the region START position.
The region start position in the chromosome, with the first base having position 0. The largest expected value is less than 250 million to represent the last base pair in Chromosome 1.

```
0 5 32 63
Expand All @@ -474,7 +496,8 @@ The RegionKey is composed of 4 sections arranged in 64 bit:

This section is encoded as in VariantKey POS.

* **`ENDPOS`** : 28 bit for the region END position.
* **`ENDPOS`** : 28 bit for the region END position.
The region end position in the chromosome. The end position is equivalent to (STARTPOS + REGION_LENGTH), such that the base having position ENDPOS is not included in the region.

```
0 33 60 63
Expand All @@ -489,7 +512,8 @@ The RegionKey is composed of 4 sections arranged in 64 bit:
```
The end position is equivalent to (STARTPOS + REGION_LENGTH).

* **`STRAND`** : 2 bit to encode the strand direction.
* **`STRAND`** : 2 bit to encode the strand direction.
(optional) The direction of the DNA strand. This is useful when encoding genic regions.

```
0 61 62
Expand All @@ -503,7 +527,7 @@ The RegionKey is composed of 4 sections arranged in 64 bit:

```
-1 : 2 dec = "10" bin = reverse (minus) strand direction
0 : 0 dec = "00" bin = unknown strand direction
0 : 0 dec = "00" bin = unknown or not applicable strand direction
+1 : 1 dec = "01" bin = forward (plus) strand direction
```

Expand All @@ -512,6 +536,18 @@ The RegionKey is composed of 4 sections arranged in 64 bit:
This software library provides several functions to operate with *RegionKey* and interact with *VariantKey*.


<a name="rkproperties"></a>
### RegionKey Properties

* It is compatible with VariantKey.
* It can be encoded and decoded on-the-fly.
* Sorting by RegionKey is equivalent of sorting by CHROM and STARTPOS.
* The 64 bit RegionKey can be exported as a single 16 character hexadecimal string.
* Sorting the hexadecimal representation of RegionKey in alphabetical order is equivalent of sorting the RegionKey numerically.
* RegionKey can be used as a main database key to index data by "region". This simplify common searching, merging and filtering operations.

----------

<a name="esid"></a>
## Encoding String IDs

Expand All @@ -523,6 +559,7 @@ This library contains extra functions to encode some string IDs to 64 bit unsign

* The `hash_string_id` function creates a 64 bit unsigned integer hash of the input string.

----------

<a name="binaryfiles"></a>
## Binary files for lookup tables
Expand Down Expand Up @@ -565,6 +602,8 @@ https://sourceforge.net/projects/variantkey/files/
1800c351f61f65d3 A AAGAAAGAAAG
```

----------

<a name="clib"></a>
## C Library

Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
5.4.0
5.4.1
2 changes: 1 addition & 1 deletion c/doc/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ PROJECT_NAME = "VariantKey"
# This could be handy for archiving the generated documentation or
# if some version control system is used.

PROJECT_NUMBER = 5.4.0
PROJECT_NUMBER = 5.4.1

# Using the PROJECT_BRIEF tag one can provide an optional one line description
# for a project that appears at the top of each page and should give viewer
Expand Down
2 changes: 1 addition & 1 deletion c/resources/debian/control
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ Depends: ${shlibs:Depends}, ${misc:Depends}
Description: Numerical Encoding for Human Genetic Variants.
Provides C header-only files for:
VariantKey, a reversible numerical encoding schema for human genetic variants.
RegionKey, a reversible numerical encoding schema for human genetic regions.
RegionKey, a reversible numerical encoding schema for human genomic regions.
ESID, a reversible numerical encoding schema for genetic string identifiers.

2 changes: 1 addition & 1 deletion c/resources/rpm/rpm.spec
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Provides: %{gh_project} = %{version}
%description
Provides C header-only files for:
VariantKey, a reversible numerical encoding schema for human genetic variants.
RegionKey, a reversible numerical encoding schema for human genetic regions.
RegionKey, a reversible numerical encoding schema for human genomic regions.
ESID, a reversible numerical encoding schema for genetic string identifiers.

%build
Expand Down
2 changes: 1 addition & 1 deletion c/src/variantkey/regionkey.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
* @file regionkey.h
* @brief RegionKey main functions.
*
* The functions provided here allows to generate and process a 64 bit Unsigned Integer Keys for Human Genetic Regions.
* The functions provided here allows to generate and process a 64 bit Unsigned Integer Keys for Human Genomic Regions.
* The RegionKey is sortable for chromosome and start position, and it is also fully reversible.
*/

Expand Down
2 changes: 1 addition & 1 deletion conda/c.src/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package:
name: variantkey-src
version: 5.4.0
version: 5.4.1

source:
path: ../..
Expand Down
2 changes: 1 addition & 1 deletion conda/c.vk/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package:
name: variantkey-vk
version: 5.4.0
version: 5.4.1

source:
path: ../..
Expand Down
8 changes: 4 additions & 4 deletions conda/python-class/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package:
name: pyvariantkey
version: 5.4.0
version: 5.4.1

source:
path: ../..
Expand All @@ -14,11 +14,11 @@ requirements:
- setuptools
- numpy >=1.15.0
build:
- variantkey >=5.4.0
- variantkey >=5.4.1
- numpy >=1.15.0
run:
- python
- variantkey >=5.4.0
- variantkey >=5.4.1
- numpy >=1.15.0

test:
Expand All @@ -30,7 +30,7 @@ test:
- pytest-cov
- pytest-benchmark
- pycodestyle
- variantkey >=5.4.0
- variantkey >=5.4.1
- numpy >=1.15.0
imports:
- pyvariantkey
Expand Down
2 changes: 1 addition & 1 deletion conda/python/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package:
name: variantkey
version: 5.4.0
version: 5.4.1

source:
path: ../..
Expand Down
2 changes: 1 addition & 1 deletion conda/r/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package:
name: r-variantkey
version: 5.4.0
version: 5.4.1

source:
path: ../..
Expand Down
4 changes: 2 additions & 2 deletions go/src/variantkey.go
Original file line number Diff line number Diff line change
Expand Up @@ -611,15 +611,15 @@ func (mf TMMFile) NormalizedVariantKey(chrom string, pos uint32, posindex uint8,

// --- REGIONKEY ---

// TRegionKey contains a representation of a genetic region key
// TRegionKey contains a representation of a genomic region key
type TRegionKey struct {
Chrom uint8 `json:"chrom"`
StartPos uint32 `json:"startpos"`
EndPos uint32 `json:"endpos"`
Strand uint8 `json:"strand"`
}

// TRegionKeyRev contains a genetic region components
// TRegionKeyRev contains a genomic region components
type TRegionKeyRev struct {
Chrom string `json:"chrom"`
StartPos uint32 `json:"startpos"`
Expand Down
4 changes: 2 additions & 2 deletions python-class/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def run(self):

setup(
name='pyvariantkey',
version='5.4.0.1',
version='5.4.1.1',
keywords=('variantkey variant key genetic genomics'),
description="VariantKey Python wrapper class",
long_description=read('../README.md'),
Expand All @@ -51,7 +51,7 @@ def run(self):
],
install_requires=[
'numpy>=1.15.0',
'variantkey>=5.4.0.1',
'variantkey>=5.4.1.1',
],
extras_require={
'test': [
Expand Down
2 changes: 1 addition & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def run(self):

setup(
name='variantkey',
version='5.4.0.1',
version='5.4.1.1',
keywords=('variantkey variant key genetic genomics'),
description="VariantKey Bindings for Python",
long_description=read('../README.md'),
Expand Down
2 changes: 1 addition & 1 deletion r/variantkey/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: variantkey
Title: Genetic VariantKey
Version: 5.4.0.1
Version: 5.4.1.1
Authors@R: person("Nicola", "Asuni", email = "[email protected]", role = c("aut", "cre"))
Description: Tools to generate and process a 64 bit Unsigned Integer Keys for Human Genetic Variants.
The VariantKey is sortable for chromosome and position,
Expand Down

0 comments on commit 612b3fa

Please sign in to comment.