The core functionality of CIViCutils lies on its matching framework, which is applied in order to methodically compare and associate the evidence records retrieved from CIViC to the molecular alterations originally found in the input file. After querying CIViC for all the input genes and collecting their available variant records and associated clinical data, CIViCutils attempts to link the retrieved evidence with the input aberrations in an automated fashion through the use of a common nomenclature for both. The approach used by the package to generate a standardized format is dependent on the type of molecular alteration being queried in each situation, as summarized below.
In this case, the matching of CIViC data is based on the variant descriptions from the record names in the database, as well as their available HGVS expressions, whenever these have been reported by curators.
Based on the coding DNA (c.) and protein (p.) HGVS annotations supplied in the input, a specific set of rules is applied which leverages information about the naming scheme of variant records in CIViC, in an attempt to translate HGVS expressions from both sources into a standardized format. The need for translation arises from the fact that variant names in the knowledgebase usually follow a very specific HGVS-like convention, where protein alterations are described excluding the "p." prefix and using one-letter aminoacid codes (e.g. "V600E"). Moreover, not all CIViC records have an HGVS expression available, and for those that do, they often deviate from the recommended guidelines by the Human Genome Variation Society, e.g. by using one-letter aminoacid codes or different representations of nonsense variants (such as "*" or "Ter"), frameshifts (e.g. "T157FS" or "p.F76Lfs*56") and silent mutations (e.g. "p.Pro61=" or "p.Pro61Pro").
The expectation is that by translating HGVS expressions into using a common nomenclature, they can be made equivalent and hence matched between input file and database for the same variant. Rules are applied only at the level of p.HGVS expressions, as this is the most commonly used naming scheme for records in the database. In this context, CIViC variant "V600E" would be translated into HGVS expression "p.Val600Glu", so that the latter can be matched when provided as input aberration. On the other hand, c.HGVS expressions remain unaffected by these rules and thus unchanged, as they offer more complexity and non protein coding variants (i.e. without a p.HGVS annotation available) are scarce in CIViC. Thus, queried c.HGVS expressions can only be matched when they are found to be exactly identical in the database. Perfect matches between CIViC variant records and input SNVs/InDels (i.e. identical nomenclature found for both) are classified as tier 1
.
Furthermore, positional matches between input and CIViC variants where the affected protein position is the same but the aminoacid change is different (e.g. "p.Pro61Arg" and "p.Pro61Lys") are also considered by CIViCutils using a tier 2
classification. The package also attempts to retrieve so-called "bucket" variants in the knowledgebase, which comprise collections of genomic alterations sharing a particular feature or category. One example of such general variants are those defining groups of mutations located at the same aminoacid position of a protein, such as e.g. "G12" or "V600", found to be relevant in the context of precision oncology. This kind of records are prioritized for a tier 2
match over other potential positional hits which may have been found in CIViC for the same input variant. On the other hand, there are other general variants known to be commonly used as record names in the database, e.g. "MUTATION" (which indicates an unspecified set of variants in a given gene or genomic region). CIViCutils considers this kind of records to be potential non-exact matches (tier 1b
) of the input variant in case they are retrieved by the framework. In addition, if impact information is optionally provided for the variant, then this is also used for finding potential tier 1b
hits in CIViC, e.g. "3' UTR MUTATION", "5' UTR MUTATION", "TRUNCATING MUTATION" or "FRAMESHIFT MUTATION". In the same manner, if exon or intron information is supplied for the variant (which can only happen when the corresponding impact is also available), this information is leveraged as well during the tier 1b
matching of CIViC records, e.g. "INTRON [N] MUTATION" (when impact is intronic), "EXON [N] MUTATION" (when impact is exonic), or "EXON [N] FRAMESHIFT" (when impact is a frameshift), where [N] corresponds to the exon or intron number affected by the given variant.
As opposed to SNVs/InDels, CNVs do not have HGVS annotations available in their CIViC records. For this reason, the retrieval of CIViC evidences for CNVs is exclusively dependent on the descriptions from their variant records in the database. This simplifies the matching procedure greatly, as only a very small number of terms are known to be used for naming CIViC records corresponding to this type of genomic alterations, just as only a few CNV terms are possible in the input file.
In this manner, input CNV types "AMPLIFICATION", "AMP", "GAIN", "DUPLICATION" and "DUP" are all matched by CIViCutils with CIViC variant record name "AMPLIFICATION", while terms "DELETION", "DEL" and "LOSS" are simultaneously linked to CIViC record names "DELETION" and "LOSS" (tier 1
for both). In addition, all the previously listed input terms can also be matched indistinctly with record name "COPY NUMBER VARIATION" in the database (tier 1
). On the other hand, CIViCutils further considers matches of the input CNVs to special cases of CIViC records which describe large genomic alterations affecting complete exons, e.g. such as "EXON 1-2 DELETION", "EXON 5 SKIPPING MUTATION", "3' EXON DELETION" or "5' EXON DELETION". This type of associations are regarded as positional matches (tier 2
) by the package. Note that records can only be matched when they refer to the same type of copy number alteration compared to the input CNV, e.g. "EXON 1-2 DELETION" could match "DELETION" but not "AMPLIFICATION", while "EXON 1-2 COPY NUMBER ALTERATION" could match both.
Similarly as with CNVs, expression records in CIViC do not have HGVS annotations available, and thus the matching of evidence is also exclusively based on a reduced set of known variant record names, commonly used to designate this kind of molecular alterations in the database.
In this case, log fold-change values resulting from differential gene expression analyses must be provided for each queried gene. Genes with log-fold changes above zero (i.e. logFC > 0) are matched to CIViC variant name "OVEREXPRESSION", while log-fold changes below zero (i.e. logFC < 0) are linked with database record "UNDEREXPRESSION" (tier 1
for both). In addition, record name "EXPRESSION" can be matched as well in both cases (tier 1
), as this term applies to both over- and under-expression situations. On the other hand, log-fold change values of exactly zero (i.e. logFC = 0) are not allowed by CIViCutils and will trigger an error, as this would imply that the corresponding input gene is in fact not differentially expressed. Moreover, the package also attempts to match additional CIViC records which describe special cases of expression events, e.g. those associated to expression changes at the exon level, such as "EXON 1-2 EXPRESSION", "EXON 5 OVEREXPRESSION" or "EXON 5 UNDEREXPRESSION". In this case, this type of matched records are classified as tier 1
for simplicity. Once more, records can only be matched when they refer to the same type of expression change in their name compared to the input logFC, e.g. "EXON 1-2 EXPRESSION" will match both under- and over-expression, while "EXON 5 UNDEREXPRESSION" will only match under-expression.