GTF (Gene Transfer Format) is a refinement to GFF that tightens the specification. The first eight GTF fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type
/value
pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space.
The attribute list must begin with the two mandatory attributes:
gene_id |
A globally unique identifier for the genomic source of the sequence. |
transcript_id |
A globally unique identifier for the predicted transcript. |
Check the nineth GTF field of the sample annotation of the workshop:
cut -f9 ~ngs00/refs/mm65.long.ok.gtf | head -1
gene_id "ENSMUSG00000076490"; transcript_id "ENSMUST00000103291"; exon_number "1"; gene_name "Trbc1"; gene_type "IG_C_gene"; transcript_name "Trbc1-201"; protein_id "ENSMUSP00000100099"; transcript_type "IG_C_gene";
Look at the first line:
head -1 ~ngs00/refs/mm65.long.ok.gtf | tr '\t' '\n'
chr6 (1) ENSEMBL (2) CDS (3) 41488217 (4) 41488591 (5) . (6) + (7) 2 (8) gene_id "ENSMUSG00000076490"; transcript_id "ENSMUST00000103291"; exon_number "1"; gene_name "Trbc1"; gene_type "IG_C_gene"; transcript_name "Trbc1-201"; protein_id "ENSMUSP00000100099"; transcript_type "IG_C_gene"; (9)
-
seqname
- The name of the sequence. Must be a chromosome or scaffold. -
source
- The program that generated this feature. -
feature
- The name of this type of feature. Some examples of standard feature types areCDS
,start_codon
,stop_codon
, andexon
. -
start
- The starting position of the feature in the sequence (1-based
). -
end
- The ending position of the feature (inclusive). -
score
- A score between 0 and 1000. -
strand
- Valid entries include+
,-
, or.
(for don’t know/don’t care). -
frame
- If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be.
. -
group
- All lines with the same group are linked together into a single item. Note thegene_id
andtranscript_id
mandatory attributes.
The third field in the GFF specification represents feature type for the line. In addition to the standard features mentioned above there can be intron
, gene
and transcript
.
Look for the first of each feature in the sample annotation of the workshop:
awk '$3=="intron"' ~ngs00/refs/mm65.long.ok.gtf | head -1
chr17 ENSEMBL #intron# 46893352 46893968 . - . gene_id "ENSMUSG00000036858"; transcript_id "ENSMUST00000041012"; exon_number "3"; gene_name "Ptcra"; gene_type "IG_C_gene"; transcript_name "Ptcra-201"; transcript_type "IG_C_gene";
awk '$3=="gene"' ~ngs00/refs/mm65.long.ok.gtf | head -1
chr4 ENSEMBL #gene# 116791824 116798271 . - . gene_id "ENSMUSG00000073771"; transcript_id "ENSMUSG00000073771"; gene_type "protein_coding"; gene_status "NULL"; gene_name "Btbd19"; transcript_type "protein_coding"; transcript_status "NULL"; transcript_name "Btbd19";
awk '$3=="transcript"' ~ngs00/refs/mm65.long.ok.gtf | head -1
chr13 ENSEMBL transcript 19305972 19308487 . + . gene_id "ENSMUSG00000076749"; transcript_id "ENSMUST00000103558"; exon_number "1"; gene_name "Gm17004"; gene_type "IG_C_gene"; transcript_name "Gm17004-201"; transcript_type "IG_C_gene";