Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two tibble gff3 files appear empty when loading in. #194

Open
tania-k opened this issue Jul 3, 2024 · 3 comments
Open

Two tibble gff3 files appear empty when loading in. #194

tania-k opened this issue Jul 3, 2024 · 3 comments

Comments

@tania-k
Copy link

tania-k commented Jul 3, 2024

Hello gggenomes folks.

I am currently attempting to run your program on my dataset that are subsections of:

  1. A Nucleotide Fasta file- a portion of my chromosome JAEVHH010000002.1 from length 6094000-6107000 bp.
  2. A GFF3 file - Containing a portion of my GFF3 file ranging from 6094000-6107000 bp.
  3. A Transposable element annotation GFF3 file - Containing a portion of my original file ranging from 6094000-6107000 bp.

Looks like:

  1. FASTA (head)

JAEVHH010000002.1
GCATCTTGATTGAGGACAGCTCTCTTAGGCAAATCTGACAGTTTGGGGATGTTGATTTGGTTAAGTGTCGCCATAGTGGA
ACTTCTGATGGGATGCAGTTATAGGCAGTGTAAATAATGTGAAGATGACTGATCTTTCTTTAGATAGTTCAGGGATGAGT
TCTGCAACAGCAATGAACAGGCATAGACAAGAAGAAAAGCAAATACAATAAGAGATGAACAGCAGTTCACTGTGTTCCTG
TTATATGAGTCAAGAGATCTGACTGACATTAGTTGCTGCTGTAACAACACAACAGTGAATGTCAGTATCTTATACCTCAA
GAAACATCCCAACAAAGACTAATAAGAACTAGATCATGGCTGACCACTGCCAAACTCTATGAACTTAACACATGGAAAAC
ACTTCAACTTATAAATAGCAAATGTAAATGCCTAGAGTGCATGGTGGTCTGAAGTGTGGAACAGGCGTTTCTTCTTCTAC
TGCAGCTAAATAAACCATCTGGTAGTAGCAATTGAAGTTTTCTtcccttctttttcttttagaCCCCGATAACGCTCCGT
GTATATGTATGTAACAGTGATTTTGAGGGGCATCATTCCCGTTGGAAATATTGACGTAGAATTGCATATCAGAGAATGAG
TAGCTTCCACGTAGCAATAATTAACAGAGAGAATAAGAAACATCAGACTTAGATAGTGAACTACCTAGAAACTGCCTATA

  1. GFF3 (head)

##gff-version 3
JAEVHH010000002.1 Genbank gene 6094524 6097279 . - . ID=gene-I7I48_07445;Name=I7I48_07445;Note=similar to ucsf_hc.01_1.G217B.02409;gbkey=Gene;gene_biotype=protein_coding;locus_tag=I7I48_07445
JAEVHH010000002.1 Genbank mRNA 6094524 6097279 . - . ID=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;Parent=gene-I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase
JAEVHH010000002.1 Genbank exon 6096870 6097279 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-1;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase
JAEVHH010000002.1 Genbank exon 6096553 6096792 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-2;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase
JAEVHH010000002.1 Genbank exon 6095606 6096476 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-3;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase
JAEVHH010000002.1 Genbank exon 6094524 6095543 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-4;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase
JAEVHH010000002.1 Genbank CDS 6096870 6096893 . - 0 ID=cds-KAG5298099.1;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;Dbxref=NCBI_GP:KAG5298099.1;Name=KAG5298099.1;gbkey=CDS;locus_tag=I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase;protein_id=KAG5298099.1

  1. TE GFF3 (head)

##gff-version 3
JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6106870 6106957 521 + . ID=TE_homo_3261;Name=TE_00000452_LTR;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.92;Method=homology
JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6106958 6107257 1063 + . ID=TE_homo_3262;Name=TE_00000131_INT;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.861;Method=homology
JAEVHH010000002.1 EDTA LTR_retrotransposon 6107268 6107505 743 + . ID=TE_homo_3263;Name=TE_00000906_LTR;Classification=LTR/unknown;Sequence_ontology=SO:0000186;Identity=0.836;Method=homology
JAEVHH010000002.1 EDTA Gypsy_LTR_retrotransposon 6107921 6108480 3528 + . ID=TE_homo_3265;Name=TE_00000327_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.881;Method=homology
JAEVHH010000002.1 EDTA Gypsy_LTR_retrotransposon 6108475 6108728 1604 + . ID=TE_homo_3266;Name=TE_00000327_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.888;Method=homology
JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6108752 6109500 4635 + . ID=TE_homo_3268;Name=TE_00000131_INT;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.823;Method=homology
JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6109501 6109604 668 + . ID=TE_homo_3269;Name=TE_00000452_LTR;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.929;Method=homology

Called as:

  1. seq
  2. gff3
  3. repeat_edta

When running:

z <- gggenomes(genes=gff) + geom_gene() + geom_seq() + geom_feat(aes(color=type), data=feats(genes)) + geom_bin_label() + geom_seq_label()

I receive
plot_zoom_png
It has all my sequences on separate lines, but adding in any of the other two features, seq or repeat_edta throws errors.

z <- gggenomes(genes=gff, seqs=seq) + geom_gene() + geom_seq() + geom_feat(aes(color=type), data=feats(genes)) + geom_bin_label() + geom_seq_label() z
Only saw type=NAin genes and will treat everything astype="CDS".

z <- gggenomes(gff,repeat_edta) + 
  geom_gene() + 
  geom_seq() + 
  geom_feat(aes(color=type), data=feats(genes)) + 
  geom_bin_label() + 
  geom_seq_label()

Error in require_vars(): ! Required column(s) missing: • length Run rlang::last_trace() to see where the error occurred.

My first error adding all three files gives me:
`> p <- gggenomes(seqs = seq, genes = gff, feats = repeat_edta) + geom_seq() +

  • geom_gene(aes(fill = geom_id)) + geom_feat(linewidth = 5, position = "identity", data = feats(repeat_edta))

p
Only saw type=NA in genes and will treat everything as type="CDS".
Error in geom_feat():
! Problem while computing layer data.
ℹ Error occurred in the 4th layer.
Caused by error in switch():
! EXPR must be a length 1 vector
Run rlang::last_trace() to see where the error occurred.`

I am unsure how to progress.

I am running gggenomes on RStudio run through Linux/HPCC.
I've updated ggplots2 to 3.5.0 and other dependencies along with restarting my R session.

Thank you for your time.

@thackl
Copy link
Owner

thackl commented Jul 3, 2024

Dear Tania,

thanks for reaching out.

The problem with your first error (genes + seqs) is that gggenomes only reads one thing from the fasta file with the sequence: the length, in your case 13000bp. But the coordinates of you genes are then 6107000. So when gggenomes tries to plot that, it cannot find any genes that fit on a 13000bp contig. If you want to zoom in on parts of sequences, you either not provide the sequence at all and gggenomes will just zoom in on the range of genes you provide, or you can explicitly set the length together with start end end for the sequence (which I did in the example below)

In the second case, you are reading your TE gff as seq= (second argument of gggenomes). What should work is z <- gggenomes(gff, feats=repeat_edta) + ....

This works for me (for the parts of the gffs that you provided)

# explicitly specify range of chromosome to plot
s0 <- tibble(
  seq_id = "JAEVHH010000002.1",
  length = 6107000,
  start = 6094000,
  end = 6107000
)

# read genes
g0 <- read_feats("genes.gff")
# read TE
t0 <- read_feats("te.gff")

gggenomes(g0, s0, t0) +
  geom_gene() + 
  geom_seq() + 
  geom_feat(aes(color=type), data=feats(genes)) + 
  geom_bin_label() + 
  geom_seq_label() +
  geom_gene() +
  geom_feat()

image

Hope that helps!

@thackl
Copy link
Owner

thackl commented Jul 3, 2024 via email

@tania-k
Copy link
Author

tania-k commented Jul 3, 2024

Hi Thomas,
I figured it out, required a bit of adjusting but here is my code in entirety (if folks fall into the same mis-step like me)

`s0 <- tibble(
seq_id = "H_ohiense_G217B_JAEVHH010000002.1",
length = 16000,
start = 6094000,
end = 6110000
)

#have gene gff3 first, then sequence, then other gff3 file(repeat content) and pull from feats instead of genes.
p <- gggenomes(gff, s0, repeat_edta) +
geom_gene(aes(fill = name), data=feats(genes)) +
geom_feat(linewidth = 5, position = "identity", aes(color= type), data=feats(feats)) +
scale_color_manual(values = c("#D55E00", "#CC79A7", "#56B4E9")) +
geom_seq() +
geom_bin_label() +
guides(color = guide_legend(title = "Repetitive elements"),fill = guide_legend(title = "GeneID"))
p
`

I realized I needed to broaden my visual to capture the TEs as they were past the gene coordinates in my s0 variables.
Attached is my result.
AGS1_G217B

Thanks again for all the help!!
Tania

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants