SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
[Gene Prediction] AUGUSTUS program syintel87 Bioinformatics 1 02-16-2014 09:15 AM
augustus gene finder condomitti Bioinformatics 2 11-23-2013 06:24 AM
survery: what assembler, gene prediction and gene annotation softwares do you use? Gorbenzer Metagenomics 6 04-22-2013 02:25 AM
cuffdiff does not output all the CDS in cds.FPKM.tracking file xiangq Bioinformatics 20 04-26-2012 12:39 PM
How to know the CDS positions of each gene ahli1981 Bioinformatics 3 12-19-2011 01:29 PM

Reply
 
Thread Tools
Old 03-20-2015, 05:44 PM   #1
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default Augustus gene prediction CDS problem

I am using Augustus to predict genes. It works well most of the time. However I am dealing with a problem of translation of the extracted codingsequences. I use getAnnoFasta.pl to extract the codingseq and the aminoacids also. If I am not wrong, the codingSeqs should translate exactly the aminoacids sequences. This is not the case in several instances. When I checked my coding seq and translate them many results in other protein totally different form the file containing the aa sequences. Is this a common problem?. Is this best way to extract sequences?. I tried to use the GFF and loaded in artemis to edit error and I noticed that some features lost the frame. Maybe I am missing something.

Thanks

joxcargator73 is offline   Reply With Quote
Old 03-23-2015, 02:14 AM   #2
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Could you provide (a) the command you used, (b) the version number of your Augustus installation and (c) a reproducible example where the translation fails? I have used Augustus some times but have never experienced anything like that.
WhatsOEver is offline   Reply With Quote
Old 03-23-2015, 04:50 PM   #3
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default

I am using augustus 3.0.3.
It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

I checked the gff file and only the incomplete genes have this problem.

I used this command
./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

Then

perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

to obtain CodingSeq, and aa



For example in my file: Gcitri.gff
I have an "incomplete call"

# ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 2169 on both strands
# start gene g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
# coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga]
# protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
FEVPDCHARLPVASRASSQDRLD]
# end gene g9811


The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
This is good

the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga
!!!

If you translate that in frame 1
You obtain this

IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
QTAMPDCQSHHEPPRRIGWT

Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.


Thanks

Last edited by joxcargator73; 03-23-2015 at 04:55 PM.
joxcargator73 is offline   Reply With Quote
Old 03-24-2015, 07:03 AM   #4
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Quote:
Originally Posted by joxcargator73 View Post
I am using augustus 3.0.3.
It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

I checked the gff file and only the incomplete genes have this problem.
ok, the prediction is fine then. I'll explain why:

Quote:
Originally Posted by joxcargator73 View Post
I used this command
./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

Then

perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

to obtain CodingSeq, and aa
Just a comment: I assume that there are no closer related species than Magnaporthe (eg something from the Dothideomycetes) available? In that case it is important to keep in mind that some of your gene predictions will probably be different when using alternative gene models. But for an initial working set everything is fine and you don't have to bother.

Quote:
Originally Posted by joxcargator73 View Post
For example in my file: Gcitri.gff
I have an "incomplete call"

# ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 2169 on both strands
# start gene g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
# coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga]
# protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
FEVPDCHARLPVASRASSQDRLD]
# end gene g9811


The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
This is good

the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga
!!!

If you translate that in frame 1
You obtain this

IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
QTAMPDCQSHHEPPRRIGWT

Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.

Thanks
1) What you get is what you see. The perl script does nothing more than extracting everything between "# coding sequence = [" and its respective closing bracket (Same thing for protein sequence).

2) Augustus uses 0 to indicate 1st frame, 1 for 2nd and 2 for 3rd and if you translate the coding sequence in all frames, you'll also see that it is the 3rd frame translation.

3) I highlighted the most important part. You have a partial gene which was detected at the start of a contig. Obivously, for such a prediction no start "ATG" can be found (as you have noticed). But your prediction starts with an intron and to see a "frame-shift" is absolutely fine in that case. Keep in might that you cannot necessarily translate every exon independently in 1st frame!

4) It doesn't make sense to translate in a frame that contains premature stop codons as this only indicates that the prediction is somehow wrong. For your example, there seems to be evidence for a transcript of that certain length and there is 1 frame which completely covers this region with only 1 stop at the end. This frame is shown.

I hope this helps
WhatsOEver is offline   Reply With Quote
Old 03-26-2015, 04:58 PM   #5
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default

Thanks, for your help.
The problem that I still have , it is how to deal with the codingSeqs. If some of them (mostly the incomplete calls) do not start in frame and I need the nucleotide sequence to run a codon usage analysis , I will get a something else but not what I expect. Moreover if I compare with the aminoacid tarlatan that Ausguts shows. Should I avoid using truncate calls. Are the complete gene calls start in frame, are these safe to just take their "coding seqs" and translate them?

Thanks for your help
joxcargator73 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO