Seqanswers Leaderboard Ad

**WhatsOEver** · 03-23-2015, 01:14 AM

Could you provide (a) the command you used, (b) the version number of your Augustus installation and (c) a reproducible example where the translation fails? I have used Augustus some times but have never experienced anything like that.

**joxcargator73** · 03-23-2015, 03:50 PM

I am using augustus 3.0.3.
It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

I checked the gff file and only the incomplete genes have this problem.

I used this command
./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

Then

perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

to obtain CodingSeq, and aa

For example in my file: Gcitri.gff
I have an "incomplete call"

# ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 2169 on both strands
# start gene g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
# coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga]
# protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
FEVPDCHARLPVASRASSQDRLD]
# end gene g9811

The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
This is good

the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga
!!!

If you translate that in frame 1
You obtain this

IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
QTAMPDCQSHHEPPRRIGWT

Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.

Thanks

**WhatsOEver** · 03-24-2015, 06:03 AM

Originally posted by joxcargator73 View Post

I am using augustus 3.0.3.
It happens with the default prediction ( partial: allow prediction of incomplete genes at the sequence boundaries (default))

I checked the gff file and only the incomplete genes have this problem.

ok, the prediction is fine then. I'll explain why:

Originally posted by joxcargator73 View Post

I used this command
./augustus --species=magnaporthe_grisea --codingseq=1 --protein=1 --exonnames=1 G_citricarpa_assembly_Scf4.fasta > Gcitri.gff

Then

perl getAnnoFasta.pl --seqfile=G_citricarpa_assembly_Scf4.fasta Gcitri.gff

to obtain CodingSeq, and aa

Just a comment: I assume that there are no closer related species than Magnaporthe (eg something from the Dothideomycetes) available? In that case it is important to keep in mind that some of your gene predictions will probably be different when using alternative gene models. But for an initial working set everything is fine and you don't have to bother.

Originally posted by joxcargator73 View Post

For example in my file: Gcitri.gff
I have an "incomplete call"

# ----- prediction on sequence number 2169 (length = 1527, name = NODE_2169_length_1527_cov_14.2353_ID_4337) -----
#
# Constraints/Hints:
# (none)
# Predicted genes for sequence number 2169 on both strands
# start gene g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS gene 1 870 0.3 + . g9811
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS transcript 1 870 0.3 + . g9811.t1
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS internal 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS terminal 442 870 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 1 216 0.58 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS intron 390 441 0.75 + . transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 217 389 0.55 + 2 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS CDS 442 867 0.44 + 0 transcript_id "g9811.t1"; gene_id "g9811";
NODE_2169_length_1527_cov_14.2353_ID_4337 AUGUSTUS stop_codon 868 870 . + 0 transcript_id "g9811.t1"; gene_id "g9811";
# coding sequence = [ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga]
# protein sequence = [NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFV
FEVPDCHARLPVASRASSQDRLD]
# end gene g9811

The file Gtcitri.aa obtained using getAnnoFasta.pl gives the same aa sequence
NLQRKLVLLHRRAHRDPSARGSSITVLGCTLWTAIPDDARQAVAGAVSDFSKIQPGLSEVESVSEEDDSALLLVVTHH
APCVQLTSSPRHVGNPWTAAFATDLLGSAAGGGGGAECWDAVRVWVFGHTHYPTNFKVNGTRVVSNQRGYVLPGLVRVKDGEDGSGKNHQSMSRRRFVFEVPDCHARLPVASRASSQDRLD
This is good

the file Gcitri.CodingSeq obtained using getAnnoFasta.pl gives the whole "coding sequence"
ataatctgcagcgcaagctcgtcctgctgcaccgacgcgctcaccgcgaccccagcgctcgcggctcctcaatcaccgt
# cttgggctgcacgctgtggacggcgattcccgacgatgcgcggcaggccgtggcgggcgccgtgagcgacttttccaagatccagccaggactgagcg
# aagtcgagtccgtttccgaggaggacgattcggcgctgctgttggtcgtcacgcaccacgcgccctgcgtccagctcacgtcgagcccgcgccacgtc
# ggcaacccgtggactgcggcattcgccacggacctgcttggatcggcggcaggagggggaggaggtgctgagtgctgggatgccgtcagggtgtgggt
# tttcgggcacacgcactacccgaccaacttcaaggtgaatggcaccagggtcgtaagtaatcagcgcggctacgttctcccagggctcgtgagggtga
# aggatggagaagatggctctgggaagaaccaccagtcgatgtctcgaagacgattcgtgtttgaagtgccagactgccatgccagactgccagtcgca
# tcacgagcctcctcgcaggataggctggactga
!!!

If you translate that in frame 1
You obtain this

IICSASSSCCTDALTATPALAAPQSPSWAARCGRRFPTMRGRPWRAP*ATFPRSSQD*AK
SSPFPRRTIRRCCWSSRTTRPASSSRRARATSATRGLRHSPRTCLDRRQEGEEVLSAGMP
SGCGFSGTRTTRPTSR*MAPGS*VISAATFSQGS*G*RMEKMALGRTTSRCLEDDSCLKC
QTAMPDCQSHHEPPRRIGWT

Agains The gff file says that the first CDS starts at 217 in frame 2 , I believe that this is the right frame for the Codseq. It seems that I am missing something when I use getAnnoFasta.pl.

Of course the start codon feature is missed too. When I loaded the gff file in artemis I also have features of the incomplete genes out of frame.

Thanks

1) What you get is what you see. The perl script does nothing more than extracting everything between "# coding sequence = [" and its respective closing bracket (Same thing for protein sequence).

2) Augustus uses 0 to indicate 1st frame, 1 for 2nd and 2 for 3rd and if you translate the coding sequence in all frames, you'll also see that it is the 3rd frame translation.

3) I highlighted the most important part. You have a partial gene which was detected at the start of a contig. Obivously, for such a prediction no start "ATG" can be found (as you have noticed). But your prediction starts with an intron and to see a "frame-shift" is absolutely fine in that case. Keep in might that you cannot necessarily translate every exon independently in 1st frame!

4) It doesn't make sense to translate in a frame that contains premature stop codons as this only indicates that the prediction is somehow wrong. For your example, there seems to be evidence for a transcript of that certain length and there is 1 frame which completely covers this region with only 1 stop at the end. This frame is shown.

I hope this helps

**joxcargator73** · 03-26-2015, 03:58 PM

Thanks, for your help.
The problem that I still have , it is how to deal with the codingSeqs. If some of them (mostly the incomplete calls) do not start in frame and I need the nucleotide sequence to run a codon usage analysis , I will get a something else but not what I expect. Moreover if I compare with the aminoacid tarlatan that Ausguts shows. Should I avoid using truncate calls. Are the complete gene calls start in frame, are these safe to just take their "coding seqs" and translate them?

Thanks for your help

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Augustus gene prediction CDS problem

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News