Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GTF file error: "gtf_to_fasta returned an error"

    I am a newcomer to the RNAseq field. I have been trying to get (differential) expression values from my experiments (rat and mouse). To achieve this I want to run tophat and cufflinks with annotated transcripts/genes (i.e. -G option in top hat). this requires a GTF or GFF file.

    I obtained my rat GTF file from:


    I already figured out that the first column of this file does not contain the correct names:
    ['1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '3' '4' '5'
    '6' '7' '8' '9' 'AABR06109308.1' 'AABR06109382.1' 'AABR06109386.1'
    'AABR06109393.1' 'AABR06109522.1' 'AABR06109730.1' 'AABR06109804.1'
    'AABR06109842.1' 'AABR06109980.1' 'AABR06109985.1' 'AABR06110113.1'
    'AABR06110137.1' 'AABR06110207.1' 'AABR06110239.1' 'AABR06110493.1'
    'AABR06110520.1' 'AABR06110524.1' 'AABR06110528.1' 'AABR06110658.1'
    'AABR06110665.1' 'AABR06110667.1' 'AABR06110668.1' 'AABR06110762.1'
    'AABR06110835.1' 'AABR06110904.1' 'AABR06110917.1' 'AABR06110957.1'
    'AABR06110964.1' 'AABR06111084.1' 'AABR06111092.1' 'AABR06111127.1'
    'AABR06111133.1' 'AABR06111236.1' 'AABR06111258.1' 'AABR06111264.1'
    'AABR06111284.1' 'AABR06111288.1' 'AABR06111321.1' 'AABR06111326.1'
    'AABR06111415.1' 'AABR06111426.1' 'AABR06111443.1' 'AABR06111498.1'
    'AABR06111564.1' 'AABR06111611.1' 'AABR06111687.1' 'AABR06111698.1'
    'AABR06111722.1' 'AABR06111841.1' 'AABR06111842.1' 'AABR06112158.1'
    'AABR06112227.1' 'AABR06112323.1' 'JH620271.1' 'JH620298.1' 'JH620359.1'
    'JH620370.1' 'JH620402.1' 'JH620421.1' 'JH620447.1' 'JH620470.1'
    'JH620473.1' 'JH620476.1' 'JH620489.1' 'JH620497.1' 'JH620498.1'
    'JH620499.1' 'JH620501.1' 'JH620511.1' 'JH620517.1' 'JH620530.1'
    'JH620566.1' 'JH620632.1' 'MT' 'X']

    they should be:
    xxx@xxx:~/RNAseq/Rattus_norvegicus/UCSC/rn4/Sequence/Bowtie2Index$ bowtie2-inspect --names genome
    chr10
    chr11
    chr12
    chr13
    chr14
    chr15
    chr16
    chr17
    chr18
    chr19
    chr1
    chr20
    chr2
    chr3
    chr4
    chr5
    chr6
    chr7
    chr8
    chr9
    chrM
    chrX

    therefore I wrote a python script that replaced the names with the correct ones. The new GTF file first column indices are, which match the ones in my index file:
    ['chr1' 'chr10' 'chr11' 'chr12' 'chr13' 'chr14' 'chr15' 'chr16' 'chr17'
    'chr18' 'chr19' 'chr2' 'chr20' 'chr3' 'chr4' 'chr5' 'chr6' 'chr7' 'chr8'
    'chr9' 'chrM' 'chrX']

    However, when I run the following command:

    xxx@xxx:~/RNAanalysis/data/rat RNAseq$ tophat -p 20 -G ../../genome.gtf -o D09LDACXX_101811-19-07_CAGATC_L003_R1 ../../RatIndex/Bowtie2Index/genome D09LDACXX_101811-19-07_CAGATC_L003_R1.fastq

    I get an error:
    [2013-06-24 11:14:41] Beginning TopHat run (v2.0.8b)
    -----------------------------------------------
    [2013-06-24 11:14:41] Checking for Bowtie
    Bowtie version: 2.1.0.0
    [2013-06-24 11:14:41] Checking for Samtools
    Samtools version: 0.1.18.0
    [2013-06-24 11:14:41] Checking for Bowtie index files
    [2013-06-24 11:14:41] Checking for reference FASTA file
    [2013-06-24 11:14:41] Generating SAM header for ../../RatIndex/Bowtie2Index/genome
    format: fastq
    quality scale: phred33 (default)
    [2013-06-24 11:14:47] Reading known junctions from GTF file
    [2013-06-24 11:14:50] Preparing reads
    left reads: min. length=51, max. length=51, 15745244 kept reads (5954 discarded)
    [2013-06-24 11:18:53] Creating transcriptome data files..
    [FAILED]
    Error: gtf_to_fasta returned an error.


    the first few rows of my GTF file look like this:
    chr1 Ensembl exon 312155 312360 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 386110 386392 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "2"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 391729 392532 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "3"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 393385 393609 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "4"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 394818 394941 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "5"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 400266 401149 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "6"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 312155 312361 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 312365 312376 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "2"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 312378 312383 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "3"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 386141 386392 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "4"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 391729 392532 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "5"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 393385 393606 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "6"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 394818 394940 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "7"; gene_name "Vom2r3"; gene_biotype "protei$
    chr1 Ensembl exon 400259 401149 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000046586"; exon_number "8"; gene_name "Vom2r3"; gene_biotype "protei$

    (the $-sign at the end is just because the complete row didn't fit in my shell, an example of a complete row:
    chr1 Ensembl exon 312155 312360 . + . gene_id "ENSRNOG00000046319"; transcript_id "ENSRNOT00000073861"; exon_number "1"; gene_name "Vom2r3"; gene_biotype "protein_coding"; transcript_name "Vom2r3-202"; exon_id "ENSRNOE00000494965"; old_row[1] "protein_coding";
    )
    compared to the original GTF file I also replaced the second column with the word "Ensembl" ,, where before this said "protein_coding". I only kept the exons, so I removed start and stop codons and CDS.

    Does anybody have a clue why TopHat is still failing to run?

    thanks in advance!
    Last edited by rubbertjes; 06-24-2013, 01:37 AM.

  • #2
    Your life would be easier if you used the Ensembl annotation with the Ensembl sequence or the UCSC annotation with the UCSC annotation and didn't try to mix them.

    Comment


    • #3
      Also, you can download pre-made indices that will work from iGenomes.

      Comment


      • #4
        Originally posted by dpryan View Post
        Also, you can download pre-made indices that will work from iGenomes.
        This is indeed where I obtained my indices from.

        Comment


        • #5
          Originally posted by dpryan View Post
          Your life would be easier if you used the Ensembl annotation with the Ensembl sequence or the UCSC annotation with the UCSC annotation and didn't try to mix them.
          I am downloading the Ensembl data from http://cufflinks.cbcb.umd.edu/igenomes.html as we speak, so we will see how that goes (ETA 15h :S), but it is still strange it doesn't function with my current implementation...
          Thanks!
          Last edited by rubbertjes; 06-24-2013, 03:22 AM.

          Comment


          • #6
            Hi rubbertjes,

            I am also facing same problem

            I am doing alignment of my RNA-Seq data from Arabidopsis with Tophat


            I gave the following command

            tophat -G genes.gtf -p 5 -o controlone genome
            C2AFAACXX_Hr-02_13s004770-1-1_ram_lane313s004770_sequence.fastq,C2AFAACXX_Hr-02_13s004770-2-1_ram_lane513s004770_sequence.fastq

            Genes.gtf is gene annotation file, while bowtie2 index files are starting with 'genome' name.

            These both the files were downloaded from this link




            and the out put looks like this

            [2013-08-27 15:23:42] Beginning TopHat run (v2.0.9)
            -----------------------------------------------
            [2013-08-27 15:23:42] Checking for Bowtie
            Bowtie version: 2.1.0.0
            [2013-08-27 15:23:42] Checking for Samtools
            Samtools version: 0.1.19.0
            [2013-08-27 15:23:42] Checking for Bowtie index files (genome)..
            [2013-08-27 15:23:42] Checking for reference FASTA file
            [2013-08-27 15:23:42] Generating SAM header for genome
            format: fastq
            quality scale: phred33 (default)
            [2013-08-27 15:23:42] Reading known junctions from GTF file
            [2013-08-27 15:23:45] Preparing reads
            left reads: min. length=52, max. length=52, 58033296 kept reads
            (3282 discarded)
            [2013-08-27 15:36:32] Building transcriptome data files..
            [FAILED]
            Error: gtf_to_fasta returned an error.

            Could you solve this probem? and what was the mistake.

            thank you so much..

            Comment


            • #7
              You might have a look in the run log to see if you get a more informative error. Aside from that, ensure that gtf_to_fasta is in your PATH and executable. You can also just directly execute it yourself using the command in the run log.

              Comment


              • #8
                Hi dpryan,

                in the run log I see following details

                /g/software/linux/pack/tophat-2.0.9/bin/gtf_to_fasta --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir controlon/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 -z gzip -p5 --gtf-annotations Arabidopsis_thaliana_NCBI_TAIR10/Arabidopsis_thaliana/NCBI/TAIR10/Annotation/Archives/archive-2013-03-06-09-50-01/Genes/genome.gtf --gtf-juncs controlon/tmp/genome.juncs --no-closure-search --no-coverage-search --no-microexon-search Arabidopsis_thaliana_NCBI_TAIR10/Arabidopsis_thaliana/NCBI/TAIR10/Annotation/Archives/archive-2013-03-06-09-50-01/Genes/genome.gtf Arabidopsis_thaliana_NCBI_TAIR10/Arabidopsis_thaliana/NCBI/TAIR10/Sequence/Bowtie2Index/genome.fa controlon/tmp/genome.fa > controlon/logs/g2f.out

                but can't understand that there is any error message

                Comment


                • #9
                  I copy-pasted the last lines of run log into the command line and output was like this

                  /g/software/linux/pack/tophat-2.0.9/bin/gtf_to_fasta: /lib64/libz.so.1: no version information available (required by /g/software/linux/pack/tophat-2.0.9/bin/gtf_to_fasta)
                  Error (GFaSeqGet): not a fasta header?

                  Comment


                  • #10
                    The libz warning can be ignored. I've not seen that particular error, but the source code suggests that the fasta file doesn't start with a ">Something" line. Is this the case?

                    Comment


                    • #11
                      I opened the fasta file in the terminal and looks like this

                      cat genome.fa
                      XSym
                      0029
                      7ec8fcfa270da919afc1f0afd221e788
                      ../WholeGenomeFasta/genome.fa

                      I am not sure this is the right, althought I got this fasta file with other index files

                      thanks for your posts..

                      Comment


                      • #12
                        Oh my, something definitely went amiss there. You're going to want to either redownload that or recreate it from you bowtie indexes.

                        Comment


                        • #13
                          I already downloaded index files twice, and both the .fa files look same. I created my own index files but in those files I didn't find.fa files, although there were other .bt2 files.

                          Now I am going to recrteate new index files..

                          Comment


                          • #14
                            Where are you getting the fasta files and indices? Are these from iGenomes?

                            Comment


                            • #15
                              this time also I don't see any .fa file after indexing, I just see 6 .bt2 files

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X