Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat transcriptome indexing

    Hi,

    I'm trying to use tophat on some data and am having an issue. I want to make a transcriptome mapping file like this:

    Code:
     tophat -G CMCP6_adjusted.gff3 --transcriptome-index=CMCP6_adjusted CMCP6_adjusted
    The output from this command, through many variations, looks like this:
    Code:
    [2016-03-03 12:58:49] Building transcriptome files with TopHat v2.1.1
    -----------------------------------------------
    [2016-03-03 12:58:49] Checking for Bowtie
                      Bowtie version:        2.1.0.0
    [2016-03-03 12:58:49] Checking for Bowtie index files (genome)..
    [2016-03-03 12:58:49] Checking for reference FASTA file
    [2016-03-03 12:58:49] Building transcriptome data files CMCP6_adjusted/CMCP6_adjusted
            [FAILED]
     Error: gtf_to_fasta returned an error.
    So I've read the manual for tophat and many similar issues on these forums and others. I'm guessing there's an issue with the chromosome names. So I started working on that. Here are the steps I've taken and what I've tried to do to address them so far.

    Build the bowtie2 index:
    Code:
    bowtie2-build CMCP6.fa CMCP6
    Here is the head of the fasta file I'm using:
    Code:
    >gi|326423644|ref|NC_004459.3| Vibrio vulnificus CMCP6 chromosome I, complete sequence
    TTACTTTAAATAATAAAAATCGTATAACGTTCGGGTAATTTCTGCA...
    I have both gff3 and a gtf I made from it available. Here are the first couple of lines from each in case it's important.
    GFF3:
    Code:
    NC_004459       GenBank chromosome      1       3281866 .       +       .       ID=NC_004459;Name=NC_004459;Alias=I;Dbxref=taxon:216895;Note=Vibrio vulnificus CMCP6 chromosome I%2C complete sequence.;chromosome=I;comment1=REFSEQ INFORMATION: The reference sequence was derived from AE016795. On Mar 18%2C 2011 this sequence version replaced gi:117956319. Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 08/18/2015 01:03:26 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set%3B GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene%3B CDS%3B rRNA%3B tRNA%3B ncRNA%3B repeat_region Genes :: 4%2C578 CDS 4%2C374 Pseudo Genes :: 64 rRNAs :: 10%2C 9%2C 9 (5S%2C 16S%2C 23S) complete rRNAs :: 10%2C 9%2C 9 (5S%2C 16S%2C 23S) tRNAs :: 111 ncRNA :: 1 Frameshifted Genes :: 43 Frameshifted Genes On Monomer Runs :: 8 Frameshifted Genes Not On Monomer Runs :: 1 ##Genome-Annotation-Data-END## COMPLETENESS: full length. ;date=18-AUG-2015;mol_type=genomic DNA;organism=Vibrio vulnificus CMCP6;strain=CMCP6
    NC_004459       GenBank gene    1       1002    .       -       .       ID=VV1_RS00005;Name=VV1_RS00005;locus_tag=VV1_RS00005;old_locus_t
    GTF:
    Code:
    NC_004459       GenBank chromosome      1       3281866 .       +       .       ID "NC_004459"; Name "NC_004459"; Alias "I"; Dbxref "taxon:216895"; Note "Vibrio vulnificus CMCP6 chromosome I, complete sequence."; chromosome "I"; comment1 "REFSEQ INFORMATION: The reference sequence was derived from AE016795. On Mar 18, 2011 this sequence version replaced gi:117956319. Annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (released 2013). Information about the Pipeline can be found here: http://www.ncbi.nlm.nih.gov/genome/annotation_prok/ ##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Date :: 08/18/2015 01:03:26 Annotation Pipeline :: NCBI Prokaryotic Genome Annotation Pipeline Annotation Method :: Best-placed reference protein set; GeneMarkS+ Annotation Software revision :: 3.0 Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA; repeat_region Genes :: 4,578 CDS 4,374 Pseudo Genes :: 64 rRNAs :: 10, 9, 9 (5S, 16S, 23S) complete rRNAs :: 10, 9, 9 (5S, 16S, 23S) tRNAs :: 111 ncRNA :: 1 Frameshifted Genes :: 43 Frameshifted Genes On Monomer Runs :: 8 Frameshifted Genes Not On Monomer Runs :: 1 ##Genome-Annotation-Data-END## COMPLETENESS: full length."; date "18-AUG-2015"; mol_type "genomic DNA"; organism "Vibrio vulnificus CMCP6"; strain "CMCP6";
    NC_004459       GenBank gene    1       1002    .       -       .       ID "VV1_RS00005"; Name "VV1_RS00005"; locus_tag "VV1_RS00005"; old_locus_tag "VV1_0001";
    So according to the tophat manual, when you use the --transcriptome-index, "...please make sure that the 1st column in the annotation file uses the exact same chromosome/contig names (case sensitive) as shown by the bowtie-inspect command..."

    Here is the result of that command:
    Code:
    bowtie2-inspect --names CMCP6
    gi|326423644|ref|NC_004459.3| Vibrio vulnificus CMCP6 chromosome I, complete sequence
    Using some command line tools, here's what I have in the gtf file for chromosome:
    Code:
    cut -f1 CMCP6.gtf | sort -u
    ##date 2016-02-26
    ##gff-version 2
    NC_004459
    So I see the fasta file has the NC_004459.3 and the gtf file has just NC_004459. So I change the fasta file to NC_004459, rebuild the index, and try again.
    Code:
    bowtie2-inspect --names CMCP6
    gi|326423644|ref|NC_004459| Vibrio vulnificus CMCP6 chromosome I, complete sequence
    Now, I try to build the transcriptome file:
    Code:
    tophat -G CMCP6.gtf --transcriptome-index=transcriptome/CMCP6 CMCP6
    
    [2016-03-03 13:21:48] Building transcriptome files with TopHat v2.1.1
    -----------------------------------------------
    [2016-03-03 13:21:48] Checking for Bowtie
                      Bowtie version:        2.1.0.0
    [2016-03-03 13:21:48] Checking for Bowtie index files (genome)..
    [2016-03-03 13:21:48] Checking for reference FASTA file
    [2016-03-03 13:21:48] Building transcriptome data files transcriptome/CMCP6
    [2016-03-03 13:21:48] Building Bowtie index from CMCP6.fa
            [FAILED]
    Error: Couldn't build bowtie index with err = 1
    So no change there, something still doesn't match but I don't know what. The tophat manual said the names should match exactly, so I've tried several variations in both files for the chromosome name, but no matter what I do I end up with the same thing.

    The g2f.err file shows the following:
    Code:
    Warning: invalid GTF record, transcript_id not found:
    NC_004459       GenBank mRNA    1       1002    .       -       .       ID "VV1_RS00005.t01"; Name "VV1_RS00005.t01"; Parent "VV1_RS00005";
    Warning: invalid GTF record, transcript_id not found:
    NC_004459       GenBank CDS     1       1002    .       -       0       ID "VV1_RS00005.p01"; Name "VV1_RS00005.p01"; Dbxref "GI:499390662"; Note "Der$
    Warning: invalid GTF record, transcript_id not found:
    NC_004459       GenBank exon    1       1002    .       -       .        Parent "VV1_RS00005.t01";
    This makes me think there is some id matching issue going on here, but I'm unsure how to correct it. A folder called 'transcriptome' is also produced by my command, and it contains a file, CMCP6.fa, that is empty, if that says anything about the point at which tophat is failing.

    I've read many threads here and elsewhere, as well as the manual for tophat, but am still unsure how to correct this. Also, in other places people often suggest getting different versions of my files, but I don't think that's going to be an option for me for this bacteria, and the next one I have to do the same thing to. I would appreciate any help anyone can offer on this. Thanks very much in advance!

  • #2
    Why not simplify the headers for the fasta file so only the accession number remains: NC_004459.3?

    Since this is bacterial RNAseq data you don't need to worry about splicing. You could just map using an aligner of your choice (BBMap would be easiest) and then go to featurecounts/htseq-count for counting.

    Comment


    • #3
      Code:
      Why not simplify the headers for the fasta file so only the accession number remains: NC_004459.3?
      I've actually tried this. It doesn't change anything as it turns out.


      Code:
       You could just map using an aligner of your choice (BBMap would be easiest) and then go to featurecounts/htseq-count for counting.
      I'm actually trying to reproduce some methodology from a paper and so I wanted to use the same methods. I was planning to do htseq-count/DESeq if I can't get this working as a last resort though. I may end up doing this though.

      Comment


      • #4
        Originally posted by aprice67 View Post
        I've actually tried this. It doesn't change anything as it turns out.
        Did you re-do the entire pipeline since you would need to re-index the fasta, remap etc.

        Comment


        • #5
          Code:
          Did you re-do the entire pipeline since you would need to re-index the fasta, remap etc.
          I did. I double checked with 'bowtie2-inspect --names CMCP6' too to confirm.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X