Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splice site prediction with solid rna-seq data

    Hi all

    We are having problems predicting splice sites from our Solid rna-seq data. We have a draft genome (125Mb, a eukaryote) assembled from 454-data and are now trying to map our Solid reads to this genome to predict splice sites. The idea is to use these predicted splice sites to make intron hints for the gene finder Augustus to create correct gene models.

    We are currently trying Bowtie/Tophat, but get weird results. For example, when working with a subset of our reads we find some splice sites, but these are not found when we add more data. Also, we have earlier tried Corona Light together with Splitseek, and Bowtie/Tophat does not find sites that were found with Corona Light/Splitseek. On the other hand, Corona Light/Splitseek is timeconsuming/awkward to run and often reports splice sites that are a few bp off, so that is not an ideal choice either.

    This cannot be an uncommon situation, so what are the rest of you doing in these situations? No closely related genomes have been sequenced.

  • #2
    Another reasonable choice might be hmmSplicer, at least for comparison. I've had what look to be reasonable results from it in the past. I take it you're working in sequence space, not colour space ?

    Comment


    • #3
      Originally posted by colindaven View Post
      Another reasonable choice might be hmmSplicer, at least for comparison. I've had what look to be reasonable results from it in the past. I take it you're working in sequence space, not colour space ?

      Thanks for the reply. No, we are working in color space. Sequences converted to sequence space would too easily become wrong if there are any errors in the original colorspace reads. However, if you or anyone else have had good success with converting to sequence space I would love to hear about it. The general recommendation seems to be to map in colorspace.

      Comment


      • #4
        Originally posted by Hobbe View Post
        Hi all

        We are having problems predicting splice sites from our Solid rna-seq data. We have a draft genome (125Mb, a eukaryote) assembled from 454-data and are now trying to map our Solid reads to this genome to predict splice sites. The idea is to use these predicted splice sites to make intron hints for the gene finder Augustus to create correct gene models.
        Augustus can cope with "hints" created by mapping Illumina reads (converted to fasta) with splice-agnostic blat. So as long as you have some gene models for training, unspliced mappings should work, I hope.

        Originally posted by Hobbe View Post
        We are currently trying Bowtie/Tophat, but get weird results. For example, when working with a subset of our reads we find some splice sites, but these are not found when we add more data. Also, we have earlier tried Corona Light together with Splitseek, and Bowtie/Tophat does not find sites that were found with Corona Light/Splitseek. On the other hand, Corona Light/Splitseek is timeconsuming/awkward to run and often reports splice sites that are a few bp off, so that is not an ideal choice either.

        This cannot be an uncommon situation, so what are the rest of you doing in these situations? No closely related genomes have been sequenced.
        I got strange results from tophat vs bowtie mapping SOLID reads without GFF gene models guide (draft+ mamalian genome): bowtie in colorspace mapped _more_ reads than tophat. I used the latest versions (TopHat 1.3.1 and Bowtie 0.12.7).

        Comment


        • #5
          Originally posted by darked89 View Post
          Augustus can cope with "hints" created by mapping Illumina reads (converted to fasta) with splice-agnostic blat. So as long as you have some gene models for training, unspliced mappings should work, I hope.

          Blat is the preferred program to use for spliced mapping (see the Augustus Rnaseq instructions). You really need those intron hints to get correct gene models. Blat doesn't work on Solid data though.

          Of biggest importance in our case was to have Augustus trained on the actual organism. We did this using our 454 cDNA data, and using this training the number of correctly found genes in our small set (14) of known test genes increased from 6 to 9 (compared to using the training files for distantly related organisms that came with Augustus). Adding intron hints we are now up to 11 out of 14 genes, but this is only with a small part of our Solid rnaseq data, and we are now working on adding more hints. The only solution we have just now is using the old Corona Light pipeline together with Splitseek by Adam Ameur. Slow, but seems to work.

          IMO, there is still a great need for a good spliced mapper for Solid data.

          Comment


          • #6
            Originally posted by Hobbe View Post
            Blat is the preferred program to use for spliced mapping (see the Augustus Rnaseq instructions). You really need those intron hints to get correct gene models. Blat doesn't work on Solid data though.
            Same for FASTQ format. Maybe there is something to be gained from color 2 fasta conversion and mapping by blat.

            Originally posted by Hobbe View Post
            Of biggest importance in our case was to have Augustus trained on the actual organism. We did this using our 454 cDNA data, and using this training the number of correctly found genes in our small set (14) of known test genes increased from 6 to 9 (compared to using the training files for distantly related organisms that came with Augustus). Adding intron hints we are now up to 11 out of 14 genes, but this is only with a small part of our Solid rnaseq data, and we are now working on adding more hints.
            Also you may try to use CEGMA (http://korflab.ucdavis.edu/Datasets/cegma/) either to produce yet another training or testing set. Also at times there is no way out except starting semi-manual annotation, again be it for the training or testing sets. Blastp your Augustus predictions: genes whith high conservation/100% coverage in other species are likely to be real.

            Originally posted by Hobbe View Post
            The only solution we have just now is using the old Corona Light pipeline together with Splitseek by Adam Ameur. Slow, but seems to work.
            Is it the currently recommended setup by Splitseek author? In the Splitseek 1.3.4 manual the recommended one is Whole Transcriptome Pipeline.

            Originally posted by Hobbe View Post
            IMO, there is still a great need for a good spliced mapper for Solid data.
            Indeed. I have found some other software (X-MATE), but it requires junction libraries and uses yet another pipeline (http://solidsoftwaretools.com/gf/project/mapreads/).
            See:

            Comment


            • #7
              Hi,

              Just a few words about SplitSeek from the author. It only works with the split read mapper from the AB Whole Transcriptome Pipeline, always had. I'm aware it is akward but unfortunately there are currently no good alternatives.

              The good news is that AB WTP actually works fine once you get it to run. I even managed to run some 75bp reads from the SOLiD5500 through WTP and SplitSeek (using 25bp anchors in the mapping) so it might be an option also in the future.

              /Adam

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X