Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I'm not sure I follow.

    Here is my prep_reads.info file for one of my samples. This one is 396,555 base pairs:
    left_min_read_len=251
    left_max_read_len=251
    left_reads_in =396555
    left_reads_out=396555
    right_min_read_len=251
    right_max_read_len=251
    right_reads_in =396555
    right_reads_out=396555

    Now here is a prep_reads.info file for a sample from another study that actually produces near perfect alignment:

    left_min_read_len=20
    left_max_read_len=50
    left_reads_in =33601862
    left_reads_out=33599795
    right_min_read_len=20
    right_max_read_len=50
    right_reads_in =33601862
    right_reads_out=33599859


    The only difference seems to be the read length?

    Comment


    • #17
      Can I ask for some context here? Reads of what? Derived from mRNA, total RNA, DNA? Is it a 250bp PE MiSeq run? What is your genome and is it eukaryotic? Does it have known annotated genes and did you use the annotation track in your tophat run?

      Comment


      • #18
        The reads are of mRNA sequenced from single cells of primates. I'm not sure how it was run, my knowledge starts off at the point of raw reads given to me. The genome is the Mmul_1 build by Ensembl, of the rhesus monkey. It has annotated genes and I used a reference GTF in tophat. I also assembled a transcriptome index and tried that. None of this made a difference.

        Comment


        • #19
          When you used the reference GTF, did you also set the option to only map to annotated genes (-T/--transcriptome-only)?
          I'm a bit confused as to why you are expecting Tophat and Bowtie to behave identically. Tophat is splice-aware, bowtie is not. I don't understand why you would ever use Bowtie (or any other non-splice-aware mapper) to map RNA-derived sequence to a genome.

          Comment


          • #20
            Also:
            I'm not sure how it was run, my knowledge starts off at the point of raw reads given to me
            While I understand that this happens occasionally and it's sometimes not under your control, this is a terrible situation and if you can get as much information on the context of the run (Platform? Run type? Library prep kit used? Size selection? Method of size selection?) then you should.

            Comment


            • #21
              I used bowtie2 (since tophat uses bowtie2 for alignment) as a test to see what was going on. I shouldn't be getting 0.4% alignment with tophat and 46% alignment with bowtie2. That's astronomically different. I am told that GSNAP produces roughly 46% alignment for the same data. So tophat is off...
              Last edited by Studentlost; 10-30-2014, 11:49 AM.

              Comment


              • #22
                I still don't understand why you want to use a non-splice-aware mapper at all, since every mRNA-derived read that spans a splice junction will fail to map properly. Why use Bowtie or SNAP at all? It makes no sense to compare between mapping rates for mRNA onto a genome between Bowtie and Tophat in a eukaryotic system.

                Both SNAP and Bowtie (and BWA) are not splice-aware so yes, they should all produce more or less similar alignment rates.

                Tophat is splice-aware. Your data is mRNA-derived. So, you should actually expect Tophat to map at a higher rate than SNAP/Bowtie/BWA since it would correctly map reads that span splice junctions.

                One situation in which Tophat may have a low mapping rate is if you're telling it to only map to known genes (as defined in your GFF/GTF file) and maybe there aren't that many known genes in the monkey? your're using (or the GFF/GTF file you have).

                Could you post the code you used?

                Comment


                • #23
                  I tried many variations. They all gave me the same results. I've ran it with and without a library type (first strand and unstranded), with and without a reference transcriptome index, with and without a GTF file, and I always get low values. I also tried it with coverage search, and with setting -r 150.
                  $tophat -p12 --no-coverage-search -o $tophat_dir $reference_genome $refined_reads/R1.atqt.fq $refined_reads/R2.atqt.fq

                  Comment


                  • #24
                    I've ran it with and without a library type

                    and with setting -r 150
                    You should know both the library type and the size of the fragments that were selected and sequenced. This should be info that the sequencing provider gives you, and if they don't, they are doing you a disservice & you probably need to hassle them or check any documentation they've sent.

                    with and without a reference transcriptome index
                    This would make no difference except to the speed of the run. Tophat requires a bowtie index for the reference genome, if none is around it will make one, which will slow the run down but won't change the results.

                    Otherwise I'm at a bit of a loss to help. The situation you describe doesn't make any sense to me so I think I'm missing something

                    Comment


                    • #25
                      If I knew the size of the fragments selected and sequenced, how would I implement that into my tophat code? Could you give me an example?

                      Comment


                      • #26
                        I still think it would be useful to try running tophat with only a single end of the reads however this is more and more sounding like something Tophat may not be able to overcome. then you have to ask if your goal is to use that data or if it is to get Tophat to work. it sounds like GSNAP is a usable option for you. I also recommend STAR. tools don't always work every time on all data...sometimes they just glitch out and you have to change the pipeline and use a different tool.

                        ideally the way to solve this issue is to provide us a sample set of your reads so that someone can try running the alignment themselves and possibly figure out what is going on.
                        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                        Salk Institute for Biological Studies, La Jolla, CA, USA */

                        Comment


                        • #27
                          Originally posted by Studentlost View Post
                          If I knew the size of the fragments selected and sequenced, how would I implement that into my tophat code? Could you give me an example?
                          You can use the -r option to set the known size of the insert, and there are also options to set the standard deviation of the size. Size selection is done on a Covaris or simply on a agarose gel, and it's not precise, so all fragments are not exactly the defined size.. hence the ability to set a range..
                          Of course there's always the possibility that whoever got this data didn't bother to get this info from the sequencing provider. You could always try setting a gigantic -mate-std-dev (so that paired reads map even if they are not that close to the expected inner distance) and then look at the resulting SAM file to see how far apart the reads are typically mapping. There are various tools to do this.

                          From the manual:


                          Code:
                          -r/--mate-inner-dist <int> 	This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp.
                          --mate-std-dev <int> 	The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.
                          But I do want to reiterate that a side-by-side comparison of tophat and bowtie simply does not make sense for mapping mRNA onto a genome. It's apples and oranges, they are doing quite different things.

                          tools don't always work every time on all data...sometimes they just glitch out and you have to change the pipeline and use a different tool.
                          Respectfully disagree with this. Nothing "glitches out"; something specific has happened, and you need to understand what, or you may miss something important about your data and end up analysing things in a totally inappropriate way.

                          Comment


                          • #28
                            the comparison to bowtie2 makes sense if you understand the comparison. anytime you map RNA to a genome with a splice aware aligner versus a DNA aligner the splice aware one should produce higher mapping rates. the fact that when he tried that and got a higher mapping rate with the DNA mapper excluded the possibility that there is something specifically wrong with the reads and isolated the issue to Tophat. he also has additional knowledge that GSNAP can align the PE reads just fine (or at least as well as bowtie2 could). the comparison to bowtie2 makes even more sense when you realize the steps Tophat takes in alignment. step one (without a transcriptome reference) is unspliced alignment to the genome with bowtie2. so AT MINIMUM you'd expect Tophat to at least replicate that mapping rate. step 2 on is all about fragmenting reads into 25 base pieces and mapping to find potential splice sites and then refining down to final alignments.

                            I do have one question for OP. how did you measure the alignment percentage from Tophat? any chance you could post the output of 'samtools flagstat' for both the accepted_hits.bam from Tophat and the bowtie2 aligned bam file?
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • #29
                              Originally posted by sdriscoll View Post
                              anytime you map RNA to a genome with a splice aware aligner versus a DNA aligner the splice aware one should produce higher mapping rates.
                              Agree totally, and I made this point downthread. This is why I thought it might be an issue of Tophat being restricted to assembling to known transcripts in the GTF, but apparently that's not the case. As you say, I struggle to see any situation in which Tophat would give a lower alignment rate than Bowtie on mRNA data mapped to a genome. It's odd.

                              he also has additional knowledge that GSNAP can align the PE reads just fine (or at least as well as bowtie2 could).
                              Not really. According to my reading of the OP, (s)he has second-hand knowledge that SNAP (not GSNAP) alignes OK.

                              OP: can you run GSNAP (or another splice-aware aligner) on this data yourself? Posting samtools flagstat outputs is also a very good idea as sdriscoll suggested.

                              Another question for OP: Did you use Bowtie or Bowtie2 for the 46% mapping? Not that it should make that much difference, but Tophat defaults to using Bowtie2 unless you tell it not to.

                              Comment


                              • #30
                                the funny part in all of this is that Tophat bothered me enough for me to stop using it a couple years ago so my best solution is to use something else but OP seems pretty committed to it. I checked back in the thread and OP mentioned gsnap in post #12 so maybe I missed if that was redacted at some point.
                                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                                Salk Institute for Biological Studies, La Jolla, CA, USA */

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X