Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie2 detecting human transcripts that STAR misses

    Hi everyone,

    I'm having the problem mentioned in the title above and it's not making any sense to me. In the RNA-Seq dataset that I have I run STAR, then I look at the left over transcripts, usually blast some of them or something. Often they are still mostly human (which get aligned to hg20 using bowtie2). I can't understand this at all, STAR being a spliced aligner should be aligning far more than bowtie2 does. I was thinking it could indicate human DNA contamination but even then shouldn't STAR still align continuous sequences? Here are two such reads that weren't aligned by STAR but are by Bowtie2 (They're not paired end, so this is two different reads). I'd hate to stop using STAR, love that speed.

    TATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGT

    ACCTTCTAGTGGTGTTTACTTGAGACCTTTTGTCATTTAATGTGTGCTGAATAAATGCCAGCACCCCTGAGTAGAAAGCAATCATGTACCTGCAGATGGTC

    Hopefully someone can point me in the right direction!
    Thanks!

  • #2
    Did you mean you look at the leftover reads (as opposed to transcripts)?

    Also, whats the quality like on those reads, and what do the bowtie alignments look like?

    Comment


    • #3
      Originally posted by mikep View Post
      Did you mean you look at the leftover reads (as opposed to transcripts)?

      Also, whats the quality like on those reads, and what do the bowtie alignments look like?
      Yeah the leftover reads are what I meant. The quality varies a bit, there are some bad in there, but plenty of good too. But the quality on all of these reads should be enough to allow an accurate alignment.

      The alignments look fine, as I said in the previous post I blasted a lot of these reads first, then they were hitting human sequences so thats when I decided to do bowtie2. So I think the bowtie2 alignments are accurate, or relatively anyway. I just don't understand why STAR didn't detect these.

      Comment


      • #4
        Well, I dunno what bowtie2 is doing, but that first sequence you posted above has a 100% hit to various bacterial sequences, and no hits to human using megablast, so I'd be rather glad star aint aligning it. The 2nd seems to hit some random stretch of the hg not associated with any gene, and it looks chimeric, and it needs balst against nr, finding no hits with megabalst vs hg

        I'd be not worrying about them. What % of your reads fall in this category?

        Any chance your username comes from Arrested Development?

        Comment


        • #5
          Originally posted by mikep View Post
          Well, I dunno what bowtie2 is doing, but that first sequence you posted above has a 100% hit to various bacterial sequences, and no hits to human using megablast, so I'd be rather glad star aint aligning it. The 2nd seems to hit some random stretch of the hg not associated with any gene, and it looks chimeric, and it needs balst against nr, finding no hits with megabalst vs hg

          I'd be not worrying about them. What % of your reads fall in this category?

          Any chance your username comes from Arrested Development?

          Oh sorry my bad, that first sequence must be from some other source.

          Well that's the problem, in some files its as high as 50%. I've had problems with contamination in this dataset before though so I wouldn't be surprised if there was more.

          Comment


          • #6
            Originally posted by mikep View Post
            Well, I dunno what bowtie2 is doing, but that first sequence you posted above has a 100% hit to various bacterial sequences, and no hits to human using megablast, so I'd be rather glad star aint aligning it. The 2nd seems to hit some random stretch of the hg not associated with any gene, and it looks chimeric, and it needs balst against nr, finding no hits with megabalst vs hg

            I'd be not worrying about them. What % of your reads fall in this category?

            Any chance your username comes from Arrested Development?
            and yeah it comes from Arrested Development. Bob loblaws law blog

            You know come to think of it, I have seen something like this in RNA-Seq datasets before, even published ones, where one sequences the transcritpome of human or mouse or whatever, but not all of it aligns back to the reference database (in my exp sometimes up to as much as 10 or 15%). I was never really able to find an answer as why that was, I always just figured it was chimeric reads and stuff, perhaps that is the case and bowtie2 is able to align them where STAR is not... or maybe I'm reaching at straws here.

            Comment


            • #7
              Perhaps STAR has trouble with reads containing sequencing errors. Do the alignments in bowtie2 but not STAR contain lots of mismatches and/or clipping?

              Comment


              • #8
                I normally get about a 10% miss rate with mapping, finished a bunch of star runs this morning to find a miss rate of 25%.

                If I find anything in it I'll get back, otherwise 'fraid I got nothing.

                Comment


                • #9
                  If you want a higher mapping rate... you might give BBMap a try. It's splice-aware and substantially more sensitive than Tophat.

                  Comment


                  • #10
                    hi @bob-loblaw,

                    As @mikep pointed out, the second sequence maps chimerically. You would need to enable chimeric output with --chimSegmentMin 20, and then STAR will output it into Chimeric.out.sam:

                    1 0 chr10 110358273 3 61M40S * 0 0 ACCTTCTAGTGGTGTTTACTTGAGACCTTTTGTCATTTAATGTGTGCTGAATAAATGCCAGCACCCCTGAGTAGAAAGCAATCATGTACCTGCAGATGGTC * NH:i:2 HI:i:1 AS:i:62 NM:i:0 MD:Z:61
                    1 272 chr10 110358218 3 40M61S * 0 0 GACCATCTGCAGGTACATGATTGCTTTCTACTCAGGGGTGCTGGCATTTATTCAGCACACATTAAATGACAAAAGGTCTCAAGTAAACACCACTAGAAGGT * NH:i:2 HI:i:2 AS:i:43 NM:i:0 MD:Z:40
                    I believe this is the same as the BLAST alignment. This is a strange chimeric sequence, with two pieces mapping in the same locus on the opposite strands.

                    You can also allow the output of the longer segment into Aligned.out.sam file by reducing the max mapped score/length requirement, e.g. --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0.5:
                    1 0 chr10 110358273 255 63M38S * 0 0 ACCTTCTAGTGGTGTTTACTTGAGACCTTTTGTCATTTAATGTGTGCTGAATAAATGCCAGCACCCCTGAGTAGAAAGCAATCATGTACCTGCAGATGGTC * NH:i:1 HI:i:1 AS:i:62 NM:i:0 MD:Z:63

                    The low mapping rate maybe caused by various factors. The Log.final.out file can give you some hints about mapped length, error rate, multi-mappers etc (if you post it I can have a look at it). You can try to reduce the --outFilterMatchNminOverLread value to check the whether only small portions of the reads can be mapped. The most typical reasons for low mappability are
                    (i) rRNA. Normally they appear multimappers, make sure that you include unplaced scaffolds in the genome, since one of them contains very highly expressed rRNA loci.
                    (ii) poor sequencing quality of the read ends (then reducing --outFilterMatchNminOverLread will help)
                    (iii) contamination

                    Hopefully, that strange chimeric sequence is not representative of the reads that cannot be mapped - if so, it would mean some strange library making artifact.

                    Cheers
                    Alex

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    69 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X