Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overlapping paired end - tophat

    Hi,

    I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

    My questions are:

    1) is this going to affect tophat alignment ? how should the -m option be specified?

    2) when counting coverage, my intuition is that those overlapping bases might be counted twice, while they only appear in the library once, is there any way to get around this?

    3) is this going to affect cufflinks transcript assembly and quantitation?

    Thanks for your help!

  • #2
    I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

    I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

    Of course, this is not an ideal solution.

    Simon

    Comment


    • #3
      Originally posted by Simon Anders View Post
      I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

      I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

      Of course, this is not an ideal solution.

      Simon
      how did you stitch them?
      samtools merge?
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Originally posted by wenhuang View Post
        Hi,

        I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

        Thanks for your help!
        Why not convert your paired end data into single end?
        Since there is a 30 bp overlap. they should assemble into a single read quite nicely.

        so you end up with a 120 bp SE data.
        http://kevin-gattaca.blogspot.com/

        Comment


        • #5
          My alignment did not seem to have too much problem. Here is just a sample of the first few alignments. It appeared to me that the two reads were processed separately, but I am not so sure about that.

          HWUSI-EAS787_0001:5:70:1610:809#AAATAG 99 chr1 5312 255 81M = 5366 0
          GCGAGGAAAGAAATGCACTAAGTAAAAAACTTAGTCATTTTTTAAAGAGAATTAAAATGAAGTCCAATTCCTTTGAGTTAC HGHHI
          HHHGHHHGGGHHHHHHHHIHHHGHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHEHHFHEHGHHG NM:i:0
          HWUSI-EAS787_0001:5:70:1610:809#AAATAG 147 chr1 5366 255 81M = 5312 0
          AAATGAAGTCCAATTCCTTTGAGTTACAAATTTACAATCACTACTCAGTAATTAAAACTATTCAGTTATAGTGAACTGATT IHFHH
          IHBGHHHHHGHHFEHHHHHHHHHHHHHHHHHHHHEHHGHHHHHHHHHHHHGGHHHHHHHHHHIHHHHHHGHHHHHH NM:i:0


          HWUSI-EAS787_0001:5:30:1504:1763#TTGTCG 163 chr1 5822 255 81M = 5860 0
          CCAGAGCCCACAGCTTACTTTTGGTGGTACCCATCCTAAGGGTCTGGGCAAACATATAACGATAAATGTCCATCATTATAA HHGHH
          GGFHHHHHHHHHEHHHHHHHHHHHEHHGHDEGHHHHHBBBGGG7FHH2HEHBHH0FHEFHC+?6><CC-CEDDBA@ NM:i:0
          HWUSI-EAS787_0001:5:30:1504:1763#TTGTCG 83 chr1 5860 255 81M = 5822 0
          AGGGTCTGGGCAAACATATAACGATAAATGTCCATCATTATAATATCACACAGAGTAGTTTCACTGCCCTGAAACTCTTTT G@CBF
          HE?G=HHGIHHHHGHGHBHGHHHEGHDHHGHHFFHHHHHHHHHHGHHGHGFHCHHGHHHHFHHHHHHHHHHHHHHH NM:i:0



          Originally posted by Simon Anders View Post
          I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

          I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

          Of course, this is not an ideal solution.

          Simon

          Comment


          • #6
            I think this is a decent solution. Many of my reads suffered from bad quality at the end though. Can you recommend a type of tools that might do this job ? Thanks!

            Originally posted by KevinLam View Post
            Why not convert your paired end data into single end?
            Since there is a 30 bp overlap. they should assemble into a single read quite nicely.

            so you end up with a 120 bp SE data.

            Comment


            • #7
              Originally posted by wenhuang View Post
              I think this is a decent solution. Many of my reads suffered from bad quality at the end though. Can you recommend a type of tools that might do this job ? Thanks!
              I only know phrap which can do this but if applied to so many reads I am not sure how long it will take.
              http://kevin-gattaca.blogspot.com/

              Comment


              • #8
                Originally posted by wenhuang View Post
                Hi,

                I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

                My questions are:

                1) is this going to affect tophat alignment ? how should the -m option be specified?

                2) when counting coverage, my intuition is that those overlapping bases might be counted twice, while they only appear in the library once, is there any way to get around this?

                3) is this going to affect cufflinks transcript assembly and quantitation?

                Thanks for your help!
                As of TopHat 1.0.13, you should be able to specify a negative inner distance of -30. TopHat does map the reads independently, and has a different algorithm from Bowtie for handling the ends. The coverage.wig file display depth of read coverage, not depth of physical coverage, so those bases will be double counted, as you suggest. However, Cufflinks operates at the fragment level, not the read level, and so should do the right thing here.

                Comment


                • #9
                  Here are more details about Wen's run which was 2x75.

                  The minimum fragment size, including flanking adapters is 150 bp. Thus fragments with the smallest insert could be diagrammed like this with 32 bases of overlapping cDNA


                  [adapter:59][cDNA 32][adapter:59]
                  o~~~~~~~~~~~> (with 43bp of adapter)
                  <~~~~~~~~~~~~o


                  I am assuming, however that reads this short would fail to map because of the high proportion of adapter-derived sequences embedded in the reads.


                  These considerations lead me to the following questions:


                  1) Does the negative inner distance of, for example, -30 reflect an expected mean of 30 bp of overlap or does it specify a maximum amount of overlap.

                  Afterall, most of Wen's reads don't overlap and the overlap could be as high as a full 75bp for a 193bp fragment. If I were to calculate the actual mean inner distance taking overlaps as having negative distances, the overall mean might well turn out to be positive.

                  2) If we were to trim the adapters this would invariably lead to a distribution of read lengths rather than a uniform 75 bases. Can Bowtie and TopHat deal with unequal read lengths or is this likely to be a problem?

                  Comment


                  • #10
                    Here is how the diagram from my previous posting should look (with dots replacing whitespace). Sorry for the confusion.

                    [adapter:59][cDNA 32][adapter:59]
                    .............................o~~~~~~~~~~~> (with 43bp of adapter)
                    ...........<~~~~~~~~~~~~o

                    Comment


                    • #11
                      Originally posted by Simon Anders View Post
                      I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

                      I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

                      Of course, this is not an ideal solution.

                      Simon
                      In my case, it seems bowtie 0.12.3 (and also BWA) works well for overlap pair-end. I have 2*59 reads, and I found the ISIZE for many records is less than 118 and the FLAG field indicate they are properly mapped.

                      Comment


                      • #12
                        Originally posted by Simon Anders View Post
                        I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

                        I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

                        Of course, this is not an ideal solution.

                        Simon
                        TopHat and Bowtie use completely different procedures to handle paired ends, and their policies are not the same. TopHat maps the left and right reads independently, and recent versions should have no trouble with paired end libraries with negative inner distances and overlapping reads. With TopHat 1.0.13 and Cufflinks 0.8.0, I have processed an RNA-Seq library size selected to 100bp and sequenced with 2x76bp GAII. The mean inner distance in this case is negative, and the TopHat/Cufflinks stack produced fine results.

                        To answer a previous question - TopHat will not handle reads of different lengths gracefully, so if you make "virtual" long reads from overlapping mates, make sure to trim the products down to a uniform length.

                        Comment


                        • #13
                          Another possible solution

                          I had to edit this post. I wrote a program that assembles overlapping paired ends from illumina. It used to be public but now it's private because I want to do a paper on it.

                          If you want a copy, you can e-mail me and I'll send it to you.

                          I tested it on 1.5 million reads that overlapping ~25 bp and it assembled about 78% into larger contigs which can then be de novo assembled. In the overlapping region, it chooses the nucleotide with the best quality score (if there is a discrepancy). If the there is a discrepancy and the quality scores are the same it chooses the appropriate ambiguous nucleotide.
                          Last edited by ACTGangster; 07-24-2010, 05:26 PM. Reason: makebettered

                          Comment


                          • #14
                            I uploaded a python script I wrote for this to SVAR:
                            --
                            Jeremy Leipzig
                            Bioinformatics Programmer
                            --
                            My blog
                            Twitter

                            Comment


                            • #15
                              stitch

                              I open-sourced my Stitch program as I do not plan on writing a paper on it specifically.



                              It runs on as many cores as you have. I did 20 million reads in 40 minutes on a 16-core mac pro.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X