Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merger/overlapper for fully contained fragment

    I am trying to find a tool that would do merging/overlapping of PE reads when the fragment is fully contained within the reads and without having to know the adapters ahead of time. The program PANDA and FLASH (and others) will merge PE reads into a single read however they are geared towards cases where the fragment is a subset of the read. E.g.

    Code:
    R1:     ---------->
    R2:          <----------
    Frag:   ----------------
    However I am thinking of the situation of:
    Code:
    R1:          ------------->
    R2:       <--------------
    Frag:        ------------
    Both Panda and Flash can remove adapters before making the merge however if the adapter is short (say 4 bases) then I am not confident that the programs will be able to do so. Perhaps a better program would be one that matches the first bases of R1 to the region close to the end of R2 and vice-versa and then only output the merged read where both R1 and R2 match. In other words a merging where the adapter does not need to be known a priori.

    Hope that this makes sense. Any suggestions? Thanks.

  • #2
    As Phillip SanMiguel said to me in private email and which may clarify my post:


    So the reads may all have a few (1-5 bases) of adapter at the their 3' ends. A better way to trim them would be to compare R1 and R2 -- the first base of each should point out the last base of the the other. If PANDA had a setting to remove single stranded sequence from pair merges, that would be good.

    Comment


    • #3
      A pair-wise aligner (that can export a consensus, followed by an appropriate trim) should work right?

      Comment


      • #4
        Have you tried SeqPrep?

        I know I've tried it on Nextera data and by giving it the Nextera adapter sequence it was able to spit out reads with 100% overlap but whose length was < 250bp, which would fit what you're talking about. What I can't say is how it would handle the "adapter" sequences that might hang off the ends if you don't provide it with any sequence to look for.

        Comment


        • #5
          @GenoMax: Your idea should work but doing it for an entire miSeq run sounds like a long processing time. I was hoping for a quicker and one-stop solution.

          @McNelson.phd: No, I haven't tried SeqPrep but from my reading of it -- and your description -- it sounds like it would act the same as Panda and Flash: not good for when there is no prior knowledge of the adapter. I will install it though and give it a spin.

          Real data coming off the sequencer later today!

          Comment


          • #6
            I use SeqPrep for exactly that purpose, although I do an extra careful adapter stripping before and after merging to clean up the errors. It did an ok job without the extra step, but I wanted the reads as error-free as possible. I can reliably find alleles at the 0.03% range by doing that.

            I look at the length of the merged reads and trim back if they are a size range where partial adapters would have been present. But your approach would work too, I think.
            Last edited by SNPsaurus; 08-27-2013, 09:10 AM.
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              Rick,

              It sounds like you do not want to trim (adapters) before the merge, is that a requirement?
              Last edited by GenoMax; 08-27-2013, 09:13 AM.

              Comment


              • #8
                This group published along these lines:
                Backgound High throughput sequencing is beginning to make a transformative impact in the area of viral evolution. Deep sequencing has the potential to reveal the mutant spectrum within a viral sample at high resolution, thus enabling the close examination of viral mutational dynamics both within- and between-hosts. The challenge however, is to accurately model the errors in the sequencing data and differentiate real viral mutations, particularly those that exist at low frequencies, from sequencing errors. Results We demonstrate that overlapping read pairs (ORP) -- generated by combining short fragment sequencing libraries and longer sequencing reads -- significantly reduce sequencing error rates and improve rare variant detection accuracy. Using this sequencing protocol and an error model optimized for variant detection, we are able to capture a large number of genetic mutations present within a viral population at ultra-low frequency levels (<0.05%). Conclusions Our rare variant detection strategies have important implications beyond viral evolution and can be applied to any basic and clinical research area that requires the identification of rare mutations.


                They align the raw reads and analyze that rather than merging. There is a second paper that came out more recently as well, but I can't dredge it up. My lab should have our version out soon, too. Gary Schroth at Illumina said he was pushing long ago to have this the standard output of the Illumina machines as a way to get separation on error rate with other platforms, so it is funny that years later there is a sudden wave of labs all independently coming up with the idea.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  Longer read lengths have finally made the idea practical.

                  Comment


                  • #10
                    It's probably too late right now if your run is already doing, but the new version of Reporter incorporates a read "Stitching" feature that might do exactly what you want. You'll have to manually add the flag to your sample sheet and reprocess your data if you want to try it. Check out the full guide on Reporter for what the actual flag is and what options are associated with it.

                    Comment


                    • #11
                      Originally posted by GenoMax View Post
                      Rick,

                      It sounds like you do not want to trim (adapters) before the merge, is that a requirement?
                      Not a requirement per se. It is what I will probably end up doing especially since we know the adapters. However Phillip and I were wondering if there an adapter-knowledge-free method.

                      Indeed, the longer lengths are making for interesting possibilities.

                      Comment


                      • #12
                        Originally posted by mcnelson.phd View Post
                        ... but the new version of Reporter incorporates a read "Stitching" feature that might do exactly what you want.
                        Ah yes, that is an interesting option. Hard to say from scanning the docs if it would be better than Panda/Flash/SeqPrep but since the Reporter can be run off-machine I might give it a try. Thanks for the tip.

                        Comment


                        • #13
                          As a followup, it turns out that the samples in question did not (for the most part) look like the 2nd example I gave -- i.e., with the desired fragment fully contained in R1 and R2 with R1 starting inside R2 and vice-versa. Instead most of the reads looked like the 1st example thus we could use normal Panda/Flash methodology on them.

                          It might still be interesting to develop an 'adapater-knowledge-free' stitching/merging program. But that is a task for another day.

                          Comment


                          • #14
                            Originally posted by westerman View Post
                            As a followup, it turns out that the samples in question did not (for the most part) look like the 2nd example I gave -- i.e., with the desired fragment fully contained in R1 and R2 with R1 starting inside R2 and vice-versa. Instead most of the reads looked like the 1st example thus we could use normal Panda/Flash methodology on them.

                            It might still be interesting to develop an 'adapater-knowledge-free' stitching/merging program. But that is a task for another day.
                            I'm curious about the 'adapter-knowledge-free' constraint to you problem. If the premise of instance #2 in your original post is that these are sequencing reads in which (read length) > (fragment length) (i.e. contain adapter sequence at the 3' end) how would you not know what the adapter sequence is? The adapters/sequencing primers for all major kits are pretty much known are they not?

                            If you have a priori knowledge of the adapter sequences then Trimmomatic, using it Palindrome trimming mode, handles cases like #2, but not in exactly the way you asked about. I makes not attempt to "merge" the two reads. It simply clips the adapter from read 1 and discards read 2 entirely as it contains no additional data beyond that which is contained in read 1.

                            Comment


                            • #15
                              @kmcarr: I will concede that the constraint is mostly, if not entirely, theoretical since the adapter sequencer should be known -- certainly it will be by us service providers and this information should be passed onto our customers. A 'adapter-knowledge-free' program would only be useful in extremely rare cases or as part of a thought experiment.

                              I had not considered Trimmomatic's Palindrome mode since I never use that part of Trimmomatic. Thanks for the tip.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X