Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Velvet Assembly for PE RAD Sequences

    Dear All,

    I am currently working with a RAD sequencing data set, sequenced on a PE-150bp Hiseq and I am now wanting to assemble my data into wonderful and long contigs (all cleaned and de-multiplexed and ready to go!).

    The standard that appears to be used with RAD sequencing assembly is velvet, so I want to start with that but I am happy to hear about other people's experiences or recommendations for other software?

    I also have a question regarding simple Illumina paired-end sequencing, because I am afraid I am a bit confused.


    "Read 1 is generated using the READ 1 Sequencing primer. The read 1 product is removed and the Index Sequencing Primer is annealed to the same strand (...). If a paired-end read is required, the original strand is used to regenerate the complementary strand. Then, the original strand is removed and the complementary stand acts as a template for application read 2."

    So basically, on the Illumina machine, a complementary strand to Read 1 is created but not in that sense read or sequenced. And then this complementary and non-read strand acts as a template for read 2 (as the sequencing reaction always occurs 5' to 3'), so basically Read 1 and Read 2 are NOT reverse complementary to one another? Is this correct? There seem to be some posts and blogs that state that the forward and reverse reads are reverse-complementary to each other in orientation!


    Many thanks in advance!!

    Sarah
    Last edited by Corydoras; 06-05-2014, 05:29 AM.

  • #2
    For an explanation of the way the Illumina technology works, see the University of Texas at Austin's Wiki page




    and the Bentley et al 2008 Nature paper, including the Supplementary Information


    I didn't think you were supposed to get long contigs from RAD-Seq, I thought you were just supposed to get the regions adjacent to the restriction enzyme sites.

    Comment


    • #3
      Hi Mastal,

      Thanks for the quick reply, I will have a look to make sure my understanding is correct .

      Well, 'long' is relative I guess . Due to the random shearing step in RAD (and if my coverage is good enough), I am hoping I should be able to assemble contigs of hopefully up to 1000 bases. This would to me class as relatively long for my first PE de novo sequencing run, but I should have clarified. In some trial assemblies I do seem to be getting contigs above 500 bases in length so I think I am off to a good start .

      Would you recommend velvet? Or is there something else you would recommend I try?

      Thanks again!

      Comment


      • #4
        Corydoras, you might think about using Stacks, which handles the RAD data and also generates contigs from the paired-end reads if available. There is a tutorial here:


        The one think Stacks doesn't do, I think, is use the first read in generating the contigs. In our original PE paper (I'm Eric J.) I collected the paired-end reads, then also passed in a subset of the first reads (reverse complemented, to answer your question above). If the shearing was done right, the paired end reads of the shorter fragments would overlap the first reads, extending the contig. I also generated a contig using a long kmer, so that any short repeats (ATATATATATATATATATATATA, for example) wouldn't break the contig. Then I re-assembled with a shorter kmer, but passed in the contig as a long read. The short kmer allowed short overlaps to merge in. Some of that could be done post-Stacks if you are getting shorter contigs than desired, but Stacks will help with the initial managing of the reads and filtering.

        mastal, RAD-PE was something that took advantage of the unique aspect of RAD (compared to other GBS methods). Since one end was sheared, the second read sampled the region around the cut site. We collected RAD reads that shared a first read sequence, then assembled the paired-end reads of that locus. This generated a contig >500 bp around each locus. We even had a mate-pair version that did 5 kb contigs! The other useful aspect of shearing is that you don't have to worry about consistent size selection--every marker is present at all sorts of sizes, whereas methods that use restriction enzymes to cut both ends of a fragment have fixed marker lengths, so any variation in library size means thousands of loci go missing.
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment


        • #5
          Hi Eric!

          Thanks for your reply and first hand advice, it is really useful and much appreciated! My samples com from different species rather than different populations but I think Stacks may be more suitable for population studies?

          I am afraid I am a bit confused, could you clarify this for me? I thought I just got my head around the fact that Illumina technology creates a reverse complementary strand to Read 1 that is not sequenced but used as the template for the reverse read, so I would not have to do any reverse complementing between my forward and reverse?

          Comment


          • #6
            @Corydoras
            Read1 and Read2 are read from the ends of opposite strands. You don't have to do anything to reverse complement them, aligners and assemblers that deal with Illumina data account for this.

            The reads of a pair will only be reverse complements of each other when the
            fragments being sequenced are shorter than the read length, so that the stretch of sequence seen in Read1 and Read2 overlaps.

            @SNPsaurus
            Thanks for explaining about RAD-PE.

            Comment


            • #7
              Thanks mastal for clarifying!


              However, Illumina itself says this in their leaflet on multiplexed sequencing processes: "Application Read 1 is generated using read 1 sequencing primer. The read 1 product is removed and the Index Sequencing Primer is annealed to the same strand to produce the 6bp index read. If a paired-end read is required, the original template strand is used to regenerate the complementary strand. Then, the original strand is removed and the complementary strand acts as a template for the application read 2."

              In the nature paper you sent me it says:
              The DNA in each cluster is linearized by cleavage within one adaptor sequence (gap marked by an asterisk) and denatured, generating single-stranded template for sequencing by synthesis to obtain a sequence read (read 1; the sequencing product is dotted). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesized (shown dotted), and the opposite strand is then cleaved (gap marked by an asterisk) to provide the template for the second read (read 2).

              Both of these to me indicate that actually there would never be a need to reverse-complement anything? Unless this is a new/or old thing and depend on the Illumina machine? I am probably being massively slow and am totally misunderstanding this, but I have asked a few people now and none of them can make sense of this :/
              Last edited by Corydoras; 06-06-2014, 04:47 AM.

              Comment


              • #8
                Could you explain more why reading the second strand would not produce the reverse complement?

                Code:
                The single-stranded template:
                P1 - CAT - P2
                
                read 1 primer binds:
                p1 
                P1 - CAT - P2
                
                read 1 sequenced, producing sequence GTA:
                
                p1 - GTA
                P1 - CAT - P2
                
                Read 1 removed:
                
                P1 - CAT - P2
                
                Second strand generated:
                
                p1 - GTA - p2
                P1 - CAT - P2
                
                First strand removed:
                
                p1 - GTA - p2
                
                Second Primer annealed
                
                p1 - GTA - p2
                           P2
                
                And second read sequenced, producing TAC
                
                p1 - GTA - p2
                     CAT - P2  
                
                first read is GTA
                second is TAC
                Now, as mastal says, most aligners will automatically try both orientations during assembly. I reverse complement the first reads when doing RAD-PE because I set velvet to expect everything in the same orientation, since the second reads all are, and I assumed this help reduce the kmers involved.

                Stacks could still help you... you could still load samples individually or in species groups and construct contigs from each. Do you want to then genotype across the species using the contigs? Stacks may limit the number of SNPs allowed, which would be a problem, but it could still get you to the contig step. You could then just align reads, and do samtools mpileup and parse that for the genotypes.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  Thanks very much for this and sorry to have been such a pain! This cleared things up for me enormously.

                  Yes, Stacks may indeed still be an option to assemble contigs for me, but I assume I could also optimize velvet myself for contigs? Would there be any specific reason why I would be better off using Stacks, considering I may run into issues with the max number of SNPs allowed?

                  Comment


                  • #10
                    Oh and also, I am assuming it would be best to combine all samples in one species to get better coverage in the assembly? Or is it common practice to assemble contigs for one individual at a time?

                    Comment


                    • #11
                      Sure, if you want to get into it with velvet, a little personal attention will always be better. And yes, I'd combine samples to get good coverage across the region. Too high coverage can cause issues, depending on the kmer size and error rate. Maybe pass the combined paired-end reads through a kmer normalization script first?
                      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                      Comment


                      • #12
                        Yes, I have read up on the fact that too high a coverage can cause issues and am keeping an eye out for that. I have not come across any 'kmer normalization scripts before', though I understand that the kmer paramter is one of the most critical to optimize. Currently running velvet with varying kmers and was going to optimize according to the kmer coverage distribution and am also trying the VelvetOptimiser. Is this what you mean?

                        Thanks for all your help! It is much appreciated.

                        Comment


                        • #13
                          Funny thing with methods--you have to use the right keywords or you'll never find it! I was thinking of http://ivory.idyll.org/blog/diginorm-paper-posted.html "digital normalization". It keeps low-coverage reads and tosses the tops of the peaks of high-coverage reads, smoothing out the depth.

                          Are you assembling one locus at a time or just everything at once? You will get longer contigs if done one locus at a time, but you can assemble all the data at once and still get some benefits from the reduced complexity. I'd suggest doing (either one locus at a time or all together) picking a kmer, calculating the optimal read depth for that kmer, sending the second reads through the digital normalization to winnow down to that optimal level, send those reads to velvet, and repeat for different kmers. I'd also try the long kmer pre-assembly, and passing the contig along with the reads again that I talked about before.

                          I haven't tried "diginorm" in this situation, so am just blithely urging you to do it based on no experience! But I do know that I capped my reads at a certain level because going higher produced shorter contigs, and diginorm seems like a more rigorous approach to doing that.
                          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                          Comment


                          • #14
                            Thanks so much for your pointers, needless to say I would have not stumbled across any of them myself! (This is quite obviously my first NGS experience).

                            I am working with all at once, and naively didn't even realize doing one locus at a time was an option?
                            I assume there is no standard 'optimal read depth that fits all data sets' but is something I would have to get a feel for varying parameters and checking outputs?

                            Just to make sure, with first reads you mean forward reads, second reads are reverse reads? You only normalize those second/reverse reads? I had originally been worried about these second/reverse reads because their coverage would be a lot less uniform and lower than the first/forward reads (due to the nature of RAD) and I was not sure how velvet can handle that. Wouldn't that mean I would also have to adjust the forward read coverage to that same level, to make the coverage as even as possible? If I understand correctly, that will aid the assembly enormously. This normalization, if I read it right, could also aid in removing errors and preventing misassemblies?

                            Apologies again if these all seem like bizarre and stupid questions, this is the first time I am doing this.

                            Comment


                            • #15
                              In the original paper, the paired-end reads were first organized by the RAD read (the first/forward read off the cut site). Then the sheared-side reads (second/reverse reads) were sent the assembler, one locus at a time (or more precisely, one group of reads sharing a RAD read). The advantage of this "local assembly" was that you could assemble a transposable element that was all over the genome because those reads were unique in the context of that single locus. If you assemble all the reads at once then any read with a kmer found elsewhere will break the assembly of that contig. However, it is vastly easier to do it all at once so certainly try it and see if you are happy with the results.

                              If you are doing it all at once, I would pass the forward and reverse reads through the diginorm step (this is even more important since the forward RAD reads will be many-fold higher coverage than any particular reverse read, as you point out).

                              No need for apologies. Enjoy your first foray into all this!
                              Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X