SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
help using velvet for de novo assembly NGS_New_User Bioinformatics 6 03-10-2015 11:07 PM
[Velvet,assembly] core dumped occured by runnning velvet matador3000 De novo discovery 0 12-17-2011 07:31 AM
Velvet assembly katussa10 De novo discovery 2 12-12-2010 10:22 AM
de novo assembly (velvet or others) strob Bioinformatics 1 01-20-2010 04:53 AM
velvet assembly vani s kulkarni Bioinformatics 25 10-07-2009 11:27 AM

Reply
 
Thread Tools
Old 06-05-2014, 05:17 AM   #1
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default Velvet Assembly for PE RAD Sequences

Dear All,

I am currently working with a RAD sequencing data set, sequenced on a PE-150bp Hiseq and I am now wanting to assemble my data into wonderful and long contigs (all cleaned and de-multiplexed and ready to go!).

The standard that appears to be used with RAD sequencing assembly is velvet, so I want to start with that but I am happy to hear about other people's experiences or recommendations for other software?

I also have a question regarding simple Illumina paired-end sequencing, because I am afraid I am a bit confused.


"Read 1 is generated using the READ 1 Sequencing primer. The read 1 product is removed and the Index Sequencing Primer is annealed to the same strand (...). If a paired-end read is required, the original strand is used to regenerate the complementary strand. Then, the original strand is removed and the complementary stand acts as a template for application read 2."

So basically, on the Illumina machine, a complementary strand to Read 1 is created but not in that sense read or sequenced. And then this complementary and non-read strand acts as a template for read 2 (as the sequencing reaction always occurs 5' to 3'), so basically Read 1 and Read 2 are NOT reverse complementary to one another? Is this correct? There seem to be some posts and blogs that state that the forward and reverse reads are reverse-complementary to each other in orientation!


Many thanks in advance!!

Sarah

Last edited by Corydoras; 06-05-2014 at 05:29 AM.
Corydoras is offline   Reply With Quote
Old 06-05-2014, 06:19 AM   #2
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

For an explanation of the way the Illumina technology works, see the University of Texas at Austin's Wiki page

https://wikis.utexas.edu/display/GSA...+-+all+flavors


and the Bentley et al 2008 Nature paper, including the Supplementary Information
http://www.nature.com/nature/journal...ture07517.html

I didn't think you were supposed to get long contigs from RAD-Seq, I thought you were just supposed to get the regions adjacent to the restriction enzyme sites.
mastal is offline   Reply With Quote
Old 06-05-2014, 06:39 AM   #3
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Hi Mastal,

Thanks for the quick reply, I will have a look to make sure my understanding is correct .

Well, 'long' is relative I guess . Due to the random shearing step in RAD (and if my coverage is good enough), I am hoping I should be able to assemble contigs of hopefully up to 1000 bases. This would to me class as relatively long for my first PE de novo sequencing run, but I should have clarified. In some trial assemblies I do seem to be getting contigs above 500 bases in length so I think I am off to a good start .

Would you recommend velvet? Or is there something else you would recommend I try?

Thanks again!
Corydoras is offline   Reply With Quote
Old 06-05-2014, 12:50 PM   #4
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

Corydoras, you might think about using Stacks, which handles the RAD data and also generates contigs from the paired-end reads if available. There is a tutorial here:
http://creskolab.uoregon.edu/stacks/pe_tut.php

The one think Stacks doesn't do, I think, is use the first read in generating the contigs. In our original PE paper (I'm Eric J.) I collected the paired-end reads, then also passed in a subset of the first reads (reverse complemented, to answer your question above). If the shearing was done right, the paired end reads of the shorter fragments would overlap the first reads, extending the contig. I also generated a contig using a long kmer, so that any short repeats (ATATATATATATATATATATATA, for example) wouldn't break the contig. Then I re-assembled with a shorter kmer, but passed in the contig as a long read. The short kmer allowed short overlaps to merge in. Some of that could be done post-Stacks if you are getting shorter contigs than desired, but Stacks will help with the initial managing of the reads and filtering.

mastal, RAD-PE was something that took advantage of the unique aspect of RAD (compared to other GBS methods). Since one end was sheared, the second read sampled the region around the cut site. We collected RAD reads that shared a first read sequence, then assembled the paired-end reads of that locus. This generated a contig >500 bp around each locus. We even had a mate-pair version that did 5 kb contigs! The other useful aspect of shearing is that you don't have to worry about consistent size selection--every marker is present at all sorts of sizes, whereas methods that use restriction enzymes to cut both ends of a fragment have fixed marker lengths, so any variation in library size means thousands of loci go missing.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 06-06-2014, 01:13 AM   #5
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Hi Eric!

Thanks for your reply and first hand advice, it is really useful and much appreciated! My samples com from different species rather than different populations but I think Stacks may be more suitable for population studies?

I am afraid I am a bit confused, could you clarify this for me? I thought I just got my head around the fact that Illumina technology creates a reverse complementary strand to Read 1 that is not sequenced but used as the template for the reverse read, so I would not have to do any reverse complementing between my forward and reverse?
Corydoras is offline   Reply With Quote
Old 06-06-2014, 01:34 AM   #6
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

@Corydoras
Read1 and Read2 are read from the ends of opposite strands. You don't have to do anything to reverse complement them, aligners and assemblers that deal with Illumina data account for this.

The reads of a pair will only be reverse complements of each other when the
fragments being sequenced are shorter than the read length, so that the stretch of sequence seen in Read1 and Read2 overlaps.

@SNPsaurus
Thanks for explaining about RAD-PE.
mastal is offline   Reply With Quote
Old 06-06-2014, 01:50 AM   #7
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Thanks mastal for clarifying!


However, Illumina itself says this in their leaflet on multiplexed sequencing processes: "Application Read 1 is generated using read 1 sequencing primer. The read 1 product is removed and the Index Sequencing Primer is annealed to the same strand to produce the 6bp index read. If a paired-end read is required, the original template strand is used to regenerate the complementary strand. Then, the original strand is removed and the complementary strand acts as a template for the application read 2."

In the nature paper you sent me it says:
The DNA in each cluster is linearized by cleavage within one adaptor sequence (gap marked by an asterisk) and denatured, generating single-stranded template for sequencing by synthesis to obtain a sequence read (read 1; the sequencing product is dotted). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesized (shown dotted), and the opposite strand is then cleaved (gap marked by an asterisk) to provide the template for the second read (read 2).

Both of these to me indicate that actually there would never be a need to reverse-complement anything? Unless this is a new/or old thing and depend on the Illumina machine? I am probably being massively slow and am totally misunderstanding this, but I have asked a few people now and none of them can make sense of this :/

Last edited by Corydoras; 06-06-2014 at 04:47 AM.
Corydoras is offline   Reply With Quote
Old 06-06-2014, 09:28 AM   #8
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

Could you explain more why reading the second strand would not produce the reverse complement?

Code:
The single-stranded template:
P1 - CAT - P2

read 1 primer binds:
p1 
P1 - CAT - P2

read 1 sequenced, producing sequence GTA:

p1 - GTA
P1 - CAT - P2

Read 1 removed:

P1 - CAT - P2

Second strand generated:

p1 - GTA - p2
P1 - CAT - P2

First strand removed:

p1 - GTA - p2

Second Primer annealed

p1 - GTA - p2
           P2

And second read sequenced, producing TAC

p1 - GTA - p2
     CAT - P2  

first read is GTA
second is TAC
Now, as mastal says, most aligners will automatically try both orientations during assembly. I reverse complement the first reads when doing RAD-PE because I set velvet to expect everything in the same orientation, since the second reads all are, and I assumed this help reduce the kmers involved.

Stacks could still help you... you could still load samples individually or in species groups and construct contigs from each. Do you want to then genotype across the species using the contigs? Stacks may limit the number of SNPs allowed, which would be a problem, but it could still get you to the contig step. You could then just align reads, and do samtools mpileup and parse that for the genotypes.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 06-09-2014, 01:41 AM   #9
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Thanks very much for this and sorry to have been such a pain! This cleared things up for me enormously.

Yes, Stacks may indeed still be an option to assemble contigs for me, but I assume I could also optimize velvet myself for contigs? Would there be any specific reason why I would be better off using Stacks, considering I may run into issues with the max number of SNPs allowed?
Corydoras is offline   Reply With Quote
Old 06-09-2014, 04:32 AM   #10
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Oh and also, I am assuming it would be best to combine all samples in one species to get better coverage in the assembly? Or is it common practice to assemble contigs for one individual at a time?
Corydoras is offline   Reply With Quote
Old 06-09-2014, 10:42 AM   #11
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

Sure, if you want to get into it with velvet, a little personal attention will always be better. And yes, I'd combine samples to get good coverage across the region. Too high coverage can cause issues, depending on the kmer size and error rate. Maybe pass the combined paired-end reads through a kmer normalization script first?
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 06-10-2014, 01:59 AM   #12
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Yes, I have read up on the fact that too high a coverage can cause issues and am keeping an eye out for that. I have not come across any 'kmer normalization scripts before', though I understand that the kmer paramter is one of the most critical to optimize. Currently running velvet with varying kmers and was going to optimize according to the kmer coverage distribution and am also trying the VelvetOptimiser. Is this what you mean?

Thanks for all your help! It is much appreciated.
Corydoras is offline   Reply With Quote
Old 06-10-2014, 08:02 AM   #13
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

Funny thing with methods--you have to use the right keywords or you'll never find it! I was thinking of http://ivory.idyll.org/blog/diginorm-paper-posted.html "digital normalization". It keeps low-coverage reads and tosses the tops of the peaks of high-coverage reads, smoothing out the depth.

Are you assembling one locus at a time or just everything at once? You will get longer contigs if done one locus at a time, but you can assemble all the data at once and still get some benefits from the reduced complexity. I'd suggest doing (either one locus at a time or all together) picking a kmer, calculating the optimal read depth for that kmer, sending the second reads through the digital normalization to winnow down to that optimal level, send those reads to velvet, and repeat for different kmers. I'd also try the long kmer pre-assembly, and passing the contig along with the reads again that I talked about before.

I haven't tried "diginorm" in this situation, so am just blithely urging you to do it based on no experience! But I do know that I capped my reads at a certain level because going higher produced shorter contigs, and diginorm seems like a more rigorous approach to doing that.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 06-10-2014, 08:34 AM   #14
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Thanks so much for your pointers, needless to say I would have not stumbled across any of them myself! (This is quite obviously my first NGS experience).

I am working with all at once, and naively didn't even realize doing one locus at a time was an option?
I assume there is no standard 'optimal read depth that fits all data sets' but is something I would have to get a feel for varying parameters and checking outputs?

Just to make sure, with first reads you mean forward reads, second reads are reverse reads? You only normalize those second/reverse reads? I had originally been worried about these second/reverse reads because their coverage would be a lot less uniform and lower than the first/forward reads (due to the nature of RAD) and I was not sure how velvet can handle that. Wouldn't that mean I would also have to adjust the forward read coverage to that same level, to make the coverage as even as possible? If I understand correctly, that will aid the assembly enormously. This normalization, if I read it right, could also aid in removing errors and preventing misassemblies?

Apologies again if these all seem like bizarre and stupid questions, this is the first time I am doing this.
Corydoras is offline   Reply With Quote
Old 06-10-2014, 09:05 AM   #15
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

In the original paper, the paired-end reads were first organized by the RAD read (the first/forward read off the cut site). Then the sheared-side reads (second/reverse reads) were sent the assembler, one locus at a time (or more precisely, one group of reads sharing a RAD read). The advantage of this "local assembly" was that you could assemble a transposable element that was all over the genome because those reads were unique in the context of that single locus. If you assemble all the reads at once then any read with a kmer found elsewhere will break the assembly of that contig. However, it is vastly easier to do it all at once so certainly try it and see if you are happy with the results.

If you are doing it all at once, I would pass the forward and reverse reads through the diginorm step (this is even more important since the forward RAD reads will be many-fold higher coverage than any particular reverse read, as you point out).

No need for apologies. Enjoy your first foray into all this!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 06-10-2014, 09:09 AM   #16
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Thank you!!
Corydoras is offline   Reply With Quote
Old 06-30-2014, 01:59 AM   #17
Corydoras
Member
 
Location: Norwich

Join Date: Jan 2014
Posts: 20
Default

Dear RAD/Assembly community,

I was hoping I could pick your brains again regarding my assembly.
I normalized coverage and together with using velvetoptimiser I am quite happy with the results I am getting! For instance, I assembled the reverse reads only, which gave me ~33,000 contigs (this is roughly what I am expecting based on preliminary data) with an N50 of 400. Kmer coverage distribution looks good, too.
However, I now wanted to feed these in as 'long reads' alongside my raw data in velvet to resolve repeats better and connect the forward and reverse reads properly. Weirdly, my N50 goes down to about 380? I have visualized the assembly in tablet, and also mapped back some of my raw data (using BWA) back to check whether forward and reverse reads were actually assembled correctly, and for the few contigs I surveyed by eye this seems to be the case.
I am not sure whether I need to be worried about the decreasing N50, this really seems to be a bit counter-intuitive, as my contigs surely should increase in size by quite a bit when assembling forward and reverse as opposed to just the reverse?
I was also wondering whether there is a good way of quality filtering my contigs from my final assembly. For instance, I only want to keep contigs for future analysis to which both forward and reverse reads map to.

Many thanks in advance for any pointers!!
Corydoras is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO