SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Oases: De novo transcriptome assembly of very short reads lcollado De novo discovery 58 02-07-2017 08:48 AM
de novo assembly using unmapped reads from tophat ccstaats Bioinformatics 4 04-12-2013 02:09 PM
Which assembler for de-novo Illumina transcriptome assembly with relatively few reads kmkocot Bioinformatics 1 05-17-2011 03:13 AM
PubMed: De novo assembly of short sequence reads. Newsbot! Literature Watch 0 08-21-2010 02:01 AM
Velvet de novo assembly of Solid reads HOWTO KevinLam De novo discovery 1 01-10-2010 12:11 AM

Reply
 
Thread Tools
Old 03-28-2012, 06:14 AM   #1
lkral
Member
 
Location: Carrollton, GA

Join Date: May 2011
Posts: 27
Default de novo assembly of PE reads

I am about to have DNA sequenced on a HiSeq and I expect about a 30 fold coverage of a 1x10^9 bp genome with 100 bp PE reads. I am unsure of the size of the fragments I should use to get the best likely assembly from these PE reads. I am aware that the best results would be obtained by having PE reads from several libraries of varying sizes but I can only afford to sequence one library at this time. Currently I would hope to obtain contigs that at least average 2,000 to 10,000 bp so single genes would likely be within a contig. The most likely problem in assembling a contig that spans a gene would be STRs in introns. I was thinking that a 1000 bp library should span across most such STRs. Any suggestions would be appreciated.
lkral is offline   Reply With Quote
Old 03-29-2012, 03:02 PM   #2
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

From the SGA paper: SGA can assemble 35X human reads into 10kbp contigs with reads from a single library with an average ~400bp insert size. Don't go for >500bp insert size. If I am right, the throughput and the quality of Illumina sequencing will degrade significantly.
lh3 is offline   Reply With Quote
Old 03-29-2012, 03:06 PM   #3
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

I think the drop off in yield comes around 800bp fragment size (which depending on how you calculate insert size would agree with Heng's suggestion of 500bp + paired-end 150bp reads).
nickloman is offline   Reply With Quote
Old 03-29-2012, 03:14 PM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

By insert size, I always mean external size (i.e. the largest possible calculation). It is interesting that Illumina can now do 800bp (without a price?). I did not know this.
lh3 is offline   Reply With Quote
Old 03-29-2012, 03:21 PM   #5
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

We routinely do 500-600 base fragments and it works well. I think I read on another thread that 800 bases is where performance falls off a cliff, not tested that high ourselves.
nickloman is offline   Reply With Quote
Old 03-30-2012, 05:27 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by nickloman View Post
We routinely do 500-600 base fragments and it works well. I think I read on another thread that 800 bases is where performance falls off a cliff, not tested that high ourselves.
I have one example from ~4 months ago running 3 libraries prepared from the same DNA sample with varying insert sizes. The estimated insert sizes (from BioAnalyzer) were 405, 540 and 795 bp. (Those are the insert size, not including adapters.) These libraries were run on a HiSeq 2000, one library per lane, 2x100bp PE. I observed no significant difference in the overall read quality among these three libraries, but the quality for all lanes on this flow cell was somewhat lower than typical, particularly at the end of read2.
kmcarr is offline   Reply With Quote
Old 03-30-2012, 05:42 AM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Thanks for the info. I need to update my old knowledges.
lh3 is offline   Reply With Quote
Old 03-30-2012, 06:11 AM   #8
lkral
Member
 
Location: Carrollton, GA

Join Date: May 2011
Posts: 27
Default

Thank you all. I'll play it safe and with a 500 to 600 bp library.
lkral is offline   Reply With Quote
Old 03-30-2012, 11:38 PM   #9
mjp
Member
 
Location: USA

Join Date: Mar 2011
Posts: 25
Default

Quote:
Originally Posted by lkral View Post
I am about to have DNA sequenced on a HiSeq and I expect about a 30 fold coverage of a 1x10^9 bp genome with 100 bp PE reads. I am unsure of the size of the fragments I should use to get the best likely assembly from these PE reads. I am aware that the best results would be obtained by having PE reads from several libraries of varying sizes but I can only afford to sequence one library at this time. Currently I would hope to obtain contigs that at least average 2,000 to 10,000 bp so single genes would likely be within a contig. The most likely problem in assembling a contig that spans a gene would be STRs in introns. I was thinking that a 1000 bp library should span across most such STRs. Any suggestions would be appreciated.
Depending on the complexity of your organism (ploidity, heterozygosity level etc.) you will most likely not avoid sequencing large insert size libraries (5kb, 10kb, 20kb) to get a reasonable assembly. There are numerous papers on de novo assembly (of generaly relatively large genomes) that continuously use large insert size libraries to obtain good assembly.

It's hard to say whether you should see contigs of the size you mentioned, since again it all depends of the complexity level of your organism.
mjp is offline   Reply With Quote
Old 03-31-2012, 11:02 AM   #10
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

My view is libraries with large insert size mainly helps scaffolding, but not much for contigs. For example, SGA assembles reads with ~400bp insert to 10kb. Allpaths-LG assembles reads from variety of insert sizes to ~20kb. The contig N50 is not that different especially given that allpaths-lg uses 3-fold as many data which are much higher in cost. The scaffold N50 of allpaths-lg is by far better.
lh3 is offline   Reply With Quote
Old 03-31-2012, 11:58 AM   #11
lkral
Member
 
Location: Carrollton, GA

Join Date: May 2011
Posts: 27
Default

The current project is phase I where all I need to do is obtain contigs that are large enough to contain a gene or part of a gene. If contigs are smaller than a gene I can align these to orthologs from other fish species for assembly of those genes. In phase II in about a year or so, I hope to build longer scaffolds aligning to long oxford nanopore generated sequences (I trust these nanopores will work as advertised).
lkral is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO