SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
single gene denovo assembly? asunen Bioinformatics 3 12-18-2012 11:20 AM
velveth assembly with single and paired ends Apexy RNA Sequencing 0 08-05-2011 09:41 AM
Nextera Problem: Single adapters in PE500 library?? Kitty Sample Prep / Library Generation 14 07-26-2011 04:09 AM
Library for single cell RNA seq??? zianeffy Sample Prep / Library Generation 2 07-05-2011 01:46 PM
Library prep from single-stranded DNA? genlyai Sample Prep / Library Generation 3 11-12-2010 01:26 PM

Reply
 
Thread Tools
Old 11-27-2013, 08:19 AM   #1
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 78
Default Assembly with a single library always fail?

I'm trying to assemble a ~70M genome from a single Illumina PE library (100nt read, 160M each end, insert size=300bp).
I've tried velvet, Abyss, and SOAPdenovo with K=75. The largest N50 is 2kb, with at least 77k contigs.

My questions are:

1. Generally does one have to use at least two libraries (one shorter insert and one longer insert) to get good assembly?

2. When I only have one short read library like this, how can I get the best assembly? (a rough question I know. What I want to know is any tips in the parameter tuning? or any preprocessing?)

3. It seems that SOAPdenovo always give me worst results with very small N50. But publication suggests this is a good assembler for large genome. I don't know if I did something wrong.

4. The reported coverage is about 50X, with genome size estimated to be ~70M. But if I calculate the coverage from my reads, it would be 160M x 2 x 100/70M = ~400x. How can it be decreased to 50x? Well, fastqc did show that there is high duplication level in the raw reads (>70%). Could this be the reason?

Here are my commands and results from each assembler:

1. velvet: (N50=2k, #contigs=77k, reported coverage=50X)
Code:
velveth Sample_name 75 -shortPaired -fastq Sample_R1R2_rmdp_trimmed_SOAPec.fastq
velvetg Sample_name -cov_cutoff auto -ins_length 300 -exp_cov auto
2. Abyss: (Contig N50=2k, # contigs=460k; Scaffold N50=4k, #=400k)
Code:
abyss-pe n=10 name=Sample_k75 k=75 j=8 in='Sample_R1_rmdp_trimmed.fastq.cor.pair_1.fq Sample_R2_rmdp_trimmed.fastq.cor.pair_2.fq'
3. SOAPdenovo: (Contig N50=170bp; Scaffold N50=300bp, #=47k)
Code:
SOAPdenovo-127mer all -s SOAPdenovo.config -K 75 -R -f -p 8 -F -V -o Sample_k75
Before assembly, I did preprocessing:
1. Use fastuniq to remove duplicates
2. Use fastq-mcf to remove adapter seqs and trim low quality ends
3. Use SOAPec to error correct the reads:
Code:
KmerFreq_HA -k 27 -f 1 -t 10 -L 101 -l fastqlistforSOAPec.lst -p Sample_k27
Corrector_HA -k 27 -l 2 -e 1 -w 1 -q 30 -r 45 -t 10 -j 1 -Q 33 -o 1 Sample_k27.freq.gz fastqlistforSOAPec.lst
I know assembly is quite case-dependent and I'm asking quite open questions. But any suggestions would be highly appreciated! Thanks!

Last edited by metheuse; 11-27-2013 at 08:26 AM.
metheuse is offline   Reply With Quote
Old 11-27-2013, 11:49 PM   #2
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

A long insert library is definitely going to help your assembly. As for the library you have, picking one kmer size and fixing the library insert size may be suboptimal. You could try running preqc from SGA, see https://github.com/jts/sga/wiki/preqc, this can tll you a lot about your data. You may have gotten a library with a different insert size than you thought, for example. Also, try velvetOptimizer, part of velvet, to hget optimal settings for this program. I don't know enough about to other programs to be able to help you.
flxlex is offline   Reply With Quote
Old 11-28-2013, 02:44 AM   #3
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Have you tried Ray? This looks like the sort of thing it was designed for.

I find the coverage strange also. Your effective coverage should be much higher with a paired-end sequencing run. Perhaps there are so many duplicated errors that it is confusing the assembly process, and it's calculating an average read separation of 50bp rather than 300bp. Have you tried it out with a random subsample of your reads (say 20-40M)?
gringer is offline   Reply With Quote
Old 11-28-2013, 04:26 AM   #4
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Quote:
Originally Posted by metheuse View Post

4. The reported coverage is about 50X, with genome size estimated to be ~70M. But if I calculate the coverage from my reads, it would be 160M x 2 x 100/70M = ~400x. How can it be decreased to 50x? Well, fastqc did show that there is high duplication level in the raw reads (>70%). Could this be the reason?
velvet calculates coverage as kmer coverage, the formula is in the velvet manual, so that would account for some of the difference between what you calculated and what velvet reports.

I find that the coverage velvet reports seems to be calculated on the number of reads it uses in the assembly, rather than the total number of reads.

The last line in the Log file after running velvetg gives the number of reads used in the assembly.
mastal is offline   Reply With Quote
Old 11-28-2013, 10:35 AM   #5
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 78
Default

Quote:
Originally Posted by flxlex View Post
A long insert library is definitely going to help your assembly. As for the library you have, picking one kmer size and fixing the library insert size may be suboptimal. You could try running preqc from SGA, see https://github.com/jts/sga/wiki/preqc, this can tll you a lot about your data. You may have gotten a library with a different insert size than you thought, for example. Also, try velvetOptimizer, part of velvet, to hget optimal settings for this program. I don't know enough about to other programs to be able to help you.
Thanks a lot! I'll take a look at SGA and velvetOptimizer.
metheuse is offline   Reply With Quote
Old 11-28-2013, 10:40 AM   #6
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 78
Default

Quote:
Originally Posted by gringer View Post
Have you tried Ray? This looks like the sort of thing it was designed for.

I find the coverage strange also. Your effective coverage should be much higher with a paired-end sequencing run. Perhaps there are so many duplicated errors that it is confusing the assembly process, and it's calculating an average read separation of 50bp rather than 300bp. Have you tried it out with a random subsample of your reads (say 20-40M)?
Thanks! I'll take a look at Ray.
I didn't try with random subsamples? Would that help?
metheuse is offline   Reply With Quote
Old 11-28-2013, 10:41 AM   #7
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 78
Default

Quote:
Originally Posted by mastal View Post
velvet calculates coverage as kmer coverage, the formula is in the velvet manual, so that would account for some of the difference between what you calculated and what velvet reports.

I find that the coverage velvet reports seems to be calculated on the number of reads it uses in the assembly, rather than the total number of reads.

The last line in the Log file after running velvetg gives the number of reads used in the assembly.
Thanks! It used 85% of the input reads, so still it can't explain the coverage difference.
metheuse is offline   Reply With Quote
Old 11-28-2013, 12:10 PM   #8
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by metheuse View Post
I didn't try with random subsamples? Would that help?
It will reduce the number of random errors that appear multiple times. You could alternatively use the kmer-based error correction of SGA, but I'm not sure that SGA will work so well with 100bp reads.
gringer is offline   Reply With Quote
Old 12-02-2013, 01:51 PM   #9
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 78
Default

Thanks a lot for all your replies above. I'm currently trying velvetoptimizer and SGA on the data.

Besides, I have another question:
Do I have to choose the same Kmer size for error correction and assembly? I was using SOAPec which allows up to 27 as k-mer size, but I wanted to use k=75 in assembly. Would this difference matter?
Thanks.
metheuse is offline   Reply With Quote
Old 12-02-2013, 03:38 PM   #10
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Do I have to choose the same Kmer size for error correction and assembly?
No, they are separate independent functions.
gringer is offline   Reply With Quote
Old 12-13-2013, 07:15 AM   #11
ctseto
Member
 
Location: SE MN

Join Date: Oct 2013
Posts: 44
Default

I've had pretty good results using SPAdes. Comparing my velvet runs against SPAdes from common files is pretty night and day.

An additional long-insert library might be helpful, especially if you've got an organism known to have lots of repetitive genome.
ctseto is offline   Reply With Quote
Reply

Tags
de-novo assembly, small n50

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO