SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
bacterial genome assembly on Miseq Etherella Bioinformatics 5 12-13-2013 08:58 AM
cuffmerge assembly vs denovo assembly of RNAseq data skm Bioinformatics 0 10-16-2013 10:16 PM
Inquiry: minimum length of reads for referece-based assembly or de novo assembly sunfuhui Bioinformatics 1 10-04-2013 10:28 AM
Miseq de novo assembly : Ambigous base pairs (NNs) in the contigs ndeshpan Bioinformatics 2 07-21-2013 04:59 PM

Reply
 
Thread Tools
Old 02-14-2014, 02:16 PM   #1
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default Assembly MiSeq

I am trying to assembly an oomycete genome ~40 Mb. I have around 20 millons of 250 bp PE-reads. I tried Velvet at the first time but I got a very fragmented assembly. I tried different K-mers (till 125) but still the contigs were very small. I know that MIRA could be an alternative for MiSeq. I tried to input MIRA with joined reads (using FLASH) but only 18% of the reads overlap. My sequencing facility said that the library insert is about 300 bp. This is weird !!!! considering the the small amount of overlapped reads.
Any comments ?
joxcargator73 is offline   Reply With Quote
Old 02-14-2014, 02:55 PM   #2
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

You can use velvet to get an estimate of the insert length. I think it calculates it from the reads that it does manage to assemble. There is a script in the contrib folder in velvet.
mastal is offline   Reply With Quote
Old 02-14-2014, 03:01 PM   #3
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Flash joins paired ends. This is used to extend contigs of a targeted amplicons.
Just because your paired end fragments do not overlap, this does not mean that sequences among paired end reads do not over lap. they most certainly will, and whole genome sequencing usually has a known insert size. This guilds alignment software in contig assembly. You shouldn't use FLASH on whole genome data.

There are many NGS assembler out there, BOWTIE 2 being one good example.

Have a play around with the paired end data, specifying your insert size.
JackieBadger is offline   Reply With Quote
Old 02-14-2014, 04:23 PM   #4
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Bowtie2 isn't an assembler, it's a mapper. You can "sort of" assemble with a mapper by generating consensus sequences based on a reference assembly, but I'm guessing that this oomycete assembly needs to be de-novo.

FWIW, an insert size of 300bp may mean an average fragment size of 800bp, depending on how your sequencing facility defines their numbers.

For long reads I'd recommend using a non-kmer approach for assembly (i.e. not velvet), because splitting up into kmers will lose the benefit of 250bp sequencing (or longer if it's paired end with overlap -- a fragment size of less than 500bp).
gringer is offline   Reply With Quote
Old 02-14-2014, 04:40 PM   #5
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Sorry, My mistake. Indeed bowtie requires a reference to map to. I have performed de novo assemblies with Mira and Newbler with pretty good success.
JackieBadger is offline   Reply With Quote
Old 02-17-2014, 08:47 AM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by joxcargator73 View Post
I am trying to assembly an oomycete genome ~40 Mb. I have around 20 millons of 250 bp PE-reads. I tried Velvet at the first time but I got a very fragmented assembly. I tried different K-mers (till 125) but still the contigs were very small. I know that MIRA could be an alternative for MiSeq. I tried to input MIRA with joined reads (using FLASH) but only 18% of the reads overlap.
If you use MIRA, you don't want to use reads joined by another program -- just give MIRA the raw files and let it take care of the details. You would use a manifest file something like:

--------------------

project = oomycete
parameters = COMMON_SETTINGS \
-GENERAL:number_of_threads=20 \
-NW:cmrnl=no\
SOLEXA_SETTINGS \
-CL:pec
job = genome,denovo,accurate
readgroup = miseq500cycle
data = xxx_R1_001.fastq xxx_R2_001.fastq
technology = solexa
template_size = 50 1000 autorefine
segment_placement = ---> <---

------------------
You just run mira the normal way. Put the stuff between the "----" in a file called something like "manifest.conf" and then run mira with:

mira manifest.conf

Possibly you would want to background the process and/or capture output to a log file.
Always best to check your manifest for errors with:

mira -m manifest.conf

I am presuming you have a computer with 20 or more cores on it. So you may want to adjust that "number_of_threads"

Anyway, I think the issue here would be the amount of time it take to get an assembly. Mira is very thorough. So you are probably looking at over a week, even on a powerful computer.

If that is a problem, you could get a result much more quickly with abyss-pe. Just:

abyss-pe k=61 name=Oomycete in='xxx_R1_001.fastq xxx_R2_001.fastq'

You will likely get a result much more quickly.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-28-2014, 02:53 PM   #7
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Smile

Quote:
Originally Posted by pmiguel View Post
If you use MIRA, you don't want to use reads joined by another program -- just give MIRA the raw files and let it take care of the details. You would use a manifest file something like:

--------------------

project = oomycete
parameters = COMMON_SETTINGS \
-GENERAL:number_of_threads=20 \
-NW:cmrnl=no\
SOLEXA_SETTINGS \
-CLec
job = genome,denovo,accurate
readgroup = miseq500cycle
data = xxx_R1_001.fastq xxx_R2_001.fastq
technology = solexa
template_size = 50 1000 autorefine
segment_placement = ---> <---

------------------
You just run mira the normal way. Put the stuff between the "----" in a file called something like "manifest.conf" and then run mira with:

mira manifest.conf

Possibly you would want to background the process and/or capture output to a log file.
Always best to check your manifest for errors with:

mira -m manifest.conf

I am presuming you have a computer with 20 or more cores on it. So you may want to adjust that "number_of_threads"

Anyway, I think the issue here would be the amount of time it take to get an assembly. Mira is very thorough. So you are probably looking at over a week, even on a powerful computer.

If that is a problem, you could get a result much more quickly with abyss-pe. Just:

abyss-pe k=61 name=Oomycete in='xxx_R1_001.fastq xxx_R2_001.fastq'

You will likely get a result much more quickly.

--
Phillip
Thanks I run a small set of reads ~ 8 millions PE with MIRA. I got a very fragmented assembly again ..N50 = 1,8 Kb. It took two days. Then I run all the reads 20 millons (PE). After 5 days I got better results N50 = 3.5 Kb. However It is still fragmented. I learned that High genome heterozygosity might create problems in the assembly. In fact we believe that the size of the genome is around 40 Mb, but the total consensus that MIRA outputs is around 92 Mb. I am doing some research for diploid genomes with high level of heterozygosity. I found a couple of tools that could help. One of these is hapsembler. Suggestions are welcome. Thanks again

Last edited by joxcargator73; 02-28-2014 at 02:56 PM.
joxcargator73 is offline   Reply With Quote
Old 02-28-2014, 02:57 PM   #8
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

It is frequently the case that error correction can improve assembly. Have a go at using BayesHammer (i.e. just the 'error correction' part of SPAdes), or Quake to correct reads before assembly.
gringer is offline   Reply With Quote
Old 02-28-2014, 03:03 PM   #9
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default

Quote:
Originally Posted by gringer View Post
It is frequently the case that error correction can improve assembly. Have a go at using BayesHammer (i.e. just the 'error correction' part of SPAdes), or Quake to correct reads before assembly.
Thanks, I was entertaining that idea too. I will try that too.
joxcargator73 is offline   Reply With Quote
Old 02-28-2014, 04:23 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by joxcargator73 View Post
Thanks, I was entertaining that idea too. I will try that too.
In my testing with Velvet and SoapDenovo, error correction has a minor positive effect and normalization has a greater positive effect (particularly on haploid genomes; less effective on diploids). If you're willing to try, I have a couple of tools that might help, available here. One is BBNorm, which does error correction and depth normalization. Velvet tends to produce the best assemblies at roughly 40x read coverage. To normalize to a target 40x kmer depth (at K=31), you would:

bbnorm.sh in=read1.fq in2=read2.fq out=norm1.fq out2=norm2.fq target=40

You can turn on error-correction with the flag "ecc=t"

Or you can run error-correction without normalization with the script "ecc.sh". They both call the same program. It's very fast.

I've had mixed results with read overlap merging. It is very useful in quickly plotting a library's insert size distribution or trimming adapters, but seems to reduce the quality of Soap's assemblies (not sure about Velvet). On the other hand, Ray seems to give better assemblies with merged reads. But if you do perform merging, my merger is much faster and substantially more accurate (in the sense of few false positives) than Flash, with a similar merging rate.

bbmerge.sh in=read1.fq in2=read2.fq out=merged.fq outsingle=outu=unmerged.fq outu2=unmerged2.fq hist=hist.txt


This command will also produce an insert-size histogram in hist.txt.

You can also try quality-trimming, if your 250bp library has a strong dropoff, and adapter removal if many reads are shorter than 250bp. I have a tool for that, too -

bbduk.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq k=25 ktrim=r hdist=1 qtrim=rl trimq=10 ref=adapters.fa.gz minlen=100

This will trim adapter sequence from the right end of the read (where it occurs if the insert size drops below read length) by matching kmers from the attached "adapters.fa.gz" (which contains TruSeq universal adapter sequences). It requires a 25-mer match allowing 1 mismatch (hdist=1). You can set the number of mismatches higher to account for errors, but it requires exponentially more memory. Then, the read will be quality-trimmed to Q10 on the right and left sides, and if either read is shorter than 100bp, the pair will be discarded. You can of course enable and disable all of these things. Note that read merging generally works best if you first remove adapters but don't quality-trim, then merge, then quality-trim the unmerged pairs, as bbmerge will internally account for base quality.

Good luck!

P.S. You can get help information on usage by running the shellscripts without any arguments.
Attached Files
File Type: gz adapters.fa.gz (369 Bytes, 5 views)

Last edited by Brian Bushnell; 03-05-2014 at 07:47 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-03-2014, 11:02 AM   #11
akorobeynikov
Member
 
Location: Saint Petersburg, Russia

Join Date: Sep 2013
Posts: 25
Default

You can definitely give SPAdes a try. The instructions to assemble 2x250 reads can be found in the manual (http://spades.bioinf.spbau.ru/releas...al.html#sec3.4), though everything described there is just a default behavior as of SPAdes 3.0.

Feel free to contact SPAdes support in case of any problems.
akorobeynikov is offline   Reply With Quote
Old 03-05-2014, 07:14 PM   #12
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default

Thanks for your help. I am trying now different options. I will post the results soon.
joxcargator73 is offline   Reply With Quote
Old 03-28-2014, 08:44 AM   #13
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default

This is an update of the results using Spades.
I first usde spades with the following parameters
spades.py -k 21,33,55,77,99,127 --careful -1 forward -2 reverse -o spades_output

The output improved a lot in comparison to I had with other assemblers (Velvet and MIRA)

N50 11.9 kb
number of contigs: 18,063
Total base 49 Mb

The I tried DiSpades with two simple paratemers

dipspades.py -k 21,33,55,77,99,127 --careful -1 reads_left.fastq -2 reads_right.fastq --expect-rearrangements -o output_dir

N50 12.7 Kb
number of contigs 5,186
Total bases 33 Mb

I will continue trying other parameters, but I have a question with regard the difference in the total bases with SPdes and Dispades. I now that the second was designed for diploid polymorphic organisms. this is the case of my bug. I wonder if spades is producing extra contigs and these are artifacts. or Dispades is missing some contigs during the assembly. We now that the total genome of our bug is between 30 and 50 Mb. Both assemblies look promising but we have to confirm the real size of the gemome by PFGE.

On the other hand bbmap helped a lot to double check my reads. It is a very useful too. Thanks
joxcargator73 is offline   Reply With Quote
Old 03-31-2014, 03:54 AM   #14
akorobeynikov
Member
 
Location: Saint Petersburg, Russia

Join Date: Sep 2013
Posts: 25
Default

I would suggest reading DipSPAdes manual (http://spades.bioinf.spbau.ru/releas...es_manual.html) in order to understand the differences between SPAdes and DipSPAdes output.

In short: since you have highly diploid genome, then the conventional assembler (like SPAdes) would produce you the so-called set of 'double' contigs, that is - each contig will be from the one of the haplomes. Surely, the total length of the assembly will be pretty close to the double of the expected genome size due to ploidy (if the k-mer is large enough and the ploidy level is more than 0.5-1%).

DipSPAdes uses haplocontigs to produce consensus contigs, which contain the consensus from both haplomes. Therefore, the expected genome size there will be close to the real genome size. Even more, DipSPAdes will also use the differences in the haplocontigs to resolve the additional repeats.

Hope this helps.
akorobeynikov is offline   Reply With Quote
Old 03-31-2014, 08:01 AM   #15
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default

Thanks for your answer.
Yes we are aware of the differences between Spades and dispades. I just wanted to double check and share the results.
I am thinking to do do a couple of PacBio runs (long reads), to improve the assembly. I think that Dispades can take PacBio reads, even if there are not corrected.
I believe that your comments helped a lot.
joxcargator73 is offline   Reply With Quote
Old 04-01-2014, 12:51 AM   #16
akorobeynikov
Member
 
Location: Saint Petersburg, Russia

Join Date: Sep 2013
Posts: 25
Default

Yes, DipSPAdes can use PacBio reads. Currently it uses them only for haplocontigs construction, so this may be a bit suboptimal.

Also, I would suggest you to disable --careful option for initial assembly
akorobeynikov is offline   Reply With Quote
Old 04-02-2014, 10:56 AM   #17
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default meanning of the consensus_contigs.fasta

I am confused about the consensus_contigs.fasta file. I though that this contained a consensus of both haplotypes, but for I understood from the manual this contains the constructed contigs from the paired_consensus_contigs. Or maybe I am missing something?
If I would want estimate the size of the genome which file or combination of files would be the best to use.
Furthermore which file(s) summarize the genome of each haplotype (haplotype_assembly.out?) and which file(s) would be the best as input for further gene prediction and annotation.

Thanks
joxcargator73 is offline   Reply With Quote
Old 04-02-2014, 01:07 PM   #18
akorobeynikov
Member
 
Location: Saint Petersburg, Russia

Join Date: Sep 2013
Posts: 25
Default

Quote:
Originally Posted by joxcargator73 View Post
I am confused about the consensus_contigs.fasta file. I though that this contained a consensus of both haplotypes, but for I understood from the manual this contains the constructed contigs from the paired_consensus_contigs. Or maybe I am missing something?
I looked over manual and found nothing which may indicate so. The manual clearly states that "consensus_contigs.fasta - file in FASTA format with a set of constructed consensus contigs".

Quote:
Originally Posted by joxcargator73 View Post
If I would want estimate the size of the genome which file or combination of files would be the best to use.
It depends whether you're interested in the size of diploid or haploid genome.

Quote:
Originally Posted by joxcargator73 View Post
Furthermore which file(s) summarize the genome of each haplotype (haplotype_assembly.out?)
The manual states that haplotype_assembly.out contains the information about which contig from the set of haplocontigs belongs to each haplome. So, you can use it, right.

Quote:
Originally Posted by joxcargator73 View Post
and which file(s) would be the best as input for further gene prediction and annotation.
Again, it would depend on the result you want to obtain. But probably for gene finding you would want to use the haplocontigs.
akorobeynikov is offline   Reply With Quote
Old 04-15-2014, 03:42 PM   #19
joxcargator73
Member
 
Location: Gainesville

Join Date: Dec 2012
Posts: 28
Default pacbio question

In the manual says:
"PacBio CLR reads are used for hybrid assemblies (e.g. with Illumina or IonTorrent). There is no need to pre-correct PacBio CLR reads. You just need to have filtered subreads in FASTQ/FASTA format. Provide these filtered subreads using --pacbio option. SPAdes will use PacBio CLR reads for gap closure and repeat resolution."

What is the real meaning of "filtered subreads". is the set of reads that just passed the quality control from the machine? or there is an additional cutoff based on quality and length?

Thanks
joxcargator73 is offline   Reply With Quote
Old 04-15-2014, 03:45 PM   #20
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Filtered subreads are broken into pieces at adapters. The raw reads would potentially have the same sequence multiple times: forward, adapter, reverse, adapter, etc.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:49 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO