![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
EBARDenovo - A new RNA-seq do novo assembler for paired-end Illumina data | htchu.taiwan | RNA Sequencing | 2 | 06-10-2013 01:13 AM |
EBARDenovo - A new RNA-seq do novo assembler for paired-end Illumina data | htchu.taiwan | Illumina/Solexa | 9 | 04-16-2013 12:08 AM |
paired-end read length for de novo assembly | Seqasaurus | Illumina/Solexa | 4 | 10-19-2011 04:32 AM |
Which assembler for de-novo Illumina transcriptome assembly with relatively few reads | kmkocot | Bioinformatics | 1 | 05-17-2011 04:13 AM |
PubMed: Local De Novo Assembly of RAD Paired-End Contigs Using Short Sequencing Reads | Newsbot! | Literature Watch | 0 | 05-06-2011 12:40 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Malaga Join Date: Feb 2010
Posts: 14
|
![]()
Hi all,
I have Illumina HighSeq paired end reads and I'm looking for some strategy to apply a de novo assembly for it. The inicial total number of reads in the two files was 240 Millions (The sum of files sizes: 56 GB). After the cleaning step, the total number of reads has been reduced to 81 Millions (21 GB). I'm tring to assemble this data with the abyss-pe. This software work good with a small paired-end files. But when I run it on my data, even using 70GB of Ram the assembly don't finish and don't give results. I tested it with smal kmer (30) and big kmer (64), and also turned the minimum coverage to 40. No result. Someone knows any strategy or pipeline to assemble a such large amount of illumina paired end data? Thank you very much |
![]() |
![]() |
![]() |
#2 |
Junior Member
Location: Pittsburgh Join Date: Mar 2012
Posts: 9
|
![]()
You may way to consider a CD-HIT run to lower complexity again by removing duplicate reads. I suggest getting access to a cluster and using Celera Assembler however, remember that alot of your contigs will be in the degenerates folder.
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Raleigh, NC Join Date: Nov 2008
Posts: 51
|
![]()
What kind of organism are you sequencing? This, of course, affects strategy?
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Malaga Join Date: Feb 2010
Posts: 14
|
![]()
Thanks for your answer
Indeed in each file reads are duplicated thousand of times. but we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats. I forget the kind of this data : RNASeq (transcriptome assembly). I think that Celera Assembler isn't suitable for this assembly because it's for genome assembly. |
![]() |
![]() |
![]() |
#5 |
Member
Location: Malaga Join Date: Feb 2010
Posts: 14
|
![]()
This is a Micro-algae
|
![]() |
![]() |
![]() |
#6 | |
Member
Location: St. Louis, MO Join Date: Aug 2011
Posts: 53
|
![]() Quote:
Do not parse out repeats. The general expression levels are important for transcriptome assemblers. We use Trinity package currently. SOAPtrans is pretty fast and memory efficient, but i haven't had a chance to assess it's correctness. |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Malaga Join Date: Feb 2010
Posts: 14
|
![]()
Right, after the cleaning step we removed reads without pair and put them in an external file. and to keep order of reads in the files.
I truth on the importance of the expression level in the transcriptome assembly. the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads. If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly? |
![]() |
![]() |
![]() |
#8 | |||
Member
Location: St. Louis, MO Join Date: Aug 2011
Posts: 53
|
![]() Quote:
Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
#9 | |
Research Engineer
Location: NICTA VRL, Melbourne, Australia Join Date: Jun 2011
Posts: 12
|
![]() Quote:
Full disclosure: I'm one of the developers.
__________________
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f}); |
|
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Berlin Join Date: Jul 2011
Posts: 156
|
![]()
I'd second the suggestion to try Trinity on that dataset. You could reduce your dataset with diginorm, if necessary, though 81 Mio reads (pairs?) sounds reasonable to tackle with a ~64 GB server - though generally the memory consumption depends more on the transcriptome complexity than the actual number of reads.
What was wrong with the 159 Mio reads that you dropped? rRNA, adapters or just bad quality? |
![]() |
![]() |
![]() |
#11 |
Member
Location: Malaga Join Date: Feb 2010
Posts: 14
|
![]()
Hi,
Thank you very much for you answers. I just read about Gosammer. In the paper it is described as good for genomic data. It can be also valid for transcriptomic reads? |
![]() |
![]() |
![]() |
#12 |
Member
Location: Malaga Join Date: Feb 2010
Posts: 14
|
![]()
Hi arvid,
After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads. |
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: Berlin Join Date: Jul 2011
Posts: 156
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#14 | |
Research Engineer
Location: NICTA VRL, Melbourne, Australia Join Date: Jun 2011
Posts: 12
|
![]()
Hicham,
Quote:
The place where most genome assemblers do significantly worse than transcriptome assemblers is in pair threading and scaffolding, where it's useful to make the assumption that there is such a thing as "N times coverage". (This assumption is incorrect in RNA-Seq, because of differing expression levels.) One thing that you could try is to use Gossamer as a pre-pass for Trinity. The input to Trinity is the output of a k-mer counter (Trinity's driver script uses Meryl by default). It would be fairly straightforward to use Gossamer as the k-mer counter by running its graph build and cleanup passes to bring it down to a managable size, then using the dump-graph to report the k-mer counts. You'd need to do a little scripting to convert it into Meryl format. Having said that... we are actively working on the problem of resource-efficient transcriptome assembly. Nothing to announce yet, but watch this space.
__________________
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f}); |
|
![]() |
![]() |
![]() |
#15 |
Rick Westerman
Location: Purdue University, Indiana, USA Join Date: Jun 2008
Posts: 1,104
|
![]()
Trinity has 'jellyfish' as a mer-counter. It is likely that in the next release jellyfish will become the default and that meryl will be removed since jellyfish is so much faster. So, if you are using Trinity, make sure that you specify jellyfish.
|
![]() |
![]() |
![]() |
#16 |
Member
Location: New York Join Date: Dec 2009
Posts: 17
|
![]()
I have used Abyss and then Trans-abyss for a de novo transcriptome assembly with ~70 million reads on a machine with 20gb RAM. From what I've read Abyss is less RAM-intensive than other Trinity and SOAPdenovo. And the memory-intensive phase for Abyss and other assemblers is loading the hash table, which depends on the kmer size, not the number of reads. For these reasons I don't think memory is your issue. Your issue about a large number of duplicate reads specific to one pair of reads and not the other sounds like a possible issue--Abyss has issues when coverage is too high, for example. The Abyss support group is probably a good place to turn. https://groups.google.com/forum/?fro...um/abyss-users
|
![]() |
![]() |
![]() |
#17 |
Senior Member
Location: Halifax, Nova Scotia Join Date: Mar 2009
Posts: 381
|
![]()
From my experience MIRA performs very well on de novo tanscriptome assemblies
|
![]() |
![]() |
![]() |
#18 |
Member
Location: Scotland Join Date: Feb 2014
Posts: 27
|
![]()
Dear All,
I am using trinity for transcriptomics assembly. I have few queries:- 1) have two condition(Control and Treated) and each condition has 4 replicates. so if I merge these .fq files together, how the generated assembly from this merged .fq file would be better than the assembly generated from single(using only one replicate) sample? 2) Do I need to remove duplicates from individual fastq file before merging or after merging them together? 3) I saw there is a script "fasta_remove_duplicates" in the trinity folder. So is there any chance that "In-silico-normalization" in trinity take care of these duplicate reads? I would appreciate any explanations. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|