SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
EBARDenovo - A new RNA-seq do novo assembler for paired-end Illumina data htchu.taiwan RNA Sequencing 2 06-10-2013 01:13 AM
EBARDenovo - A new RNA-seq do novo assembler for paired-end Illumina data htchu.taiwan Illumina/Solexa 9 04-16-2013 12:08 AM
paired-end read length for de novo assembly Seqasaurus Illumina/Solexa 4 10-19-2011 04:32 AM
Which assembler for de-novo Illumina transcriptome assembly with relatively few reads kmkocot Bioinformatics 1 05-17-2011 04:13 AM
PubMed: Local De Novo Assembly of RAD Paired-End Contigs Using Short Sequencing Reads Newsbot! Literature Watch 0 05-06-2011 12:40 AM

Reply
 
Thread Tools
Old 06-04-2012, 06:39 AM   #1
hicham
Member
 
Location: Malaga

Join Date: Feb 2010
Posts: 14
Default De novo assembly for Illumina HighSeq paired end reads

Hi all,

I have Illumina HighSeq paired end reads and I'm looking for some strategy to apply a de novo assembly for it.

The inicial total number of reads in the two files was 240 Millions (The sum of files sizes: 56 GB). After the cleaning step, the total number of reads has been reduced to 81 Millions (21 GB).

I'm tring to assemble this data with the abyss-pe. This software work good with a small paired-end files. But when I run it on my data, even using 70GB of Ram the assembly don't finish and don't give results. I tested it with smal kmer (30) and big kmer (64), and also turned the minimum coverage to 40. No result.

Someone knows any strategy or pipeline to assemble a such large amount of illumina paired end data?

Thank you very much
hicham is offline   Reply With Quote
Old 06-04-2012, 06:51 AM   #2
joshuapk
Junior Member
 
Location: Pittsburgh

Join Date: Mar 2012
Posts: 9
Default

You may way to consider a CD-HIT run to lower complexity again by removing duplicate reads. I suggest getting access to a cluster and using Celera Assembler however, remember that alot of your contigs will be in the degenerates folder.
joshuapk is offline   Reply With Quote
Old 06-04-2012, 07:02 AM   #3
Mark
Member
 
Location: Raleigh, NC

Join Date: Nov 2008
Posts: 47
Default

What kind of organism are you sequencing? This, of course, affects strategy?
Mark is offline   Reply With Quote
Old 06-04-2012, 07:12 AM   #4
hicham
Member
 
Location: Malaga

Join Date: Feb 2010
Posts: 14
Default

Thanks for your answer
Indeed in each file reads are duplicated thousand of times. but we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats.
I forget the kind of this data : RNASeq (transcriptome assembly).
I think that Celera Assembler isn't suitable for this assembly because it's for genome assembly.
hicham is offline   Reply With Quote
Old 06-04-2012, 07:14 AM   #5
hicham
Member
 
Location: Malaga

Join Date: Feb 2010
Posts: 14
Default

This is a Micro-algae
hicham is offline   Reply With Quote
Old 06-04-2012, 07:54 AM   #6
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Quote:
Originally Posted by hicham View Post
we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats.
Be careful here. Most assemblers do not look at header information to establish pairs. Rather, the 1st read in file a is paired with the 1st read in file b. If you remove any reads, be sure you also remove it's pair in the other file.


Quote:
Originally Posted by hicham View Post
I forget the kind of this data : RNASeq (transcriptome assembly).
Do not parse out repeats. The general expression levels are important for transcriptome assemblers. We use Trinity package currently. SOAPtrans is pretty fast and memory efficient, but i haven't had a chance to assess it's correctness.
ians is offline   Reply With Quote
Old 06-04-2012, 08:15 AM   #7
hicham
Member
 
Location: Malaga

Join Date: Feb 2010
Posts: 14
Default

Right, after the cleaning step we removed reads without pair and put them in an external file. and to keep order of reads in the files.
I truth on the importance of the expression level in the transcriptome assembly. the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads.
If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly?
hicham is offline   Reply With Quote
Old 06-04-2012, 08:46 AM   #8
ians
Member
 
Location: St. Louis, MO

Join Date: Aug 2011
Posts: 53
Default

Quote:
Originally Posted by hicham View Post
the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads.
It is a good idea (computationally) to reduce sequence to only as much sampling as you really need. Which organism did you sequence?

Quote:
Originally Posted by hicham View Post
If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly?
The authors say,
Quote:
Ideally, you will have access to a large-memory server, ideally having ~1G of RAM per 1M reads to be assembled (but often, much less memory may be required).
I don't have any numbers to share for paired end, but recently, we ran 160 M reads (1x100bp) with Trinity peaking memory at 18 GB.
ians is offline   Reply With Quote
Old 06-04-2012, 10:02 PM   #9
Pseudonym
Research Engineer
 
Location: NICTA VRL, Melbourne, Australia

Join Date: Jun 2011
Posts: 12
Default

Quote:
Originally Posted by hicham View Post
Someone knows any strategy or pipeline to assemble a such large amount of illumina paired end data?
You might like to try Gossamer. It was designed with memory efficiency in mind, so it can do the same job as other assemblers, using smaller machines. (Or, alternatively, it can handle more data than other assembers on the same machine.)

Full disclosure: I'm one of the developers.
__________________
sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f});
Pseudonym is offline   Reply With Quote
Old 06-05-2012, 01:20 AM   #10
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

I'd second the suggestion to try Trinity on that dataset. You could reduce your dataset with diginorm, if necessary, though 81 Mio reads (pairs?) sounds reasonable to tackle with a ~64 GB server - though generally the memory consumption depends more on the transcriptome complexity than the actual number of reads.
What was wrong with the 159 Mio reads that you dropped? rRNA, adapters or just bad quality?
arvid is offline   Reply With Quote
Old 06-05-2012, 01:21 AM   #11
hicham
Member
 
Location: Malaga

Join Date: Feb 2010
Posts: 14
Default

Hi,
Thank you very much for you answers.
I just read about Gosammer. In the paper it is described as good for genomic data.
It can be also valid for transcriptomic reads?
hicham is offline   Reply With Quote
Old 06-05-2012, 01:36 AM   #12
hicham
Member
 
Location: Malaga

Join Date: Feb 2010
Posts: 14
Default

Hi arvid,
After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions
For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads.
hicham is offline   Reply With Quote
Old 06-05-2012, 01:52 AM   #13
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by hicham View Post
Hi arvid,
After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions
For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads.
For Trinity, you'd want to combine that into one file, it should be able to recognize the pairs on its own (might have changed recently though, as a paired end mapping step was introduced which might need a different input, check the documentation and examples). Otherwise I'd just use the standard parameters except setting the set kmer-method to "jellyfish" and setting the max memory for Jellyfish and the number of CPUs to use. I wouldn't expect problems with 81 Mio reads on a server with 70+ GB RAM (as indicated by your initial post), though expect the software to run overnight or even longer.
arvid is offline   Reply With Quote
Old 06-05-2012, 04:34 PM   #14
Pseudonym
Research Engineer
 
Location: NICTA VRL, Melbourne, Australia

Join Date: Jun 2011
Posts: 12
Default

Hicham,

Quote:
Originally Posted by hicham View Post
I just read about Gosammer. In the paper it is described as good for genomic data.
It can be also valid for transcriptomic reads?
About as well as ABySS-PE. Which is to say, not anywhere near as well as an actual transcriptome assembler like Trans-ABySS, Trinity or Oases.

The place where most genome assemblers do significantly worse than transcriptome assemblers is in pair threading and scaffolding, where it's useful to make the assumption that there is such a thing as "N times coverage". (This assumption is incorrect in RNA-Seq, because of differing expression levels.)

One thing that you could try is to use Gossamer as a pre-pass for Trinity. The input to Trinity is the output of a k-mer counter (Trinity's driver script uses Meryl by default). It would be fairly straightforward to use Gossamer as the k-mer counter by running its graph build and cleanup passes to bring it down to a managable size, then using the dump-graph to report the k-mer counts. You'd need to do a little scripting to convert it into Meryl format.

Having said that... we are actively working on the problem of resource-efficient transcriptome assembly. Nothing to announce yet, but watch this space.
__________________
sub f{($f)[email protected]_;print"$f(q{$f});";}f(q{sub f{($f)[email protected]_;print"$f(q{$f});";}f});
Pseudonym is offline   Reply With Quote
Old 06-06-2012, 12:45 PM   #15
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Trinity has 'jellyfish' as a mer-counter. It is likely that in the next release jellyfish will become the default and that meryl will be removed since jellyfish is so much faster. So, if you are using Trinity, make sure that you specify jellyfish.
westerman is offline   Reply With Quote
Old 07-16-2012, 08:23 AM   #16
amango
Member
 
Location: New York

Join Date: Dec 2009
Posts: 17
Default

I have used Abyss and then Trans-abyss for a de novo transcriptome assembly with ~70 million reads on a machine with 20gb RAM. From what I've read Abyss is less RAM-intensive than other Trinity and SOAPdenovo. And the memory-intensive phase for Abyss and other assemblers is loading the hash table, which depends on the kmer size, not the number of reads. For these reasons I don't think memory is your issue. Your issue about a large number of duplicate reads specific to one pair of reads and not the other sounds like a possible issue--Abyss has issues when coverage is too high, for example. The Abyss support group is probably a good place to turn. https://groups.google.com/forum/?fro...um/abyss-users
amango is offline   Reply With Quote
Old 07-16-2012, 08:51 AM   #17
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

From my experience MIRA performs very well on de novo tanscriptome assemblies
JackieBadger is offline   Reply With Quote
Old 02-12-2014, 09:58 AM   #18
reema
Member
 
Location: Scotland

Join Date: Feb 2014
Posts: 27
Smile Trinity-duplicate removal

Dear All,

I am using trinity for transcriptomics assembly. I have few queries:-

1) have two condition(Control and Treated) and each condition has 4 replicates. so if I merge these .fq files together, how the generated assembly from this merged .fq file would be better than the assembly generated from single(using only one replicate) sample?

2) Do I need to remove duplicates from individual fastq file before merging or after merging them together?

3) I saw there is a script "fasta_remove_duplicates" in the trinity folder. So is there any chance that "In-silico-normalization" in trinity take care of these duplicate reads?

I would appreciate any explanations.
reema is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:48 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO