Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De novo assembly for Illumina HighSeq paired end reads

    Hi all,

    I have Illumina HighSeq paired end reads and I'm looking for some strategy to apply a de novo assembly for it.

    The inicial total number of reads in the two files was 240 Millions (The sum of files sizes: 56 GB). After the cleaning step, the total number of reads has been reduced to 81 Millions (21 GB).

    I'm tring to assemble this data with the abyss-pe. This software work good with a small paired-end files. But when I run it on my data, even using 70GB of Ram the assembly don't finish and don't give results. I tested it with smal kmer (30) and big kmer (64), and also turned the minimum coverage to 40. No result.

    Someone knows any strategy or pipeline to assemble a such large amount of illumina paired end data?

    Thank you very much

  • #2
    You may way to consider a CD-HIT run to lower complexity again by removing duplicate reads. I suggest getting access to a cluster and using Celera Assembler however, remember that alot of your contigs will be in the degenerates folder.

    Comment


    • #3
      What kind of organism are you sequencing? This, of course, affects strategy?

      Comment


      • #4
        Thanks for your answer
        Indeed in each file reads are duplicated thousand of times. but we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats.
        I forget the kind of this data : RNASeq (transcriptome assembly).
        I think that Celera Assembler isn't suitable for this assembly because it's for genome assembly.

        Comment


        • #5
          This is a Micro-algae

          Comment


          • #6
            Originally posted by hicham View Post
            we cant reduce these repeats because theses repeats aren't shared in the two files. For example, a read from the first file has 1000 exact repeats, but the correponding pair read hasn't the same repeats.
            Be careful here. Most assemblers do not look at header information to establish pairs. Rather, the 1st read in file a is paired with the 1st read in file b. If you remove any reads, be sure you also remove it's pair in the other file.


            Originally posted by hicham View Post
            I forget the kind of this data : RNASeq (transcriptome assembly).
            Do not parse out repeats. The general expression levels are important for transcriptome assemblers. We use Trinity package currently. SOAPtrans is pretty fast and memory efficient, but i haven't had a chance to assess it's correctness.

            Comment


            • #7
              Right, after the cleaning step we removed reads without pair and put them in an external file. and to keep order of reads in the files.
              I truth on the importance of the expression level in the transcriptome assembly. the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads.
              If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly?

              Comment


              • #8
                Originally posted by hicham View Post
                the idea was to make a relative reduction of repeats to reduce the immense amount of data. but it's not was possible in this case of paired reads.
                It is a good idea (computationally) to reduce sequence to only as much sampling as you really need. Which organism did you sequence?

                Originally posted by hicham View Post
                If we use Trinity, How much memory RAM would be necessary to dedicate for this assembly?
                The authors say,
                Ideally, you will have access to a large-memory server, ideally having ~1G of RAM per 1M reads to be assembled (but often, much less memory may be required).
                I don't have any numbers to share for paired end, but recently, we ran 160 M reads (1x100bp) with Trinity peaking memory at 18 GB.

                Comment


                • #9
                  Originally posted by hicham View Post
                  Someone knows any strategy or pipeline to assemble a such large amount of illumina paired end data?
                  You might like to try Gossamer. It was designed with memory efficiency in mind, so it can do the same job as other assemblers, using smaller machines. (Or, alternatively, it can handle more data than other assembers on the same machine.)

                  Full disclosure: I'm one of the developers.
                  sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

                  Comment


                  • #10
                    I'd second the suggestion to try Trinity on that dataset. You could reduce your dataset with diginorm, if necessary, though 81 Mio reads (pairs?) sounds reasonable to tackle with a ~64 GB server - though generally the memory consumption depends more on the transcriptome complexity than the actual number of reads.
                    What was wrong with the 159 Mio reads that you dropped? rRNA, adapters or just bad quality?

                    Comment


                    • #11
                      Hi,
                      Thank you very much for you answers.
                      I just read about Gosammer. In the paper it is described as good for genomic data.
                      It can be also valid for transcriptomic reads?

                      Comment


                      • #12
                        Hi arvid,
                        After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions
                        For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads.

                        Comment


                        • #13
                          Originally posted by hicham View Post
                          Hi arvid,
                          After the cleaning, I get 3 files: 2 files for pairs, and a file containing reads without pair, the sum of read in the tree files is 81 Millions
                          For the cleaning operation we used SeqTrimNext, this software remove adapters, contaminants, bad quality, and low complexity reads.
                          For Trinity, you'd want to combine that into one file, it should be able to recognize the pairs on its own (might have changed recently though, as a paired end mapping step was introduced which might need a different input, check the documentation and examples). Otherwise I'd just use the standard parameters except setting the set kmer-method to "jellyfish" and setting the max memory for Jellyfish and the number of CPUs to use. I wouldn't expect problems with 81 Mio reads on a server with 70+ GB RAM (as indicated by your initial post), though expect the software to run overnight or even longer.

                          Comment


                          • #14
                            Hicham,

                            Originally posted by hicham View Post
                            I just read about Gosammer. In the paper it is described as good for genomic data.
                            It can be also valid for transcriptomic reads?
                            About as well as ABySS-PE. Which is to say, not anywhere near as well as an actual transcriptome assembler like Trans-ABySS, Trinity or Oases.

                            The place where most genome assemblers do significantly worse than transcriptome assemblers is in pair threading and scaffolding, where it's useful to make the assumption that there is such a thing as "N times coverage". (This assumption is incorrect in RNA-Seq, because of differing expression levels.)

                            One thing that you could try is to use Gossamer as a pre-pass for Trinity. The input to Trinity is the output of a k-mer counter (Trinity's driver script uses Meryl by default). It would be fairly straightforward to use Gossamer as the k-mer counter by running its graph build and cleanup passes to bring it down to a managable size, then using the dump-graph to report the k-mer counts. You'd need to do a little scripting to convert it into Meryl format.

                            Having said that... we are actively working on the problem of resource-efficient transcriptome assembly. Nothing to announce yet, but watch this space.
                            sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});

                            Comment


                            • #15
                              Trinity has 'jellyfish' as a mer-counter. It is likely that in the next release jellyfish will become the default and that meryl will be removed since jellyfish is so much faster. So, if you are using Trinity, make sure that you specify jellyfish.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X