Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • deNovo assembly with long reads and big genome

    Hi all!

    First, I have to say that I am a newbie in genome assembly. Please be kind
    I have to deal with a project with Illumina Long Reads data (formerly Moleculo) and an expected genome of 1.2Gb. I was reading a little about but seems that there is no "easy solution" for that. All the recommendations/assemblers that I found are related with smaller and bacterial genomes. I have made some trials with MIRA but I think it will take months to finish with my current infrastructure. I also have pair-end HiSeq data and I have thought that maybe is possible to assembly this first with Abyss or SOAPdenovo and then use the long-reads in some way to fill gaps or something like that, but no clue how to do it.

    I am open to any suggestion/recommendation!

    Thanks in advance!

  • #2
    You either need a computer with lots of memory (at least 128GB, and possibly closer to 512GB), or a huge cluster of computers with a large amount of combined memory (if you're using Ray). If you don't have that, don't bother with a de-novo assembly of a 1.2Gbp genome -- it's not going to work.

    There are a few programs that can do hybrid assemblies, which is what you would want with this. You mentioned MIRA, but another example is SPAdes, which allows you to make small contigs with short-read data (i.e. your paired-end data), use that to correct the long read data, then use the long read data to scaffold the short read contigs.

    Comment


    • #3
      Besides what gringer says, other options include commercial software such as dnastar or CLC suite, which may do the job with less amounts of RAM. At a cost, of course...
      So it may be worth writing down a pros-cons list with things such as buying software, buying hardware, paying for an external service, or finding a colaboration with a bioinformatics group.

      Software-wise, apart from MIRA and SPAdes, you can also check Velvet and Trinity

      HTH

      Dave

      Comment


      • #4
        Trinity is just for transcriptome data. That said, if you do want to assemble transcriptomes, it's pretty good at keeping resource use low.

        Comment


        • #5
          ABySS should be able to do a de-novo assembly with the pair-ends and then use the Long Reads as scaffolding information. It also might be possible to toss in the Long Reads as single-end reads during the assembly step. Not having used the Long Reads before (although I would really like to get my hands on some) I am not sure.

          I do know that Mira will choke on a 1.2 GB genome. Don't even try it.

          Comment


          • #6
            use celera: http://wgs-assembler.sourceforge.net/

            Comment


            • #7
              First, thanks to all for the answers. Great forum!

              The computer infrastructure should not be a problem as we can access to a computer with 1.5 TB RAM and several CPU's/Threads. I think around 40.

              I will try the two approaches that westerman suggested and I will give a try to Abyss using the long-reads as single-end input and also using them as scaffolding information after an assembly with the pair-ends. I will let you know how is going.

              About the other assemblers... do you think that Spades will be able to handle the size of the genome?

              Some recommendations for Celera?

              Thanks!

              Comment


              • #8
                How's the assembly?

                I'm also a newbie for Moleculo data. And I'm now working with those kind of data and Celera assembler.

                Celera assebler spent a lot of time.

                So, I wonder that ABySS is good for assemble Moleculo reads.


                Originally posted by senna View Post
                First, thanks to all for the answers. Great forum!

                The computer infrastructure should not be a problem as we can access to a computer with 1.5 TB RAM and several CPU's/Threads. I think around 40.

                I will try the two approaches that westerman suggested and I will give a try to Abyss using the long-reads as single-end input and also using them as scaffolding information after an assembly with the pair-ends. I will let you know how is going.

                About the other assemblers... do you think that Spades will be able to handle the size of the genome?

                Some recommendations for Celera?

                Thanks!

                Comment


                • #9
                  Originally posted by senna View Post
                  First, thanks to all for the answers. Great forum!

                  The computer infrastructure should not be a problem as we can access to a computer with 1.5 TB RAM and several CPU's/Threads. I think around 40.

                  About the other assemblers... do you think that Spades will be able to handle the size of the genome?

                  Thanks!
                  So, SPAdes is a GREAT assembler and can make use of long reads during assembly, but it is NOT currently designed to do Gb sized genomes. I currently have a genome assembly of similar size running on a node with 768GB of ram and 32 fast cores. Assembly consists of 3 sets of PE data and 1 set of MP data totaling around 200GB of fastq files. It is currently using 516GB of memory, all 32 cores, and has been running for over 1000 hours (about 42 days). Additionally, BayesHammer (the read error correction tool) right now contains a limit of 2^32 k-mers in the input, so I am running in "--only-assembler" option.

                  That said, again...if you are doing something in the 100's of Mb size, DEFINITELY give SPAdes a try. Plus, the latest version (3.5.0) also has support for nanopore long reads

                  Comment


                  • #10
                    Hi there!

                    Finally the most successful approach for us was to use Abyss with the PE short-reads and then use PBJelly (http://sourceforge.net/projects/pb-jelly/) to fill gaps in the scaffolds with the Moleculo reads.

                    The trials with Celera were not good due to our low coverage of the Moleculo reads and the ones with Abyss using the long reads came to errors during the assembly.

                    Comment


                    • #11
                      Thank you for the information.

                      I've been tried assembly using Celera assembler. But, it's not good.

                      Originally posted by senna View Post
                      Hi there!

                      Finally the most successful approach for us was to use Abyss with the PE short-reads and then use PBJelly (http://sourceforge.net/projects/pb-jelly/) to fill gaps in the scaffolds with the Moleculo reads.

                      The trials with Celera were not good due to our low coverage of the Moleculo reads and the ones with Abyss using the long reads came to errors during the assembly.

                      Comment


                      • #12
                        Throw in some ONT MinION data and use LINKS

                        http://biorxiv.org/content/early/2015/03/13/016519

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-27-2024, 06:37 PM
                        0 responses
                        13 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-27-2024, 06:07 PM
                        0 responses
                        11 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        53 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        69 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X