Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to assemble two different length Solexa data?

    I have two Solexa data sets. The length of Solexa data is 35 and 75 individually. The insert length is also different. How should I assemble them?

  • #2
    If you use CLC genome workbench, the software can manage this problem. But you should specify the insert length to prevent incorrect alignment.

    Comment


    • #3
      Maybe there is some free or open source assembler which is suit for this task. I had tried the AllPaths, however, it came across fatal error at last. I would like to know if any other can do the same job!

      Originally posted by Chien-Yuan Chen View Post
      If you use CLC genome workbench, the software can manage this problem. But you should specify the insert length to prevent incorrect alignment.

      Comment


      • #4
        Have you tried Maq map merge?

        http://maq.sourceforge.net/maq-man.shtml#mapmerge

        I am guessing you could make a map for the 35 and 75bp reads separately, then merge them. Or maybe try samtools merge? Align with BWA or other favorite aligner, then merge the sam/bam files?

        http://samtools.sourceforge.net/samtools.shtml

        Comment


        • #5
          I tried to assemble de novo. I think I would like to assemble them sperately with velvet or edena, then assemble the contigs with CAP3, Phrap?
          Originally posted by caddymob View Post
          Have you tried Maq map merge?

          http://maq.sourceforge.net/maq-man.shtml#mapmerge

          I am guessing you could make a map for the 35 and 75bp reads separately, then merge them. Or maybe try samtools merge? Align with BWA or other favorite aligner, then merge the sam/bam files?

          http://samtools.sourceforge.net/samtools.shtml

          Comment


          • #6
            In an ideal world you'd have an assembler that just understands short-read data, mixed libraries with varying insert sizes, etc and just gives you the optimal answer. Some of the tools make a fair stab at this (eg velvet), but the system resources required can be HUGE.

            Therefore a more pragmatic approach used by many is starting with some sort of basic "read extension" where you lose track of the individual fragments, but build up contig consensus sequences by identifying overlapping Kmers and no branch points - much like ssake fuzzypaths, etc.

            From here you can then either take these contigs as-is or throw them into another assembly tool more appropriate for longer sequences to attempt to resolve further.

            Finally, map your individual reads (both 75 and 35) back to your consensus sequences again to get a true assembly rather than just consensus sequences.

            You could even iterate - finding reads that overlap contig ends uniquely to edit and extending the "reference", and remapping those that failed to map previously. This technique works in more "usual" cases too where the reference doesn't precisely match the organism you're mapping against it. Not pretty though.

            Comment


            • #7
              Originally posted by anyone1985 View Post
              I have two Solexa data sets. The length of Solexa data is 35 and 75 individually. The insert length is also different. How should I assemble them?
              You could play guinea pig and try MIRA (2.9.45): in theory, it should work. You can give the assembler all the necessary ancillary information (like sequencing technology, insert size, quality clips etc.pp) on a per read basis using a XML file in TRACEINFO format as standardized by the NCBI.

              MIRA will know how to treat Solexa data and handle many things almost automatically (like clipping) and even know of sequencing technology dependent errors (like the "GGC" problem in Solexa data).

              However, I would try this only for organsism of bacterial size and on a machine with lots and lots of memory.

              And you might want to try assembling the 75mers first: if you have an average coverage of >= 30x with the 75mers and the insert sizes of the 75mer library is larger than the one for the 36mer library, the 36mers probably won't improve the assembly.

              PS: Disclaimer: I wrote MIRA and might not be objective

              Comment


              • #8
                I'd have to say that velvet is still your best bet for de novo assembly. It can accept different read lengths with no problem, and you can feed it 2 different sets of paired reads, with 2 different insert sizes, "out of the box". However, you can also make a trivial change to the source code and recompile so that it accepts more than 2 sets of insert lengths.

                Also note that when you tell velvet the insert length (" -ins_length 280 "), you need to use the entire length of the fragment, so in this case if you told it 280, that would correspond to two 40bp reads with a 200bp "insert".

                Consult the velvet-users list for details on these two issues.

                Comment


                • #9
                  oh, and note that I'm not countering BaCh's suggestion! I've been wanting to try MIRA for a while, and velvet won't incorporate 454 reads well, like MIRA can ...

                  Comment


                  • #10
                    any de novo assembly tools that can iteratively assemble reads instead of eating up a whole lot of RAM?

                    my limitation is less than 60Gb RAM for a 1GB+ organism, to be de novo assembled by 20x solexa coverage worth reads
                    --
                    bioinfosm

                    Comment


                    • #11
                      Thank you for jnfass's suggestion. After I read the maual of Velet, I also find that it can handle different kinds insert length.

                      Comment


                      • #12
                        Originally posted by bioinfosm View Post
                        any de novo assembly tools that can iteratively assemble reads instead of eating up a whole lot of RAM?

                        my limitation is less than 60Gb RAM for a 1GB+ organism, to be de novo assembled by 20x solexa coverage worth reads
                        Uh ... I missed that post. No, no program I know of.

                        But just to be sure I understood you right: you have ~550 million 36mers that you want to assemble de-novo? That's (in terms of reads) almost 15-20 times more reads than the Human Genome Project or Celera had ... and they had *large* computing clusters to tackle the problems.

                        Even memory optimised programs with very simple assembly logic would need to keep lots of data in memory to be even decently efficient ... and you would still be in for *a lot* of disk reads/writes which would probably mean it'd literally take ages to get the thing assembled.

                        Correct me if I'm wrong or if you found some program which performs such a wonder ... but I don't think this is possible with 60Gb RAM.

                        Regards,
                        B.

                        Comment


                        • #13
                          Well, parallel algorithms like ABySS could possibly work if you have enough machines in a cluster. It's far cheaper and easier to get lots of small machines than a few truely humungous ones. However I've no idea what the upper limit is on an abyss assembly.

                          However the iterative approach sounds more sensible. I'm not sure of any official programs that do a decent job of this yet, although lots have manually done similar things by successive rounds of mapping to close genomes, shredding of close genomic data, etc.

                          James

                          Comment


                          • #14
                            I am new to this as well and I am trying to set up an RNASeq pipeline for my lab. I've run into an issue though. I'm confused on why one would run Velvet and the on the resultant contigs run Phrap. Why not just head to phrap straight away? Any help would be appreciated.

                            Cheers,
                            Addison

                            Comment


                            • #15
                              Originally posted by cloughlab View Post
                              I am new to this as well and I am trying to set up an RNASeq pipeline for my lab. I've run into an issue though. I'm confused on why one would run Velvet and the on the resultant contigs run Phrap. Why not just head to phrap straight away? Any help would be appreciated.

                              Cheers,
                              Addison
                              Phrap is slow and not optimized for the large NGS datasets.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X