Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    So, I'll attack the mapping task first. The contigs.fa file listed as ref= below was generated in Velvet USING a reverse comped and quality trimmed version of the LMP data...will that invalidate the mapping overall? I could generate a draft assembly with not much trouble either omitting the LMP data altogether, or by attempting to use it "raw".

    bbmap.sh -Xmx40g in1=Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq in2=Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq ref=contigs.fa nodisk out=null ihist=ihist_MP.txt mhist=mhist_MP.txt qhist=qhist_MP.txt bhist=bhist_MP.txt

    Any other flags needed/desired here?

    Comment


    • #47
      I recommend "rcs=f" (requirecorrectstrand=false) for long mate pair libraries; otherwise pairs that don't map in the normal fragment orientation will be considered improper pairs.

      Also, I think you'll probably need to generate a sam file and feed it to something to determine what percentage of pairs map in each possible orientation; BBMap does not output that, unfortunately. And I don't have a program that does it, either, but it's important to find out for LMP libraries.

      Comment


      • #48
        No, I have not "given up"...far from it :-P

        Other duties including conferences, office relocation, etc have dipped into my rattlesnake time.

        Update: We are having some additional LMP / jumping libraries created by our sequence supplier...we already have some at 6kbp, these will be at 8kbp I understand. With this volume of Illumina data (both PE's and LMP's) I plan to at least take a crack at using ALLPATHS-LG as it seems to really like these types of projects if you give it enough data of both sorts.

        As per my last chat with Brian B about direction sense of LMP data, I have requested that the sequencer lab provide VERY specific data about preparation and processing of the samples and data. I don't have an ETA yet for the new libraries, but they're "in the queue" at the lab :-D

        Comment


        • #49
          So, just received my additional Illumina LMP data. Total aggregate for sequencer data is now 288GB.

          /Snake_Orig> ls -lh
          total 288G

          HiSeq LMP Reads:
          -rw-rw-r-- 1 pummill mc3l8ep 21G 2014-07-21 14:02 Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 21G 2014-07-21 14:10 Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 40G 2014-07-21 14:01 C_horridus_a_5608_ACTTGA_L005_R1_001.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 40G 2014-07-21 14:07 C_horridus_a_5608_ACTTGA_L005_R2_001.fastq

          HiSeq PE Reads:
          -rw-rw-r-- 1 pummill mc3l8ep 19G 2014-07-21 14:10 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 19G 2014-07-21 14:13 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 27G 2014-07-21 14:18 s_1_1_sequence.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 27G 2014-07-21 14:28 s_1_2_sequence.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 24G 2014-07-21 14:02 s_2_1_sequence.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 24G 2014-07-21 14:13 s_2_2_sequence.fastq

          MiSeq Reads:
          -rw-rw-r-- 1 pummill mc3l8ep 16G 2014-07-21 14:10 MIKE_S1_L001_R1_001.fastq
          -rw-rw-r-- 1 pummill mc3l8ep 16G 2014-07-21 14:22 MIKE_S1_L001_R2_001.fastq

          Have transferred all of the data to the large memory machine (SGI) and am looking thru the ALLPATHS-LG information in an attempt to "get it right" the first time around. It'll be interesting to see if we get a substantially better assembly using an extra 80G of LMP reads plus the ALLPATHS assembler.

          Comment


          • #50
            Sorry for the prolonged delay for anyone following this thread. Finally got the time to eval the files to determine if everything was quality encoded the same way...and they're not, which I believe means that I'll have to be a bit more careful when I run ALLPATHS. 4 sets of the files are designated as Sanger / Illumina 1.9 by FastQC while the two remaining sets are shown to be Illumina 1.5.

            Illumina 1.5 = phred 64?
            Illumina 1.9 = phred 33?

            Comment


            • #51
              In regard to find optimal kmer, I would suggest to run SGA preqc, it gives better kmer suggestion than Kmer genie. Also I would like you try assembling with SGA and minia, if you have time?

              Comment


              • #52
                Originally posted by bioman1 View Post
                In regard to find optimal kmer, I would suggest to run SGA preqc, it gives better kmer suggestion than Kmer genie. Also I would like you try assembling with SGA and minia, if you have time?
                I do hope, as I have time (falling behind again at the moment), to try a number of assemblers "just to see". That being said, it doesn't look like SGA or minia explicitly use LMP or "jumping" libraries, so I expect one would have to either modify those reads or take a quality hit possibly by merging them into single reads or using as PE's. Actually, the quick look at minia seemed to indicate that it only used single reads...or at least, you just list files one after the other in the input file with no specifying what types of reads one is actually using?

                With the amount of data I have (see attached summary), I'm sure I will have to be pretty careful about specifying details to the assembler in order to achieve good results (and a good many other exercises, I expect!).
                Attached Files

                Comment


                • #53
                  Just wanted to mention here that for assembling large diploid genomes there's also Meraculous2. It gives you feedback about the polymorphic composition right after counting the k-mers so you can pause and decide whether or not the data you have is sufficient or needs more coverage, normalization, cleanup, etc.

                  I am one of the authors so I can give you a more in-depth run-down if you're interested.

                  Comment


                  • #54
                    Originally posted by GeneGolts View Post
                    Just wanted to mention here that for assembling large diploid genomes there's also Meraculous2. It gives you feedback about the polymorphic composition right after counting the k-mers so you can pause and decide whether or not the data you have is sufficient or needs more coverage, normalization, cleanup, etc.

                    I am one of the authors so I can give you a more in-depth run-down if you're interested.
                    You sure about that?

                    Is this the assembler?

                    README

                    This Docker container is part of the nucleotide.es genome assembler comparison project (currently internal to Joint Genome Institute). It installs and executes the Meraculous2 genome assembler on a specified Illumina shotgun dataset that must meet the following characteristics:

                    genome size: appx. 0.5-1 mb read length: 100-200 bp library type: paired-end, 300 bp

                    The results is the file final.scaffolds.fa containing all scaffolds over 1kb in size.

                    For more questions please contact:

                    Michael Barton [email protected] Eugene Goltsman [email protected]

                    Comment


                    • #55
                      Hmmm, that appears to be incorrect; I know Meraculous is intended for (and has been run on) much larger organisms, including human.

                      Comment


                      • #56
                        That docker container is a specific application, i.e., it's configured for a side by side comparison vs other assemblers using a bunch of microbial datasets. As Brian said, Meraculous is in fact optimized for large genomes. You can get it here: http://sourceforge.net/projects/meraculous20/

                        Comment


                        • #57
                          Thanks for the recent posts and ideas, all! Currently, I am experimenting with SSPACE to improve scaffolding of some of my existing assemblies. Not a lot of progress yet, but I am optimistic so far.

                          I do hope to get back and run the data thru another assembler or two just for comparison and completeness. I'm actually running it in SPAdes right now even though it is relatively un-tested with large genomes. Had to run with --only-assembler as one of the tools in the normal pipeline (BayesHammer) is 32 bit only and was giving " size greater than 2^32 -1 " errors. Assembly making steady progress on a 32 core, 768GB node. Been running 247 hours and is consuming 512GB of memory so far...
                          Last edited by jpummil; 12-10-2014, 07:49 AM.

                          Comment


                          • #58
                            Originally posted by GeneGolts View Post
                            That docker container is a specific application, i.e., it's configured for a side by side comparison vs other assemblers using a bunch of microbial datasets. As Brian said, Meraculous is in fact optimized for large genomes. You can get it here: http://sourceforge.net/projects/meraculous20/
                            Thanks for clarifying that. I am having real trouble installing the software:

                            Code:
                            @BioPower3-IBM ~/programs/meraculous-2.0.4 $ export Boost_INCLUDE_DIRS=/home/adrian/programs/boost_1_55_0/boost/
                            -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15:36:40
                            @BioPower3-IBM ~/programs/meraculous-2.0.4 $ ./install.sh install_dir
                            -- Testing the environment..
                            -- Perl found: /home/adrian/perl5/perlbrew/perls/5.18.2t/bin/perl
                            CMake Warning at CMakeLists.txt:65 (message):
                              Gnuplot was not found!
                            
                            
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:481 ] _boost_TEST_VERSIONS = 1.56.0;1.56;1.55.0;1.55;1.54.0;1.54;1.53.0;1.53;1.52.0;1.52;1.51.0;1.51;1.50.0;1.50
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:483 ] Boost_USE_MULTITHREADED = OFF
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:485 ] Boost_USE_STATIC_LIBS = ON
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:487 ] Boost_USE_STATIC_RUNTIME = ON
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:489 ] Boost_ADDITIONAL_VERSIONS = 
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:491 ] Boost_NO_SYSTEM_PATHS = 
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:543 ] Declared as CMake or Environmental Variables:
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:545 ]   BOOST_ROOT = 
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:547 ]   BOOST_INCLUDEDIR = 
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:549 ]   BOOST_LIBRARYDIR = 
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:551 ] _boost_TEST_VERSIONS = 1.56.0;1.56;1.55.0;1.55;1.54.0;1.54;1.53.0;1.53;1.52.0;1.52;1.51.0;1.51;1.50.0;1.50
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:620 ] Include debugging info:
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:622 ]   _boost_INCLUDE_SEARCH_DIRS = /home/adrian/programs/boost_1_55_0/boost;/home/adrian/programs/boost_1_55_0/boost/include;/home/adrian/programs/boost_1_55_0/boost;PATHS;C:/boost/include;C:/boost;/sw/local/include
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:624 ]   _boost_PATH_SUFFIXES = boost-1_56_0;boost_1_56_0;boost/boost-1_56_0;boost/boost_1_56_0;boost-1_56;boost_1_56;boost/boost-1_56;boost/boost_1_56;boost-1_55_0;boost_1_55_0;boost/boost-1_55_0;boost/boost_1_55_0;boost-1_55;boost_1_55;boost/boost-1_55;boost/boost_1_55;boost-1_54_0;boost_1_54_0;boost/boost-1_54_0;boost/boost_1_54_0;boost-1_54;boost_1_54;boost/boost-1_54;boost/boost_1_54;boost-1_53_0;boost_1_53_0;boost/boost-1_53_0;boost/boost_1_53_0;boost-1_53;boost_1_53;boost/boost-1_53;boost/boost_1_53;boost-1_52_0;boost_1_52_0;boost/boost-1_52_0;boost/boost_1_52_0;boost-1_52;boost_1_52;boost/boost-1_52;boost/boost_1_52;boost-1_51_0;boost_1_51_0;boost/boost-1_51_0;boost/boost_1_51_0;boost-1_51;boost_1_51;boost/boost-1_51;boost/boost_1_51;boost-1_50_0;boost_1_50_0;boost/boost-1_50_0;boost/boost_1_50_0;boost-1_50;boost_1_50;boost/boost-1_50;boost/boost_1_50
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:744 ] guessed _boost_COMPILER = -gcc
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:754 ] _boost_MULTITHREADED = 
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:797 ] _boost_RELEASE_ABI_TAG = -s
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:799 ] _boost_DEBUG_ABI_TAG = -sd
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:847 ] _boost_LIBRARY_SEARCH_DIRS = /home/adrian/programs/boost_1_55_0/boost/lib;/home/adrian/programs/boost_1_55_0/boost/stage/lib;Boost_INCLUDE_DIR-NOTFOUND/lib;Boost_INCLUDE_DIR-NOTFOUND/../lib;Boost_INCLUDE_DIR-NOTFOUND/stage/lib;PATHS;C:/boost/lib;C:/boost;/sw/local/lib
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:957 ] Searching for THREAD_LIBRARY_RELEASE: boost_thread-gcc-s-;boost_thread-gcc-s;boost_thread-s-;boost_thread-s;boost_thread
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:993 ] Searching for THREAD_LIBRARY_DEBUG: boost_thread-gcc-sd-;boost_thread-gcc-sd;boost_thread-sd-;boost_thread-sd;boost_thread;boost_thread
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:957 ] Searching for SYSTEM_LIBRARY_RELEASE: boost_system-gcc-s-;boost_system-gcc-s;boost_system-s-;boost_system-s;boost_system
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:993 ] Searching for SYSTEM_LIBRARY_DEBUG: boost_system-gcc-sd-;boost_system-gcc-sd;boost_system-sd-;boost_system-sd;boost_system;boost_system
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:957 ] Searching for FILESYSTEM_LIBRARY_RELEASE: boost_filesystem-gcc-s-;boost_filesystem-gcc-s;boost_filesystem-s-;boost_filesystem-s;boost_filesystem
                            -- [ /usr/share/cmake/Modules/FindBoost.cmake:993 ] Searching for FILESYSTEM_LIBRARY_DEBUG: boost_filesystem-gcc-sd-;boost_filesystem-gcc-sd;boost_filesystem-sd-;boost_filesystem-sd;boost_filesystem;boost_filesystem
                            CMake Error at /usr/share/cmake/Modules/FindBoost.cmake:1138 (message):
                              Unable to find the requested Boost libraries.
                            
                              Unable to find the Boost header files.  Please set BOOST_ROOT to the root
                              directory containing Boost or BOOST_INCLUDEDIR to the directory containing
                              Boost's headers.
                            Call Stack (most recent call first):
                              src/c/CMakeLists.txt:44 (find_package)
                            
                            
                            -- Boost libs: 
                            -- Boost include paths: Boost_INCLUDE_DIR-NOTFOUND
                            CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
                            Please set them or make sure they are set and tested correctly in the CMake files:
                            Boost_INCLUDE_DIR (ADVANCED)
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                               used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
                            
                            -- Configuring incomplete, errors occurred!
                            See also "/home/adrian/programs/meraculous-2.0.4/build/CMakeFiles/CMakeOutput.log".
                            See also "/home/adrian/programs/meraculous-2.0.4/build/CMakeFiles/CMakeError.log".
                            make: *** No targets specified and no makefile found.  Stop.
                            make: *** No rule to make target `install'.  Stop.
                            I have also tried:
                            Code:
                            export BOOST_ROOT=/home/adrian/programs/boost_1_55_0/boost/
                            export BOOST_INCLUDEDIR=/home/adrian/programs/boost_1_55_0/boost/
                            and some other combinations.

                            Comment


                            • #59
                              Originally posted by jpummil View Post
                              Thanks for the recent posts and ideas, all! Currently, I am experimenting with SSPACE to improve scaffolding of some of my existing assemblies. Not a lot of progress yet, but I am optimistic so far.

                              I do hope to get back and run the data thru another assembler or two just for comparison and completeness. I'm actually running it in SPAdes right now even though it is relatively un-tested with large genomes. Had to run with --only-assembler as one of the tools in the normal pipeline (BayesHammer) is 32 bit only and was giving " size greater than 2^32 -1 " errors. Assembly making steady progress on a 32 core, 768GB node. Been running 247 hours and is consuming 512GB of memory so far...
                              Project is still very much alive for those who check in from time to time. Right now I am trying to utilize SSPACE to do some improvements on existing assemblies, but having issues of some sort. Thread on THAT saga, see here: http://seqanswers.com/forums/showthread.php?t=8350

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X