Seqanswers Leaderboard Ad

**jpummil** · 06-02-2014, 10:47 AM

So, I'll attack the mapping task first. The contigs.fa file listed as ref= below was generated in Velvet USING a reverse comped and quality trimmed version of the LMP data...will that invalidate the mapping overall? I could generate a draft assembly with not much trouble either omitting the LMP data altogether, or by attempting to use it "raw".

bbmap.sh -Xmx40g in1=Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq in2=Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq ref=contigs.fa nodisk out=null ihist=ihist_MP.txt mhist=mhist_MP.txt qhist=qhist_MP.txt bhist=bhist_MP.txt

Any other flags needed/desired here?

**Brian Bushnell** · 06-02-2014, 10:55 AM

I recommend "rcs=f" (requirecorrectstrand=false) for long mate pair libraries; otherwise pairs that don't map in the normal fragment orientation will be considered improper pairs.

Also, I think you'll probably need to generate a sam file and feed it to something to determine what percentage of pairs map in each possible orientation; BBMap does not output that, unfortunately. And I don't have a program that does it, either, but it's important to find out for LMP libraries.

**jpummil** · 06-18-2014, 08:47 AM

No, I have not "given up"...far from it :-P

Other duties including conferences, office relocation, etc have dipped into my rattlesnake time.

Update: We are having some additional LMP / jumping libraries created by our sequence supplier...we already have some at 6kbp, these will be at 8kbp I understand. With this volume of Illumina data (both PE's and LMP's) I plan to at least take a crack at using ALLPATHS-LG as it seems to really like these types of projects if you give it enough data of both sorts.

As per my last chat with Brian B about direction sense of LMP data, I have requested that the sequencer lab provide VERY specific data about preparation and processing of the samples and data. I don't have an ETA yet for the new libraries, but they're "in the queue" at the lab :-D

**jpummil** · 07-22-2014, 10:08 AM

So, just received my additional Illumina LMP data. Total aggregate for sequencer data is now 288GB.

/Snake_Orig> ls -lh
total 288G

HiSeq LMP Reads:
-rw-rw-r-- 1 pummill mc3l8ep 21G 2014-07-21 14:02 Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq
-rw-rw-r-- 1 pummill mc3l8ep 21G 2014-07-21 14:10 Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq
-rw-rw-r-- 1 pummill mc3l8ep 40G 2014-07-21 14:01 C_horridus_a_5608_ACTTGA_L005_R1_001.fastq
-rw-rw-r-- 1 pummill mc3l8ep 40G 2014-07-21 14:07 C_horridus_a_5608_ACTTGA_L005_R2_001.fastq

HiSeq PE Reads:
-rw-rw-r-- 1 pummill mc3l8ep 19G 2014-07-21 14:10 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
-rw-rw-r-- 1 pummill mc3l8ep 19G 2014-07-21 14:13 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq
-rw-rw-r-- 1 pummill mc3l8ep 27G 2014-07-21 14:18 s_1_1_sequence.fastq
-rw-rw-r-- 1 pummill mc3l8ep 27G 2014-07-21 14:28 s_1_2_sequence.fastq
-rw-rw-r-- 1 pummill mc3l8ep 24G 2014-07-21 14:02 s_2_1_sequence.fastq
-rw-rw-r-- 1 pummill mc3l8ep 24G 2014-07-21 14:13 s_2_2_sequence.fastq

MiSeq Reads:
-rw-rw-r-- 1 pummill mc3l8ep 16G 2014-07-21 14:10 MIKE_S1_L001_R1_001.fastq
-rw-rw-r-- 1 pummill mc3l8ep 16G 2014-07-21 14:22 MIKE_S1_L001_R2_001.fastq

Have transferred all of the data to the large memory machine (SGI) and am looking thru the ALLPATHS-LG information in an attempt to "get it right" the first time around. It'll be interesting to see if we get a substantially better assembly using an extra 80G of LMP reads plus the ALLPATHS assembler.

**jpummil** · 08-08-2014, 11:19 AM

Sorry for the prolonged delay for anyone following this thread. Finally got the time to eval the files to determine if everything was quality encoded the same way...and they're not, which I believe means that I'll have to be a bit more careful when I run ALLPATHS. 4 sets of the files are designated as Sanger / Illumina 1.9 by FastQC while the two remaining sets are shown to be Illumina 1.5.

Illumina 1.5 = phred 64?
Illumina 1.9 = phred 33?

**bioman1** · 08-23-2014, 08:49 PM

In regard to find optimal kmer, I would suggest to run SGA preqc, it gives better kmer suggestion than Kmer genie. Also I would like you try assembling with SGA and minia, if you have time?

**jpummil** · 08-24-2014, 06:40 AM

Originally posted by bioman1 View Post

In regard to find optimal kmer, I would suggest to run SGA preqc, it gives better kmer suggestion than Kmer genie. Also I would like you try assembling with SGA and minia, if you have time?

I do hope, as I have time (falling behind again at the moment), to try a number of assemblers "just to see". That being said, it doesn't look like SGA or minia explicitly use LMP or "jumping" libraries, so I expect one would have to either modify those reads or take a quality hit possibly by merging them into single reads or using as PE's. Actually, the quick look at minia seemed to indicate that it only used single reads...or at least, you just list files one after the other in the input file with no specifying what types of reads one is actually using?

With the amount of data I have (see attached summary), I'm sure I will have to be pretty careful about specifying details to the assembler in order to achieve good results (and a good many other exercises, I expect!).

Attached Files

Crotalus_Data_Info.pdf (37.3 KB, 39 views)

**GeneGolts** · 12-08-2014, 04:43 PM

Just wanted to mention here that for assembling large diploid genomes there's also Meraculous2. It gives you feedback about the polymorphic composition right after counting the k-mers so you can pause and decide whether or not the data you have is sufficient or needs more coverage, normalization, cleanup, etc.

I am one of the authors so I can give you a more in-depth run-down if you're interested.

**AdrianP** · 12-08-2014, 05:08 PM

Originally posted by GeneGolts View Post

Just wanted to mention here that for assembling large diploid genomes there's also Meraculous2. It gives you feedback about the polymorphic composition right after counting the k-mers so you can pause and decide whether or not the data you have is sufficient or needs more coverage, normalization, cleanup, etc.

I am one of the authors so I can give you a more in-depth run-down if you're interested.

You sure about that?

Is this the assembler?

Docker

https://registry.hub.docker.com/u/egoltsman/meraculous2-docker/

README

This Docker container is part of the nucleotide.es genome assembler comparison project (currently internal to Joint Genome Institute). It installs and executes the Meraculous2 genome assembler on a specified Illumina shotgun dataset that must meet the following characteristics:

genome size: appx. 0.5-1 mb read length: 100-200 bp library type: paired-end, 300 bp

The results is the file final.scaffolds.fa containing all scaffolds over 1kb in size.

For more questions please contact:

Michael Barton [email protected] Eugene Goltsman [email protected]

**Brian Bushnell** · 12-08-2014, 05:22 PM

Hmmm, that appears to be incorrect; I know Meraculous is intended for (and has been run on) much larger organisms, including human.

**GeneGolts** · 12-09-2014, 12:59 PM

That docker container is a specific application, i.e., it's configured for a side by side comparison vs other assemblers using a bunch of microbial datasets. As Brian said, Meraculous is in fact optimized for large genomes. You can get it here: http://sourceforge.net/projects/meraculous20/

**jpummil** · 12-09-2014, 01:10 PM

Thanks for the recent posts and ideas, all! Currently, I am experimenting with SSPACE to improve scaffolding of some of my existing assemblies. Not a lot of progress yet, but I am optimistic so far.

I do hope to get back and run the data thru another assembler or two just for comparison and completeness. I'm actually running it in SPAdes right now even though it is relatively un-tested with large genomes. Had to run with --only-assembler as one of the tools in the normal pipeline (BayesHammer) is 32 bit only and was giving " size greater than 2^32 -1 " errors. Assembly making steady progress on a 32 core, 768GB node. Been running 247 hours and is consuming 512GB of memory so far...

**AdrianP** · 12-10-2014, 05:52 PM

Originally posted by GeneGolts View Post

That docker container is a specific application, i.e., it's configured for a side by side comparison vs other assemblers using a bunch of microbial datasets. As Brian said, Meraculous is in fact optimized for large genomes. You can get it here: http://sourceforge.net/projects/meraculous20/

Thanks for clarifying that. I am having real trouble installing the software:

Code:

@BioPower3-IBM ~/programs/meraculous-2.0.4 $ export Boost_INCLUDE_DIRS=/home/adrian/programs/boost_1_55_0/boost/
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 15:36:40
@BioPower3-IBM ~/programs/meraculous-2.0.4 $ ./install.sh install_dir
-- Testing the environment..
-- Perl found: /home/adrian/perl5/perlbrew/perls/5.18.2t/bin/perl
CMake Warning at CMakeLists.txt:65 (message):
  Gnuplot was not found!


-- [ /usr/share/cmake/Modules/FindBoost.cmake:481 ] _boost_TEST_VERSIONS = 1.56.0;1.56;1.55.0;1.55;1.54.0;1.54;1.53.0;1.53;1.52.0;1.52;1.51.0;1.51;1.50.0;1.50
-- [ /usr/share/cmake/Modules/FindBoost.cmake:483 ] Boost_USE_MULTITHREADED = OFF
-- [ /usr/share/cmake/Modules/FindBoost.cmake:485 ] Boost_USE_STATIC_LIBS = ON
-- [ /usr/share/cmake/Modules/FindBoost.cmake:487 ] Boost_USE_STATIC_RUNTIME = ON
-- [ /usr/share/cmake/Modules/FindBoost.cmake:489 ] Boost_ADDITIONAL_VERSIONS = 
-- [ /usr/share/cmake/Modules/FindBoost.cmake:491 ] Boost_NO_SYSTEM_PATHS = 
-- [ /usr/share/cmake/Modules/FindBoost.cmake:543 ] Declared as CMake or Environmental Variables:
-- [ /usr/share/cmake/Modules/FindBoost.cmake:545 ]   BOOST_ROOT = 
-- [ /usr/share/cmake/Modules/FindBoost.cmake:547 ]   BOOST_INCLUDEDIR = 
-- [ /usr/share/cmake/Modules/FindBoost.cmake:549 ]   BOOST_LIBRARYDIR = 
-- [ /usr/share/cmake/Modules/FindBoost.cmake:551 ] _boost_TEST_VERSIONS = 1.56.0;1.56;1.55.0;1.55;1.54.0;1.54;1.53.0;1.53;1.52.0;1.52;1.51.0;1.51;1.50.0;1.50
-- [ /usr/share/cmake/Modules/FindBoost.cmake:620 ] Include debugging info:
-- [ /usr/share/cmake/Modules/FindBoost.cmake:622 ]   _boost_INCLUDE_SEARCH_DIRS = /home/adrian/programs/boost_1_55_0/boost;/home/adrian/programs/boost_1_55_0/boost/include;/home/adrian/programs/boost_1_55_0/boost;PATHS;C:/boost/include;C:/boost;/sw/local/include
-- [ /usr/share/cmake/Modules/FindBoost.cmake:624 ]   _boost_PATH_SUFFIXES = boost-1_56_0;boost_1_56_0;boost/boost-1_56_0;boost/boost_1_56_0;boost-1_56;boost_1_56;boost/boost-1_56;boost/boost_1_56;boost-1_55_0;boost_1_55_0;boost/boost-1_55_0;boost/boost_1_55_0;boost-1_55;boost_1_55;boost/boost-1_55;boost/boost_1_55;boost-1_54_0;boost_1_54_0;boost/boost-1_54_0;boost/boost_1_54_0;boost-1_54;boost_1_54;boost/boost-1_54;boost/boost_1_54;boost-1_53_0;boost_1_53_0;boost/boost-1_53_0;boost/boost_1_53_0;boost-1_53;boost_1_53;boost/boost-1_53;boost/boost_1_53;boost-1_52_0;boost_1_52_0;boost/boost-1_52_0;boost/boost_1_52_0;boost-1_52;boost_1_52;boost/boost-1_52;boost/boost_1_52;boost-1_51_0;boost_1_51_0;boost/boost-1_51_0;boost/boost_1_51_0;boost-1_51;boost_1_51;boost/boost-1_51;boost/boost_1_51;boost-1_50_0;boost_1_50_0;boost/boost-1_50_0;boost/boost_1_50_0;boost-1_50;boost_1_50;boost/boost-1_50;boost/boost_1_50
-- [ /usr/share/cmake/Modules/FindBoost.cmake:744 ] guessed _boost_COMPILER = -gcc
-- [ /usr/share/cmake/Modules/FindBoost.cmake:754 ] _boost_MULTITHREADED = 
-- [ /usr/share/cmake/Modules/FindBoost.cmake:797 ] _boost_RELEASE_ABI_TAG = -s
-- [ /usr/share/cmake/Modules/FindBoost.cmake:799 ] _boost_DEBUG_ABI_TAG = -sd
-- [ /usr/share/cmake/Modules/FindBoost.cmake:847 ] _boost_LIBRARY_SEARCH_DIRS = /home/adrian/programs/boost_1_55_0/boost/lib;/home/adrian/programs/boost_1_55_0/boost/stage/lib;Boost_INCLUDE_DIR-NOTFOUND/lib;Boost_INCLUDE_DIR-NOTFOUND/../lib;Boost_INCLUDE_DIR-NOTFOUND/stage/lib;PATHS;C:/boost/lib;C:/boost;/sw/local/lib
-- [ /usr/share/cmake/Modules/FindBoost.cmake:957 ] Searching for THREAD_LIBRARY_RELEASE: boost_thread-gcc-s-;boost_thread-gcc-s;boost_thread-s-;boost_thread-s;boost_thread
-- [ /usr/share/cmake/Modules/FindBoost.cmake:993 ] Searching for THREAD_LIBRARY_DEBUG: boost_thread-gcc-sd-;boost_thread-gcc-sd;boost_thread-sd-;boost_thread-sd;boost_thread;boost_thread
-- [ /usr/share/cmake/Modules/FindBoost.cmake:957 ] Searching for SYSTEM_LIBRARY_RELEASE: boost_system-gcc-s-;boost_system-gcc-s;boost_system-s-;boost_system-s;boost_system
-- [ /usr/share/cmake/Modules/FindBoost.cmake:993 ] Searching for SYSTEM_LIBRARY_DEBUG: boost_system-gcc-sd-;boost_system-gcc-sd;boost_system-sd-;boost_system-sd;boost_system;boost_system
-- [ /usr/share/cmake/Modules/FindBoost.cmake:957 ] Searching for FILESYSTEM_LIBRARY_RELEASE: boost_filesystem-gcc-s-;boost_filesystem-gcc-s;boost_filesystem-s-;boost_filesystem-s;boost_filesystem
-- [ /usr/share/cmake/Modules/FindBoost.cmake:993 ] Searching for FILESYSTEM_LIBRARY_DEBUG: boost_filesystem-gcc-sd-;boost_filesystem-gcc-sd;boost_filesystem-sd-;boost_filesystem-sd;boost_filesystem;boost_filesystem
CMake Error at /usr/share/cmake/Modules/FindBoost.cmake:1138 (message):
  Unable to find the requested Boost libraries.

  Unable to find the Boost header files.  Please set BOOST_ROOT to the root
  directory containing Boost or BOOST_INCLUDEDIR to the directory containing
  Boost's headers.
Call Stack (most recent call first):
  src/c/CMakeLists.txt:44 (find_package)


-- Boost libs: 
-- Boost include paths: Boost_INCLUDE_DIR-NOTFOUND
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
Boost_INCLUDE_DIR (ADVANCED)
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c
   used as include directory in directory /home/adrian/programs/meraculous-2.0.4/src/c

-- Configuring incomplete, errors occurred!
See also "/home/adrian/programs/meraculous-2.0.4/build/CMakeFiles/CMakeOutput.log".
See also "/home/adrian/programs/meraculous-2.0.4/build/CMakeFiles/CMakeError.log".
make: *** No targets specified and no makefile found.  Stop.
make: *** No rule to make target `install'.  Stop.

I have also tried:

Code:

export BOOST_ROOT=/home/adrian/programs/boost_1_55_0/boost/
export BOOST_INCLUDEDIR=/home/adrian/programs/boost_1_55_0/boost/

and some other combinations.

**jpummil** · 01-30-2015, 10:05 AM

Originally posted by jpummil View Post

Thanks for the recent posts and ideas, all! Currently, I am experimenting with SSPACE to improve scaffolding of some of my existing assemblies. Not a lot of progress yet, but I am optimistic so far.

I do hope to get back and run the data thru another assembler or two just for comparison and completeness. I'm actually running it in SPAdes right now even though it is relatively un-tested with large genomes. Had to run with --only-assembler as one of the tools in the normal pipeline (BayesHammer) is 32 bit only and was giving " size greater than 2^32 -1 " errors. Assembly making steady progress on a 32 core, 768GB node. Been running 247 hours and is consuming 512GB of memory so far...

Project is still very much alive for those who check in from time to time. Right now I am trying to utilize SSPACE to do some improvements on existing assemblies, but having issues of some sort. Thread on THAT saga, see here: http://seqanswers.com/forums/showthread.php?t=8350

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News