SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to extract multi-mapped reads by samtools? mavishou RNA Sequencing 5 12-05-2016 05:27 AM
Compile multi-individual SNP tables from bowtie/samtools mappings KNS Bioinformatics 10 12-02-2011 08:40 AM
RTG Investigator 2.3: New Ion Torrent support, faster Complete Genomics mapping Stuart Inglis Vendor Forum 0 08-31-2011 01:27 PM
samtools flagstats bug or it does not support multi-reads? xinwu Bioinformatics 0 12-22-2010 09:33 PM
Cuffdiff multi-protein vs multi-promoter RockChalkJayhawk Bioinformatics 2 03-26-2010 10:26 AM

Reply
 
Thread Tools
Old 07-23-2012, 05:11 PM   #41
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

We have a parallelized mergesort, source is on github... See my post a couple posts up.... Would love feedback and suggestions, etc...
adaptivegenome is offline   Reply With Quote
Old 07-23-2012, 05:48 PM   #42
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by adaptivegenome View Post
We have a parallelized mergesort, source is on github... See my post a couple posts up.... Would love feedback and suggestions, etc...
Can you plugin the "pbgzf.c" source code into bamtools? I have been playing around with bamtools and I quite like it. The bottleneck in reading/writing BAM files is the compression/decompression.
nilshomer is offline   Reply With Quote
Old 07-23-2012, 05:54 PM   #43
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by nilshomer View Post
Can you plugin the "pbgzf.c" source code into bamtools? I have been playing around with bamtools and I quite like it. The bottleneck in reading/writing BAM files is the compression/decompression.
You are correct. We built a multithreaded version of a combined merge and sort from bamtools and after lots of work we got a 5X increase over the serial implementation. In comparison novosort offers a 10X increase and this is probably because we still are stuck with bamtool's serial I/O.

We are now replacing the serial I/O with a parallel I/O but its taken a bit of time to do. We should have something soon.

One other thing is that we also included (optionally) MarkDuplicates as part of mergesort so this speeds things up as well...
adaptivegenome is offline   Reply With Quote
Old 07-23-2012, 05:56 PM   #44
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by adaptivegenome View Post
You are correct. We built a multithreaded version of a combined merge and sort from bamtools and after lots of work we got a 5X increase over the serial implementation. In comparison novosort offers a 10X increase and this is probably because we still are stuck with bamtool's serial I/O.

We are now replacing the serial I/O with a parallel I/O but its taken a bit of time to do. We should have something soon.

One other thing is that we also included (optionally) MarkDuplicates as part of mergesort so this speeds things up as well...
Could you use the implementation that I made to do parallel I/O?
nilshomer is offline   Reply With Quote
Old 07-23-2012, 06:03 PM   #45
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Oh I see what you are saying. Yes, let me check it out and see if I can figure it out!
adaptivegenome is offline   Reply With Quote
Old 07-23-2012, 07:31 PM   #46
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

See my branch: https://github.com/nh13/samtools

The bam.h has a define that substitutes the "bgzf" functions for the equivalent "pbgzf" functions.
nilshomer is offline   Reply With Quote
Old 07-24-2012, 01:20 PM   #47
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Quick comment on Nils' parallelized samtools (not adaptivegenomics C++ version).

The __sync_fetch_and_add function is a gcc specific function. My actual "hard iron" machines run a very old kernel and the library with the "atomic builtins" doesn't work with older kernels. (??? i think). It's unlikely it will work with non gcc compilers. It's just this one function. Can the code be made to work without this function ?

I can compile and static link on me coolness Ubuntu box, but the kernel on the "hard iron" is too old.

I'm checking out the adaptivegenomics stuff but it's tough slogging ... download/compile GIT, download compile CMAKE .... pray there's compatible libraries .. more later.
Richard Finney is offline   Reply With Quote
Old 07-24-2012, 01:30 PM   #48
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by Richard Finney View Post
Quick comment on Nils' parallelized samtools (not adaptivegenomics C++ version).

The __sync_fetch_and_add function is a gcc specific function. My actual "hard iron" machines run a very old kernel and the library with the "atomic builtins" doesn't work with older kernels. (??? i think). It's unlikely it will work with non gcc compilers. It's just this one function. Can the code be made to work without this function ?

I can compile and static link on me coolness Ubuntu box, but the kernel on the "hard iron" is too old.

I'm checking out the adaptivegenomics stuff but it's tough slogging ... download/compile GIT, download compile CMAKE .... pray there's compatible libraries .. more later.
If you have trouble let me know!
adaptivegenome is offline   Reply With Quote
Old 07-24-2012, 02:24 PM   #49
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Does openge+bamtools require Boost 1.45.0 exactly or same or more recent than Boost version 1.45.0

The cmake stuff is a real pain. Dumbing it down to a makefile might be good.

Chasing down messages like this is not fun on a bun ...
-bash-3.00$ ../../cmake-2.8.7/bin/cmake ..
CMake Error: Error in cmake code at
/h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake:1:
Parse error. Expected "(", got identifier with text "version".
CMake Error at openge/CMakeLists.txt:5 (find_package):
find_package Error reading CMake code from
"/h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake".

-- Configuring incomplete, errors occurred!

[ edit ... okay... looks like I overwrote Findboost.cmake somehow ... maybe by dot.sourcing it cuz I thought it was a bash script OR by putting it as argument to cmake ... so ... downloading a new cmake ... ]

Need requirements and instructions on how to make in the README.md file. Pointing to the PDF is nice, but I'm on a command line.

Last edited by Richard Finney; 07-24-2012 at 02:42 PM.
Richard Finney is offline   Reply With Quote
Old 07-24-2012, 02:57 PM   #50
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Quote:
Originally Posted by Richard Finney View Post
Does openge+bamtools require Boost 1.45.0 exactly or same or more recent than Boost version 1.45.0

The cmake stuff is a real pain. Dumbing it down to a makefile might be good.

Chasing down messages like this is not fun on a bun ...
-bash-3.00$ ../../cmake-2.8.7/bin/cmake ..
CMake Error: Error in cmake code at
/h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake:1:
Parse error. Expected "(", got identifier with text "version".
CMake Error at openge/CMakeLists.txt:5 (find_package):
find_package Error reading CMake code from
"/h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake".

-- Configuring incomplete, errors occurred!

[ edit ... okay... looks like I overwrote Findboost.cmake somehow ... maybe by dot.sourcing it cuz I thought it was a bash script OR by putting it as argument to cmake ... so ... downloading a new cmake ... ]

Need requirements and instructions on how to make in the README.md file. Pointing to the PDF is nice, but I'm on a command line.
So I think you did end up losing that file. Hopefully it is working now? I agree with you, we should have a non-PDF explanation. There actually in the /doc directory is the LaTeX source files that are sort of readable at the command line but we will go ahead and make a standard TEXT version as well.

Sorry for the trouble! Please let me know if you have any other issues, I certainly don't want compiling to be a pain. What OS are you using?
adaptivegenome is offline   Reply With Quote
Old 07-24-2012, 04:46 PM   #51
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Can't seem to coerece cmake into liking my setup ...
Boost is in /h1/finneyr/boost_1_49_0/boost/



#environment has been set ...
-bash-3.00$ set | grep -i boost
BOOST_INCLUDEDIR=/h1/finneyr/boost_1_49_0/boost
BOOST_ROOT=/h1/finneyr/boost_1_49_0/boost

# to force it to start from the begining without caching (this is silly but necessary)
-bash-3.00$ rm CMakeCache.txt ; rm -r CMakeFiles

-bash-3.00$ ../../cmake-2.8.8/bin/cmake .. 2>&1 | tail -20
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
CMake Error at /h1/finneyr/cmake-2.8.8/Modules/FindBoost.cmake:1207 (message):
Unable to find the requested Boost libraries.

Unable to find the Boost header files. Please set BOOST_ROOT to the root
directory containing Boost or BOOST_INCLUDEDIR to the directory containing
Boost's headers.
Call Stack (most recent call first):
openge/CMakeLists.txt:5 (find_package)


CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
Boost_INCLUDE_DIR (ADVANCED)
used as include directory in directory /h1/finneyr/adaptivegenome-openge-e48ecc0/openge/src

-- Configuring incomplete, errors occurred!


_____

export BOOST_ROOT=/h1/finneyr/boost_1_49_0/
doesn't work either.

Last edited by Richard Finney; 07-24-2012 at 04:56 PM.
Richard Finney is offline   Reply With Quote
Old 07-24-2012, 07:10 PM   #52
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Richard,

If you have boost libraries installed in non-standard locations, you add these locations to CMAKE_PREFIX_PATH so that CMake will be able to find them. Setting CMAKE_PREFIX_PATH to /h1/finneyr/boost_1_49_0/boost/ before running CMake should fix your problem.

The BOOST_INCLUDE_DIR and BOOST_ROOT that you are setting aren't environment variables, but variables inside the CMake script. The FindBoost.cmake script that is mentioned in the thread comes with CMake, and is used locate boost by searching a bunch of directories, and then sets BOOST_INCLUDEDIR and BOOST_ROOT when it is found.

More about CMAKE_PREFIX_PATH: http://www.cmake.org/Wiki/CMake_Usef...ment_Variables

Let me know how it goes!

Quote:
Originally Posted by Richard Finney View Post
Can't seem to coerece cmake into liking my setup ...
Boost is in /h1/finneyr/boost_1_49_0/boost/



#environment has been set ...
-bash-3.00$ set | grep -i boost
BOOST_INCLUDEDIR=/h1/finneyr/boost_1_49_0/boost
BOOST_ROOT=/h1/finneyr/boost_1_49_0/boost

# to force it to start from the begining without caching (this is silly but necessary)
-bash-3.00$ rm CMakeCache.txt ; rm -r CMakeFiles

-bash-3.00$ ../../cmake-2.8.8/bin/cmake .. 2>&1 | tail -20
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
CMake Error at /h1/finneyr/cmake-2.8.8/Modules/FindBoost.cmake:1207 (message):
Unable to find the requested Boost libraries.

Unable to find the Boost header files. Please set BOOST_ROOT to the root
directory containing Boost or BOOST_INCLUDEDIR to the directory containing
Boost's headers.
Call Stack (most recent call first):
openge/CMakeLists.txt:5 (find_package)


CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
Boost_INCLUDE_DIR (ADVANCED)
used as include directory in directory /h1/finneyr/adaptivegenome-openge-e48ecc0/openge/src

-- Configuring incomplete, errors occurred!


_____

export BOOST_ROOT=/h1/finneyr/boost_1_49_0/
doesn't work either.
adaptivegenome is offline   Reply With Quote
Old 07-25-2012, 05:02 AM   #53
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Ack!
Found the problem.
New cmake is not compatible with boost 1.49
I downloaded boost 1.50 and the "stage/lib" directory is there and cmake finds it with this ...
Code:
rm -r CMake*
../../cmake-2.8.8/bin/cmake  -D BOOST_LIBRARYDIR=/h1/finneyr/boost_1_50_0/stage/lib -D BOOST_ROOT=/h1/finneyr/boost_1_50_0/ -D CMAKE_PREFIX_PATH=/h1/finneyr/boost_1_50_0/stage/lib -D LIBRARY_SEARCH_DIRS=/h1/finneyr/boost_1_50_0/stage/lib ..
But my compiler is too old. The "atomic builtins" are there, example s__sync_synchronize() and __sync_lock_test_and_set().

rrrggghhh.

Time to download and compile the gcc ... and probably a later glib ...
Richard Finney is offline   Reply With Quote
Old 07-25-2012, 07:49 AM   #54
leecbaker
Junior Member
 
Location: .

Join Date: May 2012
Posts: 8
Default

Quote:
Originally Posted by Richard Finney View Post
New cmake is not compatible with boost 1.49
Hello Richard,

Primary developer of OpenGE here. My primary development machine uses CMake 2.8.8 (the latest) with Boost 1.49, and I haven't run into any problems with this version yet. Can you elaborate on the issues that you are having? If there is an incompatibility with a particular version of boost, we would like to get it fixed.

-Lee
leecbaker is offline   Reply With Quote
Old 07-25-2012, 08:10 AM   #55
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

I probably downloaded boost 1.49 a long time a ago and it's was probably not successfully built, so had no stage/lib directory.
Richard Finney is offline   Reply With Quote
Old 07-25-2012, 01:02 PM   #56
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Well keep us posted if you have more trouble and of course I am excited about your thoughts once you try it out. It looks like we should be able to fit some of Nils stuff into the code... will update shortly on that.
adaptivegenome is offline   Reply With Quote
Old 07-26-2012, 12:31 AM   #57
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

Quote:
Originally Posted by nilshomer View Post
Can you plugin the "pbgzf.c" source code into bamtools? I have been playing around with bamtools and I quite like it. The bottleneck in reading/writing BAM files is the compression/decompression.
This can probably be sped up by reordering and recoding data before passing into gzip et al. Yes it's no longer strictly BAM, but a content-aware compressor could do this and reverse it so it looks like you're passing in uncompressed BAM and getting uncompressed BAM back, but in reality it's shuffling data around.

Or more specifically, by reordering data per block (eg separate into all-names, all-seqs, all-qual, all-cigars, all-pos (delta), etc) we can either substantially improve compression ratios or as a flip side use lighter weight compression tools (lzop, lz4, snappy, etc) to get the same compression ratio as existing BAM but with much faster and more symmetric times.

There's something to be said for making this configurable. In a pipeline environment we'd probably want fast reading and writing and for an archive environment we can recode the files for maximum compression. Although, BAM really isn't the best for the archival job. SAM -> something -> SAM, where something is not well suited by BAM IMO. (Eg compare it to CRAM or GOBY.)
jkbonfield is offline   Reply With Quote
Old 07-26-2012, 05:42 AM   #58
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Interesting idea. In our code the file formats are actually abstracted from the functions. So you can feed, for example SAM or BAM to our mergesort. We are going to support CRAM as soon as they update to 0.9, right now I don't think they have stable spec.

But I think the idea of reordering the data to make it easier to compress is a good one. Like Nils, we find this compression step to be a major bottleneck.


Quote:
Originally Posted by jkbonfield View Post
This can probably be sped up by reordering and recoding data before passing into gzip et al. Yes it's no longer strictly BAM, but a content-aware compressor could do this and reverse it so it looks like you're passing in uncompressed BAM and getting uncompressed BAM back, but in reality it's shuffling data around.

Or more specifically, by reordering data per block (eg separate into all-names, all-seqs, all-qual, all-cigars, all-pos (delta), etc) we can either substantially improve compression ratios or as a flip side use lighter weight compression tools (lzop, lz4, snappy, etc) to get the same compression ratio as existing BAM but with much faster and more symmetric times.

There's something to be said for making this configurable. In a pipeline environment we'd probably want fast reading and writing and for an archive environment we can recode the files for maximum compression. Although, BAM really isn't the best for the archival job. SAM -> something -> SAM, where something is not well suited by BAM IMO. (Eg compare it to CRAM or GOBY.)
adaptivegenome is offline   Reply With Quote
Old 07-26-2012, 06:29 AM   #59
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 146
Default

So a quick example using my sam_comp (v1, not the poor IMO v2) program.
Note this is NOT a general purpose sam compression program. It lacks many useful features and cannot even handle auxillary fields at all. So don't use it unless you want to contribute to the missing features (and IMO it's better to all pull together on one of the existing alternatives like CRAM or Goby than create yet another).

jkb@deskpro102485[/tmp] time samtools view -S -b a > b;time samtools view -h b > c;ls -l a b c
[samopen] SAM header is present: 7 sequences.
cmp a c

real 0m11.269s

user 0m10.413s
sys 0m0.352s

real 0m2.796s

user 0m2.120s
sys 0m0.324s
-rw-r--r-- 1 jkb team117 257837672 Jul 26 15:14 a
-rw-r--r-- 1 jkb team117 49795614 Jul 26 15:17 b
-rw-r--r-- 1 jkb team117 257837672 Jul 26 15:17 c


Here "a" is a sam file (minus aux fields, so just 11 columns) of 1 million reads from a C.Elegans map. (SRR065390_1). Sam_comp can't handle headers (I told you it wasn't general purpose!), but we can fix that quick:

jkb@deskpro102485[/tmp] egrep -v '^@' a > A
jkb@deskpro102485[/tmp] time sam_comp < A > b;time sam_comp -d < b > c;ls -l A b c

real 0m9.805s

user 0m9.517s
sys 0m0.260s

real 0m11.865s

user 0m11.213s
sys 0m0.488s
-rw-r--r-- 1 jkb team117 257837308 Jul 26 15:16 A
-rw-r--r-- 1 jkb team117 25568326 Jul 26 15:16 b
-rw-r--r-- 1 jkb team117 257837308 Jul 26 15:16 c


All things considered, it's not useful for your purpose. It's slightly faster than gzip at compression and much slower at decompression, but the file size is half the size for the same data (ignoring the minor header).

However it demonstrates how much extra room there is for efficient compression (even without references), but equally so if we're happy with the level of compression BAM already gives us then conversely it shows we ought to be able to speed it up substantially by using faster but less size-optimal algorithms.

(I haven't proved that to be true though except on fastq compression where it is easy to beat gzip on pretty much every metric simultaneously.)

I don't really have time though to experiment with lighter weight SAM compressors to see what's really possible so it's a bit hand-wavey!
jkbonfield is offline   Reply With Quote
Old 07-26-2012, 07:47 AM   #60
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

I got opene and bamtools compiled on an old kernel.
I had to build a more recent compiler (4.6.3) which basically takes all day navigating the land mines. The wounds will heal.

Then set PATH to the new compiler's bin directory to run new gcc.

I had to add this line in about 8 files ...
#define PR_SET_NAME 15
because it wasn't in current machines include files.

I then set library directory (export LD_LIBRARY_PATH ... ) to new gcc's lib directories (plural? I set them to both lib dirs) to run executables generated using the new compiler.

So I will check it and let you know.
Richard Finney is offline   Reply With Quote
Reply

Tags
sam, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:23 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO