Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    See my branch: https://github.com/nh13/samtools

    The bam.h has a define that substitutes the "bgzf" functions for the equivalent "pbgzf" functions.

    Comment


    • #47
      Quick comment on Nils' parallelized samtools (not adaptivegenomics C++ version).

      The __sync_fetch_and_add function is a gcc specific function. My actual "hard iron" machines run a very old kernel and the library with the "atomic builtins" doesn't work with older kernels. (??? i think). It's unlikely it will work with non gcc compilers. It's just this one function. Can the code be made to work without this function ?

      I can compile and static link on me coolness Ubuntu box, but the kernel on the "hard iron" is too old.

      I'm checking out the adaptivegenomics stuff but it's tough slogging ... download/compile GIT, download compile CMAKE .... pray there's compatible libraries .. more later.

      Comment


      • #48
        Originally posted by Richard Finney View Post
        Quick comment on Nils' parallelized samtools (not adaptivegenomics C++ version).

        The __sync_fetch_and_add function is a gcc specific function. My actual "hard iron" machines run a very old kernel and the library with the "atomic builtins" doesn't work with older kernels. (??? i think). It's unlikely it will work with non gcc compilers. It's just this one function. Can the code be made to work without this function ?

        I can compile and static link on me coolness Ubuntu box, but the kernel on the "hard iron" is too old.

        I'm checking out the adaptivegenomics stuff but it's tough slogging ... download/compile GIT, download compile CMAKE .... pray there's compatible libraries .. more later.
        If you have trouble let me know!

        Comment


        • #49
          Does openge+bamtools require Boost 1.45.0 exactly or same or more recent than Boost version 1.45.0

          The cmake stuff is a real pain. Dumbing it down to a makefile might be good.

          Chasing down messages like this is not fun on a bun ...
          -bash-3.00$ ../../cmake-2.8.7/bin/cmake ..
          CMake Error: Error in cmake code at
          /h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake:1:
          Parse error. Expected "(", got identifier with text "version".
          CMake Error at openge/CMakeLists.txt:5 (find_package):
          find_package Error reading CMake code from
          "/h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake".

          -- Configuring incomplete, errors occurred!

          [ edit ... okay... looks like I overwrote Findboost.cmake somehow ... maybe by dot.sourcing it cuz I thought it was a bash script OR by putting it as argument to cmake ... so ... downloading a new cmake ... ]

          Need requirements and instructions on how to make in the README.md file. Pointing to the PDF is nice, but I'm on a command line.
          Last edited by Richard Finney; 07-24-2012, 02:42 PM.

          Comment


          • #50
            Originally posted by Richard Finney View Post
            Does openge+bamtools require Boost 1.45.0 exactly or same or more recent than Boost version 1.45.0

            The cmake stuff is a real pain. Dumbing it down to a makefile might be good.

            Chasing down messages like this is not fun on a bun ...
            -bash-3.00$ ../../cmake-2.8.7/bin/cmake ..
            CMake Error: Error in cmake code at
            /h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake:1:
            Parse error. Expected "(", got identifier with text "version".
            CMake Error at openge/CMakeLists.txt:5 (find_package):
            find_package Error reading CMake code from
            "/h1/finneyr/cmake-2.8.7/Modules/FindBoost.cmake".

            -- Configuring incomplete, errors occurred!

            [ edit ... okay... looks like I overwrote Findboost.cmake somehow ... maybe by dot.sourcing it cuz I thought it was a bash script OR by putting it as argument to cmake ... so ... downloading a new cmake ... ]

            Need requirements and instructions on how to make in the README.md file. Pointing to the PDF is nice, but I'm on a command line.
            So I think you did end up losing that file. Hopefully it is working now? I agree with you, we should have a non-PDF explanation. There actually in the /doc directory is the LaTeX source files that are sort of readable at the command line but we will go ahead and make a standard TEXT version as well.

            Sorry for the trouble! Please let me know if you have any other issues, I certainly don't want compiling to be a pain. What OS are you using?

            Comment


            • #51
              Can't seem to coerece cmake into liking my setup ...
              Boost is in /h1/finneyr/boost_1_49_0/boost/



              #environment has been set ...
              -bash-3.00$ set | grep -i boost
              BOOST_INCLUDEDIR=/h1/finneyr/boost_1_49_0/boost
              BOOST_ROOT=/h1/finneyr/boost_1_49_0/boost

              # to force it to start from the begining without caching (this is silly but necessary)
              -bash-3.00$ rm CMakeCache.txt ; rm -r CMakeFiles

              -bash-3.00$ ../../cmake-2.8.8/bin/cmake .. 2>&1 | tail -20
              -- Check for working CXX compiler: /usr/bin/c++
              -- Check for working CXX compiler: /usr/bin/c++ -- works
              -- Detecting CXX compiler ABI info
              -- Detecting CXX compiler ABI info - done
              CMake Error at /h1/finneyr/cmake-2.8.8/Modules/FindBoost.cmake:1207 (message):
              Unable to find the requested Boost libraries.

              Unable to find the Boost header files. Please set BOOST_ROOT to the root
              directory containing Boost or BOOST_INCLUDEDIR to the directory containing
              Boost's headers.
              Call Stack (most recent call first):
              openge/CMakeLists.txt:5 (find_package)


              CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
              Please set them or make sure they are set and tested correctly in the CMake files:
              Boost_INCLUDE_DIR (ADVANCED)
              used as include directory in directory /h1/finneyr/adaptivegenome-openge-e48ecc0/openge/src

              -- Configuring incomplete, errors occurred!


              _____

              export BOOST_ROOT=/h1/finneyr/boost_1_49_0/
              doesn't work either.
              Last edited by Richard Finney; 07-24-2012, 04:56 PM.

              Comment


              • #52
                Richard,

                If you have boost libraries installed in non-standard locations, you add these locations to CMAKE_PREFIX_PATH so that CMake will be able to find them. Setting CMAKE_PREFIX_PATH to /h1/finneyr/boost_1_49_0/boost/ before running CMake should fix your problem.

                The BOOST_INCLUDE_DIR and BOOST_ROOT that you are setting aren't environment variables, but variables inside the CMake script. The FindBoost.cmake script that is mentioned in the thread comes with CMake, and is used locate boost by searching a bunch of directories, and then sets BOOST_INCLUDEDIR and BOOST_ROOT when it is found.

                More about CMAKE_PREFIX_PATH: http://www.cmake.org/Wiki/CMake_Usef...ment_Variables

                Let me know how it goes!

                Originally posted by Richard Finney View Post
                Can't seem to coerece cmake into liking my setup ...
                Boost is in /h1/finneyr/boost_1_49_0/boost/



                #environment has been set ...
                -bash-3.00$ set | grep -i boost
                BOOST_INCLUDEDIR=/h1/finneyr/boost_1_49_0/boost
                BOOST_ROOT=/h1/finneyr/boost_1_49_0/boost

                # to force it to start from the begining without caching (this is silly but necessary)
                -bash-3.00$ rm CMakeCache.txt ; rm -r CMakeFiles

                -bash-3.00$ ../../cmake-2.8.8/bin/cmake .. 2>&1 | tail -20
                -- Check for working CXX compiler: /usr/bin/c++
                -- Check for working CXX compiler: /usr/bin/c++ -- works
                -- Detecting CXX compiler ABI info
                -- Detecting CXX compiler ABI info - done
                CMake Error at /h1/finneyr/cmake-2.8.8/Modules/FindBoost.cmake:1207 (message):
                Unable to find the requested Boost libraries.

                Unable to find the Boost header files. Please set BOOST_ROOT to the root
                directory containing Boost or BOOST_INCLUDEDIR to the directory containing
                Boost's headers.
                Call Stack (most recent call first):
                openge/CMakeLists.txt:5 (find_package)


                CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
                Please set them or make sure they are set and tested correctly in the CMake files:
                Boost_INCLUDE_DIR (ADVANCED)
                used as include directory in directory /h1/finneyr/adaptivegenome-openge-e48ecc0/openge/src

                -- Configuring incomplete, errors occurred!


                _____

                export BOOST_ROOT=/h1/finneyr/boost_1_49_0/
                doesn't work either.

                Comment


                • #53
                  Ack!
                  Found the problem.
                  New cmake is not compatible with boost 1.49
                  I downloaded boost 1.50 and the "stage/lib" directory is there and cmake finds it with this ...
                  Code:
                  rm -r CMake*
                  ../../cmake-2.8.8/bin/cmake  -D BOOST_LIBRARYDIR=/h1/finneyr/boost_1_50_0/stage/lib -D BOOST_ROOT=/h1/finneyr/boost_1_50_0/ -D CMAKE_PREFIX_PATH=/h1/finneyr/boost_1_50_0/stage/lib -D LIBRARY_SEARCH_DIRS=/h1/finneyr/boost_1_50_0/stage/lib ..
                  But my compiler is too old. The "atomic builtins" are there, example s__sync_synchronize() and __sync_lock_test_and_set().

                  rrrggghhh.

                  Time to download and compile the gcc ... and probably a later glib ...

                  Comment


                  • #54
                    Originally posted by Richard Finney View Post
                    New cmake is not compatible with boost 1.49
                    Hello Richard,

                    Primary developer of OpenGE here. My primary development machine uses CMake 2.8.8 (the latest) with Boost 1.49, and I haven't run into any problems with this version yet. Can you elaborate on the issues that you are having? If there is an incompatibility with a particular version of boost, we would like to get it fixed.

                    -Lee

                    Comment


                    • #55
                      I probably downloaded boost 1.49 a long time a ago and it's was probably not successfully built, so had no stage/lib directory.

                      Comment


                      • #56
                        Well keep us posted if you have more trouble and of course I am excited about your thoughts once you try it out. It looks like we should be able to fit some of Nils stuff into the code... will update shortly on that.

                        Comment


                        • #57
                          Originally posted by nilshomer View Post
                          Can you plugin the "pbgzf.c" source code into bamtools? I have been playing around with bamtools and I quite like it. The bottleneck in reading/writing BAM files is the compression/decompression.
                          This can probably be sped up by reordering and recoding data before passing into gzip et al. Yes it's no longer strictly BAM, but a content-aware compressor could do this and reverse it so it looks like you're passing in uncompressed BAM and getting uncompressed BAM back, but in reality it's shuffling data around.

                          Or more specifically, by reordering data per block (eg separate into all-names, all-seqs, all-qual, all-cigars, all-pos (delta), etc) we can either substantially improve compression ratios or as a flip side use lighter weight compression tools (lzop, lz4, snappy, etc) to get the same compression ratio as existing BAM but with much faster and more symmetric times.

                          There's something to be said for making this configurable. In a pipeline environment we'd probably want fast reading and writing and for an archive environment we can recode the files for maximum compression. Although, BAM really isn't the best for the archival job. SAM -> something -> SAM, where something is not well suited by BAM IMO. (Eg compare it to CRAM or GOBY.)

                          Comment


                          • #58
                            Interesting idea. In our code the file formats are actually abstracted from the functions. So you can feed, for example SAM or BAM to our mergesort. We are going to support CRAM as soon as they update to 0.9, right now I don't think they have stable spec.

                            But I think the idea of reordering the data to make it easier to compress is a good one. Like Nils, we find this compression step to be a major bottleneck.


                            Originally posted by jkbonfield View Post
                            This can probably be sped up by reordering and recoding data before passing into gzip et al. Yes it's no longer strictly BAM, but a content-aware compressor could do this and reverse it so it looks like you're passing in uncompressed BAM and getting uncompressed BAM back, but in reality it's shuffling data around.

                            Or more specifically, by reordering data per block (eg separate into all-names, all-seqs, all-qual, all-cigars, all-pos (delta), etc) we can either substantially improve compression ratios or as a flip side use lighter weight compression tools (lzop, lz4, snappy, etc) to get the same compression ratio as existing BAM but with much faster and more symmetric times.

                            There's something to be said for making this configurable. In a pipeline environment we'd probably want fast reading and writing and for an archive environment we can recode the files for maximum compression. Although, BAM really isn't the best for the archival job. SAM -> something -> SAM, where something is not well suited by BAM IMO. (Eg compare it to CRAM or GOBY.)

                            Comment


                            • #59
                              So a quick example using my sam_comp (v1, not the poor IMO v2) program.
                              Note this is NOT a general purpose sam compression program. It lacks many useful features and cannot even handle auxillary fields at all. So don't use it unless you want to contribute to the missing features (and IMO it's better to all pull together on one of the existing alternatives like CRAM or Goby than create yet another).

                              jkb@deskpro102485[/tmp] time samtools view -S -b a > b;time samtools view -h b > c;ls -l a b c
                              [samopen] SAM header is present: 7 sequences.
                              cmp a c

                              real 0m11.269s

                              user 0m10.413s
                              sys 0m0.352s

                              real 0m2.796s

                              user 0m2.120s
                              sys 0m0.324s
                              -rw-r--r-- 1 jkb team117 257837672 Jul 26 15:14 a
                              -rw-r--r-- 1 jkb team117 49795614 Jul 26 15:17 b
                              -rw-r--r-- 1 jkb team117 257837672 Jul 26 15:17 c


                              Here "a" is a sam file (minus aux fields, so just 11 columns) of 1 million reads from a C.Elegans map. (SRR065390_1). Sam_comp can't handle headers (I told you it wasn't general purpose!), but we can fix that quick:

                              jkb@deskpro102485[/tmp] egrep -v '^@' a > A
                              jkb@deskpro102485[/tmp] time sam_comp < A > b;time sam_comp -d < b > c;ls -l A b c

                              real 0m9.805s

                              user 0m9.517s
                              sys 0m0.260s

                              real 0m11.865s

                              user 0m11.213s
                              sys 0m0.488s
                              -rw-r--r-- 1 jkb team117 257837308 Jul 26 15:16 A
                              -rw-r--r-- 1 jkb team117 25568326 Jul 26 15:16 b
                              -rw-r--r-- 1 jkb team117 257837308 Jul 26 15:16 c


                              All things considered, it's not useful for your purpose. It's slightly faster than gzip at compression and much slower at decompression, but the file size is half the size for the same data (ignoring the minor header).

                              However it demonstrates how much extra room there is for efficient compression (even without references), but equally so if we're happy with the level of compression BAM already gives us then conversely it shows we ought to be able to speed it up substantially by using faster but less size-optimal algorithms.

                              (I haven't proved that to be true though except on fastq compression where it is easy to beat gzip on pretty much every metric simultaneously.)

                              I don't really have time though to experiment with lighter weight SAM compressors to see what's really possible so it's a bit hand-wavey!

                              Comment


                              • #60
                                I got opene and bamtools compiled on an old kernel.
                                I had to build a more recent compiler (4.6.3) which basically takes all day navigating the land mines. The wounds will heal.

                                Then set PATH to the new compiler's bin directory to run new gcc.

                                I had to add this line in about 8 files ...
                                #define PR_SET_NAME 15
                                because it wasn't in current machines include files.

                                I then set library directory (export LD_LIBRARY_PATH ... ) to new gcc's lib directories (plural? I set them to both lib dirs) to run executables generated using the new compiler.

                                So I will check it and let you know.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 08:47 AM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                59 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X