Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools sorting problem

    Hi everyone,

    I encountered a strange problem with samtools sorting. A 37.6 GB sam file was generated by Stampy after mapping Illumina reads to hg19. It was converted to a 11.3 GB bam file with samtools. Then I tried to sort it and this where I hit the problem. I used fisrt default settings (1 thread, 756MB/thread):
    Code:
    samtools sort input.sam out_sorted.bam
    After a while samtools started spitting chunks of out_sorted.bam000X.bam files, each new initiated after previous reached 130-160 MB; at 21st chunk I killed it.
    Then I increased memory:
    Code:
    samtools sort -m 10G input.sam out_sorted.bam
    This time samtools spit out only 6 chunks of 1.5-2.5 GB, then started pouring binary gibberish to stdout, and eventually hang.
    When I tried to run multithreaded sorting
    Code:
    samtools sort -@ 8 -m 3G input.sam out_sorted.bam
    the behavior was the same except chunks were spit in multiples of 8, with -@ 16 in multiples of 16, but eventually all ended up with binary gibberish to stdout.
    I am using version 0.1.19-44428cd; the 2x4 cpu box has 96GB of memory, RHEL5.8.
    Can anyone advise what is going on and why?

  • #2
    A 37 gig sam file?

    Why haven't you converted it to .bam?

    I didn't think samtools would even work on a .sam file.

    Also note that if your command line worked, it would make an output file called output.bam.bam

    Why did you kill the sort function? It is supposed to make all those intermediate files, then it merges them.

    Comment


    • #3
      Originally posted by swbarnes2 View Post
      A 37 gig sam file?

      Why haven't you converted it to .bam?

      I didn't think samtools would even work on a .sam file.

      Also note that if your command line worked, it would make an output file called output.bam.bam

      Why did you kill the sort function? It is supposed to make all those intermediate files, then it merges them.
      This is what Stampy produced, not my choice. Then I converted it with samtools, it does take sam files with -S option
      Code:
      samtools view -bS -@ 16 infile.sam -o outfile.bam
      If if you read again my previous post, I got not output.bam.bam, but output.bam000X.bam, where X was incremented for each new chunk. I killed it because it did not look right, it could have produced hundreds of intermediate files. Indeed, with larger allocated memory file chinks become larger and less in number, yet everytime I waited instead of combined files I got binary gibberish to stdout. It seemed that the size of chinks was roughly 1/5 of the allocated memory (150-160MB at default 756MB, 2-2.5GB at 10GB, and 11GB at 48GB). Finally I used just one default thread but gave it 72 GB - no intremediate file outfile_sorted.bam0000.bam was formed, yet after a while the same binary gibberish was spilled out to stdout.

      Comment


      • #4
        What about simply trying

        Code:
        samtools sort input.bam out_sorted

        Comment


        • #5
          Originally posted by syfo View Post
          What about simply trying

          Code:
          samtools sort input.bam out_sorted
          Well, the only difference is that instead of hundreds of out_sorted.bam000X.bam I get hundreds of 150MB out_sorted000X.bam files and then binary gibberish if I wait long enough. Not an enticing reason to try...

          Comment


          • #6
            The temporary files are normal. And I guess the binary gibberish could actually be you sorted output bam-file.
            I usually use bamtools sort, but have you tried
            Code:
            samtools sort input.bam > output.bam

            Comment


            • #7
              Originally posted by yaximik View Post
              Well, the only difference is that instead of hundreds of out_sorted.bam000X.bam I get hundreds of 150MB out_sorted000X.bam files and then binary gibberish if I wait long enough. Not an enticing reason to try...
              How do you know? Have you tried to use it on a proper bam file?
              According to the manual:

              Code:
              Usage:   samtools sort [options] <in.bam> <out.prefix>
              I do not get the point of keeping huge sam files.

              Comment


              • #8
                Originally posted by yaximik View Post
                This is what Stampy produced, not my choice. Then I converted it with samtools, it does take sam files with -S option
                Code:
                samtools view -bS -@ 16 infile.sam -o outfile.bam
                If if you read again my previous post, I got not output.bam.bam, but output.bam000X.bam, where X was incremented for each new chunk.
                And this is totally in line with how samtools sort works.

                I killed it because it did not look right, it could have produced hundreds of intermediate files.
                Probably not hundreds, but it is supposed to produce a lot of them. And then it deletes them after it merges them together.

                And again, why do you still have a .sam file?

                bwa outputs a .sam file too, but I don't leave them lying around. I make them into .bams as soon as possible, piping the program that makes the .sam file directly into samtools view, where possible, and I don't keep the .sams once I have .bams.

                And I don't see any documentation for samtools sort that says it takes a .sam file as input. I'm using samtools 1.18, and it doesn't.

                Comment


                • #9
                  [SOLVED] samtools sorting problem

                  Originally posted by swbarnes2 View Post
                  And this is totally in line with how samtools sort works.



                  Probably not hundreds, but it is supposed to produce a lot of them. And then it deletes them after it merges them together.

                  And again, why do you still have a .sam file?

                  bwa outputs a .sam file too, but I don't leave them lying around. I make them into .bams as soon as possible, piping the program that makes the .sam file directly into samtools view, where possible, and I don't keep the .sams once I have .bams.

                  And I don't see any documentation for samtools sort that says it takes a .sam file as input. I'm using samtools 1.18, and it doesn't.
                  Samtools 0.1.19 does take sam file, and I did convert it to bam once I got the output from Stampy, which produces sam by default.

                  I got answer from samtools-help mailing list. Darn, that was so simple! The option -o means output to stdout, so I was getting what I asked for. I confused this with other programs, in which option -o outfile does exactly opposite. syfo was absolutely correct and I apologize for brushing the correct advice off.

                  Comment


                  • #10
                    Originally posted by yaximik View Post
                    Samtools 0.1.19 does take sam file, and I did convert it to bam once I got the output from Stampy, which produces sam by default.

                    I got answer from samtools-help mailing list. Darn, that was so simple! The option -o means output to stdout, so I was getting what I asked for. I confused this with other programs, in which option -o outfile does exactly opposite. syfo was absolutely correct and I apologize for brushing the correct advice off.
                    No problem Thanks for the update, that could help others.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    31 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X