Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM to FASTQ converter - Picard

    Hi,
    I am having problems to translate a large SAM file into FASTQ using Picard, the error messages are related to memory or heap space, or garbage

    the command I am currently using is:
    java -Xmx2g -XX:-UseGCOverheadLimit -jar picard-tools-1.26/SamToFastq.jar INPUT=Apollo102b.sam FASTQ=102b_1.fq SECOND_END_FASTQ=102b_2.fq INCLUDE_NON_PF_READS=True VALIDATION_STRINGENCY=SILENT &

    What other flag could I set to increase memory for JAVA? Or maybe there is even a better idea how to translate SAM to FASTQ ...

  • #2
    Originally posted by sdm View Post
    Or maybe there is even a better idea how to translate SAM to FASTQ ...
    Several suggestions on this thread initially about BAM to FASTQ will also apply for SAM to FASTQ: http://seqanswers.com/forums/showthread.php?t=6164

    Comment


    • #3
      samtofastq out of memory problem

      Hi all,

      I have the same issue: on small SAM files (730MB), samtofastq of Picard tools does a wonderful job:

      Code:
      [user]$ java -Xmx40g -jar /opt/picardtools/SamToFastq.jar I=RECORDS_IN_RAM=5000000
      [Thu Aug 04 08:55:46 CEST 2011] net.sf.picard.sam.SamToFastq INPUT=erx000019.sam FASTQ=/default_1.fastq SECOND_END_FASTQ=default_2.fastq MAX_RECORDS_IN_RAM=5000000    OUTPUT_PER_RG=false RE_REVERSE=true INCLUDE_NON_PF_READS=false READ1_TRIM=0 READ2_TRIM=0 TMP_DIR=/tmp VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false
      [Thu Aug 04 08:57:30 CEST 2011] net.sf.picard.sam.SamToFastq done.
      Runtime.totalMemory()=1179189248
      [user]$
      -rw-rw-r-- 1 user users 322M Aug  4 08:57 default_1.fastq
      -rw-rw-r-- 1 user users 322M Aug  4 08:57 default_2.fastq
      But on large SAM files (33G), samtofastq does not seem to work:
      Code:
      Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
      	at java.lang.String.substring(String.java:1951)
      	at net.sf.samtools.util.StringUtil.split(StringUtil.java:74)
      	at net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:307)
      	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:272)
      	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:244)
      	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:629)
      	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:607)
      	at net.sf.picard.sam.SamToFastq.doWork(SamToFastq.java:121)
      	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:157)
      	at net.sf.picard.sam.SamToFastq.main(SamToFastq.java:112)
      Does anyone have the same problems? And better: has anyone fixed this problem?

      Thanks,
      Joachim
      www.bits.vib.be

      Comment


      • #4
        samtofastq out of memory problem

        Hi all,

        Found some settings with more succes. I have adjusted the setting of the JAVA Virtual Machine as follows to run on our machine (24 CPU machine, with 96GB RAM):

        java -Xmx40g -jar -XX:-UseGCOverheadLimit -XX:-UseParallelGC -jar /opt/picardtools/SamToFastq.jar I=erx000016.sam F=default_1.fastq F2=default_2.fastq MAX_RECORDS_IN_RAM=5000000

        Steadily but firmly the fastq file is being filled (300MB now)... Let's hope it completes it...

        Joachim
        Last edited by joachim.jacob; 08-04-2011, 12:34 AM. Reason: reporting possible solution
        www.bits.vib.be

        Comment


        • #5
          samtofastq out of memory problem persists

          Hi all,

          No joy...

          But my fastq file contains now 304MB. Somehow I get now following error:

          Code:
          Runtime.totalMemory()=41518039040
          Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
          	at java.io.BufferedReader.readLine(BufferedReader.java:348)
          	at java.io.BufferedReader.readLine(BufferedReader.java:379)
          	at net.sf.samtools.util.BufferedLineReader.readLine(BufferedLineReader.java:65)
          	at net.sf.samtools.util.AsciiLineReader.readLine(AsciiLineReader.java:75)
          	at net.sf.samtools.SAMTextReader.advanceLine(SAMTextReader.java:203)
          	at net.sf.samtools.SAMTextReader.access$300(SAMTextReader.java:40)
          	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:274)
          	at net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:244)
          	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:629)
          	at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:607)
          	at net.sf.picard.sam.SamToFastq.doWork(SamToFastq.java:121)
          	at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:157)
          	at net.sf.picard.sam.SamToFastq.main(SamToFastq.java:112)
          Last edited by joachim.jacob; 08-04-2011, 01:38 AM.
          www.bits.vib.be

          Comment


          • #6
            Hi sdm, Joachim,

            Most of picard tools are designed to run on 2GB of JVM. So using -Xmx40g (IMO) wouldn't make a difference. IMHO what you have to check is the use of the parameter "TMP_DIR=file". Sometimes the default temp directory it chose ran out of space on the cluster I work on. Its worth a try.

            best.

            Comment


            • #7
              Hi,

              these flags have worked for me after some trial and error:
              java -Xmx3g -XX:-UseGCOverheadLimit -jar SamToFastq.jar

              Not sure if it works in any context

              Comment


              • #8
                Changing TMP_DIR does not work

                Thanks all for your suggestions : unfortunately, changing TMP_DIR to a bigger location does not work.

                The fastq file hangs at 304MB and I get JAVA heap space error.

                @sdm: thanks for your reply. My xmx is set to 55g already (used to be at 2g.

                It seems that I got most success by changing MAX_RECORDS_IN_RAM to 5000000. Will try a little further and keep you posted!
                www.bits.vib.be

                Comment


                • #9
                  Joachim,
                  Since it seems to work on small files for you (and the ones I worked on are around 8-12GB...) it seems to me that it has more to do with the code. Check this link for the chosen answer explanation.
                  I get this error message as I execute my JUnit tests: java.lang.OutOfMemoryError: GC overhead limit exceeded I know what an OutOfMemoryError is, but what does GC overhead limit mean? How can I solve


                  Best.

                  Comment


                  • #10
                    Originally posted by joachim.jacob View Post
                    Thanks all for your suggestions : unfortunately, changing TMP_DIR to a bigger location does not work.

                    The fastq file hangs at 304MB and I get JAVA heap space error.

                    @sdm: thanks for your reply. My xmx is set to 55g already (used to be at 2g.

                    It seems that I got most success by changing MAX_RECORDS_IN_RAM to 5000000. Will try a little further and keep you posted!
                    hi joachim, did you solve this problem? I added the MAX_RECORDS_IN_RAM=5000000 the fastq files get larger but still got the eroor of JAVA heap space at the end. Do you have any other suggestions? Thank you.

                    Comment


                    • #11
                      Have you had any luck solving this issue. I am having the same problem using various BAM files of ~7-10GB in size.

                      I have had success using BAM files of similar size generated in-house and from collaborators. So I was surprised when these same parameters no longer seem to work. I am starting to wonder if there is something unique to how these most recent BAM files were processed. I have gone to the picard commands page and didn't see any particular processing steps that were required. Any suggestions?

                      Comment


                      • #12
                        The Picard bam to fastq is slow and takes a lot of memory.
                        If you have lots of time and memory it is not a problem.

                        If you dont have a lot of time or memory ... try my solution presented in this thread:
                        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                        Warning: you'll have to download the AVL library and compile it yourself using a C compiler (gcc or other). A modicum of experience in compiling and editing source files is required.

                        It was developed to run on low memory beowulf nodes and not take all day.

                        Comment


                        • #13
                          kudos to you, my friend! works like a charm!

                          Comment


                          • #14
                            Time to advertise our HTSeq library, which allows to do such tasks in two lines. And it certainly won't use any noticeable amount of memory.

                            Try this:

                            Code:
                            import sys, HTSeq
                            
                            for a in HTSeq.SAM_Reader( "myfile.sam" ):
                               a.read.write_to_fastq_file( sys.stdout )
                            The following, "more advanced" version, makes sure that each read is written only once even if multiple alignments are in the SAM file (provided the SAM file had been sorted by read name (with 'samtools sort -n')) so that multiple alignments are in adjacent lines.

                            Code:
                            import sys, HTSeq
                            
                            for a in HTSeq.bundle_multipe_alignments( HTSeq.SAM_Reader( "myfile.sam" ) ):
                               a[0].read.write_to_fastq_file( sys.stdout )
                            (The code is untested, so sorry in advance for any typos.)
                            Last edited by Simon Anders; 03-19-2013, 10:01 AM.

                            Comment


                            • #15
                              A better approach to group two ends together is "htscmd bamshuf" from htslib. It is much faster than name sorting.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X