Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • carolW
    Senior Member
    • Apr 2013
    • 103

    fastq2bam

    Hi,
    which tools is better to convert fastq2bam? picard or samtools or any other that you may suggest? it seems that picard has different converters depending on from which technology fastq is generated. Will it matter to apply a converter for ex if fastq is not generated from the technologies that it was generated fastq-solexa if fastq is not generated from solexa?

    Cheers,

    Carol
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    Generally FASTQ to BAM means aligning reads to a reference.

    And yes, it is important that the FASTQ encoding is correctly set for this. Using the old (and long longer used) Solexa/Illumina FASTQ encoding rather than the (now standard) Sanger FASTQ encoding would result in wrong read quality scores in the BAM file.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      @Peter: I think @carolW is referring to FastqToSam from Picard tools which stores reads in unaligned BAM format.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Good point. But yes, if you do mean storing unaligned reads from FASTQ files as SAM/BAM files, the same applies to checking the quality score encoding.

        Comment

        • carolW
          Senior Member
          • Apr 2013
          • 103

          #5
          as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

          and what would be the best tool? picard or any other tool?

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Use BamHash to compare the data: https://github.com/DecodeGenetics/BamHash

            Raw file sizes are not a good indicator.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Originally posted by carolW View Post
              as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

              and what would be the best tool? picard or any other tool?
              I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

              Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
              And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
              Last edited by Brian Bushnell; 01-28-2016, 07:09 PM.

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Brian, see https://github.com/DecodeGenetics/BamHash as mentioned earlier, the authors describe it as a tool to: "Hash BAM and FASTQ files to verify data integrity... The result can be compared to verify that the pair of FASTQ files contain the same read information as the aligned BAM file."

                Comment

                • carolW
                  Senior Member
                  • Apr 2013
                  • 103

                  #9
                  Originally posted by Brian Bushnell View Post
                  I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

                  Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
                  And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?

                  and how are pigz, Pbzip compared with cram of EBI? Does Pbzip compress even at a higher level than CRAM?

                  The original data are BAM. bam can very well be used for alignment but I convert to fastq for alignment. moreover as fastq is in text, I just thought that it can be compressed at a significant level compared to bam

                  I tried to compress a 40G bam with Pbzip2 with -9 option and didn't gain any thing as the bz2 file had 40G at the end. this might due to the fact that the bam file is the collection of smaller bam files in one bam file.
                  Last edited by carolW; 01-29-2016, 07:16 AM.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    #10
                    Compressing a compressed file will usually not give any benefit; you have to compress the raw data. In fact, compressing a compressed file will often result in slightly larger output.

                    For unaligned reads, bam compression is not much better than gzipped fastq. I don't have any numbers but I would expect gzipped fastq to be a few percent bigger than bam, and bzip2 to be a few percent smaller (on the order of 5-10%, I'd imagine), and cram to be even smaller. For mapped sorted reads, though, bam and cram become substantially more efficient.

                    Incidentally, I wrote a program called "Clumpify" that can rearrange sequence data (fastq, fasta, sam, whatever) files to compress smaller by putting overlapping reads near each other. It's in the BBMap package. If you want to maximally compress the data, and it is not aligned, you can run that prior to putting the files in whatever format you decide on.

                    Comment

                    • dpryan
                      Devon Ryan
                      • Jul 2011
                      • 3478

                      #11
                      Originally posted by Brian Bushnell View Post
                      And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
                      bamHash can do that. It was originally made to compare fastq and BAM files, but one could just as easily compare multiple BAM files.

                      Edit: I should have scrolled down! Peter already mentioned it!

                      Comment

                      • carolW
                        Senior Member
                        • Apr 2013
                        • 103

                        #12
                        If picard converts bam2fastq and fastq2bam, is there any way to have the original bam through these 2 conversions? If so, which parameters to use and if not, why? what would differ between 2 bams?

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          It would depend on whether the initial BAM file contained only unaligned reads and nothing else. Conversion to fastq is otherwise a lossy process.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Can you tell us again what exactly you are trying to do?

                            Are you asking if bam_start would be identical to bam_new in this example? (bam_start --> Picard bam2fastq --> Fastq --> Picard fastq2bam --> bam_new)

                            You can use bamhash on the two files and let us know what you find.

                            Comment

                            • carolW
                              Senior Member
                              • Apr 2013
                              • 103

                              #15
                              yes, if the bam-start will be the sam as bam_new? does the file size not matter?

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              11 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              23 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              28 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...