Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastq2bam

    Hi,
    which tools is better to convert fastq2bam? picard or samtools or any other that you may suggest? it seems that picard has different converters depending on from which technology fastq is generated. Will it matter to apply a converter for ex if fastq is not generated from the technologies that it was generated fastq-solexa if fastq is not generated from solexa?

    Cheers,

    Carol

  • #2
    Generally FASTQ to BAM means aligning reads to a reference.

    And yes, it is important that the FASTQ encoding is correctly set for this. Using the old (and long longer used) Solexa/Illumina FASTQ encoding rather than the (now standard) Sanger FASTQ encoding would result in wrong read quality scores in the BAM file.

    Comment


    • #3
      @Peter: I think @carolW is referring to FastqToSam from Picard tools which stores reads in unaligned BAM format.

      Comment


      • #4
        Good point. But yes, if you do mean storing unaligned reads from FASTQ files as SAM/BAM files, the same applies to checking the quality score encoding.

        Comment


        • #5
          as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

          and what would be the best tool? picard or any other tool?

          Comment


          • #6
            Use BamHash to compare the data: https://github.com/DecodeGenetics/BamHash

            Raw file sizes are not a good indicator.

            Comment


            • #7
              Originally posted by carolW View Post
              as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

              and what would be the best tool? picard or any other tool?
              I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

              Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
              And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
              Last edited by Brian Bushnell; 01-28-2016, 07:09 PM.

              Comment


              • #8
                Brian, see https://github.com/DecodeGenetics/BamHash as mentioned earlier, the authors describe it as a tool to: "Hash BAM and FASTQ files to verify data integrity... The result can be compared to verify that the pair of FASTQ files contain the same read information as the aligned BAM file."

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

                  Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
                  And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?

                  and how are pigz, Pbzip compared with cram of EBI? Does Pbzip compress even at a higher level than CRAM?

                  The original data are BAM. bam can very well be used for alignment but I convert to fastq for alignment. moreover as fastq is in text, I just thought that it can be compressed at a significant level compared to bam

                  I tried to compress a 40G bam with Pbzip2 with -9 option and didn't gain any thing as the bz2 file had 40G at the end. this might due to the fact that the bam file is the collection of smaller bam files in one bam file.
                  Last edited by carolW; 01-29-2016, 07:16 AM.

                  Comment


                  • #10
                    Compressing a compressed file will usually not give any benefit; you have to compress the raw data. In fact, compressing a compressed file will often result in slightly larger output.

                    For unaligned reads, bam compression is not much better than gzipped fastq. I don't have any numbers but I would expect gzipped fastq to be a few percent bigger than bam, and bzip2 to be a few percent smaller (on the order of 5-10%, I'd imagine), and cram to be even smaller. For mapped sorted reads, though, bam and cram become substantially more efficient.

                    Incidentally, I wrote a program called "Clumpify" that can rearrange sequence data (fastq, fasta, sam, whatever) files to compress smaller by putting overlapping reads near each other. It's in the BBMap package. If you want to maximally compress the data, and it is not aligned, you can run that prior to putting the files in whatever format you decide on.

                    Comment


                    • #11
                      Originally posted by Brian Bushnell View Post
                      And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
                      bamHash can do that. It was originally made to compare fastq and BAM files, but one could just as easily compare multiple BAM files.

                      Edit: I should have scrolled down! Peter already mentioned it!

                      Comment


                      • #12
                        If picard converts bam2fastq and fastq2bam, is there any way to have the original bam through these 2 conversions? If so, which parameters to use and if not, why? what would differ between 2 bams?

                        Comment


                        • #13
                          It would depend on whether the initial BAM file contained only unaligned reads and nothing else. Conversion to fastq is otherwise a lossy process.

                          Comment


                          • #14
                            Can you tell us again what exactly you are trying to do?

                            Are you asking if bam_start would be identical to bam_new in this example? (bam_start --> Picard bam2fastq --> Fastq --> Picard fastq2bam --> bam_new)

                            You can use bamhash on the two files and let us know what you find.

                            Comment


                            • #15
                              yes, if the bam-start will be the sam as bam_new? does the file size not matter?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X