SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Reply
 
Thread Tools
Old 01-28-2016, 03:58 AM   #1
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default fastq2bam

Hi,
which tools is better to convert fastq2bam? picard or samtools or any other that you may suggest? it seems that picard has different converters depending on from which technology fastq is generated. Will it matter to apply a converter for ex if fastq is not generated from the technologies that it was generated fastq-solexa if fastq is not generated from solexa?

Cheers,

Carol
carolW is offline   Reply With Quote
Old 01-28-2016, 07:51 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Generally FASTQ to BAM means aligning reads to a reference.

And yes, it is important that the FASTQ encoding is correctly set for this. Using the old (and long longer used) Solexa/Illumina FASTQ encoding rather than the (now standard) Sanger FASTQ encoding would result in wrong read quality scores in the BAM file.
maubp is offline   Reply With Quote
Old 01-28-2016, 07:57 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

@Peter: I think @carolW is referring to FastqToSam from Picard tools which stores reads in unaligned BAM format.
GenoMax is offline   Reply With Quote
Old 01-28-2016, 08:10 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Good point. But yes, if you do mean storing unaligned reads from FASTQ files as SAM/BAM files, the same applies to checking the quality score encoding.
maubp is offline   Reply With Quote
Old 01-28-2016, 08:47 AM   #5
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

and what would be the best tool? picard or any other tool?
carolW is offline   Reply With Quote
Old 01-28-2016, 09:30 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Use BamHash to compare the data: https://github.com/DecodeGenetics/BamHash

Raw file sizes are not a good indicator.
GenoMax is offline   Reply With Quote
Old 01-28-2016, 05:57 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by carolW View Post
as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

and what would be the best tool? picard or any other tool?
I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?

Last edited by Brian Bushnell; 01-28-2016 at 06:09 PM.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2016, 01:13 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Brian, see https://github.com/DecodeGenetics/BamHash as mentioned earlier, the authors describe it as a tool to: "Hash BAM and FASTQ files to verify data integrity... The result can be compared to verify that the pair of FASTQ files contain the same read information as the aligned BAM file."
maubp is offline   Reply With Quote
Old 01-29-2016, 05:23 AM   #9
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

Quote:
Originally Posted by Brian Bushnell View Post
I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?

and how are pigz, Pbzip compared with cram of EBI? Does Pbzip compress even at a higher level than CRAM?

The original data are BAM. bam can very well be used for alignment but I convert to fastq for alignment. moreover as fastq is in text, I just thought that it can be compressed at a significant level compared to bam

I tried to compress a 40G bam with Pbzip2 with -9 option and didn't gain any thing as the bz2 file had 40G at the end. this might due to the fact that the bam file is the collection of smaller bam files in one bam file.

Last edited by carolW; 01-29-2016 at 06:16 AM.
carolW is offline   Reply With Quote
Old 01-29-2016, 08:42 PM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Compressing a compressed file will usually not give any benefit; you have to compress the raw data. In fact, compressing a compressed file will often result in slightly larger output.

For unaligned reads, bam compression is not much better than gzipped fastq. I don't have any numbers but I would expect gzipped fastq to be a few percent bigger than bam, and bzip2 to be a few percent smaller (on the order of 5-10%, I'd imagine), and cram to be even smaller. For mapped sorted reads, though, bam and cram become substantially more efficient.

Incidentally, I wrote a program called "Clumpify" that can rearrange sequence data (fastq, fasta, sam, whatever) files to compress smaller by putting overlapping reads near each other. It's in the BBMap package. If you want to maximally compress the data, and it is not aligned, you can run that prior to putting the files in whatever format you decide on.
Brian Bushnell is offline   Reply With Quote
Old 01-29-2016, 11:51 PM   #11
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

Quote:
Originally Posted by Brian Bushnell View Post
And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
bamHash can do that. It was originally made to compare fastq and BAM files, but one could just as easily compare multiple BAM files.

Edit: I should have scrolled down! Peter already mentioned it!
dpryan is offline   Reply With Quote
Old 02-10-2016, 03:37 AM   #12
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

If picard converts bam2fastq and fastq2bam, is there any way to have the original bam through these 2 conversions? If so, which parameters to use and if not, why? what would differ between 2 bams?
carolW is offline   Reply With Quote
Old 02-10-2016, 03:42 AM   #13
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

It would depend on whether the initial BAM file contained only unaligned reads and nothing else. Conversion to fastq is otherwise a lossy process.
dpryan is offline   Reply With Quote
Old 02-10-2016, 03:45 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Can you tell us again what exactly you are trying to do?

Are you asking if bam_start would be identical to bam_new in this example? (bam_start --> Picard bam2fastq --> Fastq --> Picard fastq2bam --> bam_new)

You can use bamhash on the two files and let us know what you find.
GenoMax is offline   Reply With Quote
Old 02-10-2016, 03:59 AM   #15
carolW
Senior Member
 
Location: US

Join Date: Apr 2013
Posts: 103
Default

yes, if the bam-start will be the sam as bam_new? does the file size not matter?
carolW is offline   Reply With Quote
Old 02-10-2016, 04:09 AM   #16
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

As @Devon said, if the file only contained unaligned reads then the two files should contain the same information (I am not sure if reads would retain the same order).

As we have discussed before the size of the file is a bad parameter for comparison. Bamhash would be your best bet.
GenoMax is offline   Reply With Quote
Reply

Tags
bam, converter, fastq, picard, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:12 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO