SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK indel re-aligner without known indels SeekAnswers Bioinformatics 0 04-30-2012 10:50 AM
Is the PCR purification step between ligation and size selection necessary? skblazer Sample Prep / Library Generation 7 10-28-2011 04:44 AM
advice on the size selection step in ChIP-seq protocol mboth Sample Prep / Library Generation 5 10-13-2011 02:59 AM
GATK recalibration score and local realignment: what is the first step? m_elena_bioinfo Bioinformatics 0 01-24-2011 03:59 AM
step by step for rarefaction calculation psong Metagenomics 1 01-06-2010 06:08 AM

Reply
 
Thread Tools
Old 06-18-2012, 02:59 PM   #1
manducasexta
Member
 
Location: San Francisco

Join Date: Mar 2009
Posts: 12
Default GATK re aligner step doubles file size?

Has anyone else seen the GATK IndelRealigner double file size? It happened on a set of 12 genomes I'm processing. After the next step the file sizes returned to their normal range. A quick look at the first reads in the files showed no difference; same reads in the same order with the same amount of metadata, a little of it changed. I'm indexing them now so I can sample a few other regions.

Thanks!
manducasexta is offline   Reply With Quote
Old 06-18-2012, 04:43 PM   #2
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Sounds odd. The only reason I can think is if GATK LR is returning the BAM unsorted for some reason. A sorted BAM can be smaller than an unsorted BAM since it is easier to compress...
adaptivegenome is offline   Reply With Quote
Old 07-03-2012, 02:57 AM   #3
dawe
Senior Member
 
Location: 4530'25.22"N / 915'53.00"E

Join Date: Apr 2009
Posts: 258
Default

Is that the IndelRealigner? I found the same problem with TableRecalibration, but that happens because GATK retains old quality scores for each read. Solved with "--doNotWriteOriginalQuals" options.
dawe is offline   Reply With Quote
Old 07-03-2012, 10:53 AM   #4
swNGS
Member
 
Location: SW UK

Join Date: Nov 2011
Posts: 83
Default

I had noticed that too... Is there any reason why anyone would want to retain the original quality information (aside from an OCD-esque obsessing with not discarding anything)
swNGS is offline   Reply With Quote
Old 07-04-2012, 12:34 PM   #5
manducasexta
Member
 
Location: San Francisco

Join Date: Mar 2009
Posts: 12
Default

Hi Dawe --
While the quality scores are retained in the later files, those files are back down to a reasonable size. The following image shows the progression of file sizes in obsessive detail. There are 12 points per analysis step because I'm running 12 samples in parallel:



I'm wondering if I could have caused the problem by something atypical I did in preparing the interval file for the realignment: I needed to add read groups to the alignments, so I ran parallel jobs creating the interval file and adding read groups. Then I used the interval file to realign the reads in the file with the newly added read groups (one per file). I reasoned that read groups aren't relevant to realignment when there is one per file, but maybe I tripped over some unexpected consequences.
manducasexta is offline   Reply With Quote
Old 07-04-2012, 12:41 PM   #6
manducasexta
Member
 
Location: San Francisco

Join Date: Mar 2009
Posts: 12
Default

>Is there any reason why anyone would want to retain the original quality information

swNGS: it's useful if you plan to keep the bams as your sole archive for a sequencing project because if you need to realign in the future (e.g. to a new version of the genome or a new genome entirely because you've been aligning to a related organism since yours isn't sequenced yet), you can recreate fastq from the original quality scores, and not have artifacts in the quality scores from errors in the recalibration based on the first genome.

But obsessive data retention may play a part too.
manducasexta is offline   Reply With Quote
Old 07-04-2012, 02:09 PM   #7
swNGS
Member
 
Location: SW UK

Join Date: Nov 2011
Posts: 83
Default

Manducasexta: hmmm you have a point there!
We were having a discussion in the lab recently about what data to keep/discard etc. I'm all for keeping the minimum required, and hadnt considered that you could regenerate the FASTQ from the bam.
What would be the path to achieve this? As I could theoretically discard the original fastqs....
swNGS is offline   Reply With Quote
Old 07-05-2012, 09:50 AM   #8
manducasexta
Member
 
Location: San Francisco

Join Date: Mar 2009
Posts: 12
Default

swNGS: I haven't done it, so I don't have a method immediately at hand. But having verified that the the information is present in the SAM file, I'm confident that fastq containing the original quality scores could be generated from SAM using perl or (probably) some other tool for parsing SAM format. When using an aligner that can use BAM input, it would be sufficient to replace the recalibrated quality scores with the original scores in a copy of the file.
manducasexta is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:07 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO