SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
multithreaded GATK local realignment adaptivegenome Open Genomics Engine Project 1 08-30-2012 07:46 AM
GATK realignment HGENETIC Bioinformatics 17 08-28-2012 08:05 AM
FixMateInformation after GATK realignment MolecularToast Bioinformatics 6 07-24-2012 06:26 AM
Gatk multiSample realignment and recalibration seq_GA Bioinformatics 5 06-15-2011 01:02 AM
Local realignment using GATK and smra seq_GA Bioinformatics 28 01-17-2011 08:10 AM

Reply
 
Thread Tools
Old 10-04-2012, 02:34 PM   #1
wzhangvv
Member
 
Location: IL

Join Date: Oct 2012
Posts: 10
Unhappy Help! about GATK realignment speed

Hi All,
I try to do SNP calling using GATK. This is my first time to do such things and I generated the work flow as follows. Everything went well till I was blocked by RealignerTargetCreator. It seemed to cost 15 days per sample! I don't know whether it was a normal speed with 300MB reference and 1GB bam file or not. Could anybody help me figure it out? I have 80 samples and obviously I don't have enough time to run this step.
Thanks for your time~!

My work flow (till RealignTargetCreator):
I used sga to do de novo assembly and I used the output file contigs.fa as reference.

bwa index -P contigs.fa -a bwtsw contigs.fa

bwa aln -t 4 contigs.fa R1.fq > R1.sai

bwa aln -t 4 contigs.fa R2.fq > R2.sai

bwa sampe contigs.fa R1.sai R2.sai R1.fq R2.fq > A.sam

samtools view -bST contigs.fa -o A_noRG.bam A.sam

java -Xmx20g -XX:PermSize=10g -XX:MaxPermSize=10g -jar /usr/share/picard/lib/AddOrReplaceReadGroups.jar INPUT=A_noRG.bam OUTPUT=A_std.bam SORT_ORDER=coordinate RGID=lib1_A RGLB=AA RGPL=illumina RGSM=lib1_A RGPU=none VALIDATION_STRINGENCY=LENIENT

java -Xmx20g -XX:PermSize=10g -XX:MaxPermSize=10g -jar /usr/share/picard/lib/MarkDuplicates.jar INPUT=A_std.bam OUTPUT=A_std_noduplicates.bam METRICS_FILE=A_std.duplicate_matrics REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT

java -Xmx20g -XX:PermSize=10g -XX:MaxPermSize=10g -jar /usr/share/picard/lib/BuildBamIndex.jar INPUT=A_std_noduplicates.bam VALIDATION_STRINGENCY=LENIENT

java -Xmx20g -XX:PermSize=10g -XX:MaxPermSize=10g -jar /usr/share/GenomeAnalysisTK-2.1-10-gdbc86ec/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 8 -I A_std_noduplicates.bam -R contigs.fa -o A_forIndelAligner.intervals
wzhangvv is offline   Reply With Quote
Old 10-04-2012, 03:58 PM   #2
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

Try OpenGE for realignment:

www.github.com/adaptivegenome/OpenGE
adaptivegenome is offline   Reply With Quote
Old 10-04-2012, 08:23 PM   #3
wzhangvv
Member
 
Location: IL

Join Date: Oct 2012
Posts: 10
Default

Hi,

Thanks for your reply!

I just read the menu of OpenGE. It can't help me because its localrealign step requires the intervals file which need to be generated by GATK RealignerTargetCreator, which was the very slow step I mentioned before.

Do you think my RealignerTargetCreator speed is normal (~10+ days per sample)? If it really is, I have to change my strategy.




Quote:
Originally Posted by adaptivegenome View Post
Try OpenGE for realignment:

www.github.com/adaptivegenome/OpenGE
wzhangvv is offline   Reply With Quote
Old 10-05-2012, 02:38 AM   #4
qtrinh
Member
 
Location: Canada

Join Date: May 2008
Posts: 20
Default

Hi,
What about running RealignerTargetCreator in parallel on each of the chromosomes ? This should speed things up for you.

Q
qtrinh is offline   Reply With Quote
Old 10-05-2012, 06:51 AM   #5
wzhangvv
Member
 
Location: IL

Join Date: Oct 2012
Posts: 10
Default

Hi,
Thanks for your reply!
I just did de novo assembly and got millions of contigs without any chromosome info... The frog species I worked on doesn't have an assembled genome...
I tried to get more scaffolds, but it was difficult to my rad data.

Quote:
Originally Posted by qtrinh View Post
Hi,
What about running RealignerTargetCreator in parallel on each of the chromosomes ? This should speed things up for you.

Q
wzhangvv is offline   Reply With Quote
Old 10-05-2012, 08:41 AM   #6
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

I'm surprised that TargetCreator is the limiting step. Typically runs quite fast.
adaptivegenome is offline   Reply With Quote
Old 10-05-2012, 10:39 AM   #7
wzhangvv
Member
 
Location: IL

Join Date: Oct 2012
Posts: 10
Default

Hi,
Do you think there is anything wrong in my workflow? Can you give me some advice? I really don't know how to improve and debug it because the previous steps finished smoothly.
Really appreciate your reply!

Quote:
Originally Posted by adaptivegenome View Post
I'm surprised that TargetCreator is the limiting step. Typically runs quite fast.
wzhangvv is offline   Reply With Quote
Old 10-14-2012, 06:05 AM   #8
Zaag
Senior Member
 
Location: Amsterdam

Join Date: Nov 2009
Posts: 112
Default

use targeted regions so it doesn't walk over the entire genome (unless it's whole genome data)

Last edited by Zaag; 10-14-2012 at 06:10 AM.
Zaag is offline   Reply With Quote
Old 10-14-2012, 12:23 PM   #9
wzhangvv
Member
 
Location: IL

Join Date: Oct 2012
Posts: 10
Default

Hi,
This is the whole genome data and I don't know the exact targeted regions...So I can't start with a known VCF file.
Thanks for your reply!



Quote:
Originally Posted by Zaag View Post
use targeted regions so it doesn't walk over the entire genome (unless it's whole genome data)
wzhangvv is offline   Reply With Quote
Reply

Tags
gatk, realignertargetcreator, snp calling, speed

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO