SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mpileup/BCFtools pipeline not picking up indels (suggestions please) cam.jack Bioinformatics 7 05-17-2013 01:05 PM
dwgsim to simulate Illumina reads gene coder Bioinformatics 1 07-07-2011 09:51 AM
Samtools pipeline produces many indels Hkins552 Bioinformatics 2 06-17-2011 04:30 AM
Simulate ABI reads wesb SOLiD 0 12-04-2008 09:10 AM
indels using single end short reads! bioinfosm Bioinformatics 10 08-01-2008 12:57 PM

Reply
 
Thread Tools
Old 12-09-2011, 11:37 AM   #1
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default Pipeline to simulate short reads and call indels

Hi

http://biotechies.blogspot.com/2011/...ct-indels.html is a kind of howto for doing the following pipeline:

1. simulate short reads (wgsim)

2. align short reads
2.1. using BWA
2.2. using Bowtie2

3. call indels
3.1. using GATK UnifiedGenotyper
3.2. using Dindel

I initially started to write it for my own usage to remember command calls but it may help others since I've spent a lot of time discovering software parameters and pipelines.

Please help me to correct and complete the post as much as possible. Any suggestion is very welcome too.

Regards.
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 12-09-2011, 11:55 AM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

So the bwa/samtools command woud be:

Quote:
bwa sampe -r [email protected]\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS - > out.bam
swbarnes2 is offline   Reply With Quote
Old 12-09-2011, 12:06 PM   #3
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default

wow :-O
Thanks for taking the time to read it and for your very interesting comments. I will modify the post accordingly.

I believe that wgsim is limited compared to maq simulator and I plan to add a section along wgsim to simulate reads using MAQ. What do you thing about that?
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 12-09-2011, 01:20 PM   #4
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Personally, It's a lot more fun to learn with real data, like off of SRA, but it might be hard to find a data set that's suitable for practice.
swbarnes2 is offline   Reply With Quote
Old 12-09-2011, 01:32 PM   #5
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default

I confirm your suspicion: it is not that easy to find a data sets suitable for practicing. Simulated data sets is a good complementary source of data.

Of course you have data sets from 1000g project and co. but I also like very much the flexibility of simulated data sets (eg. indels are well known, you can have as many SVs as you want etc.).
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 12-10-2011, 02:19 PM   #6
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 533
Default

Quote:
Originally Posted by swbarnes2 View Post
Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

So the bwa/samtools command woud be:
One comment I'll add is that if you intend to sort the bam file with samtools sort after generating it, you can add the -u option to form an uncompressed bam file. After sorting with samtools it will be compressed. This will cut down on the time of generating the bam file.
Heisman is offline   Reply With Quote
Old 12-12-2011, 12:21 AM   #7
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default

Thanks Heisman for your comment.

If I have understood correctly you propose to generate the .bam file in an uncompressed format:

bwa sampe -f out.sam -r [email protected]\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS -u - > out.bam

because the command

samtools sort out.bam out_sorted

will compress it anyway? So it will save time on computing the sort?
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 12-12-2011, 03:48 AM   #8
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 533
Default

Yes, correct. Very simple to test too if you have a stopwatch.
Heisman is offline   Reply With Quote
Old 12-12-2011, 04:06 AM   #9
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default

Thank you Sir! I added your comment into the post.
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 12-14-2011, 08:05 AM   #10
vyellapa
Member
 
Location: phoenix

Join Date: Oct 2011
Posts: 59
Default Can dindel be run on Exome data?

Im trying to use dindel on exome human data but am getting about 100 indels/exome. I would like to know if there is something wrong with my pipeline (setting parameter values incorrectly; using re-calibrated, de-duped bams as opposed to using the raw bams, etc..) or if dindel cannot be used on exome data at all.

Any information is helpful.
vyellapa is offline   Reply With Quote
Old 12-14-2011, 08:50 AM   #11
Alex Renwick
Member
 
Location: Houston, Texas

Join Date: Jul 2011
Posts: 44
Default

Is there a reason you leave out the realignment step (RealignerTargetCreator and IndelRealigner) when using UnifiedGenotyper? The step is recommended in Broad's "best practices" (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v3), and Dindel does the equivalent using the result of getCIGARindels.
Alex Renwick is offline   Reply With Quote
Old 12-15-2011, 10:00 AM   #12
vyellapa
Member
 
Location: phoenix

Join Date: Oct 2011
Posts: 59
Default

I have done the realignment step too. The SNPs using GATK pipeline look ok but indels do not. I have the intermediate files and am wondering what indel caller to use?
vyellapa is offline   Reply With Quote
Old 12-16-2011, 01:10 AM   #13
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default

@Alex Renwick: I am not sure whether it helps to do realignment around known indels here (I guess you refer to the steps described in http://www.broadinstitute.org/gsa/wi..._around_indels) as this pipeline starts from simulating reads and simulating indels. My understanding is that indels are scattered randomly so there are probably not "known" sites and not referenced in dbSNP or 1000genomes databases.

Please let me know if I misunderstood it and if local realignment is necessary.
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 12-16-2011, 07:24 AM   #14
Alex Renwick
Member
 
Location: Houston, Texas

Join Date: Jul 2011
Posts: 44
Default

Quote:
Originally Posted by pmaugeri View Post
@Alex Renwick: I am not sure whether it helps to do realignment around known indels here ... as this pipeline starts from simulating reads and simulating indels...
The Broad documentation does seem to focus on the issue of handling known sites, but you can use the realigner to deal with novel indels.

RealignerTargetCreator scans the bam file for likely indel regions (probably looking at the CIGAR string) and then IndelRealigner performs a local alignment in those areas. The local alignment takes all reads into account at once, which requires more computation and finds a better result than the usual practice of instead of handling each read independently.

You can use wgsim_eval.pl to compare the correctness of the bam file before and after realignment, and you'll see read placement improve.
Alex Renwick is offline   Reply With Quote
Old 12-22-2011, 01:35 AM   #15
pmaugeri
Member
 
Location: Barcelona

Join Date: Sep 2011
Posts: 11
Default

@Alex,

I just ended another round of indels calling against several sets of simulated data. In brief I simulated sets of reads for chromosome 20 at four different coverages (5x, 11x, 22x, 44x).

Then I run GATK UnifiedGenotyper with or without local realignment. I found no difference between the number of indels calls with or without realignment?!

So is it worth doing local realignment steps prior to call for indels ON simulated data?

Or maybe I just did something wrong in the realignement steps?!

Here are my commands:

java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T RealignerTargetCreator -o output.intervals --known 1000G_biallelic.indels.b37.vcf

java -jar GenomeAnalysisTK.jar -I out_sorted.bam -R human_g1k_v37_chr20.fasta -T IndelRealigner -targetIntervals output.intervals -o realigned_out_sorted.bam -known 1000G_biallelic.indels.b37.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I realigned_out_sorted.bam -o gatk_indels.vcf -glm INDEL

Regards.
__________________
@pascal_biotech
pmaugeri is offline   Reply With Quote
Old 02-29-2012, 04:25 PM   #16
xlzhang
Junior Member
 
Location: beijing

Join Date: Nov 2011
Posts: 6
Default

Quote:
Originally Posted by pmaugeri View Post
@Alex,

I just ended another round of indels calling against several sets of simulated data. In brief I simulated sets of reads for chromosome 20 at four different coverages (5x, 11x, 22x, 44x).

Then I run GATK UnifiedGenotyper with or without local realignment. I found no difference between the number of indels calls with or without realignment?!

So is it worth doing local realignment steps prior to call for indels ON simulated data?

Or maybe I just did something wrong in the realignement steps?!

Here are my commands:

java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T RealignerTargetCreator -o output.intervals --known 1000G_biallelic.indels.b37.vcf

java -jar GenomeAnalysisTK.jar -I out_sorted.bam -R human_g1k_v37_chr20.fasta -T IndelRealigner -targetIntervals output.intervals -o realigned_out_sorted.bam -known 1000G_biallelic.indels.b37.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I realigned_out_sorted.bam -o gatk_indels.vcf -glm INDEL

Regards.
Hi, pmaugeri

I always don't understand the meaning of -LOD in gatk IndelRealigner. Do you know that?

Appreciate your help!

Regards.
xlzhang is offline   Reply With Quote
Old 03-01-2012, 02:09 PM   #17
adaptivegenome
Super Moderator
 
Location: US

Join Date: Nov 2009
Posts: 437
Default

LOD is used to set threshold for realignment. GATK tries to avoid realigning reads unless the impact is likely to be significant.
adaptivegenome is offline   Reply With Quote
Old 03-01-2012, 03:45 PM   #18
xlzhang
Junior Member
 
Location: beijing

Join Date: Nov 2011
Posts: 6
Default

Quote:
Originally Posted by genericforms View Post
LOD is used to set threshold for realignment. GATK tries to avoid realigning reads unless the impact is likely to be significant.
Thank you!
xlzhang is offline   Reply With Quote
Reply

Tags
bowtie, bwa, dindel, gatk, wgsim

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO