Unconfigured Ad

**swbarnes2** · 12-09-2011, 12:55 PM

Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

So the bwa/samtools command woud be:

bwa sampe -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS - > out.bam

**pmaugeri** · 12-09-2011, 01:06 PM

wow :-O
Thanks for taking the time to read it and for your very interesting comments. I will modify the post accordingly.

I believe that wgsim is limited compared to maq simulator and I plan to add a section along wgsim to simulate reads using MAQ. What do you thing about that?

**swbarnes2** · 12-09-2011, 02:20 PM

Personally, It's a lot more fun to learn with real data, like off of SRA, but it might be hard to find a data set that's suitable for practice.

**pmaugeri** · 12-09-2011, 02:32 PM

I confirm your suspicion: it is not that easy to find a data sets suitable for practicing. Simulated data sets is a good complementary source of data.

Of course you have data sets from 1000g project and co. but I also like very much the flexibility of simulated data sets (eg. indels are well known, you can have as many SVs as you want etc.).

**Heisman** · 12-10-2011, 03:19 PM

Originally posted by swbarnes2 View Post

Some of those commands are not as useful for real life problems, so maybe you want to add alternate commands, for once you really start working.

For starters, you may want to note that the default bwa index won't work on a genome > 2 Gb. You need to add "-a bwtsw" to the command. So I think chr 20 will index alright, but the whole genome will not. And in real life, most people have access to computers with a few processors, so they'd add a "-t 2" or whatever to the bwa aln command.

In real life, .sam files are just too big. If you pipe the command that makes the .sam file straight into samtools view, you won't make a .sam file, just the much smaller .bam file. You can use samtools view to convert part of the .bam back into a .sam if you want to look at the raw alignments. And I'd add -h to the samtools view command, to make sure the headers are carried on. You need them for some commands to work.

So the bwa/samtools command woud be:

One comment I'll add is that if you intend to sort the bam file with samtools sort after generating it, you can add the -u option to form an uncompressed bam file. After sorting with samtools it will be compressed. This will cut down on the time of generating the bam file.

**pmaugeri** · 12-12-2011, 01:21 AM

Thanks Heisman for your comment.

If I have understood correctly you propose to generate the .bam file in an uncompressed format:

bwa sampe -f out.sam -r '@RG\tID:foo\tSM:bar' human_g1k_v37_chr20.fasta out1.sai out2.sai out.read1b.fq out.read2b.fq | samtools view -hbS -u - > out.bam

because the command

samtools sort out.bam out_sorted

will compress it anyway? So it will save time on computing the sort?

**Heisman** · 12-12-2011, 04:48 AM

Yes, correct. Very simple to test too if you have a stopwatch.

**pmaugeri** · 12-12-2011, 05:06 AM

Thank you Sir! I added your comment into the post.

**vyellapa** · 12-14-2011, 09:05 AM

Can dindel be run on Exome data?

Im trying to use dindel on exome human data but am getting about 100 indels/exome. I would like to know if there is something wrong with my pipeline (setting parameter values incorrectly; using re-calibrated, de-duped bams as opposed to using the raw bams, etc..) or if dindel cannot be used on exome data at all.

Any information is helpful.

**Alex Renwick** · 12-14-2011, 09:50 AM

Is there a reason you leave out the realignment step (RealignerTargetCreator and IndelRealigner) when using UnifiedGenotyper? The step is recommended in Broad's "best practices" (http://www.broadinstitute.org/gsa/wi...th_the_GATK_v3), and Dindel does the equivalent using the result of getCIGARindels.

**vyellapa** · 12-15-2011, 11:00 AM

I have done the realignment step too. The SNPs using GATK pipeline look ok but indels do not. I have the intermediate files and am wondering what indel caller to use?

**pmaugeri** · 12-16-2011, 02:10 AM

@Alex Renwick: I am not sure whether it helps to do realignment around known indels here (I guess you refer to the steps described in http://www.broadinstitute.org/gsa/wi..._around_indels) as this pipeline starts from simulating reads and simulating indels. My understanding is that indels are scattered randomly so there are probably not "known" sites and not referenced in dbSNP or 1000genomes databases.

Please let me know if I misunderstood it and if local realignment is necessary.

**Alex Renwick** · 12-16-2011, 08:24 AM

Originally posted by pmaugeri View Post

@Alex Renwick: I am not sure whether it helps to do realignment around known indels here ... as this pipeline starts from simulating reads and simulating indels...

The Broad documentation does seem to focus on the issue of handling known sites, but you can use the realigner to deal with novel indels.

RealignerTargetCreator scans the bam file for likely indel regions (probably looking at the CIGAR string) and then IndelRealigner performs a local alignment in those areas. The local alignment takes all reads into account at once, which requires more computation and finds a better result than the usual practice of instead of handling each read independently.

You can use wgsim_eval.pl to compare the correctness of the bam file before and after realignment, and you'll see read placement improve.

**pmaugeri** · 12-22-2011, 02:35 AM

@Alex,

I just ended another round of indels calling against several sets of simulated data. In brief I simulated sets of reads for chromosome 20 at four different coverages (5x, 11x, 22x, 44x).

Then I run GATK UnifiedGenotyper with or without local realignment. I found no difference between the number of indels calls with or without realignment?!

So is it worth doing local realignment steps prior to call for indels ON simulated data?

Or maybe I just did something wrong in the realignement steps?!

Here are my commands:

java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T RealignerTargetCreator -o output.intervals --known 1000G_biallelic.indels.b37.vcf

java -jar GenomeAnalysisTK.jar -I out_sorted.bam -R human_g1k_v37_chr20.fasta -T IndelRealigner -targetIntervals output.intervals -o realigned_out_sorted.bam -known 1000G_biallelic.indels.b37.vcf --consensusDeterminationModel KNOWNS_ONLY -LOD 0.4

java -jar GenomeAnalysisTK.jar -R human_g1k_v37_chr20.fasta -T UnifiedGenotyper -I realigned_out_sorted.bam -o gatk_indels.vcf -glm INDEL

Regards.

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Pipeline to simulate short reads and call indels

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News