Seqanswers Leaderboard Ad

**NGSfan** · 03-12-2010, 06:26 AM

*bump*

I am also interested in this problem. It seems that indel FPs are easier to come across than in the SNP.

Perhaps some benchmarking is needed to determine parameters? against dbSNP for example?

**Melissa** · 03-16-2010, 02:20 AM

In my case, I'm looking for SNPs. I get a maximum score of 228 for column 5 & 6. I compared the SNPs score from Bowtie and BWA alignment. BWA gives me lower mapping and SNPs quality. I think it's pretty much depends on your data and also the aligner you use.

Hope this helps,
Melissa

**NGSfan** · 04-19-2010, 08:43 AM

Originally posted by Melissa View Post

In my case, I'm looking for SNPs. I get a maximum score of 228 for column 5 & 6. I compared the SNPs score from Bowtie and BWA alignment. BWA gives me lower mapping and SNPs quality. I think it's pretty much depends on your data and also the aligner you use.

Hope this helps,
Melissa

SNP calling is a lot easier than indels it appears.

Interesting to know that you go lower mapping and quality scores with BWA.

I wouldn't use Bowtie for SNP calling - I would restrict it to RNA-seq.

You might want to look into BFAST.

**nilshomer** · 04-19-2010, 09:18 AM

Originally posted by NGSfan View Post

SNP calling is a lot easier than indels it appears.

Interesting to know that you go lower mapping and quality scores with BWA.

I wouldn't use Bowtie for SNP calling - I would restrict it to RNA-seq.

You might want to look into BFAST.

Indels are hard to align so all reads agree, since each read is aligned independently. Efforts have been made in GATK to create an local re-aligner; I have my own (now not so stealth) effort. These both help SNP calling, as mis-aligned reads with indels leads to false SNPs, as well as to form a better consensus around indels.

Regarding the SNP quality, I usually see upwards of 255 for some SNPs, but indels are off the chart and can reach >2000. Looking at the examples above, say

Code:

 chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0

you see that it has SNP quality 125, but only one read supporting it, with the indel being called heterozygous. Nevertheless, SNP quality I have found is a great "knob" (plotted the ROC curve) even though it may not be perfectly calibrated to PHRED log-likelihood for indels.

**NGSfan** · 04-20-2010, 12:59 AM

Originally posted by nilshomer View Post

Indels are hard to align so all reads agree, since each read is aligned independently. Efforts have been made in GATK to create an local re-aligner; I have my own (now not so stealth) effort. These both help SNP calling, as mis-aligned reads with indels leads to false SNPs, as well as to form a better consensus around indels.

Regarding the SNP quality, I usually see upwards of 255 for some SNPs, but indels are off the chart and can reach >2000. Looking at the examples above, say

Code:

 chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0

you see that it has SNP quality 125, but only one read supporting it, with the indel being called heterozygous. Nevertheless, SNP quality I have found is a great "knob" (plotted the ROC curve) even though it may not be perfectly calibrated to PHRED log-likelihood for indels.

Yes, that is exactly the problem (pairwise alignments vs multiple sequence alignment). I have been using the GATK with great success, so I am looking forward to your own approach. It really helps clean up the indels and reduces the false SNPs surrounding them.

I was also thinking about using ROC curves to figure out the optimal settings for SNP and indel detection. What are you using as your benchmark? Are you generating your own reference sequence with known SNPs/indels?

Best,
German

**nilshomer** · 04-20-2010, 01:34 AM

Originally posted by NGSfan View Post

Yes, that is exactly the problem (pairwise alignments vs multiple sequence alignment). I have been using the GATK with great success, so I am looking forward to your own approach. It really helps clean up the indels and reduces the false SNPs surrounding them.

I was also thinking about using ROC curves to figure out the optimal settings for SNP and indel detection. What are you using as your benchmark? Are you generating your own reference sequence with known SNPs/indels?

Best,
German

ROC curves are the only way, I agree. For simulated data, I simulate 30x diploid coverage of hg18 using a program I wrote called "dwgsim" found at http://dnaa.sf.net (settings omitted). After alignment, I assess the quality of mapping using an accompanying program called "dwgsim_eval". I then use "samtools pileup" to variant call both SNPs and indels, filtering the resulting pileup with the samtools "samtool.pl varFilter" script. I then finally use "dwgsim_pileup_eval.pl" (in dnaa) to evaluate the pileup to get the data to plot TPR vs. FPR. This works for both Illumina and SOLiD data as well as various mapping and re-alignment tools, with the latter plugged in after the initial alignment. For real-world data, it gets a lot more complicated (allele frequency balance, dbsnp concordance, micro array concordance, ts/tv ratios, etc.).

**SH1** · 04-27-2010, 09:46 AM

Originally posted by cur View Post

examples of lines in final_aln_ivcf.pileup
chr_1 427504 * +T/+T 9 498 57 22 +T * 9 13 0 0 0
chr_1 441613 * */-C 22 22 53 37 * -C 34 3 0 0 0
chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
chr_1 456726 * */-T 27 27 52 38 * -T 36 1 1 1 0
chr_1 470390 * -C/-C 111 1021 44 28 -C * 24 4 0 0

column headers in final_aln_ivcf.pileup according to http://samtools.sourceforge.net/samtools.shtml with option -c are as follows:
chromosome name, coordinate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first allele, the second allele, # reads supporting the first allele, # reads supporting the second allele and # reads containing indels different from the top two alleles.

Hello -
I am confused about the pileup output. Am I misreading something or do you have 14-15 output columns but only 13 column headers? The first row, when broken down:
chromosome name: chr_1
coordinate: 427504
a star: *
the genotype: +T/+T
consensus quality: 9
SNP quality: 498
RMS mapping quality: 57
# covering reads: 22
the first allele: +T
the second allele: *
# reads supporting the first allele: 9
# reads supporting the second allele: 13
# reads containing indels different from the top two alleles: 0

EXTRA COLUMNS: 0 0 <- Do these mean anything??? Thank you for your help - I'm new to NGS analysis.

**Kasycas** · 11-15-2010, 05:20 AM

Hi SH1,

I am also interested in your question. Have you found out?!

Thanks,

Kasycas

**kweber2** · 01-12-2011, 05:06 PM

Same question about extra two columns in consensus indel output

Does anyone know what those last two columns mean? I've been searching for hours... Thanks!
-Kris

**oiiio** · 03-24-2011, 03:56 PM

The last two columns represent the number of reads supporting further alleles. So if there was a 3rd allele present in the reads, then the FIRST of the last two columns would be greater than 0.

**Turnerac0987** · 02-08-2012, 02:04 PM

I'm pretty new to analyzing NextGen data and this thread has been really helpful with figuring out my pileup output file. However, while I know what every column means, the results in my columns are sometimes extremely strange or at the very least not corresponding to what you guys are getting.

Here are some examples:

1.) In the "A star" column, are you supposed to always get a * ? I have a mixture of *,A,G,C, and T's.

2.) In the "First allele" and "Second allele" columns, I have a bunch of really weird results. Anything from a repeat of a letter (ggggg) (CCcc$c.,,ccCC), to totally random letters and symbols (IBIIIIIIIIEIHHHIHIIII@III;!). But there are also normal calls in there too, although they are the minority for those 2 columns.

3.) All the other columns are normal

Unfortunately, I don't know the details of how the analysis was done to get these pileup files. I was just e-mailed the results and I'm not sure what to make of this.
Do any of you have any insight on why I'm getting results like this?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

identification of indels with samtools interpretation of SNP quality

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News