Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • identification of indels with samtools interpretation of SNP quality

    Hi,
    I am trying to identify indels and am having trouble understanding the values in column 5 and 6 of my pileup output (see example of lines in final_aln_ivcf.pileup). As I understand column 5 in final_aln_ivcf.pileup should be the Phred-scaled consensus quality and column 6 the SNP quality (i.e. the Phred-scaled probability of the consensus being identical to the reference) see http://samtools.sourceforge.net/samtools.shtml.

    However what confuses me is that the values in column 5 and 6 are much higher than I would expect for Phred-scaled probabiliies (see below example of lines in final_aln_ivcf.pileup).
    It looks as though in this pileup, a reliable call for an indel has got an SNP quality of >1000 (see chr1 470390), SNP quality of around 500 seems to indicate half the sequences are like the reference and half show the indel (see chr1 427504) while SNP quality scores below 50 seem to suggest almost all sequences are identical to reference with few sequences showing presence of an indel (see chr1 453369, 456726).

    Now I am wondering whether/or not I have made a mistake somewhere in the analysis, whether/or not my interpretation of this result is right and how to explain these strange very high values for Phred-based probability of consensus and SNP quality....Can anybody help?

    Quick summary how the analysis was conducted to get the pileup below:
    I initially used bwa to align paired-end solexa data to a reference:
    bwa index -a bwtsw reference.fasta
    bwa aln reference.fasta 1.fastq > aln.sai1
    bwa aln reference.fasta 2.fastq > aln.sai2
    bwa sampe -a 250 reference.fasta aln.sai1 aln.sai2 1.fastq 2.fastq > aln.sam

    then converted the .sam file to .bam file, followed by sorting and indexing using samtools:
    samtools view -bt reference.fasta aln.sam -o aln.bam
    samtools sort aln.bam aln_sorted.bam
    samtools index aln_sorted.bam

    Then I used samtools pileup to extract the indels:
    samtools pileup -i -vcf reference.fasta aln_sorted.bam > aln_ivcf.pileup
    samtools.pl varFilter aln_ivcf.pileup | awk '($3=="*" && $6>=20 && $7>=20 && $8>=10)'> final_aln_ivcf.pileup

    examples of lines in final_aln_ivcf.pileup
    chr_1 427504 * +T/+T 9 498 57 22 +T * 9 13 0 0 0
    chr_1 441613 * */-C 22 22 53 37 * -C 34 3 0 0 0
    chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
    chr_1 456726 * */-T 27 27 52 38 * -T 36 1 1 1 0
    chr_1 470390 * -C/-C 111 1021 44 28 -C * 24 4 0 0

    column headers in final_aln_ivcf.pileup according to http://samtools.sourceforge.net/samtools.shtml with option -c are as follows:
    chromosome name, coordinate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first allele, the second allele, # reads supporting the first allele, # reads supporting the second allele and # reads containing indels different from the top two alleles.

  • #2
    *bump*

    I am also interested in this problem. It seems that indel FPs are easier to come across than in the SNP.

    Perhaps some benchmarking is needed to determine parameters? against dbSNP for example?

    Comment


    • #3
      In my case, I'm looking for SNPs. I get a maximum score of 228 for column 5 & 6. I compared the SNPs score from Bowtie and BWA alignment. BWA gives me lower mapping and SNPs quality. I think it's pretty much depends on your data and also the aligner you use.

      Hope this helps,
      Melissa

      Comment


      • #4
        Originally posted by Melissa View Post
        In my case, I'm looking for SNPs. I get a maximum score of 228 for column 5 & 6. I compared the SNPs score from Bowtie and BWA alignment. BWA gives me lower mapping and SNPs quality. I think it's pretty much depends on your data and also the aligner you use.

        Hope this helps,
        Melissa
        SNP calling is a lot easier than indels it appears.

        Interesting to know that you go lower mapping and quality scores with BWA.

        I wouldn't use Bowtie for SNP calling - I would restrict it to RNA-seq.

        You might want to look into BFAST.

        Comment


        • #5
          Originally posted by NGSfan View Post
          SNP calling is a lot easier than indels it appears.

          Interesting to know that you go lower mapping and quality scores with BWA.

          I wouldn't use Bowtie for SNP calling - I would restrict it to RNA-seq.

          You might want to look into BFAST.
          Indels are hard to align so all reads agree, since each read is aligned independently. Efforts have been made in GATK to create an local re-aligner; I have my own (now not so stealth) effort. These both help SNP calling, as mis-aligned reads with indels leads to false SNPs, as well as to form a better consensus around indels.

          Regarding the SNP quality, I usually see upwards of 255 for some SNPs, but indels are off the chart and can reach >2000. Looking at the examples above, say
          Code:
           chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
          you see that it has SNP quality 125, but only one read supporting it, with the indel being called heterozygous. Nevertheless, SNP quality I have found is a great "knob" (plotted the ROC curve) even though it may not be perfectly calibrated to PHRED log-likelihood for indels.

          Comment


          • #6
            Originally posted by nilshomer View Post
            Indels are hard to align so all reads agree, since each read is aligned independently. Efforts have been made in GATK to create an local re-aligner; I have my own (now not so stealth) effort. These both help SNP calling, as mis-aligned reads with indels leads to false SNPs, as well as to form a better consensus around indels.

            Regarding the SNP quality, I usually see upwards of 255 for some SNPs, but indels are off the chart and can reach >2000. Looking at the examples above, say
            Code:
             chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
            you see that it has SNP quality 125, but only one read supporting it, with the indel being called heterozygous. Nevertheless, SNP quality I have found is a great "knob" (plotted the ROC curve) even though it may not be perfectly calibrated to PHRED log-likelihood for indels.

            Yes, that is exactly the problem (pairwise alignments vs multiple sequence alignment). I have been using the GATK with great success, so I am looking forward to your own approach. It really helps clean up the indels and reduces the false SNPs surrounding them.

            I was also thinking about using ROC curves to figure out the optimal settings for SNP and indel detection. What are you using as your benchmark? Are you generating your own reference sequence with known SNPs/indels?

            Best,
            German

            Comment


            • #7
              Originally posted by NGSfan View Post
              Yes, that is exactly the problem (pairwise alignments vs multiple sequence alignment). I have been using the GATK with great success, so I am looking forward to your own approach. It really helps clean up the indels and reduces the false SNPs surrounding them.

              I was also thinking about using ROC curves to figure out the optimal settings for SNP and indel detection. What are you using as your benchmark? Are you generating your own reference sequence with known SNPs/indels?

              Best,
              German
              ROC curves are the only way, I agree. For simulated data, I simulate 30x diploid coverage of hg18 using a program I wrote called "dwgsim" found at http://dnaa.sf.net (settings omitted). After alignment, I assess the quality of mapping using an accompanying program called "dwgsim_eval". I then use "samtools pileup" to variant call both SNPs and indels, filtering the resulting pileup with the samtools "samtool.pl varFilter" script. I then finally use "dwgsim_pileup_eval.pl" (in dnaa) to evaluate the pileup to get the data to plot TPR vs. FPR. This works for both Illumina and SOLiD data as well as various mapping and re-alignment tools, with the latter plugged in after the initial alignment. For real-world data, it gets a lot more complicated (allele frequency balance, dbsnp concordance, micro array concordance, ts/tv ratios, etc.).

              Comment


              • #8
                Originally posted by cur View Post
                examples of lines in final_aln_ivcf.pileup
                chr_1 427504 * +T/+T 9 498 57 22 +T * 9 13 0 0 0
                chr_1 441613 * */-C 22 22 53 37 * -C 34 3 0 0 0
                chr_1 453369 * */-A 125 125 48 22 * -A 20 1 1 1 0
                chr_1 456726 * */-T 27 27 52 38 * -T 36 1 1 1 0
                chr_1 470390 * -C/-C 111 1021 44 28 -C * 24 4 0 0

                column headers in final_aln_ivcf.pileup according to http://samtools.sourceforge.net/samtools.shtml with option -c are as follows:
                chromosome name, coordinate, a star, the genotype, consensus quality, SNP quality, RMS mapping quality, # covering reads, the first allele, the second allele, # reads supporting the first allele, # reads supporting the second allele and # reads containing indels different from the top two alleles.
                Hello -
                I am confused about the pileup output. Am I misreading something or do you have 14-15 output columns but only 13 column headers? The first row, when broken down:
                chromosome name: chr_1
                coordinate: 427504
                a star: *
                the genotype: +T/+T
                consensus quality: 9
                SNP quality: 498
                RMS mapping quality: 57
                # covering reads: 22
                the first allele: +T
                the second allele: *
                # reads supporting the first allele: 9
                # reads supporting the second allele: 13
                # reads containing indels different from the top two alleles: 0

                EXTRA COLUMNS: 0 0 <- Do these mean anything??? Thank you for your help - I'm new to NGS analysis.

                Comment


                • #9
                  Hi SH1,

                  I am also interested in your question. Have you found out?!

                  Thanks,

                  Kasycas

                  Comment


                  • #10
                    Same question about extra two columns in consensus indel output

                    Does anyone know what those last two columns mean? I've been searching for hours... Thanks!
                    -Kris

                    Comment


                    • #11
                      The last two columns represent the number of reads supporting further alleles. So if there was a 3rd allele present in the reads, then the FIRST of the last two columns would be greater than 0.

                      Comment


                      • #12
                        I'm pretty new to analyzing NextGen data and this thread has been really helpful with figuring out my pileup output file. However, while I know what every column means, the results in my columns are sometimes extremely strange or at the very least not corresponding to what you guys are getting.

                        Here are some examples:

                        1.) In the "A star" column, are you supposed to always get a * ? I have a mixture of *,A,G,C, and T's.

                        2.) In the "First allele" and "Second allele" columns, I have a bunch of really weird results. Anything from a repeat of a letter (ggggg) (CCcc$c.,,ccCC), to totally random letters and symbols (IBIIIIIIIIEIHHHIHIIII@III;!). But there are also normal calls in there too, although they are the minority for those 2 columns.

                        3.) All the other columns are normal

                        Unfortunately, I don't know the details of how the analysis was done to get these pileup files. I was just e-mailed the results and I'm not sure what to make of this.
                        Do any of you have any insight on why I'm getting results like this?

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        18 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        16 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        47 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X