Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtool's pileup format - reference base question

    Hi everybody,

    I have a question concerning the SAMtools pileup format. I performed SNP (and Indel) calling for 75 bp PE Illumina data with hg19 by following the protocol explained here.

    I noticed in the filtered output the following SNP call:

    chr3 57627500 t C 241 241 60 71 C$c$c$c$ccccccccccccccCccccccCcccccccccccccccccccccccccccccccccccccccccccc^]c BCCCCCCCCCCCCCCCCCBCCCCCCBCCCCC@CCCCCCCCCC?C=CCCCCCCCC>CC?CBCCCBCC@CCC@

    Why is the reference base (t) written in lower case? I read that in some of MAQ's tools (eg. cns2fq) "bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called."
    I doubt that this works in this case because it seems like the coverage is ok (71), the SNP appears on both strands, the alignments are reliable (RMS MQ = 60), and, according to UCSC, the position where the SNP is called has quite a good mappability.

    Additionally, Indel lines do have more than 13 columns. Does anybody know what the additional 14th and 15th column mean?

    Any hint/help will be greatly appreciated!
    Best regards

  • #2
    As per the samtools manual page:

    At this column [reference], a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, ‘ACGTN’ for a mismatch on the forward strand and ‘acgtn’ for a mismatch on the reverse strand.
    I believe that since all of your reads are a 'C' either forward (uppercase) or reverse strand (lowercase) then the reference is upper/lowercase depending on the predominance of the reads; i.e., since most of the reads are reverse then your reference is 'reverse'.

    I do not believe that MAQ has anything to do with the sam format.

    Comment


    • #3
      In your reference file, t is in lowercase.

      Comment


      • #4
        @westerman: Sorry I have to correct you: The column you refer to is not the reference but the reads column.
        @mfischer: Your reference is "t" because of the so-called softmasking by UCSC that makes lower case letters if there is a repeat. The UCSC browser informs me that chr3:57627500 lies inside a simple repeat, (CAAAA)n. At the same time, that position is a C/T SNP. So everything OK, you have a homozygous SNP allele (C) that is supported by reads from both strands, but most from the reverse strand (c).
        As to the 14th and 15the colum - do you mean 11th to 13th? Because the samtools FAQ say that these are indel-specific, see
        http://sourceforge.net/apps/mediawik..._pileup_output.

        Comment


        • #5
          Thanks for the replys. I totally forgot that UCSC repeat masks the reference.

          @epigen: I've expected to see 13 columns in the indel rows as described in the link you've sent, but actually I got 15 columns for every indel. An example would be:

          chr3 44826315 * */+T 221 221 60 33 * +T 24 7 2 2 0

          It seems like others have experienced that as well, see http://seqanswers.com/forums/showthread.php?t=4234 post #8.

          Comment


          • #6
            additional columns in samtools pileup output for indels

            Now that you mention it, I looked at the indel lines of my data - I ignored them before because I'm only interested in SNPs ATM - and also saw the two additional columns. Heng Li must have changed the output format since writing the samtools FAQs. (Also, the manual page entry for pileup is not up to date, the parameters have changed.) How do we bug him to answer/update since he already commented on this thread, but only answered your first question?

            Comment


            • #7
              Originally posted by epigen View Post
              Now that you mention it, I looked at the indel lines of my data - I ignored them before because I'm only interested in SNPs ATM - and also saw the two additional columns. Heng Li must have changed the output format since writing the samtools FAQs. (Also, the manual page entry for pileup is not up to date, the parameters have changed.) How do we bug him to answer/update since he already commented on this thread, but only answered your first question?
              His seqanswers handle is lh3.

              Comment


              • #8
                I wrote an email to the samtools mailing list.

                Hi everybody,

                according to the SAM FAQ page the pileup format has 13 columns for indel
                lines (when the pileup is called with -c). I noticed in my pileup files
                that all indel rows have 15 columns. Does anybody know what column 14
                and 15 are?

                Thanks in advance
                Cheers
                Maybe this helps

                Comment


                • #9
                  Just wondering if anyone has discovered what the extra columns are? I can't find any information on them in the samtools documentation.

                  Comment


                  • #10
                    So far, I didn't get any answers to that question. But I need to admit that I didn't dig deeper into that issue.

                    Comment


                    • #11
                      Originally posted by mard View Post
                      Just wondering if anyone has discovered what the extra columns are? I can't find any information on them in the samtools documentation.

                      Comment


                      • #12
                        Originally posted by nilshomer View Post
                        Thanks for the link but I can only see explanations for 13 out of the 15 columns there.
                        This issue has also been reported in this thread:
                        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        30 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        32 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X