Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unique Reads from BAM/SAM

    Hi all,

    I'm using published data that includes all reads from an experiment and am not familiar with the SAM/BAM output. I'm having trouble interpreting the bitwise flag and MAPQ values to extract uniquely mapping reads and understand appropriate cut-offs.

    The SAM FAQ says to use samtools view -q 1 in order to extract "unique" reads; to which value does q refer? MAPQ?

    I want to replicate the number of acceptable reads in the publications using their criteria: "The best alignment was kept only in cases in which the second-best alignment had at least three more mismatches." What sort of q value (MAPQ) would this be?

    In all, I just want to be confident that I can filter out all unique reads.

    Thanks for any help!

    Further info: the dataset is single end bisulfite treated and only has 10 bitwise flags:
    0
    4
    16
    512
    516
    528
    1024
    1040
    1536
    1552
    Last edited by Gen2007; 01-13-2011, 10:31 AM.

  • #2
    Hope this will help:

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Comment


    • #3
      The only flags you want are 0 and 16. Everything higher than 1023 is a duplicate. Everything higher than 511 failed a platform quality check. The 4's are unmapped.

      view -f 0x010 is what you want, I think.

      Comment


      • #4
        Thanks for the suggestions. Depending on how I filter I got different results (out of 24.5 million reads). Do you guys have any idea of what is contributing to the disparities?

        samtools view -q 1 Colon_tumor_FFPE_lane2.bam | wc -l : 8,481,938 reads
        view -f 0x010 Colon_tumor_FFPE_lane2.bam | wc -l : 9,597,123 reads
        awk '$2~/^(0|16)$/' Colon_tumor_FFPE_lane2.sam | wc -l : 1,237,334

        Any ideas?

        Thanks a lot!

        Comment


        • #5
          I have the same question. After searching the forum I found many people have similar questions, and people have different understandings of how "unique/multiple" mappings were defined in SAM format. From the SAM specification, the flag 0x100 indicates whether the alignment is primary or not. But this is different from "unique", because multiple mappings will still have one primary mapping and other non-primary mappings, so just counting primary mappings doesn't give you the answer. I wonder if SAM format could add an aux field for this basic information, which could be extracted easily? If I'm missing something, please correct me.

          Comment


          • #6
            Hello,
            Originally posted by Gen2007 View Post
            The SAM FAQ says to use samtools view -q 1 in order to extract "unique" reads; to which value does q refer? MAPQ?
            ...Well, the FAQ on samtools shows an example of how to filter out reads with MAPQ below a certain threshold. In that example they put 1 which means 'filter out nearly nothing!'. As they say, more stringent thresholds would be appropriate, say 15 (p-value ~0.05 that a read is incorrectly mapped) or 20 (p-value ~0.01).
            Also note that 'uniqueness' has a rather vague definition, better to think in terms of reliability (i.e. MAPQ).

            I want to replicate the number of acceptable reads in the publications using their criteria: "The best alignment was kept only in cases in which the second-best alignment had at least three more mismatches." What sort of q value (MAPQ) would this be?
            I can't tell exactly, but I would say that -q 15 or -q 20 should be stringent enough while avoiding to throw away too many genuine reads.
            The filtering may become trickier if you want to replicate *exactly* the published results.

            Just my 2p...
            Dario

            Comment


            • #7
              Thanks a lot for the reply Dario. It seemed to me that -q 1 was not removing very much, and I was not very confident in using that value as a proper filter. Let me try your suggestions. Thanks again!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              57 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X