Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove reads from BAM whose mate has already been filtered

    Hi,

    I have removed duplicates from a paired end run BAM using Picard MarkDuplicates. In some cases, a single read was retained (not exactly sure why, perhaps the retained read was unmapped, but my BAM no longer has an even number of reads). No other filtering was done.

    For some downstream methods (e.g., bedtools pairtobed) I need to have a BAM where both reads are present for each fragment and no "singletons" of this type are present.

    Is there an available method to remove such singleton reads?

    If not, I was thinking to sort on readname, cook up something to identify singletons, dump names of singletons to file, remove reads using Picard FilterSamReads. Other ideas?

  • #2
    You can use bam flags to do this filtering.

    Here is a webpage with some good information on BAM flags:



    INTERPRETING THE BAM FLAGS


    The second column in a SAM/BAM file is the flag column. They may seem confusing at first but the encoding allows details about a read to be stored by just using a few digits. The trick is to convert the numerical digit into binary, and then use the table to interpret the binary numbers, where 1 = true and 0 = false.

    Here are some common BAM flags:

    163: 10100011 in binary
    147: 10010011 in binary
    99: 1100011 in binary
    83: 1010011 in binary

    Interpretation of 10100011 (reading the binary from left to right):

    1 the read is paired in sequencing, no matter whether it is mapped in a pair
    1 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment)
    0 the query sequence itself is unmapped
    0 the mate is unmapped
    0 strand of the query (0 for forward; 1 for reverse strand)
    1 strand of the mate
    0 the read is the first read in a pair
    1 the read is the second read in a pair

    Comment


    • #3
      Hi vivek,

      BAM flags won't work for this. The information about the paired read does not tell you anything about whether the read is still in the file. It only contains information about its mapping properties.

      Comment


      • #4
        If not, I was thinking to sort on readname, cook up something to identify singletons, dump names of singletons to file, remove reads using Picard FilterSamReads. Other ideas?
        I think that's what you'll have to do.

        Maybe you can go back and confirm that MarkDuplicates was treating your reads as paired end, and not single end? Maybe that was the problem.

        Or, try filtering your orignal file to only have reads where both ends mapped, then MarkDuplictes. Maybe that's why MarkDuplicates didn't mark both reads.

        Comment


        • #5
          Hi swbarnes2,

          Here is my output in the MarkDuplicates metrics file:

          ## METRICS CLASS net.sf.picard.sam.DuplicationMetrics
          LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICAT
          ES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
          CR503-1 4628209 124378530 7792909 3928406 30212247 83840 0.253973 213024317

          It certainly looks like MarkDups detected paired ends. The UNPAIRED_READS_EXAMINED and UNPAIRED_READ_DUPLICATES are the classes in question. I had always interpreted these to be cases where one read mapped and the other didn't. In any event, if I were to guess the UNPAIRED_READ_DUPLICATES are cases where a read, whose mate was unmapped, was removed because it mapped to the exact same coordinates as other reads.

          If this looks unusual I would appreciate feedback, but my guess is that the expected behavior is that MarkDuplicates will leave some orphan unmapped reads when REMOVE_DUPLICATES=true.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          23 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X