Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools rmdup in a stream

    I'm trying the following strategy to remove duplicates from small subsets of interest:

    samtools view -b test.bam chr1:7660315-7660315 | samtools rmdup - - | samtools view -

    ...but it doesn't seem to remove anything. Am I doing something wrong syntactically, or do I perhaps have some of that data that rmdup doesn't work on? How do I tell if I have FR orientation, or if ISIZE is correctly set? (I'm new to using BAMs, and I don't create them myself.)

    Here's a sample line.

    938_1080_925 16 chr1 590592 0 30M * 0 0 TCATGTCAACTGCAAACAGAAACAATTTAA :BIIIICIIIIIIIIIHIIIIIIIIIIIII RG:Z:0 CS:Z:T003003011002211001312101211312 CQ:Z:?0>59;7=/;=><54<5<39?/@/5>3A):

    (And yes, according to other posts, "I should really use Picard for this". But that has weird Java problems on my system that I don't have the time or permissions to fix.)

    Thanks!

  • #2
    For samtools, you must set ISIZE correctly; otherwise it won't work. If you have single-end reads, use rmdup -s. To use Picard, you must have MRNM and MPOS set correctly for paired end reads (so far as I know). Picard is more theoretically correct about duplicates. You may see tiny differences between samtools and Picard results, but in practice this does not matter too much.

    Comment


    • #3
      Thanks, but I don't know if I have ISIZE set correctly or if I have single-end reads. Those are my actual questions.

      Comment


      • #4
        You need to read the sam format specifications:
        Look it up on google.
        Column number 9 is the Inferred insert SIZE (ISIZE). If all of those in your file are 0, then you have single ends (or no read mapped to the reference )

        Comment


        • #5
          Originally posted by Pepe View Post
          You need to read the sam format specifications:
          Look it up on google.
          Column number 9 is the Inferred insert SIZE (ISIZE). If all of those in your file are 0, then you have single ends (or no read mapped to the reference )
          I've read the spec backwards and forwards; it's a matter of understanding the details. I don't see anywhere that it says "if you have all zeroes, that means single ends, so use rmdup -s" -- so, thanks.

          Comment


          • #6
            To tell if reads are paired in sequencing or mapped in pairs, one should look at the FLAG (2nd) column.

            Comment


            • #7
              I know about the flags field, but I don't know what those things mean in detail. I've Googled for resources about pairing, but I don't think I'm getting the whole picture. I just don't know enough about sequencing technology yet, I guess.

              Is the combination of pipes I'm trying reasonable, if I get the right options set up? That is, am I correct that I want a -b on my first view command, so that I'm sending BAM format to rmdup? And should those stdin/stdout dashes work, the way I have them? Thanks.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 08:47 AM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              59 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X