Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unexpected results using samtools view -f/-F

    I'm trying to extract the set of reads from a bam file where neither the read itself, nor its mate aligned. To accomplish this, first I tried using the REQUIRE argument (-f) with the flag 12:

    read unmapped
    mate unmapped

    When I ran it, the lines it was returning appeared to be aligned even though the bit flags seemed to imply that they were not (i.e. the bit flag did include read unmapped & mate unmapped). Here are a couple example lines returned from:

    samtools view -f12 <bam>

    ILLUMINA-1898B0:32:638L8AAXX:4:45:2010:18660 77 TRISPI_Contig6402 551 25 80M = 551 0 ACCGTAGGCCGCTACCGTAACCATGAACGCAGGTACAGATGCTCAGGAGTCCGGGAGTGACCAGACGAATTTCTAACAGA HHGHHHHHDHHGHHHHHHHHBHHHG<GEGGGGD<GGGEGDGDGGGFHHH@HDDHHBG3DBDDBDB@D>3>BBBB:CEBDE RG:Z:ALL_3_MERGED.1 XT:A:U NM:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:15A24A16C7G14

    ILLUMINA-1898B0:32:638L8AAXX:3:89:8485:16433 93 TRISPI_Contig2320 1477 0 80M = 1477 0 TCCGTCAATGTCGACGATGATATACAGTTCCGCTGCGCAGAACCCGCCCGGATACCAAGTTCCACGACTGGACATGCATC CEDD<FFBEF<HIGHHDBFHEFFBIFHFBHIIIHBGGGGGIAIIGIIIIIHIIFGIIGIIHIIIIIIIIIIIIIIIIIII RG:Z:ALL_3_MERGED.1 XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:33 XM:i:2 XO:i:0 XG:i:0 MD:Z:23C24A31

    ILLUMINA-1898B0:32:638L8AAXX:4:30:1852:14147 157 TRISPI_Contig2320 1477 0 80M = 1477 0 TCCGTCAATGTCGACGATGATATACAGTTCCGCTGCGCAGAACCCGCCCGGATACCAAGTTCCACGACTGGACATGCATC B3DB@====8>>DBDEE<EE8ECBE@@@CD@I@IHFG@GG8CDAGGDGGGDGED>IIGGIGGG<GDDGDGII?GIIIGEB RG:Z:ALL_3_MERGED.1 XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:33 XM:i:2 XO:i:0 XG:i:0 MD:Z:23C24A31


    In every case I am seeing a reference being listed (as opposed to the '*' I expect for a query with no hit), and CIGAR strings that seem to also confirm some kind of mapping being done. While in many of the lines output, the mapping quality is 0, I am seeing some cases where the mapping quality is fairly good (such as the '25' in the first example above). But actually I didn't think mapping score would affect how samtools parses the bit flag.

    Going by the bit flags in the produced output, it does appear that samtools view correctly filtered the data. But all of these alignments that were output seem to be at odds with their own bit flags, clearly being alignments even though the flag says they are not.

    The bam file I'm working with was produced in 2012 from the bwa aligner using aln + sampe. Are current bit flag conventions incompatible with whatever may be in the older bam files? Can anyone explain whats going on?


    And finally, in a related question, when I use the samtools view filter/require (-F/-f) arguments, when I use bit flags for multiple states, such as my read unmapped & mate unmapped above, does samtools combine them using AND or OR? I kind of expected the require argument to use AND, and the filter argument to use OR. But I haven't read anywhere exactly how it works.

  • #2
    The quick thing to check...see if these read hang off the edge of your contigs. It is a known thing that bwa does if you have a read that hangs off the edge to give the read the unmapped 4 flag and to give it a mapping position as well.

    Comment


    • #3
      Thanks for the suggestion. I just checked, and the examples I posted are suspicious cases with regards to mapping near an edge. For the last 2 with the reference TRISPI_Contig2320, that reference piece is 1477 bp long, and the mapping position is listed as 1477. But the bit flag is implying that these are reverse mappings (93 & 157), so wouldn't that mean that they start on that last position and map inwards? For the other example (ref: TRISPI_Contig6402), the reference is 626 bases long, and the mapping starts at 551. My queries are all 80mers, so in that case it must fall off the edge by 5bp. I'm surprised that even aligned given that I used bwa aln + sampe, and set the edit distance to 4, but that certainly could be an issue with mapping off the edge of a reference confusing bwa.

      But in both of those cases the CIGAR string reads 80M, is the CIGAR string incorrect when the query touches the edges?

      Assuming that this is an issue with my reads touching or spilling over the edges of references, is there some way to get around the problem? Should I just ignore the weird bits and trust that the bit flags have them correctly labeled?

      Comment


      • #4
        After some more reading, I am starting to wonder if the only bit flag I can really trust is 0x4 (read unmapped). I can't trust the presence or absence of a '*' in the REF column of the bam file since some aligners (bwa?) will propagate the hit from one end with mapping over into the REF column of its mate if that mate did not have a hit.

        Basically I want to find a way to use bit flags to extract these types of slices from bam files:

        1) all unmapped reads

        2) all mapped reads (uniquely, i.e. only reporting the primary/non-supplemental cases for a read aligning)

        3) all unmapped pairs

        4) all mapped pairs (uniquely, i.e. only reporting the primary/non-supplemental cases for a pair aligning)

        5) all reads whose mate is unmapped (uniquely, i.e. only reporting the primary/non-supplemental cases for a read aligning)

        6) all reads who are unmapped themselves, but their mates do map


        Can those things be reliably done using samtools -f/-F filtering on the bit flags? And is that operation consistent across different aligners?

        There seem to be so many exceptions that its difficult to trust any kind of filtering done using any bit flag other than 0x4.

        Comment


        • #5
          To quote the SAM specification:

          Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
          So yeah, look at the flags and ignore the remainder. Yes, bwa is known to leave some alignment information in unmapped reads (at least some of the time). So, to answer your other questions:

          1) samtools view -f 0x4
          2) samtools view -F 0x804 (likely with -q 1 or -q 2, depending on the aligner, since many can output just one read of a multimapper)
          3) samtools view -f 0xD
          4) samtools view -f 0x1 -F 0x80C (This doesn't require "proper pairing" and again, you might want to use a MAPQ threshold to better define "unique").
          5) samtools view -F 0x808 (again, use a MAPQ threshold too)
          6) samtools view -f 0x4 -F 0x808

          Aside from the 0x800 bit, it's best to filter by MAPQ rather than define something as "unique" according to the flags, which is unreliable.

          Comment


          • #6
            Thank you very much! That is very helpful.

            Comment


            • #7
              Originally posted by jmartin View Post
              Thanks for the suggestion. I just checked, and the examples I posted are suspicious cases with regards to mapping near an edge. For the last 2 with the reference TRISPI_Contig2320, that reference piece is 1477 bp long, and the mapping position is listed as 1477. But the bit flag is implying that these are reverse mappings (93 & 157), so wouldn't that mean that they start on that last position and map inwards?
              1477 is the leftmost coordinate of the read. It's pretty much mapping to the next reference sequence in the file, not contig2320, and I bet if you check the following sequence in your ref, you'll see that.

              For the other example (ref: TRISPI_Contig6402), the reference is 626 bases long, and the mapping starts at 551. My queries are all 80mers, so in that case it must fall off the edge by 5bp. I'm surprised that even aligned given that I used bwa aln + sampe, and set the edit distance to 4, but that certainly could be an issue with mapping off the edge of a reference confusing bwa.
              bwa concatenates all your reference sequences into one long reference. So if enough of those last 5 bases of the read happened to match the first 5 bases of the following sequence, it would align.

              Assuming that this is an issue with my reads touching or spilling over the edges of references, is there some way to get around the problem? Should I just ignore the weird bits and trust that the bit flags have them correctly labeled?
              As mentioned already, assume that when the 4 flag is on, nothing else means anything.

              If you padded you contigs with N's, that would really reduce the number of reads that hang over.

              Comment


              • #8
                New post to an old thread...also having issues with -f-F

                Hi all,

                I'm trying to help out a colleague and finding that the solution is less straight forward that I thought it would be.

                In short, he has a bam file which is aligned to a reference genome but not filtered in anyway. He would now like to subsample that file to take a random 1M reads that have a minimum length cutoff of 50bp, are either merged or are unmerged R1s.

                My though had been to filter the file using samtools -m 50 and -f flags of interest, then do a shuf and write out to a new bam file, but I'm finding that -f -F don't behave as I thought they should.

                For example, on the test file I'm using, I know that these flags a present:
                0
                113
                117
                129
                133
                137
                141
                145
                147
                153
                16
                161
                163
                177
                181
                4
                589
                65
                653
                69
                73
                77
                81
                83
                89
                97
                99

                but I can't seem to use -f for all the flags of interest, I get a seemingly random subset. Originally I thought it was caused by the presence of the 0 flag, so I tried instead using -F to exclude the flags I wasn't interested in, but it then returned only reads with either 0, 16 or 4 as the flag.

                Any suggestions would be greatly appreciated.

                Cheers,
                A.

                Comment


                • #9
                  Originally posted by dpryan View Post
                  To quote the SAM specification:



                  So yeah, look at the flags and ignore the remainder. Yes, bwa is known to leave some alignment information in unmapped reads (at least some of the time). So, to answer your other questions:

                  1) samtools view -f 0x4
                  2) samtools view -F 0x804 (likely with -q 1 or -q 2, depending on the aligner, since many can output just one read of a multimapper)
                  3) samtools view -f 0xD
                  4) samtools view -f 0x1 -F 0x80C (This doesn't require "proper pairing" and again, you might want to use a MAPQ threshold to better define "unique").
                  5) samtools view -F 0x808 (again, use a MAPQ threshold too)
                  6) samtools view -f 0x4 -F 0x808

                  Aside from the 0x800 bit, it's best to filter by MAPQ rather than define something as "unique" according to the flags, which is unreliable.
                  As you mentioned "samtools view -f 0x4" equal all unmapped reads
                  what is the different between -f 4 and -f 12. Also with -f 8?

                  Comment


                  • #10
                    -f 4 returns unmapped entries
                    -f 12 returns unmapped entries whose mate is also unmapped
                    -f 8 returns entries whose mate is unmapped

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X