Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding reads where the mate is unmapped

    Hi folks,

    Apologies in advance for the rambling post!

    Out of the paired end reads in the 1000 genomes data, I am trying to locate those reads where one end of the pair is unmapped. I understand that I need to be using the flag field for this. Originally I thought that the following would pull out what I needed:

    ./samtools view -f 8 ftp://ftp.1000genomes.ebi.ac.uk/vol1...e.20101123.bam > flagtest.txt

    This resulted in a file that's around 50GB! Having a skim through this file, I noticed that the flag field varied considerably. I guess that by using -f 8, I asked samtools to pull out all those reads that have an unmapped mate, and that all the other bits can vary as they like?

    Using http://picard.sourceforge.net/explain-flags.html to decipher the SAM flags, I noticed that some of the flags (for example, 75) translates to:

    the read is paired in sequencing
    the read is mapped in a proper pair
    the mate is unmapped
    the read is the first read in a pair

    I think I'm missing something here. How can a read be "mapped in a proper pair" and have "the mate is unmapped"? I thought the 2 would be mutually exclusive?

    Am i right in thinking that I need to work out what the flag score should be using the various options and search just for those reads that satisfy that criteria? For example:

    the read is paired in sequencing
    the mate is unmapped
    strand of the query or strand of the query reverse
    strand of the mate or strand of the mate reverse
    the read is the first read in a pair or the read is the second read in a pair

    So this would be 8 searches in total? Or is there a more sophisticated way to do this?

    Apologies if this has already been asked - I have been searching through the threads (which are most informative), but without success.

    All help gratefully received!

    A

    PS I also noticed that there seemed to be a lot of reads where the ISIZE column is 0 - is this to be expected because the mate is unmapped?

  • #2
    I don't know about the reads with 2 and 8 both flagged, but you shouldn't need to do eight searches. You can specify multiple flags that you are filtering out, and the flags that you want to include, in one command line. For instance, samtools view -f 8 -F 4 should get all the reads that are mapped whose mates aren't mapped. You don't care about the read being the first or second read, or about its direction, so you don't filter for those flags at all.

    Comment


    • #3
      Many thanks for your response swbarnes2. I should have picked up on -F being the opposite of -f! I've been playing around with what you suggested and a few more questions if I may:

      1) Using your suggestion of -f 8 -F 4 certainly cuts down on the number of reads. Looking at the other criteria I can cut down on, I would also like to exclude those reads that fail platform/vendor quality checks and those reads that are either a PCR or an optical duplicate. Am I right in thinking that I should use -f 8 -F 1540? (4 + 512 + 1024 = 1540)

      2) In the results that I get using -f 8 -F 4, POS and MPOS are always the same and ISIZE is always 0. I take it this means that POS and MPOS represent the position of the read that did map, and that MPOS is not actually the location of the read that didn't map (as it shouldn't have a position!). So these results are the unmapped reads only?

      3) As an aside, I am also interested in those reads who's mates mapped unexpectedly far away (say 1kb or greater downstream/upstream). Is there anyway of filtering by MPOS or ISIZE? Or is there a better way of doing this?

      Apologies if these questions have already been asked - I've not found anything in the forums or the manual, but I may be missing something...

      Many thanks,

      A

      Comment


      • #4
        Originally posted by Ash View Post
        Many thanks for your response swbarnes2. I should have picked up on -F being the opposite of -f! I've been playing around with what you suggested and a few more questions if I may:

        1) Using your suggestion of -f 8 -F 4 certainly cuts down on the number of reads. Looking at the other criteria I can cut down on, I would also like to exclude those reads that fail platform/vendor quality checks and those reads that are either a PCR or an optical duplicate. Am I right in thinking that I should use -f 8 -F 1540? (4 + 512 + 1024 = 1540)
        Sure, if your bam file actually has those categories flagged.

        2) In the results that I get using -f 8 -F 4, POS and MPOS are always the same and ISIZE is always 0. I take it this means that POS and MPOS represent the position of the read that did map, and that MPOS is not actually the location of the read that didn't map (as it shouldn't have a position!). So these results are the unmapped reads only?
        Sam specs call for the unmapped mate of a mapped read to have the mapping coordinates of the mapped mate. It's so the two will sort togther. So yes, what you are seeing is totally normal and expected. So the only way to know the read is unmapped is to look at the flag, and if it has the 4 flag, it's unmapped, no matter what anything else in the line might imply.

        3) As an aside, I am also interested in those reads who's mates mapped unexpectedly far away (say 1kb or greater downstream/upstream). Is there anyway of filtering by MPOS or ISIZE? Or is there a better way of doing this?
        You could try -F 14, which should only get reads that both mapped, but are not a "proper pair". I'm not quite sure what the definition of a proper pair is, and it's probably going to depend on the software that is making the .bam, but reads that span a much larger than average or expected length should qualify.

        You could also convert to sam, and use awk, or something like that. I don't know how to do that on a .bam.

        Comment


        • #5
          Aaaah... now I see! So everything has to be searched from the flag field, and using -f and -F should give you every possible combination you need (with there being some redundancy in the system).

          With regards to the unmapped mates, the file sizes I'm getting out vary from 164Mb to 44GB - why is there such a difference? Occasionally I get a truncated file msg (usually the connection to the ftp site has been interrupted), but more usually I don't get an error message. Is this kind of variation between these 1000 genomes normal, or is something going wrong?

          With regards to those reads that are not in a proper pair, again there's a lot of variation in the file sizes (1GB to 10GB) - is this also normal? Is there any way of specifying certain positions in the genomes that I'm interested in (I'm thinking this would cut down on the file sizes and the time its taking..)?

          Many thanks,

          A

          Comment


          • #6
            samtools view allows for a region to be specified. So does samtools mpileup. BEDTools can give you the intersect between a .bed file and a .bam file

            Comment


            • #7
              Thanks swbarnes2. I was hoping there was a way of specifying more than one position at a time, but I guess I can get round that with a perl script.

              All the best,

              A

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X