Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding reads where the mate is unmapped

    Hi folks,

    Apologies in advance for the rambling post!

    Out of the paired end reads in the 1000 genomes data, I am trying to locate those reads where one end of the pair is unmapped. I understand that I need to be using the flag field for this. Originally I thought that the following would pull out what I needed:

    ./samtools view -f 8 ftp://ftp.1000genomes.ebi.ac.uk/vol1...e.20101123.bam > flagtest.txt

    This resulted in a file that's around 50GB! Having a skim through this file, I noticed that the flag field varied considerably. I guess that by using -f 8, I asked samtools to pull out all those reads that have an unmapped mate, and that all the other bits can vary as they like?

    Using http://picard.sourceforge.net/explain-flags.html to decipher the SAM flags, I noticed that some of the flags (for example, 75) translates to:

    the read is paired in sequencing
    the read is mapped in a proper pair
    the mate is unmapped
    the read is the first read in a pair

    I think I'm missing something here. How can a read be "mapped in a proper pair" and have "the mate is unmapped"? I thought the 2 would be mutually exclusive?

    Am i right in thinking that I need to work out what the flag score should be using the various options and search just for those reads that satisfy that criteria? For example:

    the read is paired in sequencing
    the mate is unmapped
    strand of the query or strand of the query reverse
    strand of the mate or strand of the mate reverse
    the read is the first read in a pair or the read is the second read in a pair

    So this would be 8 searches in total? Or is there a more sophisticated way to do this?

    Apologies if this has already been asked - I have been searching through the threads (which are most informative), but without success.

    All help gratefully received!

    A

    PS I also noticed that there seemed to be a lot of reads where the ISIZE column is 0 - is this to be expected because the mate is unmapped?

  • #2
    I don't know about the reads with 2 and 8 both flagged, but you shouldn't need to do eight searches. You can specify multiple flags that you are filtering out, and the flags that you want to include, in one command line. For instance, samtools view -f 8 -F 4 should get all the reads that are mapped whose mates aren't mapped. You don't care about the read being the first or second read, or about its direction, so you don't filter for those flags at all.

    Comment


    • #3
      Many thanks for your response swbarnes2. I should have picked up on -F being the opposite of -f! I've been playing around with what you suggested and a few more questions if I may:

      1) Using your suggestion of -f 8 -F 4 certainly cuts down on the number of reads. Looking at the other criteria I can cut down on, I would also like to exclude those reads that fail platform/vendor quality checks and those reads that are either a PCR or an optical duplicate. Am I right in thinking that I should use -f 8 -F 1540? (4 + 512 + 1024 = 1540)

      2) In the results that I get using -f 8 -F 4, POS and MPOS are always the same and ISIZE is always 0. I take it this means that POS and MPOS represent the position of the read that did map, and that MPOS is not actually the location of the read that didn't map (as it shouldn't have a position!). So these results are the unmapped reads only?

      3) As an aside, I am also interested in those reads who's mates mapped unexpectedly far away (say 1kb or greater downstream/upstream). Is there anyway of filtering by MPOS or ISIZE? Or is there a better way of doing this?

      Apologies if these questions have already been asked - I've not found anything in the forums or the manual, but I may be missing something...

      Many thanks,

      A

      Comment


      • #4
        Originally posted by Ash View Post
        Many thanks for your response swbarnes2. I should have picked up on -F being the opposite of -f! I've been playing around with what you suggested and a few more questions if I may:

        1) Using your suggestion of -f 8 -F 4 certainly cuts down on the number of reads. Looking at the other criteria I can cut down on, I would also like to exclude those reads that fail platform/vendor quality checks and those reads that are either a PCR or an optical duplicate. Am I right in thinking that I should use -f 8 -F 1540? (4 + 512 + 1024 = 1540)
        Sure, if your bam file actually has those categories flagged.

        2) In the results that I get using -f 8 -F 4, POS and MPOS are always the same and ISIZE is always 0. I take it this means that POS and MPOS represent the position of the read that did map, and that MPOS is not actually the location of the read that didn't map (as it shouldn't have a position!). So these results are the unmapped reads only?
        Sam specs call for the unmapped mate of a mapped read to have the mapping coordinates of the mapped mate. It's so the two will sort togther. So yes, what you are seeing is totally normal and expected. So the only way to know the read is unmapped is to look at the flag, and if it has the 4 flag, it's unmapped, no matter what anything else in the line might imply.

        3) As an aside, I am also interested in those reads who's mates mapped unexpectedly far away (say 1kb or greater downstream/upstream). Is there anyway of filtering by MPOS or ISIZE? Or is there a better way of doing this?
        You could try -F 14, which should only get reads that both mapped, but are not a "proper pair". I'm not quite sure what the definition of a proper pair is, and it's probably going to depend on the software that is making the .bam, but reads that span a much larger than average or expected length should qualify.

        You could also convert to sam, and use awk, or something like that. I don't know how to do that on a .bam.

        Comment


        • #5
          Aaaah... now I see! So everything has to be searched from the flag field, and using -f and -F should give you every possible combination you need (with there being some redundancy in the system).

          With regards to the unmapped mates, the file sizes I'm getting out vary from 164Mb to 44GB - why is there such a difference? Occasionally I get a truncated file msg (usually the connection to the ftp site has been interrupted), but more usually I don't get an error message. Is this kind of variation between these 1000 genomes normal, or is something going wrong?

          With regards to those reads that are not in a proper pair, again there's a lot of variation in the file sizes (1GB to 10GB) - is this also normal? Is there any way of specifying certain positions in the genomes that I'm interested in (I'm thinking this would cut down on the file sizes and the time its taking..)?

          Many thanks,

          A

          Comment


          • #6
            samtools view allows for a region to be specified. So does samtools mpileup. BEDTools can give you the intersect between a .bed file and a .bam file

            Comment


            • #7
              Thanks swbarnes2. I was hoping there was a way of specifying more than one position at a time, but I guess I can get round that with a perl script.

              All the best,

              A

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:49 AM
              0 responses
              15 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X