Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do you deal with reads "unmapped to NM_0012345"?

    I understand fully what it means when a read is mapped at coordinate X to ref sequence NM_0012345 with a certain quality score etc.

    I also understand fully if a read is unmapped (sam flag 4 is set). In most cases with these the Refname field of a SAM file is "*". Makes sense since there is no reference sequence.

    I recently ran BWA on a set of unpaired reads and came upon lines in the SAM file like this:

    R001778658_15_1 4 NM_008409.2 1536 18 100M ...
    R002786437_15_1 4 NM_008409.2 1536 25 100M ...
    R002993094_15_1 4 NM_008409.2 1535 23 100M ...

    I have truncated each line (removing seqs etc) for clarity. With a SAM flag of 4 (meaning the read is unmapped) it makes no sense to have a reference name like NM_008409. So I investigated what this means.

    The answer is that each of these alignments hangs off the 3' end of the ref sequence by a little bit: NM_008409 is 1632 long and these alignments begin at 1535 or 1536. Hence the reads (each 100 NTs long) stretch beyond the 3 prime end of the reference by 3 or 4 bases. Other than that the alignment is perfect (yes, I checked).

    Since this is RNASeq, I would certainly want to count these reads toward this gene. So, based on the evidence thus far, I am inclined to ignore the SAM flag and consider the reference name for counting the tags. But I am not sure that this is the only type of situation where SAM flag 4 is set and a reference name is mentioned.

    How do you deal with nominally "unmapped" reads which are nevertheless assigned to a particular reference sequence? Do you ignore the SAM flag or the ref name?
    Kamalakar Gulukota,
    Director,
    Center for Bioinformatics and Computational Biology
    NorthShore University Health System, [email protected]

  • #2
    Did you align to the whole genome or transcripts?

    Comment


    • #3
      This was to transcripts.
      Kamalakar Gulukota,
      Director,
      Center for Bioinformatics and Computational Biology
      NorthShore University Health System, [email protected]

      Comment


      • #4
        I guess I'm not really that surprised then that you see this. Do the unmapped portion of the reads match the genome? If not, then you have some sort of contamination at the end of your reads. If they do, then that tells me that the reference transcript is wrong in where it makes the cutoff for the 3' UTR.

        Do you have a reference genome? If so why map to transcripts over the genome. I think you can miss a lot by mapping to transcripts because you are making the implicit assumption that all transcripts and isoforms are known.

        Comment


        • #5
          The other reason one can get both a mapping coordinate and the unmapped flag is that sam specs call for unmapped reads to be given the mapping coordinates of their mapped mate. This is so the two reads will sort together. But maybe you only have single end data, in which case, you won't see this.

          To distinguish these, maybe you could pad your transcripts with n's so that reads won't cross between adjacent reference transcripts, and therefore wouldn't have the unmapped flag set. You could also try other aligners, as they might not set the unmapped flag in those circumstances. I know that bwa will, but other aligners might behave differently.

          Comment


          • #6
            Chadn737 - For the pipeline I am building, mapping to transcripts and to the genome and then combining information from both is important. (I will post more details on it after the pipeline is in a decent shape). I am also not surprised at the mapping per se. Rather my surprise was that BWA would report it as unmapped and yet provide the reference ID.

            swbarnes2 - Padding with N's could "fix" this. I'll try it. Other aligners: I have tried bowtie2 and have had some trouble. I am privately in touch with Ben Langmead in trying to resolve that issue. It is clear though that SAM format, a standard though it is, does have these odd behaviors based on what software created the file.

            Thank you both for your responses. Very helpful in driving me to my final pipeline. I have come to the conclusion that I should go with the Ref Name rather than the SAM flag (though perhaps with some additional checking if the flag is 4).

            Gulu
            Kamalakar Gulukota,
            Director,
            Center for Bioinformatics and Computational Biology
            NorthShore University Health System, [email protected]

            Comment


            • #7
              Originally posted by kgulukota View Post
              It is clear though that SAM format, a standard though it is, does have these odd behaviors based on what software created the file.
              ...
              Thank you both for your responses. Very helpful in driving me to my final pipeline. I have come to the conclusion that I should go with the Ref Name rather than the SAM flag (though perhaps with some additional checking if the flag is 4).
              That would be wrong - the spec is very clear that the FLAG determines if a read is mapped or not, and if unmapped other fields are to be regarded as undefined (although there is usually an expected value).

              Comment


              • #8
                Have a look at the FAQ for bwa: http://bio-bwa.sourceforge.net/

                "I see a read stands out the end of a chromosome and is flagged as unmapped (flag 0x4). What is happening here?

                "Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this case, BWA will flag the read as unmapped, but you will see position, CIGAR and all the tags. A better solution would be to choose an alternative position or trim the alignment out of the end, but this is quite complicated in programming and is not implemented at the moment."


                If the read happens to align to both the end of one chromosome and the beginning of the next, this is sufficiently odd so that one should simply consider it unmapped, so I agree here with maubp in #7.

                Comment


                • #9
                  Yes, that BWA "feature" is a classic example. It would be nice if they fixed it to clear the reference name, mapping position, etc in this case - as well as the flag say it was unmapped.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  23 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  21 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X