Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mapping quality score in tophat sam file

    Hello all,


    I have been trying to work with the sam output file that I received from tophat. My input file was an RNA-sequencing single end 36 bp run using Illumina GA2 with C. elegans data.

    A sample of the sam file from tophat I am looking at:

    HWUSI-EAS1795_0102:5:116:16624:12048#0 256 I 2821 0 16M * 0 0 GCTGGGTCTGAACTCT IIIIIIIIIIBGGGGI NM:i:2 NH:i:14 CC:Z:= CP:i:13533166 HI:i:0
    HWUSI-EAS1795_0102:5:103:16153:10159#0 256 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG HHHHHHHHHHHHEHHHH NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
    HWUSI-EAS1795_0102:5:111:9819:1338#0 0 I 2916 0 17M * 0 0 GTCAATTCTGCGACATG IIIIIIIIIIIIIIIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533261 HI:i:0
    HWUSI-EAS1795_0102:5:24:5025:14685#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC IIIIIIGIIHIIIIIHIHI NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
    HWUSI-EAS1795_0102:5:70:18860:13966#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GGGGHHGHHGGDGGGGFGG NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
    HWUSI-EAS1795_0102:5:85:1866:10941#0 272 I 2946 0 19M * 0 0 CGAAAGTAACCCGAATACC GIIIIIIIIIIIIIIGIII NM:i:0 NH:i:6 CC:Z:= CP:i:13533291 HI:i:0
    HWUSI-EAS1795_0102:5:86:8520:16181#0 256 I 2962 0 21M * 0 0 ACCGACAAAAGAAGAGGAACG HHHHHHHHHHDFDDFGFGGFE NM:i:0 NH:i:7 CC:Z:= CP:i:13533307 HI:i:0
    HWUSI-EAS1795_0102:5:5:11293:13799#0 256 I 2970 0 30M * 0 0 AAGAAGAGGAACGCCAACTTTGGATAGACG HHFHHHHHHHEGGGGHHHHHHHGHHHHHGH NM:i:0 NH:i:7 CC:Z:= CP:i:13533315 HI:i:0
    HWUSI-EAS1795_0102:5:120:17838:10446#0 256 I 2975 0 36M * 0 0 GAGGAACGCCAACTTTGGATAGACGCTCTAGGGGCT IHIIBEIIIIIIIIHIIG@IEHEEIIIFIBGIIIEH NM:i:0 NH:i:7 CC:Z:= CP:i:13533320 HI:i:0
    HWUSI-EAS1795_0102:5:4:7318:17553#0 256 I 2990 0 36M * 0 0 TGGATAGACGCTCTAGGGGCTGATTTTGGTCGGAAA GHHHHGEHHEHHHHHHGHHGEG>GEGGGE@EDDFEE NM:i:0 NH:i:7 CC:Z:= CP:i:13533335 HI:i:0

    The fifth column seems to report the mapping quality score. The only scores I am coming across in my sam file is 0, 1 or 255. This seems off to me since I have a match in the 6th column... am I interpreting this incorrectly?

    I did an alignment with novoalign with the exact same input file and here is a sample of the sam file I am looking at:

    HWUSI-EAS1795_0102:5:120:17501:21319#0/1 4 * 0 0 * * 0 0 TGGCGA IIIIII PG:Z:novoalign ZS:Z:QC
    HWUSI-EAS1795_0102:5:120:17879:21320#0/1 4 * 0 0 * * 0 0 TGAGATC HGHHHHH PG:Z:novoalign ZS:Z:QC
    HWUSI-EAS1795_0102:5:120:2546:21319#0/1 4 * 0 0 * * 0 0 TGAAGATGCGACATACCCGCAGCTAGAC IIIIIIIIIIIIIIIIIIHIIGIHIGGG PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:17450:21318#0/1 0 chrI 15064402 77 26M * 0 0 AACTGGGCCTCCAGTTGGTACGTCTG HHHHHHDHHHHHHEHHHHHHHHHHHH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:26
    HWUSI-EAS1795_0102:5:120:18252:21315#0/1 4 * 0 0 * * 0 0 CCCCGG IIIIHI PG:Z:novoalign ZS:Z:QC
    HWUSI-EAS1795_0102:5:120:17255:21323#0/1 4 * 0 0 * * 0 0 CGGCGGAGGTCGCGGGTTCG GGEGGEGEG@GGGDGHDHH@ PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:19739:21305#0/1 4 * 0 0 * * 0 0 GTGAGCGGTCGAGACCTGGATTGACAAG IIHIHIHHHIGGIIIIIIDIIGGHIIDG PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:7089:21322#0/1 4 * 0 0 * * 0 0 TGAGCTAAACCGTACTAATGATCGAGGGCTTACCAT IIIIIIIIIIIIIIIIIHIIEIIIIIIIIIIIIIIG PG:Z:novoalign ZS:Z:NM
    HWUSI-EAS1795_0102:5:120:18560:21325#0/1 0 chrI 15066796 48 21M * 0 0 ATAGAATAATGTAGGTAAGGG FFFFE<GGGGGG?GGGDGGGG PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:21
    HWUSI-EAS1795_0102:5:120:18280:21316#0/1 0 chrI 15065492 43 20M * 0 0 GTCTTGAAACACGGATTGCG GGGGGGDGGGHGHGGHHHDH PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:20
    HWUSI-EAS1795_0102:5:120:17770:21316#0/1 0 chrI 15063143 3 36M * 0 0 TATGGTTGCAAAGCTGAAACTTAAAGAAATTGACGG BB?BBCFEFEGG@BDF8FEFGGG@GDGED;F8FEC2 PG:Z:novoalign AS:i:0 UQ:i:0 NM:i:0 MD:Z:36 ZS:Z:R ZN:i:2

    In this case, looking at column number 5, I clearly have scores that are ranging from 3 to 77... depending on the quality of my alignment...

    My concern is that I am using a program to toss out poor quality alignments which is using a value of 30 or greater for mapping quality as a filter. Clearly, none of my tophat alignments are passing that filter except for the ones with a 255 (unknown) score.
    Am I missing something here? If the tophat mapping quality scores are indeed correct - What can I do to filter out poor quality alignments from my tophat output?

    Thanks for your help!

  • #2
    Originally posted by jjpurwar View Post
    I have been trying to work with the sam output file that I received from tophat. My input file was an RNA-sequencing single end 36 bp run using Illumina GA2 with C. elegans data.
    Whenever the is something funny going on with quality scores, the first thing I try to check is what kind of FASTQ encoding did you start with (Sanger, old Solexa, or Illumina). And did you tell tophat this, or just trust the default was appropriate?

    Comment


    • #3
      Hi, so I did specify the --solexa1.3-quals as my option. I dont take for granted the default option.
      I did check with tophat regarding this matter - turns out that they dont take mapping qualities into account. This is worrisome - since you might want to throw out poor quality alignments. But I guess thats the trade off since tophat is a really fast aligner....

      Comment


      • #4
        Hi, so I did specify the --solexa1.3-quals as my option. I dont take for granted the default option.
        I did check with tophat regarding this matter - turns out that they dont take mapping qualities into account. This is worrisome - since you might want to throw out poor quality alignments. But I guess thats the trade off since tophat is a really fast aligner....

        Comment


        • #5
          Tophat may not look at the qualities, but it does write them out in the SAM/BAM file. This is important because if Tophat misunderstood the input FASTQ encoding, they will be wrong in the SAM/BAM.

          It would be worth double checking a couple of reads - compare the input quality (which you say used the Solexa/Illumina 1.3 style, meaning PHRED with offset 64) and the SAM/BAM quality (which should be Sanger style, meaning PHRED with offset 33).

          [Given Tophat doesn't look at the read qualities, this is a separate issue to your original query about the mapping qualities]

          Comment


          • #6
            I believe the read quality is the column after the sequence. For example, the quality for the read "GCTGGGTCTGAACTCT" is "IIIIIIIIIIBGGGGI" where
            "B" represents a Phred quality score of 33 (Assuming Sanger scoring)
            "G" represents a Phred quality score of 38
            "I" represents a Phred quality score of 40
            To convert the letter to the Phred, just take the ASCII representation and subtract 33.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X