Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count warning messages: can they be ignored?

    Dear all,

    I'm new to RNA-seq analysis. I'm Trying out htseq-count to get the raw counts for downstream analysis by EdgeR or DESeq. I am getting some warning messages from HTSeq and I am not sure if I can ignore it, or what I should do about it.

    While running htseq-count:
    Code:
    python -m HTSeq.scripts.count names_sorted.sam  genes.gtf
    I got thousands of warning messages like these:

    Warning: Read HWI-ST845:13032020JBACXX:8:1101:6938:66256 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

    Warning: Read HWI-ST845:13032020JBACXX:8:1101:7019:18774 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

    After looking into those reads that produce warning messages and from other posts on this forum, I think it is because some reads miss their mate.
    For example:
    Code:
    grep "HWI-ST845:130320:D20JBACXX:8:1101:6938:66256" names_sorted.sam
    HWI-ST845:13032020JBACXX:8:1101:6938:66256 129 chr1 24615683 2 50M chr2 32387601 0 AGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTC @@CFFFDFHHAFHIGJIJIGHJIJJIJGIJJGHIIJIHDHGHIFHIJEHG MD:Z:50 NH:i:3 HI:i:2 NM:i:0 SM:i:2 XQ:i:40 X2:i:40

    HWI-ST845:13032020JBACXX:8:1101:6938:66256 129 chr2 22588080 2 50M = 32387601 0 AGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTC @@CFFFDFHHAFHIGJIJIGHJIJJIJGIJJGHIIJIHDHGHIFHIJEHG MD:Z:50 NH:i:3 HI:i:1 NM:i:0 SM:i:2 XQ:i:40 X2:i:40

    HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M = 22588080 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0

    HWI-ST845:13032020JBACXX:8:1101:6938:66256 145 chrM 6846 2 50M chr2 32387601 0 GAACCAACCTATGTAAAAGTAAAATAAGAAAGGAAGGAATCGAACCCCCT GHEJIHFIHGHDHIJIIHGJJIGJIJJIJHGIJIJGIHFAHHFDFFFC@@ MD:Z:50 NH:i:3 HI:i:3 NM:i:0 SM:i:2 XQ:i:40 X2:i:40

    My guess is that these are what HTSeq complained about. However, is this something I need to worry about? Why do I get those unpaired missing mates?Do I simply disregard those warning messages?

    --
    Some additional information about the alignment I am looking at:

    Code:
    samtools flagstat mybam.bam
    80760824 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    79363333 + 0 mapped (98.27%:nan%)
    80760824 + 0 paired in sequencing
    40378231 + 0 read1
    40382593 + 0 read2
    77654682 + 0 properly paired (96.15%:nan%)
    78100006 + 0 with itself and mate mapped
    1263327 + 0 singletons (1.56%:nan%)
    354942 + 0 with mate mapped to a different chr
    66138 + 0 with mate mapped to a different chr (mapQ>=5)

    Last few lines of the results from HTseq-count:
    a 0
    l7Rn6 168
    no_feature 15754100
    ambiguous 109560
    too_low_aQual 0
    not_aligned 387345
    alignment_not_unique 11975435

  • #2
    Those warnings occur for the reason you wrote. It's a bit odd, though, for a mapped mate to not be in the SAM file (you'll run into this frequently if only one read of a pair map and the unmapped mate is omitted from the resulting alignment file, as is done by tophat).

    Comment


    • #3
      Many thanks for the reply dpryan!! I really appreciate it.

      You said that it's odd for a mapped mate to not be in the sam file - do you think it is possible that they just don't have a mapped mate? It looks like except for the 2nd and 3rd reads, the rest I posted in the above example were mapped to different chromosome.
      On a side note, would you suggest doing some post processing on the alignment file ? (some of the reads that htseq-count gave warning messages were aligned to many places and have low MAPQ)

      Thanks again!


      Originally posted by dpryan View Post
      Those warnings occur for the reason you wrote. It's a bit odd, though, for a mapped mate to not be in the SAM file (you'll run into this frequently if only one read of a pair map and the unmapped mate is omitted from the resulting alignment file, as is done by tophat).

      Comment


      • #4
        In this case, the aligner is just saving space in the output file. It looks like one read of a pair maps uniquely, but the other does not (and does not seem to map to a known splice form, if you're using an aligner that maps first to the transcriptome). In these cases, the reads should be considered multi-mapping anyway, so they should get ignored. You can ensure that this is correct by making a test SAM file with just the header and and something like:

        Code:
        HWI-ST845:13032020JBACXX:8:1101:6938:66256 129 chr1 24615683 2 50M chr2 32387601 0 AGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTC @@CFFFDFHHAFHIGJIJIGHJIJJIJGIJJGHIIJIHDHGHIFHIJEHG MD:Z:50 NH:i:3 HI:i:2 NM:i:0 SM:i:2 XQ:i:40 X2:i:40
        HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M chr1 24615683 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
        HWI-ST845:13032020JBACXX:8:1101:6938:66256 129 chr2 22588080 2 50M = 32387601 0 AGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTC @@CFFFDFHHAFHIGJIJIGHJIJJIJGIJJGHIIJIHDHGHIFHIJEHG MD:Z:50 NH:i:3 HI:i:1 NM:i:0 SM:i:2 XQ:i:40 X2:i:40
        HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M = 22588080 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
        HWI-ST845:13032020JBACXX:8:1101:6938:66256 145 chrM 6846 2 50M chr2 32387601 0 GAACCAACCTATGTAAAAGTAAAATAAGAAAGGAAGGAATCGAACCCCCT GHEJIHFIHGHDHIJIIHGJJIGJIJJIJHGIJIJGIHFAHHFDFFFC@@ MD:Z:50 NH:i:3 HI:i:3 NM:i:0 SM:i:2 XQ:i:40 X2:i:40
        HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M chrM 32387601 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
        You'll need to fix the flag field, which I didn't adjust. If you use the resulting file with htseq-count, the correct output would be no counts (assuming that any of those align to a region that you're counting).

        Comment


        • #5
          Thank you dpryan for the suggestion. I"ll use a test sam file to confirm the result.

          Originally posted by dpryan View Post
          In this case, the aligner is just saving space in the output file. It looks like one read of a pair maps uniquely, but the other does not (and does not seem to map to a known splice form, if you're using an aligner that maps first to the transcriptome). In these cases, the reads should be considered multi-mapping anyway, so they should get ignored. You can ensure that this is correct by making a test SAM file with just the header and and something like:

          Code:
          HWI-ST845:13032020JBACXX:8:1101:6938:66256 129 chr1 24615683 2 50M chr2 32387601 0 AGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTC @@CFFFDFHHAFHIGJIJIGHJIJJIJGIJJGHIIJIHDHGHIFHIJEHG MD:Z:50 NH:i:3 HI:i:2 NM:i:0 SM:i:2 XQ:i:40 X2:i:40
          HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M chr1 24615683 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
          HWI-ST845:13032020JBACXX:8:1101:6938:66256 129 chr2 22588080 2 50M = 32387601 0 AGGGGGTTCGATTCCTTCCTTTCTTATTTTACTTTTACATAGGTTGGTTC @@CFFFDFHHAFHIGJIJIGHJIJJIJGIJJGHIIJIHDHGHIFHIJEHG MD:Z:50 NH:i:3 HI:i:1 NM:i:0 SM:i:2 XQ:i:40 X2:i:40
          HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M = 22588080 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
          HWI-ST845:13032020JBACXX:8:1101:6938:66256 145 chrM 6846 2 50M chr2 32387601 0 GAACCAACCTATGTAAAAGTAAAATAAGAAAGGAAGGAATCGAACCCCCT GHEJIHFIHGHDHIJIIHGJJIGJIJJIJHGIJIJGIHFAHHFDFFFC@@ MD:Z:50 NH:i:3 HI:i:3 NM:i:0 SM:i:2 XQ:i:40 X2:i:40
          HWI-ST845:13032020JBACXX:8:1101:6938:66256 65 chr2 32387601 40 50M chrM 32387601 0 CTGCAGGGGGACAGTGAGCAGAGATGGGGCAGGGATCAAGTTCTGAGTTG CCCFFFFFHHHHHJGHGIIJJIIIIIJJJFJIJJGHJJJJCGIJJJJGIJ MD:Z:50 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
          You'll need to fix the flag field, which I didn't adjust. If you use the resulting file with htseq-count, the correct output would be no counts (assuming that any of those align to a region that you're counting).

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X