Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why HTseq warning of unfound mate pairs?

    Dear all
    I am using htseq-count tool to summarize gene counts from bam files generated by tophat (v 2.03) based on bowtie2. I've used this pipeline (based on bowtie1) several times with human RNA-Seq and have been generating good results.

    In the most recent project, we are working with Ecoli K12 genome, 100 bp paired-ends.

    I tried htseq-count tool on the accepted_hits.bam files generated by tophat but it gave me all the warnings of "xxx claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)". I then sorted the bam files with samtools prior to this step yet still got no luck: thousands of the same warnings came out and I got no reads in the output gene_counts.txt file.

    I lchecked the sam file (first 10 lines, converted from the sorted bam file) and they looked like these:

    HWI-ST984:1021021ACXX:2:1210:8261:88919 99 chr 1 255 4M14I82M =57 156 AGTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACC @@BFFFDFHHHHHJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJIJIIJJJJJJJHFFFBBCEEEEEEDDDDDDDDDDDDDDDDDDDDC AS:i:-57 XN:i:0 XM:i:2 XO:i:1 XG:i:14 NM:i:16 MD:Z:2C0T82 YT:Z:UU NH:i:1
    HWI-ST984:1021021ACXX:2:1308:13660:65155 99 chr 2 255 6M9I85M = 117 215 TATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGT CCCFFFFFHGHHHJJJJJJJJJJJJJJJJJJJJJJJJJIIIIJIJJHIGIFJJJIJGHIJHHHH?CEFEFFEECD>@BCDDDCDDDDDD@CDDDDDBDD9 AS:i:-42 XN:i:0 XM:i:2 XO:i:1 XG:i:9 NM:i:11 MD:Z:0G0C89 YT:Z:UU NH:i:1
    HWI-ST984:1021021ACXX:2:2108:14990:23666 99 chr 10 255 100M = 167 257 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT CCCFFFFFHHHHHJJIIIIJJJJIIIJIJIJIIJHIJIJJJJIJJIJEHIJIJJJJJIHHHHHFFCDFFEEECEEDDDDDDDDBDDACCCDDDDDDCDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
    HWI-ST984:1021021ACXX:2:1214:16246:55224 89 chr 10 255 100M * 00 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT A@C>>CCC9?A<BC>::EECACC=>DDDDECCBB@@EFGGHFC===<FC?893F@B9B>EBDBDB9C9EFB3F?1JIEIGGIIGHEGHDHDFFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
    HWI-ST984:1021021ACXX:2:1108:7813:47825 99 chr 22 255 100M = 113 191 CGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGT CCCFFFFFHHGHHJJJJHJHIJIJJJJJJJJJJJJHJIIIGIJJIJJJJJJJJJJJJJJJHIJJHHHHHFDDDCC>CCEEDDDDEDDDFDDDDDDDDDCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
    HWI-ST984:1021021ACXX:2:1105:8881:46986 163 chr 23 255 100M = 137 214 GGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTC CCCFFFFFHHHHHJEIHGHGIGHGHIJIJJJIIIIIHIIJIIJIJJJJJJIIJHJJIJI@GIJJJIHHHBDFD>AEEEEDDDDEDDDEDDCCDDDDDDCD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
    I then checked the sequence stats with samtools flagstat and found 82.25% reads are properly paired.

    So what is wrong with my bam file? There are definitely majority of proper mate pairs in the bam file. Why can't they be sorted in a way that mate pairs are assgined in adjacent lines for htseq-count to read?
    I used samtools sort commend to do the soring? Any better ideas?

    I'm pretty new in this field, so pardon me if similar questions have been asked before.

  • #2
    Have you sorted by position or by name? You have to us e"samtools sort -n" to sort by read name, in order to cause lines describing mates to appear next to each other.

    Comment


    • #3
      Thanks Simon.
      I forgot to add -n in the samtools sorting process and thus messed up with SAM reads.
      I re-ran the program today and this time htseq-count works fine with by-name soring.

      Comment


      • #4
        Hi,

        I have the same warnings : Warning: Read xxxx claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

        My sam file looks like this:

        HWI-EAS261_0019_FC:1:1:1144:8868#0 99 11 128587218 255 76M = 128587361 219 GAAAAGCACACGCATGATGGTTTTGCTATCGTGTGACATTTATTTCATACTTGCTACCTGTAAGAAATTCCTTGAA IIIIIIHHIIIIEIIIIGIIIIIIIIIIIIIIIGIII<IIGIIIIHGFIIHIIIIIIIFIHIIIHH?IIGHIHIIG NM:i:0 NH:i:1
        HWI-EAS261_0019_FC:1:1:1144:8868#0 147 11 128587361 255 76M = 128587218 -219 AGGGATGCTGTTTCTAAGGCATGTAGGTGCTGAGGGTCTACCCCAAAGGGTAGTTTGGGACTGCAGGGCAGGCAGG DCIFIIIIIIIDBIBIIG@FHGIIIII@HIIIIIIHIIHGIIIIHIIIGIDIIIIHIIIIIIIIIIIHIIIIGIII NM:i:1 NH:i:1
        HWI-EAS261_0019_FC:1:1:1145:1981#0 99 22 21959147 255 76M = 21959234 163 GAGAAGTTCAGATGAGTTTGGCCAAGTTCCCTGGGTGGTGAGAGGCCTGGCCTGCCTCATGTAGTAACAGAACTGC HHHHHHHHHHHFGHHGGGGEGG<GGDEGGGDGGGGGGDDGGEGGGGDGGEDFBGGGGGGBGGGGA<BAEF@GFGEE NM:i:0 NH:i:1
        HWI-EAS261_0019_FC:1:1:1145:1981#0 147 22 21959234 255 76M = 21959147 -163 CCTTCCTCTTTTTGGAAGAAAAAAGAGGCAGGATCTCACTGTCTTGTCCAGGCTGGAAGGCAGTGGCGTGATCATG =F<AE@7IF@EHHGHGIIIIIGGG@G8GGD<GEDGGGBGE>IHGIGIEIIIIGGGIDIFIIGIIGIHHHIIIIIHI NM:i:0 NH:i:1
        HWI-EAS261_0019_FC:1:1:1145:8828#0 99 10 6054667 255 76M = 6054796 5361 TGCCACTGCCCCGTGTCCTGTGATGTGACTTCAGAGCTTCCAAAACGCAGGCAAGCACAACGGATGTCTCCTGGGC DFHHEHHHHHHHGHHHHHHBGEBB:GGGGGGDBGB4DGGGHHHHHHHHGHBHFBHG:G@42FF,,DBBDB+>DGGA NM:i:0 NH:i:1
        HWI-EAS261_0019_FC:1:1:1145:8828#0 147 10 6054796 255 64M5156N12M = 6054667 -5361 CCCTGCTTCTTACCAAGAAATTCTTGTTCTTTTGGTTTTCTAGATTGTTCTTCTACTCTTCCTCTGTCTCCGCTGC CBE3EGDDHBGI>DDIG@BIEBEBHHDD>EG@DDDAGGDFBIBDDBDDDED>DDDGAGGDGG@GGEDGDHIHIDII NM:i:1 XS:A:- NH:i:1

        I have sorted the bam file from tophat using: samtools sort -n
        and then convert bam to sam using: samtools view .bam >.sam

        I could see in my sam file, the lines with the same name are next to each other, why does ht-seq still give me this warning?

        Many thanks

        Comment


        • #5
          To: xy6699
          Your sam file looks properly sorted (at least from the section you posted here). The warnings may come from other unpaired reads. Did you check your alignment stats? What is the percentage of aligned reads that are properly paired?

          Comment


          • #6
            The warning is not about improperly paired mates but about missing mates. Take the read ID from one of the warnings, grep for it in the SAM file, and check whether it really appears an even number of times, in adjacent lines.

            Comment


            • #7
              Hi,

              Thanks a lot for the reply.

              I looked at the warning reads carefully and found that they have very low mapping quality and actually the adjacent mate reads have the same sequence, so they are not really "mate" pairs.

              Take one warning for example:

              Warning: Read HWI-EAS261_0019_FC:1:1:2912:15323#0 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

              and check the read "HWI-EAS261_0019_FC:1:1:2912:15323#0" in my sam file:

              HWI-EAS261_0019_FC:1:1:2912:15323#0 163 12 57869932 3 18M197N58M = 57870226 505 CCGGCTACCCGCTGGTCCCCAGCCTGCGGAGGGCGCTGTCGGCGGTGGCTCTCGGTAGAACACCAGGCTGTTACCC IIIIIIIHIIIIIIIFHIIIIEGIG<GGGBHIIDEEIIDGADGD+)@C??AAA8ABBDBDEB@EEBC8>C<>@8@? NM:i:1 XS:A:- NH:i:2 CC:Z:= CP:i:57869932 HI:i:0
              HWI-EAS261_0019_FC:1:1:2912:15323#0 419 12 57869932 3 18M197N58M = 57870226 699 CCGGCTACCCGCTGGTCCCCAGCCTGCGGAGGGCGCTGTCGGCGGTGGCTCTCGGTAGAACACCAGGCTGTTACCC IIIIIIIHIIIIIIIFHIIIIEGIG<GGGBHIIDEEIIDGADGD+)@C??AAA8ABBDBDEB@EEBC8>C<>@8@? NM:i:1 XS:A:- NH:i:2 HI:i:1

              I think I can just discard these reads...

              Many thanks,
              Xin

              Comment


              • #8
                Originally posted by xy6699 View Post
                ... and actually the adjacent mate reads have the same sequence, so they are not really "mate" pairs.
                Exactly. You may now wonder where in your pipeline the mates got lost (the the other mate with its sequence must be somewhere). Maybe you filtered them out in some previous step.

                Comment


                • #9
                  The program continues to run even after spiting out these warnings.
                  Does anyone knows if it skips the troubled reads? thanks

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  8 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  66 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X