SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Aligners for Illumina's mate-pairs Margarida Bioinformatics 8 07-29-2013 10:28 AM
Fixing mate pairs in fastq lukas1848 Bioinformatics 2 05-03-2012 12:08 PM
Merging mate pairs by quality Yrinky Bioinformatics 2 04-13-2012 02:48 AM
htseq-count with warning for every read to represent all of zero counts in output hibachings2013 RNA Sequencing 10 07-15-2011 11:19 AM
454 mate pairs and mosaik afb Bioinformatics 4 04-02-2010 06:07 AM

Reply
 
Thread Tools
Old 06-26-2012, 07:17 PM   #1
slowsmile
Member
 
Location: long island

Join Date: May 2011
Posts: 22
Default Why HTseq warning of unfound mate pairs?

Dear all
I am using htseq-count tool to summarize gene counts from bam files generated by tophat (v 2.03) based on bowtie2. I've used this pipeline (based on bowtie1) several times with human RNA-Seq and have been generating good results.

In the most recent project, we are working with Ecoli K12 genome, 100 bp paired-ends.

I tried htseq-count tool on the accepted_hits.bam files generated by tophat but it gave me all the warnings of "xxx claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)". I then sorted the bam files with samtools prior to this step yet still got no luck: thousands of the same warnings came out and I got no reads in the output gene_counts.txt file.

I lchecked the sam file (first 10 lines, converted from the sorted bam file) and they looked like these:

Quote:
HWI-ST984:1021021ACXX:2:1210:8261:88919 99 chr 1 255 4M14I82M =57 156 AGTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACC @@BFFFDFHHHHHJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJIJIIJJJJJJJHFFFBBCEEEEEEDDDDDDDDDDDDDDDDDDDDC AS:i:-57 XN:i:0 XM:i:2 XO:i:1 XG:i:14 NM:i:16 MD:Z:2C0T82 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1308:13660:65155 99 chr 2 255 6M9I85M = 117 215 TATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGT CCCFFFFFHGHHHJJJJJJJJJJJJJJJJJJJJJJJJJIIIIJIJJHIGIFJJJIJGHIJHHHH?CEFEFFEECD>@[email protected] AS:i:-42 XN:i:0 XM:i:2 XO:i:1 XG:i:9 NM:i:11 MD:Z:0G0C89 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:2108:14990:23666 99 chr 10 255 100M = 167 257 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT CCCFFFFFHHHHHJJIIIIJJJJIIIJIJIJIIJHIJIJJJJIJJIJEHIJIJJJJJIHHHHHFFCDFFEEECEEDDDDDDDDBDDACCCDDDDDDCDDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1214:16246:55224 89 chr 10 255 100M * 00 TTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTT [email protected]>>CCC9?A<BC>::EECACC=>[email protected]@EFGGHFC===<[email protected]>EBDBDB9C9EFB3F?1JIEIGGIIGHEGHDHDFFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1108:7813:47825 99 chr 22 255 100M = 113 191 CGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGT CCCFFFFFHHGHHJJJJHJHIJIJJJJJJJJJJJJHJIIIGIJJIJJJJJJJJJJJJJJJHIJJHHHHHFDDDCC>CCEEDDDDEDDDFDDDDDDDDDCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
HWI-ST984:1021021ACXX:2:1105:8881:46986 163 chr 23 255 100M = 137 214 GGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTC CCCFFFFFHH[email protected]>AEEEEDDDDEDDDEDDCCDDDDDDCD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1
I then checked the sequence stats with samtools flagstat and found 82.25% reads are properly paired.

So what is wrong with my bam file? There are definitely majority of proper mate pairs in the bam file. Why can't they be sorted in a way that mate pairs are assgined in adjacent lines for htseq-count to read?
I used samtools sort commend to do the soring? Any better ideas?

I'm pretty new in this field, so pardon me if similar questions have been asked before.
slowsmile is offline   Reply With Quote
Old 06-26-2012, 11:52 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Have you sorted by position or by name? You have to us e"samtools sort -n" to sort by read name, in order to cause lines describing mates to appear next to each other.
Simon Anders is offline   Reply With Quote
Old 06-27-2012, 11:31 AM   #3
slowsmile
Member
 
Location: long island

Join Date: May 2011
Posts: 22
Default

Thanks Simon.
I forgot to add -n in the samtools sorting process and thus messed up with SAM reads.
I re-ran the program today and this time htseq-count works fine with by-name soring.
slowsmile is offline   Reply With Quote
Old 06-29-2012, 08:17 AM   #4
xy6699
Member
 
Location: Cambridge

Join Date: Oct 2011
Posts: 12
Default

Hi,

I have the same warnings : Warning: Read xxxx claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

My sam file looks like this:

HWI-EAS261_0019_FC:1:1:1144:8868#0 99 11 128587218 255 76M = 128587361 219 GAAAAGCACACGCATGATGGTTTTGCTATCGTGTGACATTTATTTCATACTTGCTACCTGTAAGAAATTCCTTGAA IIIIIIHHIIIIEIIIIGIIIIIIIIIIIIIIIGIII<IIGIIIIHGFIIHIIIIIIIFIHIIIHH?IIGHIHIIG NM:i:0 NH:i:1
HWI-EAS261_0019_FC:1:1:1144:8868#0 147 11 128587361 255 76M = 128587218 -219 AGGGATGCTGTTTCTAAGGCATGTAGGTGCTGAGGGTCTACCCCAAAGGGTAGTTTGGGACTGCAGGGCAGGCAGG [email protected]@HIIIIIIHIIHGIIIIHIIIGIDIIIIHIIIIIIIIIIIHIIIIGIII NM:i:1 NH:i:1
HWI-EAS261_0019_FC:1:1:1145:1981#0 99 22 21959147 255 76M = 21959234 163 GAGAAGTTCAGATGAGTTTGGCCAAGTTCCCTGGGTGGTGAGAGGCCTGGCCTGCCTCATGTAGTAACAGAACTGC HHHHHHHHHHHFGHHGGGGEGG<GGDEGGGDGGGGGGDDGGEGGGGDGGEDFBGGGGGGBGGGGA<[email protected] NM:i:0 NH:i:1
HWI-EAS261_0019_FC:1:1:1145:1981#0 147 22 21959234 255 76M = 21959147 -163 CCTTCCTCTTTTTGGAAGAAAAAAGAGGCAGGATCTCACTGTCTTGTCCAGGCTGGAAGGCAGTGGCGTGATCATG =F<[email protected]@[email protected]<GEDGGGBGE>IHGIGIEIIIIGGGIDIFIIGIIGIHHHIIIIIHI NM:i:0 NH:i:1
HWI-EAS261_0019_FC:1:1:1145:8828#0 99 10 6054667 255 76M = 6054796 5361 TGCCACTGCCCCGTGTCCTGTGATGTGACTTCAGAGCTTCCAAAACGCAGGCAAGCACAACGGATGTCTCCTGGGC DFHHEHHHHHHHGHHHHHHBGEBB:GGGGGGDBGB4DGGGHHHHHHHHGHBHFBHG:[email protected],,DBBDB+>DGGA NM:i:0 NH:i:1
HWI-EAS261_0019_FC:1:1:1145:8828#0 147 10 6054796 255 64M5156N12M = 6054667 -5361 CCCTGCTTCTTACCAAGAAATTCTTGTTCTTTTGGTTTTCTAGATTGTTCTTCTACTCTTCCTCTGTCTCCGCTGC CBE3EGDDHBGI>[email protected]>[email protected]>[email protected] NM:i:1 XS:A:- NH:i:1

I have sorted the bam file from tophat using: samtools sort -n
and then convert bam to sam using: samtools view .bam >.sam

I could see in my sam file, the lines with the same name are next to each other, why does ht-seq still give me this warning?

Many thanks
xy6699 is offline   Reply With Quote
Old 06-29-2012, 10:26 AM   #5
slowsmile
Member
 
Location: long island

Join Date: May 2011
Posts: 22
Default

To: xy6699
Your sam file looks properly sorted (at least from the section you posted here). The warnings may come from other unpaired reads. Did you check your alignment stats? What is the percentage of aligned reads that are properly paired?
slowsmile is offline   Reply With Quote
Old 06-29-2012, 10:32 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

The warning is not about improperly paired mates but about missing mates. Take the read ID from one of the warnings, grep for it in the SAM file, and check whether it really appears an even number of times, in adjacent lines.
Simon Anders is offline   Reply With Quote
Old 07-02-2012, 05:27 AM   #7
xy6699
Member
 
Location: Cambridge

Join Date: Oct 2011
Posts: 12
Default

Hi,

Thanks a lot for the reply.

I looked at the warning reads carefully and found that they have very low mapping quality and actually the adjacent mate reads have the same sequence, so they are not really "mate" pairs.

Take one warning for example:

Warning: Read HWI-EAS261_0019_FC:1:1:2912:15323#0 claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)

and check the read "HWI-EAS261_0019_FC:1:1:2912:15323#0" in my sam file:

HWI-EAS261_0019_FC:1:1:2912:15323#0 163 12 57869932 3 18M197N58M = 57870226 505 CCGGCTACCCGCTGGTCCCCAGCCTGCGGAGGGCGCTGTCGGCGGTGGCTCTCGGTAGAACACCAGGCTGTTACCC IIIIIIIHIIIIIIIFHIIIIEGIG<GGGBHIIDEEIIDGADGD+)@[email protected]>C<>@[email protected]? NM:i:1 XS:A:- NH:i:2 CC:Z:= CP:i:57869932 HI:i:0
HWI-EAS261_0019_FC:1:1:2912:15323#0 419 12 57869932 3 18M197N58M = 57870226 699 CCGGCTACCCGCTGGTCCCCAGCCTGCGGAGGGCGCTGTCGGCGGTGGCTCTCGGTAGAACACCAGGCTGTTACCC IIIIIIIHIIIIIIIFHIIIIEGIG<GGGBHIIDEEIIDGADGD+)@[email protected]>C<>@[email protected]? NM:i:1 XS:A:- NH:i:2 HI:i:1

I think I can just discard these reads...

Many thanks,
Xin
xy6699 is offline   Reply With Quote
Old 07-09-2012, 02:16 PM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 992
Default

Quote:
Originally Posted by xy6699 View Post
... and actually the adjacent mate reads have the same sequence, so they are not really "mate" pairs.
Exactly. You may now wonder where in your pipeline the mates got lost (the the other mate with its sequence must be somewhere). Maybe you filtered them out in some previous step.
Simon Anders is offline   Reply With Quote
Old 11-15-2017, 04:53 PM   #9
Madza Farias Virgens
Junior Member
 
Location: Los Angeles

Join Date: Oct 2016
Posts: 2
Default

The program continues to run even after spiting out these warnings.
Does anyone knows if it skips the troubled reads? thanks
Madza Farias Virgens is offline   Reply With Quote
Reply

Tags
htseq-count, mate pairs, paired ends, sorting

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:25 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO