nr23 02-07-2013

Same species mapping problem with Tophat

I'm working with Illumina PE (100bp) RNA-Seq reads from Xenopus laevis. I've previously had a lot of trouble (very low % reads mapped, and almost 0% reads 'properly paired') mapping with bowtie/tophat to the X.tropicalis genome and, assuming that this was due to mismatches (and being unable to overcome this by the limit of N-3 mismatches per segment in bowtie), I switched to using STAMPY (, which allows multiple mismatches, and achieved very good results.

Recently the X.laevis genome has been released - I've tried re-mapping my reads using tophat/bowtie, but still get the same results (<20% reads mapping and ver low fraction 'properly paired'). This is really confusing, I would expect the occasional mismatch due to allelic differences, but should still see almost all of my reads mapping.

In addition, on inspecting the bowtie log files, I can see that ~75% of both left and right reads map. The trouble seems to be with the way tophat interprets the alignment produced by bowtie, as tophat seems to include a very small fraction (6M reads / ~ 90M) and reports 100% mapped for these reads in samtools flagstat.

I'll paste some of the stats I'm seeing below:

Log file from bowtie run (X.laevis reads vs X.laevis genome):

logs> more bowtie.left_kept_reads.fixmap.log
# reads processed: 31151246
# reads with at least one reported alignment: 22576653 (72.47%)
# reads that failed to align: 8249899 (26.48%)
# reads with alignments suppressed due to -m: 324694 (1.04%)

logs> more bowtie.right_kept_reads.fixmap.log
# reads processed: 33478582
# reads with at least one reported alignment: 24249964 (72.43%)
# reads that failed to align: 8880054 (26.52%)
# reads with alignments suppressed due to -m: 348564 (1.04%)
Reported 30987873 alignments to 1 output stream(s)

Samtools flagstat on same tophat run:

> samtools flagstat accepted_hits.bam
6401438 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
6401438 + 0 mapped (100.00%:-nan%)
6401438 + 0 paired in sequencing
2050216 + 0 read1
4351222 + 0 read2
10784 + 0 properly paired (0.17%:-nan%)
205892 + 0 with itself and mate mapped
6195546 + 0 singletons (96.78%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I'm really stumped with this - my STAMPY results are great (~85% reads mapped, ~70% reads paired properly) and eyeballing the results in IGV confirms that reads stack up nicely across expressed regions, and contain very few mismatches.

Any help would be tremendously appreciated!

Many thanks and all the best,


