For the first time, I am dealing with 300bp data generated from Illumina's MiSeq. It's paired end data, but I'm currently treating it as unpaired data for simplicity's sake. I've dealt with 75bp and 150bp data from similar experiments before without much difficulty. I prepared the fastq files as I usually do for TopHat and ran as usual, but I am finding only around 25% alignment (for read 1; for read 2, which has lower quality scores, alignments are only around 10%).
Incidentally, I checked and the problem is indeed with the alignment (thus Bowtie2), not "tophat" per se.
I checked the unmapped.bam file, and sorted by frequency of read. I found that all the top sequences gave very good alignments when I threw then into a standard nucleotide blast (human). So it's not like these are junk that shouldn't be expected to align. The rejected sequences all looked long to me, so I looked at the length distribution of reads in the unmapped vs. the accepted_hits files; sure enough, 60% of unmapped reads were in the 250-300bp length, while only 10% of accepted hits were in the same size range. So, clearly tophat is having issues with long reads.
I found that the default --read-edit-dist is set to 2, which seems a little silly, as you'd need it higher or lower depending on read length. In any event, I tried bumping this up to 4, but this only got me an additional 0.1% of reads in the accepted_hits.bam.
In case anyone is interested, here is the tophat command I ran (through a perl script) to do the alignment:
"~/software/tophat/tophat -p 32 -o $outfolder --read-edit-dist 4 ~/software/hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome $folder/cutadapted-full/$filename"
So, any suggestions of options/flags ??? that can help Bowtie2/TopHat properly find alignments for long reads?
Incidentally, I checked and the problem is indeed with the alignment (thus Bowtie2), not "tophat" per se.
I checked the unmapped.bam file, and sorted by frequency of read. I found that all the top sequences gave very good alignments when I threw then into a standard nucleotide blast (human). So it's not like these are junk that shouldn't be expected to align. The rejected sequences all looked long to me, so I looked at the length distribution of reads in the unmapped vs. the accepted_hits files; sure enough, 60% of unmapped reads were in the 250-300bp length, while only 10% of accepted hits were in the same size range. So, clearly tophat is having issues with long reads.
I found that the default --read-edit-dist is set to 2, which seems a little silly, as you'd need it higher or lower depending on read length. In any event, I tried bumping this up to 4, but this only got me an additional 0.1% of reads in the accepted_hits.bam.
In case anyone is interested, here is the tophat command I ran (through a perl script) to do the alignment:
"~/software/tophat/tophat -p 32 -o $outfolder --read-edit-dist 4 ~/software/hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome $folder/cutadapted-full/$filename"
So, any suggestions of options/flags ??? that can help Bowtie2/TopHat properly find alignments for long reads?
Comment