I'm using Tophat2 to align my RNA-seq reads. Many of the junctions produced by Tophat have extremely long introns and span multiple genes in my dataset. I've set the max intron length to be slightly larger than the biggest intron in my GFF file, but I'm still observing this weird behavior. I've attached a few IGV screenshots of what I'm talking about. These very long "intron" mapped reads occur consistently throughout my dataset. Often, however, the junctions perfectly match with my gene model (also in an IGV screenshot), suggesting that something here is working.
Has anyone else observed this behavior?
EDIT:
Discussion with a lab mate has solved the problem. I'm adding the solution here in case anyone runs into the same issue and searches for it. The GFF file I was using was incorrectly formatted, and Tophat was interpreting each exon as a separate transcript. Lesson learned - make sure your GFF is correctly formatted in the first place and if you're concerned, double check the reconstructed transcriptome file.
Has anyone else observed this behavior?
EDIT:
Discussion with a lab mate has solved the problem. I'm adding the solution here in case anyone runs into the same issue and searches for it. The GFF file I was using was incorrectly formatted, and Tophat was interpreting each exon as a separate transcript. Lesson learned - make sure your GFF is correctly formatted in the first place and if you're concerned, double check the reconstructed transcriptome file.
Comment