I am using Tophat for aligning unpaired RNAseq reads (36 bases each) and I had a question about junction calling in Tophat. I tried running Tophat with both -j (a known junctions file from Ensembl) and -G (known gene model annotations file from Ensembl) and also ran Cufflinks on the results.
In both of the runs, if I compare the results with the actual known transcripts from Ensembl, it seems like I am missing many known junctions.
To Illustrate, say if Ensembl has a Transcript with say 5 exons. Cufflinks annotates the same region as containing 3 transcripts (Exon 1-2 as Transcript 1, Exon 3-4 as Transcript 2 and Exon 5 as Transcript 3).
If I have understood the paper correctly, this is because even though Tophat has the coordinates of the known junctions it didn't find enough IUM reads spanning those particular junctions (under the given parameters) to allow it to call it junction? Is that correct..?
Under that assumption, I tried running Tophat with --min-anchor-length = 5 (reduced from default 8) and --min-isoform-fraction 0.1 (reduced from default 0.1). But I didn't get any improvement in finding more junctions. (The junctions.bed file has the exact same number of lines)
Does anyone have any suggestions on what else I can try to improve junction calling?
Also, given the Genomic coordinates of a splice junction, is there a way I can extract, from the Tophat output, the no of IUM (Initially Unmapped reads) that Tophat mapped to span that particular junction?
thanks,
Avinash
In both of the runs, if I compare the results with the actual known transcripts from Ensembl, it seems like I am missing many known junctions.
To Illustrate, say if Ensembl has a Transcript with say 5 exons. Cufflinks annotates the same region as containing 3 transcripts (Exon 1-2 as Transcript 1, Exon 3-4 as Transcript 2 and Exon 5 as Transcript 3).
If I have understood the paper correctly, this is because even though Tophat has the coordinates of the known junctions it didn't find enough IUM reads spanning those particular junctions (under the given parameters) to allow it to call it junction? Is that correct..?
Under that assumption, I tried running Tophat with --min-anchor-length = 5 (reduced from default 8) and --min-isoform-fraction 0.1 (reduced from default 0.1). But I didn't get any improvement in finding more junctions. (The junctions.bed file has the exact same number of lines)
Does anyone have any suggestions on what else I can try to improve junction calling?
Also, given the Genomic coordinates of a splice junction, is there a way I can extract, from the Tophat output, the no of IUM (Initially Unmapped reads) that Tophat mapped to span that particular junction?
thanks,
Avinash
Comment