Dear all,
I have a question regarding junctions discovery in Tophat (version 2.0.6). I have mapped an unstranded RNA-seq library (single-end 107bp reads) to the human genome and provided gene models as a guide (Ensembl GTF, option --transcriptome-index).
It is not explicitly documented how the strand of the discovered junctions (file junctions.bed) is attributed, but I assumed that:
- if the junction is in the annotation GTF file, the orientation of each junction is given according to the strand of transcripts at this location.
- if the junction is novel, the orientation may be given by the donor/acceptor sites. For example, if the nucleotides adjacent to the potential exon parts are GT-AG, the junction should be on the + strand, but if it is the reverse-complement CT-AC, the junction is probably on the - strand. Do you think this what is done?
I have now some doubts since I observed 2 junctions with the exact same coordinates to be reported on both + and - strands, in the same sample. How is that possible?
Here is an excerpt from junctions.bed file (note that the BED coordinates represent the limits of the flanking regions which are defined by the span of the reads overlapping the junction):
chr19 3612320 3613105 JUNC00070694 2 + 3612320 3613105 255,0,0 2 91,50 0,735
chr19 3612320 3613161 JUNC00070695 9 - 3612320 3613161 255,0,0 2 91,106 0,735
Apparently there are 2 Ensembl gene models at this location, on both strands: http://useast.ensembl.org/Homo_sapie...NST00000447295
However, I don't understand how Tophat manages to attributes reads to either the + or - strands... Given that the library is unstranded, it should not be possible to attribute any of these 11 reads to their strand of origin. Am I missing something here? Did anyone encounter a similar situation?
Thanks for your help
Julien
I have a question regarding junctions discovery in Tophat (version 2.0.6). I have mapped an unstranded RNA-seq library (single-end 107bp reads) to the human genome and provided gene models as a guide (Ensembl GTF, option --transcriptome-index).
It is not explicitly documented how the strand of the discovered junctions (file junctions.bed) is attributed, but I assumed that:
- if the junction is in the annotation GTF file, the orientation of each junction is given according to the strand of transcripts at this location.
- if the junction is novel, the orientation may be given by the donor/acceptor sites. For example, if the nucleotides adjacent to the potential exon parts are GT-AG, the junction should be on the + strand, but if it is the reverse-complement CT-AC, the junction is probably on the - strand. Do you think this what is done?
I have now some doubts since I observed 2 junctions with the exact same coordinates to be reported on both + and - strands, in the same sample. How is that possible?
Here is an excerpt from junctions.bed file (note that the BED coordinates represent the limits of the flanking regions which are defined by the span of the reads overlapping the junction):
chr19 3612320 3613105 JUNC00070694 2 + 3612320 3613105 255,0,0 2 91,50 0,735
chr19 3612320 3613161 JUNC00070695 9 - 3612320 3613161 255,0,0 2 91,106 0,735
Apparently there are 2 Ensembl gene models at this location, on both strands: http://useast.ensembl.org/Homo_sapie...NST00000447295
However, I don't understand how Tophat manages to attributes reads to either the + or - strands... Given that the library is unstranded, it should not be possible to attribute any of these 11 reads to their strand of origin. Am I missing something here? Did anyone encounter a similar situation?
Thanks for your help
Julien
Comment