Hello folks,
in the Cufflinks paper it is stated that TopHat splits reads greater than 75pb:
But I thought that having longer reads is better because it is more likely that they will align to only a single position because they are "more unique".
Why is it useful that TopHat splits longer reads? I could imagine that its better to find splice sites in some way but I thought that is done in the other steps.
And, does TopHat store the dependence of the splitted reads in some way so it can infer a splice junction if they do not align contiguously (but in the right order and with a space of an approximately intron length)?
What happens when they are aligned over the whole genome with no coherence? Then the result is biased. I think this is more likely to happen with reads of 25 bp length.
I didn't find any information, maybe someone could help. The original TopHat paper does not refer to this feature.
Thanks in advance,
Oliver
in the Cufflinks paper it is stated that TopHat splits reads greater than 75pb:
TopHat version 1.0.7 and later splits a read 75 bp or longer in three or more segments of approximately equal size (25 bp) and maps them independently.
Why is it useful that TopHat splits longer reads? I could imagine that its better to find splice sites in some way but I thought that is done in the other steps.
And, does TopHat store the dependence of the splitted reads in some way so it can infer a splice junction if they do not align contiguously (but in the right order and with a space of an approximately intron length)?
What happens when they are aligned over the whole genome with no coherence? Then the result is biased. I think this is more likely to happen with reads of 25 bp length.
I didn't find any information, maybe someone could help. The original TopHat paper does not refer to this feature.
Thanks in advance,
Oliver
Comment