Hi
I've got a data set with SOLiD RNA-Seq data that was aligned with SOLiD's whole transcriptome analysis pipeline (WTP 1.2.1). This software produces a GFF file that represents each read with one line, or, if the read straddles a splice junction, wth two lines (which are usually not next to each other).
I have trouble understanding how the spliced reads are represented.
Here is a normal read:
There are 0 mismatches (mm=0) and 16 bases skipped (rs=16). if I convert the read to sequence space and extract the part at the indicated coordinates from my refernce FASTA, this alignes nicely:
Now for a spliced read. This bead ID here appears twice:
The two lines refer each others starting positions via the 'jp' attribute. However if I extract the indicates positions, there is no match:
I can't find these reference parts anywhere in the read.
I tried many different reads, and always, the non-spliced ones agree with the reference (unless there are mismatches, causing the colour space decoding to lose sync) and the spliced ones don't. Do I have to do something different if I decode colour space for a spliced read? Do I misunderstand the WTP output format? Or is something going severely wrong here?
Thanks for any hints
Simon
I've got a data set with SOLiD RNA-Seq data that was aligned with SOLiD's whole transcriptome analysis pipeline (WTP 1.2.1). This software produces a GFF file that represents each read with one line, or, if the read straddles a splice junction, wth two lines (which are usually not next to each other).
I have trouble understanding how the spliced reads are represented.
Here is a normal read:
Code:
chr2L wtp read 75079 75113 30 + . bd=1445_1152_746_F3;rs=16;mm=0;g=T12003120303000210002013210322101003330110312122223;i=1;
Code:
TGAAATGAATTAAAAGTTTTCCATCAATCTGGTTTATAACAATGACTCTCG [read] TTTTCCATCAATCTGGTTTATAACAATGACTCTCG [reference, 2L:75079-75113] ---------------- [ <-- 16 skipped]
Code:
chr2L wtp read 108217 108226 45 + . bd=1636_459_310_F3;rs=1;mm=0;g=T32012102331321332201132130130000113020000230013032;i=1;jp=108588;jt=k; chr2L wtp read 108588 108622 45 + . bd=1636_459_310_F3;rs=1;mm=0;g=T32012102331321332201132130130000113020000230013032;i=1;jp=108217;jt=k;
Code:
TAGGTCAAGCGTAGTATCTTGTAGTAACGGGGGTGCCTTTTTCGGGTAATC [read] CTCAGAATCA [reference, 2L:108217-108226] CTCCACCAACAATTTAGCCGACCGGAACTCGGGTT [reference, 2L:108588-108622]
I tried many different reads, and always, the non-spliced ones agree with the reference (unless there are mismatches, causing the colour space decoding to lose sync) and the spliced ones don't. Do I have to do something different if I decode colour space for a spliced read? Do I misunderstand the WTP output format? Or is something going severely wrong here?
Thanks for any hints
Simon