View Single Post
Old 08-19-2010, 09:29 AM   #1
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default SOLiD WTP alignment file: representation of spliced reads

Hi

I've got a data set with SOLiD RNA-Seq data that was aligned with SOLiD's whole transcriptome analysis pipeline (WTP 1.2.1). This software produces a GFF file that represents each read with one line, or, if the read straddles a splice junction, wth two lines (which are usually not next to each other).

I have trouble understanding how the spliced reads are represented.

Here is a normal read:
Code:
chr2L   wtp     read    75079   75113   30      +       .       bd=1445_1152_746_F3;rs=16;mm=0;g=T12003120303000210002013210322101003330110312122223;i=1;
There are 0 mismatches (mm=0) and 16 bases skipped (rs=16). if I convert the read to sequence space and extract the part at the indicated coordinates from my refernce FASTA, this alignes nicely:

Code:
TGAAATGAATTAAAAGTTTTCCATCAATCTGGTTTATAACAATGACTCTCG  [read]
                TTTTCCATCAATCTGGTTTATAACAATGACTCTCG  [reference, 2L:75079-75113]
----------------  [ <-- 16 skipped]
Now for a spliced read. This bead ID here appears twice:

Code:
chr2L   wtp     read    108217  108226  45      +       .       bd=1636_459_310_F3;rs=1;mm=0;g=T32012102331321332201132130130000113020000230013032;i=1;jp=108588;jt=k;
chr2L   wtp     read    108588  108622  45      +       .       bd=1636_459_310_F3;rs=1;mm=0;g=T32012102331321332201132130130000113020000230013032;i=1;jp=108217;jt=k;
The two lines refer each others starting positions via the 'jp' attribute. However if I extract the indicates positions, there is no match:

Code:
TAGGTCAAGCGTAGTATCTTGTAGTAACGGGGGTGCCTTTTTCGGGTAATC   [read]
 CTCAGAATCA                                           [reference, 2L:108217-108226]
           CTCCACCAACAATTTAGCCGACCGGAACTCGGGTT        [reference, 2L:108588-108622]
I can't find these reference parts anywhere in the read.

I tried many different reads, and always, the non-spliced ones agree with the reference (unless there are mismatches, causing the colour space decoding to lose sync) and the spliced ones don't. Do I have to do something different if I decode colour space for a spliced read? Do I misunderstand the WTP output format? Or is something going severely wrong here?

Thanks for any hints

Simon

Last edited by Simon Anders; 08-23-2010 at 02:55 AM. Reason: corrected GFF excerpt
Simon Anders is offline   Reply With Quote