SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
celera assembler to ace file problem greigite Bioinformatics 8 04-29-2014 11:04 PM
where to find coordinates for promter, splice site, splice regulatory site? cliff Bioinformatics 13 11-18-2013 05:23 PM
Problem with Celera Assembler sram Bioinformatics 6 09-14-2011 03:00 AM
Splice site mutation Tiaret Bioinformatics 1 06-08-2011 01:14 AM
Celera WGS requires paired data? k-gun12 Bioinformatics 0 03-11-2011 10:40 AM

Reply
 
Thread Tools
Old 12-08-2008, 03:25 AM   #1
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default Celera Assembler (WGS) - splice site file?

Hi,

I want to use the Celera Assembler (WGS) in my assembly pipeline in order to compare the results to Phred / Phrap. I read that to vector / quality trim my reads, I should use Lucy, but on this point I am confused.

What is the "sequence of the vector splice site"?


I am reading this: http://www.cbcb.umd.edu/research/CeleraAssembler.shtml

"Each vector file [one per vector] must be accompanied by a splice site file containing the sequence within the vector that is adjacent to the splice sites used in the project. In case your project uses an adapter it should be included in the splice file. ... The vector file must contain a single FASTA-formatted sequence representing the entire sequencing vector. The splice file contains 4 FASTA records corresponding to approximately 200 bp flanking either side of the splice site, presented in both the forward and reverse-complemented orientation."


Unfortunately I don't understand what this means, specifically, what is the splice site file and how do I identify the splice sites? Typically will this refer to the sequencing vector or the cloning vector (BAC)?

The project uses the pSMART-HCKan (AF532107) sequencing vector from the Lucigen CLONESMART Blunt Cloning Kit ... does that mean anything to anyone?

Should I just use the 200 bp either side of the primer sites?


Sorry for the potentially very dumb question!

Dan.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 09-25-2009, 02:17 AM   #2
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

Since I at least have something working for this question, I thought I'd update the thread. No clear answers exactly, but I got something that seemed to work (hopefully useful for someone) ...

Some of what I eventually worked out on this topic is described here:

http://sourceforge.net/mailarchive/m...mail.gmail.com



And here is some info from an email exchange with Sven Klages (user 'sven').

> What is the "sequence of the vector splice site"?

The flanking bases of the cloning site, e.g. pUC19/SmaI:
Figure
======



----f2------------------------->
----f1------------------------->
|========================= GGG/CCC =========================|
<-------------------------r1----
<-------------------------r2----


f1 = for.begin
f2 = for.end
r1 = rev.begin
r2 = rev.end

OVERLAPS f1/f2 and/or r1/r2 ~ 50bp

So your splice site file could look like this (sequences
shortened, [...]):

>pUC19.for.begin
attcgccattcaggctgcgcaactgttgggaagggcgatcggtgcgggcctcttcgctat
[...]
>pUC19.for.end
tttcccagtcacgacgttgtaaaacgacggccagtgaattcgagctcggtaCCCGGGgat
[...]
>pUC19.rev.begin
gggcagtgagcgcaacgcaattaatgtgagttagctcactcattaggcaccccaggcttt
[...]
>pUC19.rev.end
aggaaacagctatgaccatgattacgccaagcttgcatgcctgcaggtcgactctagagg
[...]

"man lucy" will tell you more (after compiling).



But I still didn't understand! Sven continued...

roughly, you take the 5' flanking sequence,
CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT

and the 3' flanking sequence,
GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG

and join it to form

>pSMART-HCAmp.for.begin
CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT
GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG
>pSMART-HCAmp.for.end
CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT
GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG

Which is pretty much the the same for 'begin' and 'end' ..
This is not what is proposed, but it should work.

You should "reverse complement" if you need reverse clipping
as well.

>pSMART-HCAmp.rev.begin
[sequence]
>pSMART-HCAmp.rev.end
[sequence]

lucy is pretty "tolerant" ...

Just use 'lucy' with the flag '-debug FILENAME' to see if clipping
was successful.


If you're expecting any adaptors they should be included in
the sequence as they are read by sequencing,

Vector-Adaptor-(INSERT)-Adaptor-Vector



So I said...

Thanks Sven, its all clear now. Just to make sure I understand though,
the GenBank sequence for this pSMART vector (pSMART-HCKan, AF532107.1)
just 'happens' to start with:

GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAGTCAA


and just 'happens' to end with:

TGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT


but actually, I need some detailed knowledge of where on the vector
sequence the sequence 'insert site' (or splice site) is before I can
create what you did above?



And Sven said...

Yes, you should know about the insert location.
But that's easy, isn't it?

If you have the whole sequence you should design the splice file as
mentioned.


----f2------------------------->
----f1------------------------->
|========================= INSERT =========================|

<-------------------------r1----
<-------------------------r2----


f1 = for.begin
f2 = for.end
r1 = rev.begin
r2 = rev.end

OVERLAPS f1/f2 and/or r1/r2 ~ 50bp, individual length of f1,f2,r1,r2 ~150bp.
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 09-28-2009, 01:35 AM   #3
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 620
Default

keep in mind that you should use a non-proportional font (fixed) so that it makes sense.

btw, it's not really clear to me what is unclear to you ... ;-)

Sven

Last edited by sklages; 09-28-2009 at 01:52 AM. Reason: .. rethinking ..
sklages is offline   Reply With Quote
Old 09-28-2009, 02:44 AM   #4
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 265
Default

It's unclear to me how, given an arbitrary vector sequence, one generates the associated .splice file.

Given the position of the splice site, I guess its straight forward.

Could you demo some simple script for doing this?
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.
dan is offline   Reply With Quote
Old 09-28-2009, 02:56 AM   #5
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 620
Default

Script in terms of "perl script"? I never do this automatically ..

You need to know your 5' vector/adaptor sequences, re sites if applicable and the 3' vector/adaptor/whatever sequences ... and then create a multi fasta file as mentioned before.

Code:
                                  ----f2------------------------->
                      ----f1------------------------->
|======================[]=====================|
                                  <-------------------------r1----
                     <-------------------------r2----
I am afraid I am missing something?

cheers,
Sven
sklages is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:41 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO