Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Celera Assembler (WGS) - splice site file?

    Hi,

    I want to use the Celera Assembler (WGS) in my assembly pipeline in order to compare the results to Phred / Phrap. I read that to vector / quality trim my reads, I should use Lucy, but on this point I am confused.

    What is the "sequence of the vector splice site"?


    I am reading this: http://www.cbcb.umd.edu/research/CeleraAssembler.shtml

    "Each vector file [one per vector] must be accompanied by a splice site file containing the sequence within the vector that is adjacent to the splice sites used in the project. In case your project uses an adapter it should be included in the splice file. ... The vector file must contain a single FASTA-formatted sequence representing the entire sequencing vector. The splice file contains 4 FASTA records corresponding to approximately 200 bp flanking either side of the splice site, presented in both the forward and reverse-complemented orientation."


    Unfortunately I don't understand what this means, specifically, what is the splice site file and how do I identify the splice sites? Typically will this refer to the sequencing vector or the cloning vector (BAC)?

    The project uses the pSMART-HCKan (AF532107) sequencing vector from the Lucigen CLONESMART Blunt Cloning Kit ... does that mean anything to anyone?

    Should I just use the 200 bp either side of the primer sites?


    Sorry for the potentially very dumb question!

    Dan.
    Homepage: Dan Bolser
    MetaBase the database of biological databases.

  • #2
    Since I at least have something working for this question, I thought I'd update the thread. No clear answers exactly, but I got something that seemed to work (hopefully useful for someone) ...

    Some of what I eventually worked out on this topic is described here:





    And here is some info from an email exchange with Sven Klages (user 'sven').

    > What is the "sequence of the vector splice site"?

    The flanking bases of the cloning site, e.g. pUC19/SmaI:
    Figure
    ======



    ----f2------------------------->
    ----f1------------------------->
    |========================= GGG/CCC =========================|
    <-------------------------r1----
    <-------------------------r2----


    f1 = for.begin
    f2 = for.end
    r1 = rev.begin
    r2 = rev.end

    OVERLAPS f1/f2 and/or r1/r2 ~ 50bp

    So your splice site file could look like this (sequences
    shortened, [...]):

    >pUC19.for.begin
    attcgccattcaggctgcgcaactgttgggaagggcgatcggtgcgggcctcttcgctat
    [...]
    >pUC19.for.end
    tttcccagtcacgacgttgtaaaacgacggccagtgaattcgagctcggtaCCCGGGgat
    [...]
    >pUC19.rev.begin
    gggcagtgagcgcaacgcaattaatgtgagttagctcactcattaggcaccccaggcttt
    [...]
    >pUC19.rev.end
    aggaaacagctatgaccatgattacgccaagcttgcatgcctgcaggtcgactctagagg
    [...]

    "man lucy" will tell you more (after compiling).



    But I still didn't understand! Sven continued...

    roughly, you take the 5' flanking sequence,
    CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT

    and the 3' flanking sequence,
    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG

    and join it to form

    >pSMART-HCAmp.for.begin
    CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT
    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG
    >pSMART-HCAmp.for.end
    CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT
    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG

    Which is pretty much the the same for 'begin' and 'end' ..
    This is not what is proposed, but it should work.

    You should "reverse complement" if you need reverse clipping
    as well.

    >pSMART-HCAmp.rev.begin
    [sequence]
    >pSMART-HCAmp.rev.end
    [sequence]

    lucy is pretty "tolerant" ...

    Just use 'lucy' with the flag '-debug FILENAME' to see if clipping
    was successful.


    If you're expecting any adaptors they should be included in
    the sequence as they are read by sequencing,

    Vector-Adaptor-(INSERT)-Adaptor-Vector



    So I said...

    Thanks Sven, its all clear now. Just to make sure I understand though,
    the GenBank sequence for this pSMART vector (pSMART-HCKan, AF532107.1)
    just 'happens' to start with:

    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAGTCAA


    and just 'happens' to end with:

    TGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT


    but actually, I need some detailed knowledge of where on the vector
    sequence the sequence 'insert site' (or splice site) is before I can
    create what you did above?



    And Sven said...

    Yes, you should know about the insert location.
    But that's easy, isn't it?

    If you have the whole sequence you should design the splice file as
    mentioned.


    ----f2------------------------->
    ----f1------------------------->
    |========================= INSERT =========================|

    <-------------------------r1----
    <-------------------------r2----


    f1 = for.begin
    f2 = for.end
    r1 = rev.begin
    r2 = rev.end

    OVERLAPS f1/f2 and/or r1/r2 ~ 50bp, individual length of f1,f2,r1,r2 ~150bp.
    Homepage: Dan Bolser
    MetaBase the database of biological databases.

    Comment


    • #3
      keep in mind that you should use a non-proportional font (fixed) so that it makes sense.

      btw, it's not really clear to me what is unclear to you ... ;-)

      Sven
      Last edited by sklages; 09-28-2009, 01:52 AM. Reason: .. rethinking ..

      Comment


      • #4
        It's unclear to me how, given an arbitrary vector sequence, one generates the associated .splice file.

        Given the position of the splice site, I guess its straight forward.

        Could you demo some simple script for doing this?
        Homepage: Dan Bolser
        MetaBase the database of biological databases.

        Comment


        • #5
          Script in terms of "perl script"? I never do this automatically ..

          You need to know your 5' vector/adaptor sequences, re sites if applicable and the 3' vector/adaptor/whatever sequences ... and then create a multi fasta file as mentioned before.

          Code:
          [FONT=Courier New]                                  ----f2------------------------->
                                ----f1------------------------->
          |======================[]=====================|
                                            <-------------------------r1----
                               <-------------------------r2----[/FONT]
          I am afraid I am missing something?

          cheers,
          Sven

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          66 views
          0 likes
          Last Post seqadmin  
          Working...
          X