Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large K-mer Velvet

    Hi Folks,
    I am using Velvet to assemble a number of genes where the reads are of 75bp length. An issue I am having is that some of these genes are a result of duplications, where the parent and duplicate gene are very similar. Am I right in thinking that a high k-mer length will reduce that chances of an assembly error (smaller k-mers being merged as one contig despite coming from reads generated from duplicates). I realize sequencing errors may be unavoidable, hopefully good coverage will help avoid these. If longer k-mers are better for duplicates would it be better to generate longer reads?

  • #2
    Originally posted by NGS_user View Post
    Hi Folks,
    I am using Velvet to assemble a number of genes where the reads are of 75bp length.
    Are the reads paired ?

    Originally posted by NGS_user View Post

    An issue I am having is that some of these genes are a result of duplications, where the parent and duplicate gene are very similar. Am I right in thinking that a high k-mer length will reduce that chances of an assembly error (smaller k-mers being merged as one contig despite coming from reads generated from duplicates).
    Surely this will account for some differences in assemblers using bubble merging or bubble popping approaches such as Velvet or ABySS.

    In general, increasing the k-mer length increases the uniqueness of k-mers in the resulting graph.

    Two things disallow the use of a very large k-mer length. The first is obviously the read length. The second is the error rate.


    Originally posted by NGS_user View Post
    I realize sequencing errors may be unavoidable, hopefully good coverage will help avoid these.
    If sequencing errors occur randomly, they won't stack and therefore can be weeded out to some extent. Different assemblers will do that in different manners.

    For example, In Ray (see http://denovoassembler.sf.net; I am the author), these errors are just avoided, but are not removed from the graph.


    Originally posted by NGS_user View Post
    If longer k-mers are better for duplicates would it be better to generate longer reads?
    Longer reads is always better if the throughput scales as well.

    This is one of the goals that Pacific Biosciences aims to achieve -- longer reads.


    Maybe you can try Ray on your dataset. Ray does not merge similar paths in the assembly process so that might help.


    seb
    Last edited by seb567; 05-31-2011, 09:29 AM. Reason: fixed link

    Comment


    • #3
      The reads are single end but if I am to generate new data I could have paired end reads of either 100 or 150 bp (GAII). I am just concerned that the high error rate will affect my assemblies as I am not assembling a genome, rather a family of mammalian genes

      Comment


      • #4
        Originally posted by NGS_user View Post
        The reads are single end but if I am to generate new data I could have paired end reads of either 100 or 150 bp (GAII). I am just concerned that the high error rate will affect my assemblies as I am not assembling a genome, rather a family of mammalian genes
        Perhaps you could first perform simulations on those genes (if they are known) or on closely-related or similar genes.

        You can do that with Ray right away.

        First, you need these packages (available in all GNU/Linux distros):

        make
        g++
        open-mpi
        git (to get the development version of Ray)
        boost (to compile the read simulator shipped with Ray)


        What follows is the workflow you could use.

        Install Ray and VirtualNextGenSequencer

        Code:
        git clone [email protected]:sebhtml/ray.git
        cd ray
        make PREFIX=build MAXKMERLENGTH=128 VIRTUAL_SEQUENCER=y
        make install

        Sequencer your genes in silico


        Code:
        N=600000 #number of pairs of reads
        readLength=75
        errorRate=0.005 # 0.5%
        ref=~/nuccore/genes.fasta
        mean=400 # average insert size
        sd=40 # standard deviation
        
        ./build/VirtualNextGenSequencer $ref $errorRate \
        $mean $sd $N $readLength L1_1.fasta L1_2.fasta
        Build an assembly
        Code:
        mpirun -np 64 ./build/Ray -k 70 -p L1_1.fasta L2_2.fasta \
         -o GeneBuild

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X