Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Contig length, k-mer coverage, and differential expression

    I'm working with some data where I have a read count and k-mer coverage (Ck) for a set of contigs and scaffolds across different conditions. I've recently heard and read a few very confusing explanations of k-mer coverage, so would appreciate some clarification. From what I gather, Ck is directly related to base coverage. But, can the size of a contig be determined if I know the Ck value, read length, and read number for that specific contig? Or would this calculation not work for a de novo transcriptome where read coverage varies greatly between contigs and scaffolds?

    For example, here are my numbers for contig A:

    Read length = 75 b
    Read count = 185,600 reads
    Ck = 63
    hash length = 31

    When I plug all this into Ck = C*(rL-k+1)/rL where C=coverage (read length*reads/contig length (cL)) and rL = read length, I get a value for cL of about 127 kb. However, when I go back to the raw data and look at that contig's sequence, I find it to be only .823 kb. Not sure how the total reads for the run figure into this, but I have ~40 million reads for this condition.

    Because C depends on the read count, my best guess is that contigs and scaffolds that have relatively high or low expression over the mean will have Ck values unrepresentative of the contig length. But I feel clueless, and my partner appears to be only acting as if he knows. I have a feeling I'm misunderstanding something completely obvious.

    Any help on this matter would be greatly appreciated.

  • #2
    Hi, all
    I am new to denovo genome assembly. I have a fastq sequence data which i have to assemble using velvet. I used the velvet optimiser script with different hash length from 27 to 41 and it predicted best to be 37. The output file contigs.fa contains 260 contigs whereas log file predicts 283 nodes, where are the rest gone? Length given in contigs.fa is in k mers? how do i calculate it's actual nucleotide length in bp?. How do i understand whether the assembly is good or bad. FInal stat given after script running:
    Final graph has 283 nodes and n50 of 347, max 2336, total 68614, using 19064/50000 reads
    Why are the number of used reads so low?

    Comment


    • #3
      Contig length, k-mer coverage, and differential expression

      I'm pretty sure velvet has a cutoff value for the length of the contigs
      listed in the contigs.fa file, although I don't remember off the top of my head what that is. So the missing contigs are probably the very short ones.

      The formula for calculating kmer coverage from base coverage is
      given in the velvet manual. See



      As to whether the assembly is good, have a look at this Nature Methods article entitled ''De novo genome assembly: what every biologist should know"

      Comment


      • #4
        What is the twin node as specified in velvet? It says reverse of reverse complement k merss. How are contigs actually generated using paired end assembly with velvet? can someone show using an example?

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        29 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X