Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome size estimation for paired end data

    Hi,

    I am working with Illumina paired-end data with read length (l) 101 to generate de novo assembly. I am trying to predict genome size as per given formula, M = N * (L-K+1)/L.
    The M value I found using Jellyfish which is around 3591. But I am confused since I have paired end file with 80284125 reads in each file, i.e. 80284125*2. When I have calculated genome size using only 80284125*101, I got expected genome size around 1.78MB.

    But I am not sure whether genome size should be calculated how I have calculated above or using (80284125*101)*2, since data is paired end.

    Also, it would be of great help if you tell how to get error k-mer distribution value.

    Regards,
    Mandar

  • #2
    If you download the BBMap package and run kmercountexact, it will tell you the size of the single-copy genomic peak, which is roughly the genome size for bacteria.

    kmercountexact.sh in1=read1.fastq in2=read2.fq khist=khist.txt peaks=peaks.txt

    You can plot the khist.txt if you want to look at it visually to see how many error kmers there are. "peaks.txt" will give you the location and size of the largest peaks; the number indicating the volume of the first peak should be approximately the genome size, though you can make it more accurate by factoring in the repeat peaks.

    You can do some of this with Jellyfish, too, but I don't know the specifics.

    Comment


    • #3
      Thanks Brian. It, BBMAP, did work well, since it gave volume in first row which is very much near to my specie's nearby species' genomic size unlike jellyfish. Only thing is I had to subsample my very huge data, as with large dataset it threw exceptions.

      Only one question in this aspect which you have mentioned in your message " though you can make it more accurate by factoring in the repeat peaks".. If you could clarify this, it would be of great help.

      Comment


      • #4
        If you plot the kmer frequency histogram of a single haploid organism, you will typically see a peak at 0 (the error peak), then the next peak is typically the most prominent and corresponds to the size of the genome that is unique.

        Then there will typically be more, smaller peaks at higher kmer depths. Let's say the first peak is at 40, and contains 3 million kmers. That implies that you have roughly 40x coverage, and the genome contains 3Mbp of unique sequence. There will probably be another peak at 80, which is a lot smaller. This is the peak from 2-copy repeat regions of the genome. If it contains 100,000 kmers, then the expected genome size would actually be increased by 200,000bp - because each of those kmers occurs twice. The three-copy peak would be at 120, and if it contained 100,000 kmers, it would contribute 300,000bp to the expected genome size.

        So - if your first peak P is at depth D, with volume V, then let's label subsequent peaks at depth N*D as PN with volume VN - e.g. P2 is the peak at 2*D, and has a volume of V2. The total genome size would be:

        V1*1+V2*2+V3*3...+VN*N.

        In each case, you can get the multiplier N by round(DN/D1) where DN is the depth (center) of the current peak and D1 is the depth of the first peak.

        But, my program does not have a very sophisticated peak-calling algorithm, so it won't find all of the peaks - it will probably find the first 2, and then maybe some more. You'll have to do that manually from looking at the graph of the kmer histogram, or use a better peak-caller. For most small organisms, though, the vast majority of the genome is single-copy so the higher peaks can be ignored if you just want a rough estimate.
        Last edited by Brian Bushnell; 11-06-2015, 10:44 AM.

        Comment


        • #5
          Thanks alot Brian once again for your comprensive explanation. Further peaks, after first error and second actual, there are no apparent peaks. So, I assume whaterver first row volume is there, is my genome size.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          10 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          51 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          67 views
          0 likes
          Last Post seqadmin  
          Working...
          X