Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mandar.bobade60
    Member
    • Jun 2013
    • 14

    Genome size estimation for paired end data

    Hi,

    I am working with Illumina paired-end data with read length (l) 101 to generate de novo assembly. I am trying to predict genome size as per given formula, M = N * (L-K+1)/L.
    The M value I found using Jellyfish which is around 3591. But I am confused since I have paired end file with 80284125 reads in each file, i.e. 80284125*2. When I have calculated genome size using only 80284125*101, I got expected genome size around 1.78MB.

    But I am not sure whether genome size should be calculated how I have calculated above or using (80284125*101)*2, since data is paired end.

    Also, it would be of great help if you tell how to get error k-mer distribution value.

    Regards,
    Mandar
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    If you download the BBMap package and run kmercountexact, it will tell you the size of the single-copy genomic peak, which is roughly the genome size for bacteria.

    kmercountexact.sh in1=read1.fastq in2=read2.fq khist=khist.txt peaks=peaks.txt

    You can plot the khist.txt if you want to look at it visually to see how many error kmers there are. "peaks.txt" will give you the location and size of the largest peaks; the number indicating the volume of the first peak should be approximately the genome size, though you can make it more accurate by factoring in the repeat peaks.

    You can do some of this with Jellyfish, too, but I don't know the specifics.

    Comment

    • mandar.bobade60
      Member
      • Jun 2013
      • 14

      #3
      Thanks Brian. It, BBMAP, did work well, since it gave volume in first row which is very much near to my specie's nearby species' genomic size unlike jellyfish. Only thing is I had to subsample my very huge data, as with large dataset it threw exceptions.

      Only one question in this aspect which you have mentioned in your message " though you can make it more accurate by factoring in the repeat peaks".. If you could clarify this, it would be of great help.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        If you plot the kmer frequency histogram of a single haploid organism, you will typically see a peak at 0 (the error peak), then the next peak is typically the most prominent and corresponds to the size of the genome that is unique.

        Then there will typically be more, smaller peaks at higher kmer depths. Let's say the first peak is at 40, and contains 3 million kmers. That implies that you have roughly 40x coverage, and the genome contains 3Mbp of unique sequence. There will probably be another peak at 80, which is a lot smaller. This is the peak from 2-copy repeat regions of the genome. If it contains 100,000 kmers, then the expected genome size would actually be increased by 200,000bp - because each of those kmers occurs twice. The three-copy peak would be at 120, and if it contained 100,000 kmers, it would contribute 300,000bp to the expected genome size.

        So - if your first peak P is at depth D, with volume V, then let's label subsequent peaks at depth N*D as PN with volume VN - e.g. P2 is the peak at 2*D, and has a volume of V2. The total genome size would be:

        V1*1+V2*2+V3*3...+VN*N.

        In each case, you can get the multiplier N by round(DN/D1) where DN is the depth (center) of the current peak and D1 is the depth of the first peak.

        But, my program does not have a very sophisticated peak-calling algorithm, so it won't find all of the peaks - it will probably find the first 2, and then maybe some more. You'll have to do that manually from looking at the graph of the kmer histogram, or use a better peak-caller. For most small organisms, though, the vast majority of the genome is single-copy so the higher peaks can be ignored if you just want a rough estimate.
        Last edited by Brian Bushnell; 11-06-2015, 10:44 AM.

        Comment

        • mandar.bobade60
          Member
          • Jun 2013
          • 14

          #5
          Thanks alot Brian once again for your comprensive explanation. Further peaks, after first error and second actual, there are no apparent peaks. So, I assume whaterver first row volume is there, is my genome size.

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Today, 05:37 AM
          0 responses
          5 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-26-2026, 11:10 AM
          0 responses
          16 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          50 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          109 views
          0 reactions
          Last Post SEQadmin2  
          Working...