SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
The insert-size in paired-end data louis7781x Illumina/Solexa 15 02-06-2017 09:42 AM
Calculating coverage based on full insert size from paired-end data rnaeye Bioinformatics 3 11-09-2015 10:49 AM
Weird and unexpected insert size distribution from ChIPseq Paired-end data jajclement Bioinformatics 0 05-13-2014 06:41 AM
Genome size estimation moinul De novo discovery 9 04-04-2014 04:22 AM
Average insert size from paired end data Petrichor Bioinformatics 4 04-29-2013 02:13 AM

Reply
 
Thread Tools
Old 11-18-2014, 04:32 AM   #1
mandar.bobade60
Member
 
Location: India

Join Date: Jun 2013
Posts: 14
Unhappy Genome size estimation for paired end data

Hi,

I am working with Illumina paired-end data with read length (l) 101 to generate de novo assembly. I am trying to predict genome size as per given formula, M = N * (L-K+1)/L.
The M value I found using Jellyfish which is around 3591. But I am confused since I have paired end file with 80284125 reads in each file, i.e. 80284125*2. When I have calculated genome size using only 80284125*101, I got expected genome size around 1.78MB.

But I am not sure whether genome size should be calculated how I have calculated above or using (80284125*101)*2, since data is paired end.

Also, it would be of great help if you tell how to get error k-mer distribution value.

Regards,
Mandar
mandar.bobade60 is offline   Reply With Quote
Old 11-18-2014, 10:54 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you download the BBMap package and run kmercountexact, it will tell you the size of the single-copy genomic peak, which is roughly the genome size for bacteria.

kmercountexact.sh in1=read1.fastq in2=read2.fq khist=khist.txt peaks=peaks.txt

You can plot the khist.txt if you want to look at it visually to see how many error kmers there are. "peaks.txt" will give you the location and size of the largest peaks; the number indicating the volume of the first peak should be approximately the genome size, though you can make it more accurate by factoring in the repeat peaks.

You can do some of this with Jellyfish, too, but I don't know the specifics.
Brian Bushnell is offline   Reply With Quote
Old 11-18-2014, 11:24 PM   #3
mandar.bobade60
Member
 
Location: India

Join Date: Jun 2013
Posts: 14
Default

Thanks Brian. It, BBMAP, did work well, since it gave volume in first row which is very much near to my specie's nearby species' genomic size unlike jellyfish. Only thing is I had to subsample my very huge data, as with large dataset it threw exceptions.

Only one question in this aspect which you have mentioned in your message " though you can make it more accurate by factoring in the repeat peaks".. If you could clarify this, it would be of great help.
mandar.bobade60 is offline   Reply With Quote
Old 11-19-2014, 09:27 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you plot the kmer frequency histogram of a single haploid organism, you will typically see a peak at 0 (the error peak), then the next peak is typically the most prominent and corresponds to the size of the genome that is unique.

Then there will typically be more, smaller peaks at higher kmer depths. Let's say the first peak is at 40, and contains 3 million kmers. That implies that you have roughly 40x coverage, and the genome contains 3Mbp of unique sequence. There will probably be another peak at 80, which is a lot smaller. This is the peak from 2-copy repeat regions of the genome. If it contains 100,000 kmers, then the expected genome size would actually be increased by 200,000bp - because each of those kmers occurs twice. The three-copy peak would be at 120, and if it contained 100,000 kmers, it would contribute 300,000bp to the expected genome size.

So - if your first peak P is at depth D, with volume V, then let's label subsequent peaks at depth N*D as PN with volume VN - e.g. P2 is the peak at 2*D, and has a volume of V2. The total genome size would be:

V1*1+V2*2+V3*3...+VN*N.

In each case, you can get the multiplier N by round(DN/D1) where DN is the depth (center) of the current peak and D1 is the depth of the first peak.

But, my program does not have a very sophisticated peak-calling algorithm, so it won't find all of the peaks - it will probably find the first 2, and then maybe some more. You'll have to do that manually from looking at the graph of the kmer histogram, or use a better peak-caller. For most small organisms, though, the vast majority of the genome is single-copy so the higher peaks can be ignored if you just want a rough estimate.

Last edited by Brian Bushnell; 11-06-2015 at 10:44 AM.
Brian Bushnell is offline   Reply With Quote
Old 11-19-2014, 08:56 PM   #5
mandar.bobade60
Member
 
Location: India

Join Date: Jun 2013
Posts: 14
Default

Thanks alot Brian once again for your comprensive explanation. Further peaks, after first error and second actual, there are no apparent peaks. So, I assume whaterver first row volume is there, is my genome size.
mandar.bobade60 is offline   Reply With Quote
Reply

Tags
genome size estimation

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO