SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
MiSeq gDNA reads still fail "Kmer content" and "per base seq content" after trimming" ysnapus Illumina/Solexa 4 11-12-2014 07:25 AM
What "high duplication rate" means Kleido General 5 01-31-2014 06:13 AM
How to extract the "overall alignment rate" from bowtie andaleef Illumina/Solexa 0 07-26-2012 02:38 PM
what is the new bwa sampe "prior of chimeric rate" option? jstjohn Bioinformatics 1 07-20-2011 10:40 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 12:19 PM

Reply
 
Thread Tools
Old 02-19-2015, 03:16 AM   #1
carlos
Junior Member
 
Location: Spain

Join Date: Feb 2008
Posts: 7
Default Heterozygous rate from "de novo" sequencing

Hi,

I've just tried to calculate the heterozygous rate of my seq data, but no luck using gce or even preqc from sga using kmer information. Any idea about how to directly calculate heterozygous rate by SNPs?

Bests,

Carlos
carlos is offline   Reply With Quote
Old 02-19-2015, 09:55 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can do that with BBTools like this:

kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt

You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).

I think AllPaths-LG will report het rate, also.
Brian Bushnell is offline   Reply With Quote
Old 07-08-2015, 05:05 AM   #3
cvargasc
Junior Member
 
Location: Valencia, Spain

Join Date: Aug 2011
Posts: 5
Default

Dear Brian,

Thank you for developing such an useful an versatile tool. I was wondering how to interpretate both files to gain insight into the heterozygosity. My peaks file was generated using information from a highly heterozygous diploid organism, however there is a third peak with a greater volume than any other, why could this be happening?

Thanks a lot!

#start center stop max volume
10 14 18 4633863 36855138
18 29 36 4913455 85947645
36 59 103 12178207 347426111
103 105 106 324995 972461
106 114 228 331479 18667384
228 229 230 56194 111376
230 231 234 55293 217949
234 235 99931 54052 12688071
cvargasc is offline   Reply With Quote
Old 07-08-2015, 10:00 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

There are several possible explanations for this, such as organelles, contamination, a symbiont, or mozaicism. But in this case, noting that the peaks are centered at 14, 29, and 59, it very much looks like the organism is tetraploid, or was tetraploid in the past and is now diploid but each pair of chromosomes has a very similar additional pair. How confident are you that it's diploid?

Organelles is unlikely because they generally have a small, very high copy peak. You can check for contamination/symbionts by plotting a gc-content histogram of the raw reads (reformat.sh in=reads.fq gchist=gchist.txt gcbins=auto) or assembling, mapping, and plotting gc versus coverage (this can be generated by BBMap with the covstats output) or plotting a PCA of tetramer frequencies of the contigs.

The latest version of BBTools (35.14) does smoothing of the kmer histogram prior to calling peaks, which greatly increases the accuracy, particularly for high-copy-count peaks.
Brian Bushnell is offline   Reply With Quote
Old 07-08-2015, 10:45 AM   #5
cvargasc
Junior Member
 
Location: Valencia, Spain

Join Date: Aug 2011
Posts: 5
Default

Thank you for your prompt reply! A karyotype was performed and it was determined that the organism is diploid, however the possibility of a recent duplication is not difficult to imagine.
Regarding symbionts or organelles, we would only expect the mitochondria, we are quite confident that this sample is symbiont free.
Could you please elaborate into why you think this sample is tetraploid? Also, how could I quantify heterozygosity using the output?
Thanks!
cvargasc is offline   Reply With Quote
Old 07-08-2015, 11:01 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Typically, for a polyploid organism, the peak with the largest volume (or the major peak with the greatest depth) represents the homozygous portion of the genome. For a diploid organism with 60x coverage, you would expect two peaks - one at 60x (the homozygous peak) and one at 30x (the heterozygous peak). As long as the het rate is not super high, the homozygous peak will be bigger; though with an extremely high het rate, the heterozygous peak could be bigger. But, instead of homozygous and heterozygous peaks, let's call them the 1-copy and 2-copy peaks.

With a tetraploid, you would expect 4 peaks - a 1-copy, 2-copy, 3-copy, and 4-copy. The 4-copy will generally be the largest. If you have 60x coverage (of the haploid genome), then the peaks will be at 15, 30, 45, and 60. Presumably, if you look at the kmer histogram, this dataset should have a 3-copy peak at depth 45, but it was probably merged into the 60-depth peak.

So, generally, if you ever have the biggest peak at depth N, and have smaller peaks at N/4, 2N/4, and ideally 3N/4, you are probably looking at a tetraploid.

Oh... as for calculating the het rate, you can estimate that the genome size is 1*(volume of 1-copy peak)+2*(volume of 2-copy peak), etc. The latest version of BBMap will do this automatically and report the genome size in the header of the peaks file. That's the number of base pairs in each cell of the organism; if you want the haploid genome size, then for a tetraploid, you'd divide that number by 4. So:

(1*36855138+2*85947645+4*347426111)/4=399,613,718 - roughly 400Mbp.

To get the het rate, if you assume that all the differences between ploidy is just SNPs (a useful simplifying assumption), then since KmerCountExact uses 31-mers, each 1-copy SNP would be expected to produce 31 kmers in the 1-copy peak. Therefore, the number of single-copy SNPs would be the volume of the 1-copy peak divided by 31, which is 1188875. The number of 2-copy SNPs would be 2-copy peak over 31, or 2772504. So the total number of SNP locations is probably around 3,961,379. Therefore the het rate is something like 1/100.

Last edited by Brian Bushnell; 07-08-2015 at 11:29 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-09-2015, 06:52 AM   #7
cvargasc
Junior Member
 
Location: Valencia, Spain

Join Date: Aug 2011
Posts: 5
Default

Thanks a lot for your explanations, it is all clear now! And once more thank you for developing BBTools.
cvargasc is offline   Reply With Quote
Reply

Tags
de novo genome, heterozygosity, illumina, kmer, rate

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO