SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome size estimation using BBMAP syintel87 Bioinformatics 9 07-12-2017 10:43 AM
Kmer genome size estimation so much lower than the actual GAFA De novo discovery 3 01-22-2017 12:29 AM
Genome Size Estimation from PacBio Raw Reads jpummil Bioinformatics 7 10-10-2016 06:00 AM
Genome size estimation-jellyfish bioman1 Bioinformatics 3 08-18-2014 12:14 PM
Genome size estimation moinul De novo discovery 9 04-04-2014 04:22 AM

Reply
 
Thread Tools
Old 06-23-2017, 05:36 AM   #1
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 11
Default Genome size estimation with BBTools: what is the right k value?

Hello,

I am trying to assemble the genome of Gonioctena intermedia. Its size was estimated to +/- 1.6 Gb with flow cytometry and we think it should be highly polymorphic.

I am now trying to estimate these parameters from the data we received (Illumina HiSeq 2500) with BBTools by executing the following command:

Code:
./kmercountexact.sh in1=./R1.fastq in2=./R2.fastq k=${K_PARAM} khist=./${K_PARAM}.khist peaks=./${K_PARAM}.peaks
with K_PARAM values going from 17 to 31.

First question: I remark that the estimation of the genome size, as well as the estimation of the het rate, varies depending on the value of k. When k increases, the genome size increases and the het rate decreases (see the attached image). So my question is: which is the right k value to estimate the genome size?

Second question: for different values of k, the ploidy varies between 1 and 2 (see the attached image). I don't understand why it happens. Also, the main peak is estimated to +/- 63 (depends again on k) and for all the executions I see the following pattern (for X = 63): first peak at X/2, the second at X and then at X*2 and X*3. Is that just a coincidence, or can I conclude something from this observation? I suppose that X corresponds to the homozygous peak and X/2 to the heterozygous peak, but what is the interpretation of X*2 and X*3?

Third question: we expect the genome to be highly polymorphic and I think that the problem related to ploidy sometimes estimated to 1 could be explained by the highly polymorphic content of the genome. But the program indicates that the het rate is relatively low. Isn't that strange?

I also attach graphs that I obtained for k=17 and k=31, and the output of BBTools for k=29 and k=31 below:

Code:
#k      29
#unique_kmers   7181618041
#main_peak      61
#genome_size    2535937314
#haploid_genome_size    2535937314
#fold_coverage  31
#haploid_fold_coverage  31
#ploidy 1
#percent_repeat 91.476
#start  center  stop    max     volume
14      31      41      13493009        216159759
41      61      102     36090010        856406178
102     123     167     1151778 47402871
167     181     453     287488  30543532
1136    1139    1225    6345    520083
1225    1227    1332    5562    539154
1332    1338    1397    4677    289466
1397    1399    1445    4244    198202
1445    1446    1450    4028    19938
1450    1452    1475    4003    98402
1475    1476    1493    3934    69066
1493    1498    3745    3922    3396667
Code:
#k      31
#unique_kmers   7495157208
#main_peak      61
#genome_size    2619186233
#haploid_genome_size    1309593116
#fold_coverage  30
#haploid_fold_coverage  61
#ploidy 2
#het_rate       0.00574
#percent_repeat 23.602
#start  center  stop    max     volume
14      30      41      14545750        233034146
41      61      101     37606843        883987911
101     121     166     1142508 46978440
166     179     448     285056  29563222
1067    1069    1095    6768    183168
1095    1101    1125    6301    187407
1125    1127    1156    6163    184603
1156    1160    1247    5811    495762
1247    1248    1276    5184    143577
1276    1277    1343    4874    307763
1343    1344    1401    4568    244932
1401    1405    1513    4012    424762
1513    1520    3799    3623    3191287
Thank you in advance for the help!
Attached Images
File Type: bmp Different k values.bmp (143.6 KB, 12 views)
Attached Files
File Type: pdf 17.khist.pdf (12.5 KB, 12 views)
File Type: pdf 31.khist.pdf (12.5 KB, 7 views)
svitlana is offline   Reply With Quote
Old 06-27-2017, 12:22 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

The peak-calling and ploidy estimation in KmerCountExact are not very sophisticated, but I certainly expect them to do a better job than that! To my eye this is a very obvious diploid; not sure why the ploidy sometimes is estimated at 1; I'll have to look into that. The X*2 and X*3 peaks are 2-copy and 3-copy repeats, which are very pronounced; looks like the organism is fairly repetitive - they are normally much smaller.

It does look to me from the picture that the het rate is pretty low, though (assuming that X=63 is indeed the haploid genomic peak).

So, as for your questions...

1) None of them are exactly right, but certainly all the kmer lengths that predict ploidy of 1 are wrong! I'd use the estimates from K=31. The model for calculating genome size and het rate assumes all SNPs are at least K bases from each other. As you increase K (or increase the het rate) that starts to become less correct, which is why you can't use a super-long K usefully for these calculations, but K=31 is still pretty short. Also, as you increase K, semi-repetitive sequences are better resolved as being unique (which is why the estimated genome size increases with K), so there's a tradeoff between high and low K.

2) Answered.

3) The het rate is based on the assumption that all heterozygous events are SNPs or very short indels. If the polymorphism is more along the lines of rearrangements, the estimate would be pretty incorrect...
Brian Bushnell is offline   Reply With Quote
Old 07-03-2017, 04:27 AM   #3
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 11
Default

Thank you Brian for this complete and very useful answer! So, according to these estimations, the main complexity in assembly of this genome will be its repetitive content rather than its degree of polymorphism. It's a very useful information.
svitlana is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO