I'm working with some data where I have a read count and k-mer coverage (Ck) for a set of contigs and scaffolds across different conditions. I've recently heard and read a few very confusing explanations of k-mer coverage, so would appreciate some clarification. From what I gather, Ck is directly related to base coverage. But, can the size of a contig be determined if I know the Ck value, read length, and read number for that specific contig? Or would this calculation not work for a de novo transcriptome where read coverage varies greatly between contigs and scaffolds?
For example, here are my numbers for contig A:
Read length = 75 b
Read count = 185,600 reads
Ck = 63
hash length = 31
When I plug all this into Ck = C*(rL-k+1)/rL where C=coverage (read length*reads/contig length (cL)) and rL = read length, I get a value for cL of about 127 kb. However, when I go back to the raw data and look at that contig's sequence, I find it to be only .823 kb. Not sure how the total reads for the run figure into this, but I have ~40 million reads for this condition.
Because C depends on the read count, my best guess is that contigs and scaffolds that have relatively high or low expression over the mean will have Ck values unrepresentative of the contig length. But I feel clueless, and my partner appears to be only acting as if he knows. I have a feeling I'm misunderstanding something completely obvious.
Any help on this matter would be greatly appreciated.
For example, here are my numbers for contig A:
Read length = 75 b
Read count = 185,600 reads
Ck = 63
hash length = 31
When I plug all this into Ck = C*(rL-k+1)/rL where C=coverage (read length*reads/contig length (cL)) and rL = read length, I get a value for cL of about 127 kb. However, when I go back to the raw data and look at that contig's sequence, I find it to be only .823 kb. Not sure how the total reads for the run figure into this, but I have ~40 million reads for this condition.
Because C depends on the read count, my best guess is that contigs and scaffolds that have relatively high or low expression over the mean will have Ck values unrepresentative of the contig length. But I feel clueless, and my partner appears to be only acting as if he knows. I have a feeling I'm misunderstanding something completely obvious.
Any help on this matter would be greatly appreciated.
Comment