SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 11:45 PM
kmer content in the first bases of Illumina sequence brachysclereid Bioinformatics 2 01-09-2012 03:54 PM
kmer coverage in Trinity Kiroro Bioinformatics 0 09-11-2011 08:24 PM
optimal Kmer PHSchi Bioinformatics 0 02-16-2011 12:30 PM
kmer vs accuracy Autotroph Bioinformatics 2 02-14-2011 03:59 AM

Reply
 
Thread Tools
Old 12-15-2011, 10:51 PM   #1
subuhikhan
Junior Member
 
Location: Auckland

Join Date: Dec 2011
Posts: 4
Default Kmer content

Hello,

I have recently got back my Illumina RNA sequencing dataset and I have used Fastqc software to check its quality. I want to know what is Kmer content and what is its significance?

Thank you
Subuhi
subuhikhan is offline   Reply With Quote
Old 12-27-2011, 09:06 AM   #2
cllorens
Member
 
Location: Valencia

Join Date: Nov 2011
Posts: 44
Default

Hola

A k-mer is a motif (or a small word) of length k observed more than once in a genomic or sequenced sequence. The order of the kmer is defined by its word size.

Examples for 2, 3, and 4

for repeats

acacacacacac.. (this is "AC" dinucleotides)


gacgacgacgacgac (this is "GAC" trinucleotides)

for spaced occurrences

tttccGAGGaaggcgtagcgacgacGAGGaagcctca ( this is "GAGG" tetrads)


The content is the number of times the kmer occurs in the sequence and the distribution is related with the enrichment of a genomic sequence based on a particular kmer.

Taking into account that you can search for kmers of any size (the concept can be extended to larger words) the significances are diverse, searching and masking of repeats and mobile elements, preprocessing of fastqs, denovo assembling etc etc.

This is a very short explanation it is just the basic but it can helps you to check papers related with software and pipeline for seraching repeats mobile elements de novo using kmers or also and of course papers and manuals for software oriented to de novo assembling etc.

Best
Carlos
cllorens is offline   Reply With Quote
Old 03-01-2012, 06:26 PM   #3
xlzhang
Junior Member
 
Location: beijing

Join Date: Nov 2011
Posts: 6
Default

Quote:
Originally Posted by cllorens View Post
Hola

A k-mer is a motif (or a small word) of length k observed more than once in a genomic or sequenced sequence. The order of the kmer is defined by its word size.

Examples for 2, 3, and 4

for repeats

acacacacacac.. (this is "AC" dinucleotides)


gacgacgacgacgac (this is "GAC" trinucleotides)

for spaced occurrences

tttccGAGGaaggcgtagcgacgacGAGGaagcctca ( this is "GAGG" tetrads)


The content is the number of times the kmer occurs in the sequence and the distribution is related with the enrichment of a genomic sequence based on a particular kmer.

Taking into account that you can search for kmers of any size (the concept can be extended to larger words) the significances are diverse, searching and masking of repeats and mobile elements, preprocessing of fastqs, denovo assembling etc etc.

This is a very short explanation it is just the basic but it can helps you to check papers related with software and pipeline for seraching repeats mobile elements de novo using kmers or also and of course papers and manuals for software oriented to de novo assembling etc.

Best
Carlos
Hi,Carlos

I have used SOAPdenovo, and the minimum length of its contig is the value of Kmer. I also used Cortex, in its result file there is the following string: lst_kmer:ATATTTTCTTACATGTTCCAAGGGT. I want to had a deeper understanding of Kmer.

I am a beginner. Thanks for your help.
xlzhang is offline   Reply With Quote
Old 03-04-2012, 12:34 PM   #4
Zam
Member
 
Location: Oxford

Join Date: Apr 2010
Posts: 51
Default

Hi there

1. Kmers are just words (chunks of sequence) of length k.
2. The current version of Cortex contains some unnecessary stuff in the output.
This text
lst_kmer:ATATTTTCTTACATGTTCCAAGGGT

just tells you the last kmer in the contig. "lst" stands for last.
fst_kmer is the first kmer. It was once useful, but is not any more, and I have just removed it from Cortex - when I make the next release, it will be gone.

Sorry for this, I've been meaning to remove it for a while, it just confuses new users.
Zam is offline   Reply With Quote
Old 03-04-2012, 12:51 PM   #5
cllorens
Member
 
Location: Valencia

Join Date: Nov 2011
Posts: 44
Default

Hi Zhang

In addition of Zam comments (it is like that Zam says k-mers are words of a particular size that you can find repeated in a genome with a particular frequency that depends of their size), perhaps I attach some references on distinct topics using K-mers for you to read them if you want to get deeper.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2336801
http://www.ncbi.nlm.nih.gov/pubmed/19935826
http://www.nature.com/nbt/journal/v2.../nbt.2023.html
http://www.ncbi.nlm.nih.gov/pubmed/18976482

Hope you to enjoy them
Carlos
cllorens is offline   Reply With Quote
Old 03-04-2012, 12:59 PM   #6
cllorens
Member
 
Location: Valencia

Join Date: Nov 2011
Posts: 44
Default

There is goes another interesting reference i forget to attach in the post above.
http://genomebiology.com/2009/10/10/R108
cllorens is offline   Reply With Quote
Old 03-04-2012, 07:06 PM   #7
xlzhang
Junior Member
 
Location: beijing

Join Date: Nov 2011
Posts: 6
Default

Quote:
Originally Posted by Zam View Post
Hi there

1. Kmers are just words (chunks of sequence) of length k.
2. The current version of Cortex contains some unnecessary stuff in the output.
This text
lst_kmer:ATATTTTCTTACATGTTCCAAGGGT

just tells you the last kmer in the contig. "lst" stands for last.
fst_kmer is the first kmer. It was once useful, but is not any more, and I have just removed it from Cortex - when I make the next release, it will be gone.

Sorry for this, I've been meaning to remove it for a while, it just confuses new users.
Thanks, Zam

So, what is the meaning of "fst_r:GT fst_f:G" and "lst_r:A lst_f:AT"? I thought "r" stood for reverse and "f" stood for forward, am I right?

If I want to get a consensus assembly from a set of reads possibly in SV structure, guess I should use Cortex_con? or Cortex_var? I don't understand the the fundamental difference between the two.

And, If I run different Kmers, which result is better? "length" or "average_coverage"?

Thank you for your answer!

Last edited by xlzhang; 03-04-2012 at 07:55 PM.
xlzhang is offline   Reply With Quote
Old 03-04-2012, 07:13 PM   #8
xlzhang
Junior Member
 
Location: beijing

Join Date: Nov 2011
Posts: 6
Default

Thanks, Carlos.
xlzhang is offline   Reply With Quote
Old 03-05-2012, 12:40 AM   #9
Zam
Member
 
Location: Oxford

Join Date: Apr 2010
Posts: 51
Default

Hi xlzhang

"fst_r:GT fst_f:G" and "lst_r:A lst_f:AT"

This describes the edges going in/out of the contig at the first/last nodes.
The first node has G and T edges going out in the reverse complement direction, and a G forwards. The last node has A and T going out forwards and A in the reverse. I don't think you need to pay attention to this though for most uses.

As for cortex_con versus cortex_var - the fundamental difference is one of goal. Con is for making a consensus/haploid assembly of a single whole genome - it delas with one sample. Var is for assembling polymorphism, in one or many samples. If you have a set of reads which you know are precisely the reads for an alternate haplotype/SV, then you have effectively reduced your problem to a haploid one, and I would try cortex_con (or any standard assembler of your choice, depends a bit on the size of your region). If you have a set of reads from a structurally variant region, from a sample which might be heterozygous, I would try cortex_var. There is a Cortex_var google group where you could post more detailed questions if you like

best wishes

Zam
Zam is offline   Reply With Quote
Old 03-05-2012, 01:05 AM   #10
xlzhang
Junior Member
 
Location: beijing

Join Date: Nov 2011
Posts: 6
Default

Quote:
Originally Posted by Zam View Post
Hi xlzhang

"fst_r:GT fst_f:G" and "lst_r:A lst_f:AT"

This describes the edges going in/out of the contig at the first/last nodes.
The first node has G and T edges going out in the reverse complement direction, and a G forwards. The last node has A and T going out forwards and A in the reverse. I don't think you need to pay attention to this though for most uses.

As for cortex_con versus cortex_var - the fundamental difference is one of goal. Con is for making a consensus/haploid assembly of a single whole genome - it delas with one sample. Var is for assembling polymorphism, in one or many samples. If you have a set of reads which you know are precisely the reads for an alternate haplotype/SV, then you have effectively reduced your problem to a haploid one, and I would try cortex_con (or any standard assembler of your choice, depends a bit on the size of your region). If you have a set of reads from a structurally variant region, from a sample which might be heterozygous, I would try cortex_var. There is a Cortex_var google group where you could post more detailed questions if you like

best wishes

Zam
You've given me a lot of help. Thank you.
xlzhang is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO