SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Required sequencing depth for finding (nearly) all unique human transcripts schelhorn RNA Sequencing 5 01-22-2013 03:00 AM
Maximum possible coverage depth junfeng Bioinformatics 1 01-18-2012 08:54 AM
depth of coverage basic question madsaan Bioinformatics 0 03-24-2011 06:40 AM
About the read depth of coverage El Mariachi Illumina/Solexa 2 12-30-2010 12:22 AM
Very high depth of coverage knott76 Bioinformatics 5 11-19-2009 12:27 AM

Reply
 
Thread Tools
Old 05-24-2012, 01:29 PM   #1
sigma
Junior Member
 
Location: Rennes

Join Date: May 2012
Posts: 8
Default Unique K-mers & coverage depth

Hi friends,

I'm working with some reads' simulations from Illumina paired end data and I try to find the link between the coverage depth and unique k-mer.

My hypothesis : with a HIGH coverage depth we must have LESS unique k-mer. Most of nucleotides are covered.
However I see for a 20 mers the complete opposite result :for a high coverage depth I have got a high uniqueness ratio whereas my hypothesis is validated for a 80 mers.

Actually I look for a paper about it but can not find it.
Have you got any idea about my hypothesis or these results ?


advance thanks,
sigma is offline   Reply With Quote
Old 05-24-2012, 02:55 PM   #2
jimmybee
Senior Member
 
Location: Adelaide, Australia

Join Date: Sep 2010
Posts: 119
Default

This would be completely dependent on the type of organism you're working on wouldn't it?
jimmybee is offline   Reply With Quote
Old 05-24-2012, 02:58 PM   #3
sigma
Junior Member
 
Location: Rennes

Join Date: May 2012
Posts: 8
Default

I don't think so...

K-mer and coverage depth do not depend on the type of organism.
n.b : I am working on maize
sigma is offline   Reply With Quote
Old 05-24-2012, 03:00 PM   #4
jimmybee
Senior Member
 
Location: Adelaide, Australia

Join Date: Sep 2010
Posts: 119
Default

Thats not what I meant. Nevermind
jimmybee is offline   Reply With Quote
Old 05-24-2012, 11:16 PM   #5
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

How do you define uniqueness ratio?

My hypothesis would be that sequencing errors causing unique k-mers will continue to add to the number of unique k-mers for much longer than your actual underlying sequence will (once you've saturated your sequence). If you continue with extreme sequencing depths, all possible errors will have been seen, so this trend will eventually flatten out.

You should probably have a look at how your read simulator generates sequencing errors, at high depths, and check that these are similar to real datasets - I would grab a big public dataset and do the comparison (not sure if there are good maize sets, however). I'd be interested to see whether the errors are as random in real datasets (at extremely high depths, where the underlying libraries are saturated) as with typical simulators.
arvid is offline   Reply With Quote
Old 05-25-2012, 12:58 AM   #6
sigma
Junior Member
 
Location: Rennes

Join Date: May 2012
Posts: 8
Default

Hi arvid,

uniqueness ratio = unique k-mers/ total distinct k-mers
And yes unique k-mers result from sequencing errors.

BUT when i simulated data I simulate reads WITHOUT errors (some parameters help to simulate a perfect sequence) so with a hight depth i must find more identical kmers (it depends on the sequence).
sigma is offline   Reply With Quote
Old 05-25-2012, 01:57 AM   #7
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

based on what distribution do you sample k-mers? gamma?
Also could you attach a plot of your k-mer frequency distribution/histogram?
cedance is offline   Reply With Quote
Old 05-25-2012, 02:58 AM   #8
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by sigma View Post
Hi arvid,

uniqueness ratio = unique k-mers/ total distinct k-mers
And yes unique k-mers result from sequencing errors.

BUT when i simulated data I simulate reads WITHOUT errors (some parameters help to simulate a perfect sequence) so with a hight depth i must find more identical kmers (it depends on the sequence).
If you are simulating reads, why without errors? If you are developing an algorithm or heuristic that should deal with real reads, I don't see the point... because they will behave quite differently.
arvid is offline   Reply With Quote
Old 05-25-2012, 03:00 AM   #9
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

@arvid, Unless the hypothesis itself is to test the validity of the claim that "unique k-mers are solely attributed to sequencing errors"?
cedance is offline   Reply With Quote
Old 05-25-2012, 03:22 AM   #10
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by cedance View Post
@arvid, Unless the hypothesis itself is to test the validity of the claim that "unique k-mers are solely attributed to sequencing errors"?
Right, but then why use a read simulator? To me it sounds more feasible to dissect the problems into units that can be more easily solved separately:

1. check k-mer distributions in the genome of interest - by simply counting all the k-mers
2. check other biases introduces by the sequencing pipeline, not just substitution or indel errors that leads to skews in these k-mer distributions

With the current approach, problem 1 gets mixed with problem 2. I'm also not sure to what extent a read simulator is feasible for problem 2. Maybe if sigma can elaborate on the actual question he is trying to solve, a better way could be found...
arvid is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO