Unconfigured Ad

**mchaisso** · 02-09-2009, 02:19 PM

research topics for nextgen sequencing

The field of analysis of RNA-seq data is somewhat young... I think many people intend to use some of the statistics developed for microarrays to detect differential expression. Some data were presented in recent Nature Methods and Genome Research papers, and I believe some are posted at the short read archive. Furthermore, ABI is pretty open with sharing data from their research labs.

If you are open to combinatorial problems and not just statistics, there are some problems related to de novo fragment assembly that we could discuss.

cheers,
-mark

**nullabee** · 02-10-2009, 08:18 AM

Hello Mark,

Thanks for the input. For me, ABI is not an option (UGent has only recently acquired the FLX and the GA). Some of my colleagues who have a history in microarrays are indeed hoping to extend their findings to NGS, but I prefer not to swim in the same lanes.

I know by now (I am 'only the statistician') that the first runs on our FLX have been amplicon sequencing runs, and some de novo-runs (bacteria) are soon to come, so for now, I'm trying to get a picture of how these are processed nowadays. I'm hoping that once I understand the mechanisms at present, I will at least be able to find some articles describing their statistics (e.g.: I have not been able to find a single article giving proper explanation for the 'habit' of having a coverage of 20 - though everybody seems to agree that this works)

But yes, I am also interested in combinatorics, and always open to suggestions.

Nick.

**mchaisso** · 02-10-2009, 06:58 PM

NextGen Coverage vs. Lander Waterman Statistics

The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

So, if a genome has:

ABCDpqrEFGHIJKpqrLMNpqrSTUV

and reads are sequenced with 3 characters
mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
ABCDpqrEFGHIJKpqrLMNpqrSTUV
versus
ABCDpqrLMNpqrEFGHIJKpqrSTUV

So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?

**mandova** · 04-01-2010, 05:29 PM

as of today

Originally posted by mchaisso View Post

The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

So, if a genome has:

ABCDpqrEFGHIJKpqrLMNpqrSTUV

and reads are sequenced with 3 characters
mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
ABCDpqrEFGHIJKpqrLMNpqrSTUV
versus
ABCDpqrLMNpqrEFGHIJKpqrSTUV

So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?

solved as of today?

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, Yesterday, 11:10 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 104 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Where to start

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News