SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
transcription start sites (TSS) altodor Bioinformatics 5 10-03-2014 11:51 AM
Overrepresented kmers at the start of reads kentk Bioinformatics 20 07-23-2014 02:23 AM
how to get start on RNA-seq analysis ips Bioinformatics 1 01-13-2011 10:54 AM
Transcription start sites clustering zhlyang Bioinformatics 3 12-02-2009 03:05 PM
start position of reads and its distribution baohua100 Bioinformatics 0 11-18-2008 06:20 AM

Reply
 
Thread Tools
Old 02-05-2009, 08:12 AM   #1
nullabee
Junior Member
 
Location: Ghent

Join Date: Feb 2009
Posts: 3
Red face Where to start

Hi All.

I'm a mathematician, hoping to do a PhD on the data-analysis (statistics) of NGS-data, at the university of Ghent (Roche / Illumina).
Unfortunately, up to now, this has not been specified, so it is not yet clear to me what kind of data I will be presented with (ChiP, de novo,...)

This also implies that to this day, I have no data to work on, nor a clear sight on what will be expected.
I've simply been reading up on NGS and statistics (finding strangely little articles linking them). Even more so, I am quite new at biotechnology, so it is not easy to get a focus.

So here's my question: I would like to prepare myself somewhat for when the 'real' questions come (I expect these in the range of the next few months), so I'd like to emulate some data-analysis. Do any of you have pointers on:
* which type of analysis would be a good starter?
* where could I find sample data (ideally with a matching article on how somebody else analysed it)?
* what are the statistical challenges brought on by NGS (as opposed to classical sequencing), apart from sheer volume?
* which 'general' statistical subjects would be a good read (books/subjects welcome), e.g.: would bootstrap do me any good (and why)?

Thanks in advance for any suggestions!
nullabee is offline   Reply With Quote
Old 02-09-2009, 02:19 PM   #2
mchaisso
Member
 
Location: Seattle, WA

Join Date: Apr 2008
Posts: 84
Default research topics for nextgen sequencing

The field of analysis of RNA-seq data is somewhat young... I think many people intend to use some of the statistics developed for microarrays to detect differential expression. Some data were presented in recent Nature Methods and Genome Research papers, and I believe some are posted at the short read archive. Furthermore, ABI is pretty open with sharing data from their research labs.

If you are open to combinatorial problems and not just statistics, there are some problems related to de novo fragment assembly that we could discuss.

cheers,
-mark
mchaisso is offline   Reply With Quote
Old 02-10-2009, 08:18 AM   #3
nullabee
Junior Member
 
Location: Ghent

Join Date: Feb 2009
Posts: 3
Default

Hello Mark,

Thanks for the input. For me, ABI is not an option (UGent has only recently acquired the FLX and the GA). Some of my colleagues who have a history in microarrays are indeed hoping to extend their findings to NGS, but I prefer not to swim in the same lanes.

I know by now (I am 'only the statistician') that the first runs on our FLX have been amplicon sequencing runs, and some de novo-runs (bacteria) are soon to come, so for now, I'm trying to get a picture of how these are processed nowadays. I'm hoping that once I understand the mechanisms at present, I will at least be able to find some articles describing their statistics (e.g.: I have not been able to find a single article giving proper explanation for the 'habit' of having a coverage of 20 - though everybody seems to agree that this works)

But yes, I am also interested in combinatorics, and always open to suggestions.

Nick.
nullabee is offline   Reply With Quote
Old 02-10-2009, 06:58 PM   #4
mchaisso
Member
 
Location: Seattle, WA

Join Date: Apr 2008
Posts: 84
Default NextGen Coverage vs. Lander Waterman Statistics

The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

So, if a genome has:

ABCDpqrEFGHIJKpqrLMNpqrSTUV

and reads are sequenced with 3 characters
mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
ABCDpqrEFGHIJKpqrLMNpqrSTUV
versus
ABCDpqrLMNpqrEFGHIJKpqrSTUV

So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?
mchaisso is offline   Reply With Quote
Old 04-01-2010, 06:29 PM   #5
mandova
Member
 
Location: Shanghai, China

Join Date: Mar 2010
Posts: 19
Default as of today

Quote:
Originally Posted by mchaisso View Post
The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

So, if a genome has:

ABCDpqrEFGHIJKpqrLMNpqrSTUV

and reads are sequenced with 3 characters
mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
ABCDpqrEFGHIJKpqrLMNpqrSTUV
versus
ABCDpqrLMNpqrEFGHIJKpqrSTUV

So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?
solved as of today?
mandova is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:54 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO