Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nullabee
    Junior Member
    • Feb 2009
    • 3

    Where to start

    Hi All.

    I'm a mathematician, hoping to do a PhD on the data-analysis (statistics) of NGS-data, at the university of Ghent (Roche / Illumina).
    Unfortunately, up to now, this has not been specified, so it is not yet clear to me what kind of data I will be presented with (ChiP, de novo,...)

    This also implies that to this day, I have no data to work on, nor a clear sight on what will be expected.
    I've simply been reading up on NGS and statistics (finding strangely little articles linking them). Even more so, I am quite new at biotechnology, so it is not easy to get a focus.

    So here's my question: I would like to prepare myself somewhat for when the 'real' questions come (I expect these in the range of the next few months), so I'd like to emulate some data-analysis. Do any of you have pointers on:
    * which type of analysis would be a good starter?
    * where could I find sample data (ideally with a matching article on how somebody else analysed it)?
    * what are the statistical challenges brought on by NGS (as opposed to classical sequencing), apart from sheer volume?
    * which 'general' statistical subjects would be a good read (books/subjects welcome), e.g.: would bootstrap do me any good (and why)?

    Thanks in advance for any suggestions!
  • mchaisso
    Member
    • Apr 2008
    • 84

    #2
    research topics for nextgen sequencing

    The field of analysis of RNA-seq data is somewhat young... I think many people intend to use some of the statistics developed for microarrays to detect differential expression. Some data were presented in recent Nature Methods and Genome Research papers, and I believe some are posted at the short read archive. Furthermore, ABI is pretty open with sharing data from their research labs.

    If you are open to combinatorial problems and not just statistics, there are some problems related to de novo fragment assembly that we could discuss.

    cheers,
    -mark

    Comment

    • nullabee
      Junior Member
      • Feb 2009
      • 3

      #3
      Hello Mark,

      Thanks for the input. For me, ABI is not an option (UGent has only recently acquired the FLX and the GA). Some of my colleagues who have a history in microarrays are indeed hoping to extend their findings to NGS, but I prefer not to swim in the same lanes.

      I know by now (I am 'only the statistician') that the first runs on our FLX have been amplicon sequencing runs, and some de novo-runs (bacteria) are soon to come, so for now, I'm trying to get a picture of how these are processed nowadays. I'm hoping that once I understand the mechanisms at present, I will at least be able to find some articles describing their statistics (e.g.: I have not been able to find a single article giving proper explanation for the 'habit' of having a coverage of 20 - though everybody seems to agree that this works)

      But yes, I am also interested in combinatorics, and always open to suggestions.

      Nick.

      Comment

      • mchaisso
        Member
        • Apr 2008
        • 84

        #4
        NextGen Coverage vs. Lander Waterman Statistics

        The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

        I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

        A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

        So, if a genome has:

        ABCDpqrEFGHIJKpqrLMNpqrSTUV

        and reads are sequenced with 3 characters
        mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
        ABCDpqrEFGHIJKpqrLMNpqrSTUV
        versus
        ABCDpqrLMNpqrEFGHIJKpqrSTUV

        So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?

        Comment

        • mandova
          Member
          • Mar 2010
          • 19

          #5
          as of today

          Originally posted by mchaisso View Post
          The fact that people use at least 20X coverage points out some of the difficulties in accurate statistics for sequencing. Say the FLX sequencer were only producing 100 base reads, and coverage is 20X. In de novo assembly, most (all) assemblers will have a minimum over lap length (either explicitly stated or as a k-value in a de Bruijn graph), so say k=25, so the coverage is 25% less. Still, at 15X coverage, the Lander Waterman statistics dictate that there will be one contig, yet when mapping reads back to the genome there are still usually gaps. This is worse with Illumina GAI sequencers, where I have found that 80X coverage with 35 base reads finally begins to overcome sample bias and get rid of gaps in assembly.

          I'm not saying this is an open field for research -- rather something to steer clear of. 20X coverage seems to compensate for amplification bias in 454 sequencing, which are difficult to model. In illumina sequencing projects, this will probably be overcome by adding scaffolding methods to assemblers. I imagine the latest release of Velvet has this given some of the results I've seen, and I'm working on making this a standard part of euler.

          A more reasonable avenue for statistical development, at least in de novo assembly, is regarding repeat coverage. All short read assemblers resolve repeats by using ends of mate-pairs that span the repeat.

          So, if a genome has:

          ABCDpqrEFGHIJKpqrLMNpqrSTUV

          and reads are sequenced with 3 characters
          mate-pairs BCD---EFG and IJK---LMN, LMN---STU are required to resolve if the genome is
          ABCDpqrEFGHIJKpqrLMNpqrSTUV
          versus
          ABCDpqrLMNpqrEFGHIJKpqrSTUV

          So, the question is, given a genome size G, repeat length r, repeat multiplicity m, clone length L, read length l, and number of reads, N, what is the probability that mate-pairs span all repeats?
          solved as of today?

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, Yesterday, 11:10 AM
          0 responses
          7 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          42 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          104 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          125 views
          0 reactions
          Last Post SEQadmin2  
          Working...