Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • reference-free SNP discovery

    Dear all,

    I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:

    I have Illumina short reads from which I want to call SNPs WITHOUT
    using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.

    Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.

    Thank you a lot for your help,

    Marius
    Last edited by Marius; 12-22-2010, 02:34 PM.

  • #2
    To do SNP calling, the standard procedure is to map reads to a reference genome. Then you look at your pileup (i.e. the base frequencies and associated quality scores for every position) and find regions where allele frequencies are least divergent. Illumina's CASAVA uses a fancy nearest-neighbor SNP caller, SOAPsnp uses a bayesian algorithm, and I'm sure there are many, many other methods.

    The standard way to SNPcall, because you don't have a reference sequence, is to generate one. You do this by feeding trimmed, high-quality-only reads into a de-novo assembler such as Velvet or ABYSS.

    For SNPcalls, contig length isn't really your end goal. Your goal for the assembly should be to have a high percentage of your reads to actually map to your de novo genome.

    It is okay if your de novo genome has 1000s of contigs.

    If you are dealing with RNA, then mapping partial reads plays a role for a minority of SNPs (close to intron junctions, etc). So you might need to use a Bowtie/Cufflinks, SOAP or whatever to map partially.

    Good luck.

    Comment


    • #3
      Awesome,
      thanks a lot for this straight forward answer. So in your opinion, what I would have to do is:
      Take all reads (all individuals, all populations) and sort these only for high quality ones (i.e. Phred >20, no Ns etc.). And then I could take all these reads to create my contigs (I expect around 40'000 contigs). Since I have reads of individuals that belong to quite different populations (which might already have diverged quite a bit, also in the genome), I would have to include all individuals to build these contigs I guess.

      There is one aspect I'm not really sure yet. Lets say I have a heterozygote read, which has a SNP somewhere when comparing the different individuals (or even a multiple allele position), i.e.

      Read1 (i.e. Ind.2, Pop1): ..AGGGTGGACT...
      Read2 (i.e. Ind.4, Pop2): ..AGGGGGGACT..
      Read3 (i.e. Ind.1, Pop3): ..AGGGAGGACT..

      Let's say all these reads are of high-quality, so the polymorphic site is a true multi-allel SNP position. What would the contig (reference-sequence) look like, which is basically the consensus sequence of these 3 reads I quess? Best would probably be: ..AGGGNGGACT..
      And, when I then would do SNPcalling (or consensus calling first for every individual), is this always in relation to this reference-contig or not? Because, I don't want to do SNPcalling relative to the reference, I only need the reference to assure I compare the individual pileups of the same locus among the individuals and populations later on. So the contig-seuqence shouldn't influence my individual consensus/SNP calling!
      I.e. I know from SAMtools, that consensus-calling/SNP-calling is only possible relative to the reference sequence...
      Which assembler and consensus-calling program would be best for this?

      Comment


      • #4
        Dear Marius,

        At INRIA, France we developped an algorithm, called kisSnp that compares two sets of raw reads. It detects from these sets SNP polymorphism.

        We have a public validated Java version here: http://alcovna.genouest.org/kissnp/ and a lighter C version, not yet fully validated but that you could test if you're interested.

        Pierre

        Comment


        • #5
          Ironically, but perhaps not, you might want to to check out a program out of William Cresko's lab called, Stacks.

          Comment


          • #6
            Thanks for the link.

            I take the opportunity of this "up" to inform you that a new version of kisSnp is available: http://alcovna.genouest.org/kissnp-page/

            Pierre

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X