Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • reference-free SNP discovery

    Dear all,

    I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:

    I have Illumina short reads from which I want to call SNPs WITHOUT
    using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.

    Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.

    Thank you a lot for your help,

    Marius
    Last edited by Marius; 12-22-2010, 02:34 PM.

  • #2
    To do SNP calling, the standard procedure is to map reads to a reference genome. Then you look at your pileup (i.e. the base frequencies and associated quality scores for every position) and find regions where allele frequencies are least divergent. Illumina's CASAVA uses a fancy nearest-neighbor SNP caller, SOAPsnp uses a bayesian algorithm, and I'm sure there are many, many other methods.

    The standard way to SNPcall, because you don't have a reference sequence, is to generate one. You do this by feeding trimmed, high-quality-only reads into a de-novo assembler such as Velvet or ABYSS.

    For SNPcalls, contig length isn't really your end goal. Your goal for the assembly should be to have a high percentage of your reads to actually map to your de novo genome.

    It is okay if your de novo genome has 1000s of contigs.

    If you are dealing with RNA, then mapping partial reads plays a role for a minority of SNPs (close to intron junctions, etc). So you might need to use a Bowtie/Cufflinks, SOAP or whatever to map partially.

    Good luck.

    Comment


    • #3
      Awesome,
      thanks a lot for this straight forward answer. So in your opinion, what I would have to do is:
      Take all reads (all individuals, all populations) and sort these only for high quality ones (i.e. Phred >20, no Ns etc.). And then I could take all these reads to create my contigs (I expect around 40'000 contigs). Since I have reads of individuals that belong to quite different populations (which might already have diverged quite a bit, also in the genome), I would have to include all individuals to build these contigs I guess.

      There is one aspect I'm not really sure yet. Lets say I have a heterozygote read, which has a SNP somewhere when comparing the different individuals (or even a multiple allele position), i.e.

      Read1 (i.e. Ind.2, Pop1): ..AGGGTGGACT...
      Read2 (i.e. Ind.4, Pop2): ..AGGGGGGACT..
      Read3 (i.e. Ind.1, Pop3): ..AGGGAGGACT..

      Let's say all these reads are of high-quality, so the polymorphic site is a true multi-allel SNP position. What would the contig (reference-sequence) look like, which is basically the consensus sequence of these 3 reads I quess? Best would probably be: ..AGGGNGGACT..
      And, when I then would do SNPcalling (or consensus calling first for every individual), is this always in relation to this reference-contig or not? Because, I don't want to do SNPcalling relative to the reference, I only need the reference to assure I compare the individual pileups of the same locus among the individuals and populations later on. So the contig-seuqence shouldn't influence my individual consensus/SNP calling!
      I.e. I know from SAMtools, that consensus-calling/SNP-calling is only possible relative to the reference sequence...
      Which assembler and consensus-calling program would be best for this?

      Comment


      • #4
        Dear Marius,

        At INRIA, France we developped an algorithm, called kisSnp that compares two sets of raw reads. It detects from these sets SNP polymorphism.

        We have a public validated Java version here: http://alcovna.genouest.org/kissnp/ and a lighter C version, not yet fully validated but that you could test if you're interested.

        Pierre

        Comment


        • #5
          Ironically, but perhaps not, you might want to to check out a program out of William Cresko's lab called, Stacks.

          Comment


          • #6
            Thanks for the link.

            I take the opportunity of this "up" to inform you that a new version of kisSnp is available: http://alcovna.genouest.org/kissnp-page/

            Pierre

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            33 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            48 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            34 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            46 views
            0 likes
            Last Post seqadmin  
            Working...
            X