SEQanswers (
-   De novo discovery (
-   -   reference-free SNP discovery (

Marius 12-20-2010 02:48 PM

reference-free SNP discovery
Dear all,

I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:

I have Illumina short reads from which I want to call SNPs WITHOUT
using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.

Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.

Thank you a lot for your help,


Awesome 12-22-2010 04:17 PM

To do SNP calling, the standard procedure is to map reads to a reference genome. Then you look at your pileup (i.e. the base frequencies and associated quality scores for every position) and find regions where allele frequencies are least divergent. Illumina's CASAVA uses a fancy nearest-neighbor SNP caller, SOAPsnp uses a bayesian algorithm, and I'm sure there are many, many other methods.

The standard way to SNPcall, because you don't have a reference sequence, is to generate one. You do this by feeding trimmed, high-quality-only reads into a de-novo assembler such as Velvet or ABYSS.

For SNPcalls, contig length isn't really your end goal. Your goal for the assembly should be to have a high percentage of your reads to actually map to your de novo genome.

It is okay if your de novo genome has 1000s of contigs.

If you are dealing with RNA, then mapping partial reads plays a role for a minority of SNPs (close to intron junctions, etc). So you might need to use a Bowtie/Cufflinks, SOAP or whatever to map partially.

Good luck.

Marius 12-23-2010 12:17 AM

thanks a lot for this straight forward answer. So in your opinion, what I would have to do is:
Take all reads (all individuals, all populations) and sort these only for high quality ones (i.e. Phred >20, no Ns etc.). And then I could take all these reads to create my contigs (I expect around 40'000 contigs). Since I have reads of individuals that belong to quite different populations (which might already have diverged quite a bit, also in the genome), I would have to include all individuals to build these contigs I guess.

There is one aspect I'm not really sure yet. Lets say I have a heterozygote read, which has a SNP somewhere when comparing the different individuals (or even a multiple allele position), i.e.

Read1 (i.e. Ind.2, Pop1): ..AGGGTGGACT...
Read2 (i.e. Ind.4, Pop2): ..AGGGGGGACT..
Read3 (i.e. Ind.1, Pop3): ..AGGGAGGACT..

Let's say all these reads are of high-quality, so the polymorphic site is a true multi-allel SNP position. What would the contig (reference-sequence) look like, which is basically the consensus sequence of these 3 reads I quess? Best would probably be: ..AGGGNGGACT..
And, when I then would do SNPcalling (or consensus calling first for every individual), is this always in relation to this reference-contig or not? Because, I don't want to do SNPcalling relative to the reference, I only need the reference to assure I compare the individual pileups of the same locus among the individuals and populations later on. So the contig-seuqence shouldn't influence my individual consensus/SNP calling!
I.e. I know from SAMtools, that consensus-calling/SNP-calling is only possible relative to the reference sequence...
Which assembler and consensus-calling program would be best for this?

pierre350d 02-07-2011 01:28 AM

Dear Marius,

At INRIA, France we developped an algorithm, called kisSnp that compares two sets of raw reads. It detects from these sets SNP polymorphism.

We have a public validated Java version here: and a lighter C version, not yet fully validated but that you could test if you're interested.


vinchenz 03-30-2011 10:59 AM

Ironically, but perhaps not, you might want to to check out a program out of William Cresko's lab called, Stacks.

pierre350d 03-30-2011 12:23 PM

Thanks for the link.

I take the opportunity of this "up" to inform you that a new version of kisSnp is available:


All times are GMT -8. The time now is 05:55 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.