SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
Snp discovery without a reference lletourn De novo discovery 38 01-12-2016 05:15 AM
CLC and SNP discovery extari Bioinformatics 9 04-15-2011 01:32 AM
PubMed: Reference-free validation of short read data. Newsbot! Literature Watch 0 02-22-2011 11:00 AM
Hierarchical reference-free SNP calling Marius Bioinformatics 1 12-27-2010 08:38 AM
Reference-Free Validation of Short Read Data krobison Literature Watch 1 09-23-2010 04:40 PM

Reply
 
Thread Tools
Old 12-20-2010, 01:48 PM   #1
Marius
8armed
 
Location: Germany

Join Date: Dec 2010
Posts: 28
Default reference-free SNP discovery

Dear all,

I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:

I have Illumina short reads from which I want to call SNPs WITHOUT
using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.

Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.

Thank you a lot for your help,

Marius

Last edited by Marius; 12-22-2010 at 01:34 PM.
Marius is offline   Reply With Quote
Old 12-22-2010, 03:17 PM   #2
Awesome
Junior Member
 
Location: california

Join Date: Aug 2009
Posts: 7
Default

To do SNP calling, the standard procedure is to map reads to a reference genome. Then you look at your pileup (i.e. the base frequencies and associated quality scores for every position) and find regions where allele frequencies are least divergent. Illumina's CASAVA uses a fancy nearest-neighbor SNP caller, SOAPsnp uses a bayesian algorithm, and I'm sure there are many, many other methods.

The standard way to SNPcall, because you don't have a reference sequence, is to generate one. You do this by feeding trimmed, high-quality-only reads into a de-novo assembler such as Velvet or ABYSS.

For SNPcalls, contig length isn't really your end goal. Your goal for the assembly should be to have a high percentage of your reads to actually map to your de novo genome.

It is okay if your de novo genome has 1000s of contigs.

If you are dealing with RNA, then mapping partial reads plays a role for a minority of SNPs (close to intron junctions, etc). So you might need to use a Bowtie/Cufflinks, SOAP or whatever to map partially.

Good luck.
Awesome is offline   Reply With Quote
Old 12-22-2010, 11:17 PM   #3
Marius
8armed
 
Location: Germany

Join Date: Dec 2010
Posts: 28
Default

Awesome,
thanks a lot for this straight forward answer. So in your opinion, what I would have to do is:
Take all reads (all individuals, all populations) and sort these only for high quality ones (i.e. Phred >20, no Ns etc.). And then I could take all these reads to create my contigs (I expect around 40'000 contigs). Since I have reads of individuals that belong to quite different populations (which might already have diverged quite a bit, also in the genome), I would have to include all individuals to build these contigs I guess.

There is one aspect I'm not really sure yet. Lets say I have a heterozygote read, which has a SNP somewhere when comparing the different individuals (or even a multiple allele position), i.e.

Read1 (i.e. Ind.2, Pop1): ..AGGGTGGACT...
Read2 (i.e. Ind.4, Pop2): ..AGGGGGGACT..
Read3 (i.e. Ind.1, Pop3): ..AGGGAGGACT..

Let's say all these reads are of high-quality, so the polymorphic site is a true multi-allel SNP position. What would the contig (reference-sequence) look like, which is basically the consensus sequence of these 3 reads I quess? Best would probably be: ..AGGGNGGACT..
And, when I then would do SNPcalling (or consensus calling first for every individual), is this always in relation to this reference-contig or not? Because, I don't want to do SNPcalling relative to the reference, I only need the reference to assure I compare the individual pileups of the same locus among the individuals and populations later on. So the contig-seuqence shouldn't influence my individual consensus/SNP calling!
I.e. I know from SAMtools, that consensus-calling/SNP-calling is only possible relative to the reference sequence...
Which assembler and consensus-calling program would be best for this?
Marius is offline   Reply With Quote
Old 02-07-2011, 12:28 AM   #4
pierre350d
Junior Member
 
Location: rennes, france

Join Date: Nov 2008
Posts: 7
Default

Dear Marius,

At INRIA, France we developped an algorithm, called kisSnp that compares two sets of raw reads. It detects from these sets SNP polymorphism.

We have a public validated Java version here: http://alcovna.genouest.org/kissnp/ and a lighter C version, not yet fully validated but that you could test if you're interested.

Pierre
pierre350d is offline   Reply With Quote
Old 03-30-2011, 09:59 AM   #5
vinchenz
Junior Member
 
Location: Indiana

Join Date: Mar 2011
Posts: 1
Default

Ironically, but perhaps not, you might want to to check out a program out of William Cresko's lab called, Stacks.
vinchenz is offline   Reply With Quote
Old 03-30-2011, 11:23 AM   #6
pierre350d
Junior Member
 
Location: rennes, france

Join Date: Nov 2008
Posts: 7
Default

Thanks for the link.

I take the opportunity of this "up" to inform you that a new version of kisSnp is available: http://alcovna.genouest.org/kissnp-page/

Pierre
pierre350d is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO