SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
CLC and SNP discovery extari Bioinformatics 9 04-15-2011 02:32 AM
reference-free SNP discovery Marius De novo discovery 5 03-30-2011 12:23 PM
PubMed: SNP discovery by transcriptome pyrosequencing. Newsbot! Literature Watch 0 03-03-2011 03:00 AM
HELP: some suggestions for SNP discovery in 454? linikujp Bioinformatics 1 04-07-2010 01:39 AM
Nonsynonymous SNP (nsSNP) discovery tools? jpeaco02 Bioinformatics 2 11-08-2009 02:13 PM

Reply
 
Thread Tools
Old 03-25-2010, 06:28 AM   #21
MattB
Member
 
Location: Norway

Join Date: Aug 2008
Posts: 35
Default

Good points there. In our case, we were not too concerned about rare alleles, in fact we planned to avoid those SNPs!

I don't think I've really mentioned it above, but in our case the goal was to find 3000 or so polymorphic SNPs throughout the genome for subsequent genotyping in a linkage mapping study. Therefore, the strategy I describe is related to this goal:

-20 individuals represented the subsequent mapping population
-transcriptome sequencing gave us good depth and some annotation info
-pooling enabled us to run only 2 lanes for cost efficiency

All these things probably need to be modified like lletourn suggests if you have different goals in mind.
MattB is offline   Reply With Quote
Old 03-25-2010, 07:10 AM   #22
pfranchini
Member
 
Location: Cape Town

Join Date: May 2009
Posts: 19
Default

Thanks for the info.

Actually, we were wondering to sequence the transcriptome to obtain a higher coverage and find more SNPs as possible to use them in genotyping and for linkage map studies.
We are more interested in most common SNPs and no in rare ones, for this reason we thought to use many animals and optimizing costs using less lanes as possible of a single Illumina run.
We just have a preliminary transcriptome obtained by three Illumina lanes (short reads single and paired of about 40-45 bp) for a total of 18 different animals but closely related. The coverage of the contigs file we built with Velvet is around 30X. Theoretically, should be these data a good starting point to detect SNPs?
P
pfranchini is offline   Reply With Quote
Old 03-25-2010, 07:32 AM   #23
lletourn
Member
 
Location: Montreal

Join Date: Oct 2009
Posts: 63
Default

Sure, the only disadvantage of transcriptomics if uniformity. Since some will be more expressed than others and the fact that this is tissue/time specific will biase the finding of snps for specific transcripts.

If you know that what you are looking for is in an moderately to highly expressed transcript this is a great and cheap way of getting the answer.

If all you know is that it's expressed but don't know in which tissue or if it's highly expressed or not, reduced genome approach targeting exons might be better...if you know the genome, which I guess, is not your case.

Abyss does a good job at assembling RNA, I don't know about velvet. There are special considerations when assembling RNA because alternate splice sites confuses assemblers if they don't know they're there.
lletourn is offline   Reply With Quote
Old 03-25-2010, 09:48 AM   #24
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Just to note: another strategy that has been employed for this problem is to pool genomic DNA from multiple individuals, digest with a restriction enzyme, size select and make libraries from that defined fraction. I believe the recent chicken paper in Nature did this and definitely it has been published for other species (cows & pigs?).

By performing the restriction digestion & size selection you essentially create a smaller genome, which can then be sequenced exhaustively.
krobison is offline   Reply With Quote
Old 03-26-2010, 03:02 AM   #25
pfranchini
Member
 
Location: Cape Town

Join Date: May 2009
Posts: 19
Default

We did think about reducing genome with some techniques like the one you suggested. We are oriented to transcriptome because we just have some sets Illumina short reads about 40 bp of length and we would like to improve our de novo transcriptome assembly, and with the increased data and depth of coverage obtain a more reliable SNPs detection.

Regarding the suggestions of Iletourn, I tried some preliminary analysis regarding SNPs detection with MAQ and SAMTOOLS and I had some results. The only thing/problem in my sample is the relation between individuals (19 animals but 16 of them are very closely related beeing sibling) and I think it could affect the analyses. What do you think about? As suggested by MattB, 20 animals are sufficient to start,but for genotyping and for linkage map studies the individuals should be not related. Am I correct? What do you think regarding sample composition?
Thanks for all suggestions and comments!
pfranchini is offline   Reply With Quote
Old 03-26-2010, 04:47 AM   #26
MattB
Member
 
Location: Norway

Join Date: Aug 2008
Posts: 35
Default

Well, you'll definitely pick up polymorphic SNPs in the 16 animal family...and I'd suspect they'll probably be polymorphic in other individuals/populations unless there are dramatic genetic differences between them and 'other' animals (and avoid the very low MAF SNPs if mapping is your goal).

The best thing to do is probably select a small number of your discovered SNPs, and test them in a validation panel of other individuals. This will tell you if your assembly and SNP discovery is doing a good job.
MattB is offline   Reply With Quote
Old 03-26-2010, 05:32 AM   #27
lletourn
Member
 
Location: Montreal

Join Date: Oct 2009
Posts: 63
Default

Quote:
Originally Posted by MattB View Post
The best thing to do is probably select a small number of your discovered SNPs, and test them in a validation panel of other individuals. This will tell you if your assembly and SNP discovery is doing a good job.
This is a very good approach to take for data validation. It tests your discovery pipeline at the same time and if you build a good panel, it's cheaper in the end to run on many of your individuals.
lletourn is offline   Reply With Quote
Old 03-30-2010, 07:40 AM   #28
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default SNPs and de novo assembly

I'm working in a similar dataset that I've inherited - 454 transcriptome runs from 30 pooled individuals. I've been wondering how assembler handle major frequency SNPs. Would contigs be split in two at sites of a SNP that's present in a roughly 50:50 ratio in your reads or would one or the other variant be selected as a representative base at that position?

Does anyone have experience with different assemblers and how they handle polymorphims when constructing contigs? So far I have assemblies from Newbler and clc for this dataset and from ABySS and clc for an Illumina dataset but I'm not sure how to compare between the different assemblers really.
natstreet is offline   Reply With Quote
Old 03-30-2010, 08:14 AM   #29
MattB
Member
 
Location: Norway

Join Date: Aug 2008
Posts: 35
Default

Quote:
Does anyone have experience with different assemblers and how they handle polymorphims when constructing contigs? So far I have assemblies from Newbler and clc for this dataset and from ABySS and clc for an Illumina dataset but I'm not sure how to compare between the different assemblers really.
I think overall the de novo assemblers handle SNPs quite OK (at least from my experience with CLC; Abyss and SOAPdenovo). I think SOAPdenovo chooses one of the SNP alleles at random for the contig sequence (ie. consensus), but others may use the 'major' allele. Using CLC, I realigned my reads back to my de novo reference, and used the SNP detector ('find variants'). This actually allows you to replace the SNP allele in your reference with the 'major' SNP allele if it wasn't already there.

I can't really comment on how the de novo assemblers compare in terms of SNP handling performance, but those mentioned above have worked OK for me with 50:50 SNPs..

Matt
MattB is offline   Reply With Quote
Old 03-30-2010, 08:22 AM   #30
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

Quote:
Originally Posted by MattB View Post
I think overall the de novo assemblers handle SNPs quite OK (at least from my experience with CLC; Abyss and SOAPdenovo)....

Matt
Did you change any settings for ABySS or SOAPdenovo to specifically help handle SNPs or go with defaults?

Do you have any feel for how they cope with things that aren't 50:50? In a pool of individuals where you have lower frequency alleles coming from high quality reads do you know if this will cause contig splitting or is it that a consensus base will be called?
natstreet is offline   Reply With Quote
Old 03-30-2010, 08:33 AM   #31
MattB
Member
 
Location: Norway

Join Date: Aug 2008
Posts: 35
Default

There is a setting in SOAPdenovo that I thought had some influence on this, used when you run 'SOAPdenovo contig' separately.

-M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging

However, when I experimented with different values it made no difference on the contig assembly results....not sure if it did anything with the 'consensus' base, probably not.

If you search for 'bubbles' in the Abyss, Velvet and CLC documentation you will find a lot more detail on how they deal with SNPs.
MattB is offline   Reply With Quote
Old 04-01-2010, 01:35 PM   #32
Boonie
Junior Member
 
Location: Memphis

Join Date: Mar 2009
Posts: 6
Default A 454 - SSAHA approach

Just to throw in on the conversation, I pooled genomic DNA from 18 individuals, cut with a 4 base cutter, and sequenced a 15bp size fraction with two full runs of 454 reads (250bp). I assembled them gsAssembler which produced an average 20 reads per contig. Then I mapped the individual reads back to the contig consensus sequences using SSAHA2 and used the SSAHA_pipeline to call SNPs. It worked pretty well - wound up with about 8000 SNPs I could believe in, and the validation rate was about 95%. The predicted allele frequency was strongly correlated (>0.8) with the real allele frequency in the donors. My goal was just basic SNP discovery in a novel species and it fit the bill.

Caveats - Beware of minor allele freqs near 0.5 which could arise from alignment of reads from duplicated loci; Screen out short tandem repeats because STR allelic differences in the alignment can cause false positive SNPs; Loci with only 4 mapped reads (minimum 2 reads per allele) may be useful but don't count on them.
Boonie is offline   Reply With Quote
Old 09-29-2010, 12:38 AM   #33
pierre350d
Junior Member
 
Location: rennes, france

Join Date: Nov 2008
Posts: 7
Default

A piece of information,

We developed a tool, called kisSnp that takes two sets of non assembled raw short reads and compare them for finding SNPs between these two sets.
It outputs the SNPs with small flanking regions.
It uses light memory and run in short time.

All info and download can be found on the dedicated website: http://alcovna.genouest.org/

Enjoy ! (remarks and comments are welcome)
pierre350d is offline   Reply With Quote
Old 09-29-2010, 04:42 AM   #34
lletourn
Member
 
Location: Montreal

Join Date: Oct 2009
Posts: 63
Default

I checked your site quickly, it's very interesting.

I do have a question though, without a reference won't you be missing all the homozygous variations?

Also you need long enough reads to generate flanks no, anything smaller dans 50 even 75 wouldn't ne long enough.

Or am I missing something.
lletourn is offline   Reply With Quote
Old 09-29-2010, 08:07 AM   #35
pierre350d
Junior Member
 
Location: rennes, france

Join Date: Nov 2008
Posts: 7
Default

With the current version we detect only SNPs between individuals. One compares two set of reads, focusing on small substitutions that may be those SNPs.

We are currently working on a version intra-individual, that will enable to detect heterozygous SNP of one individual.

This may be done avoiding the use of a reference genome, if the coverage is sufficient.
Reads of length 50 to 75 are indeed long enougth.

Pierre
pierre350d is offline   Reply With Quote
Old 12-06-2010, 12:40 PM   #36
ybfu
Junior Member
 
Location: Saskatoon

Join Date: Apr 2010
Posts: 2
Question DIAL by Dr. Ratan for SNP without reference genome

Hi, Everyone:

I am trying to use DIAL without success for unknown reason, even following exact instructions. So I am wondering if anyone in our community is using the DIAL to get SNP and sharing some experience. I contacted Dr. Ratan at Penn State, but got no response. Any comments on DIAL?

I have a 454 sequencing run of 8 samples with barcodes each and got individual .sff file. When I perform DIAL by adding each .sff file, it worked sometime, and some time not working. I tested it with the supplied data and it worked for Adding but not working with Update (it returns with $ without error, but I check ps showing no such task).
ybfu is offline   Reply With Quote
Old 12-06-2010, 01:21 PM   #37
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

What version of newbler are you using? I tried DIAL and it would very specifically only work with v2.0 and nothing later.
natstreet is offline   Reply With Quote
Old 12-06-2010, 01:30 PM   #38
ybfu
Junior Member
 
Location: Saskatoon

Join Date: Apr 2010
Posts: 2
Question

I did give it a trial at 2.0 version by changing the newbler path in my .profile. What I got when I performed DIAL add is: Errors: unable to open sff file. SRR000375.sff (which is one of the test sff file).

Last edited by ybfu; 12-06-2010 at 01:53 PM.
ybfu is offline   Reply With Quote
Old 01-12-2016, 06:15 AM   #39
arthurmelo
Member
 
Location: Durham, NH, US

Join Date: Jul 2012
Posts: 19
Default

Hi everybody, I wondering to introduce and share the GBS-SNP-CROP:a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by- sequencing data.
Recently published on BMC Bioinformatics, this methodology could be useful for population genomic studies in model and non model organism when or not a reference genome is available.

Please see the GBS-SNP-CROP GitHub page for more details and UserManual:
https://github.com/halelab/GBS-SNP-CROP.git

Best regards,
Arthur Melo
arthurmelo is offline   Reply With Quote
Reply

Tags
de novo, illumina, snp, snp discovery, solexa

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO