SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   De novo discovery (http://seqanswers.com/forums/forumdisplay.php?f=27)
-   -   Snp discovery without a reference (http://seqanswers.com/forums/showthread.php?t=3018)

lletourn 11-04-2009 06:20 AM

Snp discovery without a reference
 
I have paired-end (76bp) output from a GA in which I would like to try snp discovery. The hiccup is there is no reference genome for my specie.

Does anyone have any ideas, or know any tool that could do this?

Most of the tools that do snp discovery well, use a pre aligned dataset to work on. If I were to assemble the data, is there something that could to ace->(snp discovery tool format) to do the work?

Thanks

MattB 11-06-2009 06:49 PM

Hi,

you could do a de novo assembly with several tools, such as SOAPdenovo, Velvet, Abyss, MIRA etc. and then use the contigs as a reference (separately or joined into a single sequence) to align your reads back to. Mosaik will output an assembly in Gigabayes format for SNP discovery. I have also used SOAP to align my reads back to contigs generated by SOAPdenovo, and then used MapView to view the alignment and find SNPs.

Commercial software like Seqman NGen will do de novo assembly and SNP detection together from what I understand.

Matt

lletourn 11-09-2009 04:21 AM

I thought about this and, without any good reason, I wondered if any 'bias' or something of the sort would be added to the results since the reads used to build an assembly would be aligned to themselves.

Can't hurt trying though (except for a few lost CPU hours :-) )

thanks

MattB 11-09-2009 04:28 AM

I can't think of any reason why this wouldn't work myself....but stand to be corrected ;) In fact, I think it makes for an interesting comparison between the denovo assembly program and parameters that are used in that to the corresponding parameters in the reference guided assembler.

Matt

Nick Miller 11-13-2009 08:43 AM

I am in the middle of trying this approach for SNP discovery. My starting material was normalized cDNA from several individuals. I used SSAKE for the assembly and maq to look for SNPs. I am hoping to test some of the putative SNPs soon.

bioenvisage 11-16-2009 01:57 AM

Hi,


why cant you try using the ESTs as the reference for aligning..

lletourn 11-16-2009 04:47 AM

There are no ESTs on my fungi genome, as far as I know.

I tried MattB's approach and it seemed to work well. I have a bit too many snps compared to what would be expected, but the lab will validate a few as QC.

MattB 11-16-2009 04:52 AM

I'd be suspicious about SNPs only found on the last one or two bases of your reads (I posted a separate thread on this), as they could well be remnants of adaptor sequence (adaptor trimming won't work when only one or few bases of adaptor are present on the ends of your reads).

Boonie 11-18-2009 12:29 PM

Is there a need to obtain flanking sequence to design a genotyping assay? If so, how will you get sufficient flanking sequence if you are mapping short reads to the contig consensus seqs (assuming no reference genome).

MattB 11-18-2009 10:18 PM

Boonie, it depends on the type of genotyping assay (ie. number of SNPs) that are interested in. For the Illumina Infinium iSelect assay, Illumina specify minimum 50bp on EITHER side of the SNP for probe design, so short contigs in theory aren't such a problem (although it would be nice to have 50bp both sides so Illumina can pick the 'best' probe). For other genotyping applications like Sequenom iPlex, then you will need more flanking sequence on both sides..

little_beetle 03-02-2010 04:38 PM

This is great MattB.
I am trying to develop SNP from a de novo assembled EST library.
How do you joined them contigs into a single sequence? Do you put them together according to some sort of order or just simply join all contig sequences?
Thanks.

Quote:

Originally Posted by MattB (Post 10158)
Hi,

you could do a de novo assembly with several tools, such as SOAPdenovo, Velvet, Abyss, MIRA etc. and then use the contigs as a reference (separately or joined into a single sequence) to align your reads back to. Mosaik will output an assembly in Gigabayes format for SNP discovery. I have also used SOAP to align my reads back to contigs generated by SOAPdenovo, and then used MapView to view the alignment and find SNPs.

Commercial software like Seqman NGen will do de novo assembly and SNP detection together from what I understand.

Matt


drio 03-02-2010 06:31 PM

Quote:

Originally Posted by MattB (Post 10158)
Hi,

you could do a de novo assembly with several tools, such as SOAPdenovo, Velvet, Abyss, MIRA etc. and then use the contigs as a reference (separately or joined into a single sequence) to align your reads back to. Mosaik will output an assembly in Gigabayes format for SNP discovery. I have also used SOAP to align my reads back to contigs generated by SOAPdenovo, and then used MapView to view the alignment and find SNPs.

Commercial software like Seqman NGen will do de novo assembly and SNP detection together from what I understand.

Matt

Once you have your de novo assembly treat that as your reference (as MattB is saying here). After that, remap the reads back to the "new" reference and pileup the alignments. Finally you can setup your filters to try to get the best snps possible.

Let us know how it goes.

MattB 03-02-2010 10:44 PM

We just joined the contigs in the order they were output by the denovo assember, so essentially at random. Since I posted that however, I have been using the CLC NGS Cell software to perform de novo assembly, reference guided alignment and SNP detection on the contigs separately... ;)

So naturally if the alignment/SNP detection software can handle thousands of separate contigs, then this is probably preferable, and makes life easier if you are BLASTing your assembled ESTs...

Matt

pfranchini 03-25-2010 02:09 AM

Hi, We are starting a project aiming to detect SNPs in a species without reference genome.
I also have thought to assembly my short reads de novo and use the obtained contigs as reference.
From your experience, what is the best NGS technology for an approach like this? We are wondering between 454 Titanium and Solexa (75 bp reads).
Then, how many individuals are necessary for a reliable SNPs detection?
Thanks for you help!
P

lletourn 03-25-2010 04:42 AM

We worked with hybrid assemblies using the bigger PE 454 to builder bigger scaffolds (we used 8k because our lab had trouble with the 20k protocol) and we used illuminas 76 short insert PE to have bigger depth of coverge (we didn't use the 5k long inserts again because the lab had some trouble in the past).

We used wgs-celera to assemble and remapped the reads and used samtools to call the snps.

It worked rather well. The drawback is in costs, since you need double the number of librairies.

lletourn 03-25-2010 04:47 AM

Quote:

Originally Posted by pfranchini (Post 16026)
how many individuals are necessary for a reliable SNPs detection?
P

I'm still not sure about the right answer here. For mapping to a ref, to eliminate many of the false positives, I would say to go as high as 25x-30x (for hets, for homozygous, lower would still be good).

But starting from an assembly which won't be perfect to start with, I don't really know but it should probably be around the same.

Actually you could use only one individual for the 454 run, and use all the individuals (separately) for the alignment part.

Use individual A 454PE + individual A GAPE to assemble
Use all individuals on that assembly to find snps.

MattB 03-25-2010 05:06 AM

We will be using paired-end 75bp Illumina reads for our next project, since we believe the higher sequence output will outweigh the longer read lengths of 454. Ultimately, if you are just trying to identify SNPs more or less at random then you don't necessarily need big contigs, just enough to have sufficient flanking sequence.

Depth will of course be related to what you originally sequence, but I'd suggest transcriptome or reduced representation library sequencing to ensure adequate depth without resorting to huge amounts of sequencing.

We have used 10-20 pooled individuals, I think it is reasonably important here that these individuals are representative of any downstream SNP genotyping that you have in mind (if that is what you plan to do).

lletourn 03-25-2010 05:11 AM

I agree it depends on what you want.

In our case we wanted the assembly (we're working on finishing...the painful part), but if the only part of interest are snps, long PE aren't necessary like you mentioned.

The transcriptome is fine for exonic snps, but if you're looking at regulatory or others, it's not really an option.

MattB 03-25-2010 05:15 AM

yep, agree with lletourn that the optimal strategy very much depends on what type of SNPs you want to find and what you want to do with them afterwards ;)

lletourn 03-25-2010 05:16 AM

Quote:

Originally Posted by MattB (Post 16040)
We have used 10-20 pooled individuals

Again it depends (I hate that sentence and it keeps croping up).

The more individuals are pooled, the less you'll see rare snps except if you have higher coverage.

But, the more 'frequent' snp in your population you'll see.

If you want 'all' the snps between a ref and an individual, with a coverage around 30x you probably won't find false negatives using GA.

But if you have 2 individual pooled, your reads a spread between them so you'll miss rarer snps.

So if you want population genetics, pool away
if you want a specific mutation for a phenotype (say ENU induced), don't pool. (this is extreme since you know only one individual has the mutation, but same goes for rare diseases).

BTW, I never thanked you for the first reply...thanks :-)


All times are GMT -8. The time now is 04:11 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.