SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
find all snps/indels prbndr Bioinformatics 2 09-20-2011 10:43 AM
Any way to validate BFAST index file? feederbing Bioinformatics 2 09-08-2011 02:21 PM
newbler assembly ... padding and SNPs jnfass De novo discovery 12 06-09-2011 03:58 AM
1KG SNPs and Indels file mpiro Bioinformatics 4 01-31-2011 03:55 AM
Software for Sorting SNPs/ InDels by Affect? Nix Bioinformatics 2 10-27-2010 02:24 PM

Reply
 
Thread Tools
Old 02-22-2011, 04:20 PM   #1
sulicon
Member
 
Location: Los Angeles

Join Date: Aug 2010
Posts: 41
Question how to validate SNPs and Indels by using just contigs?

Hi all,
We have observed some SNP sites and Indels by aligning our contigs to the reference genome. We want to distinguish real variants from those due to sequencing errors.

I have found several tools for variants detection but all of them work with the raw reads. I think a lot of information would be lost if assembled contigs used for this task... However, we performed a de novo assembly of the 454 reads due to the design of this specific project. It would be great if we could estimate the "confidence" of the variants we found after assembly. Is there any variant validation method working with contigs?

Or can I verify this by an independent analysis performed on the raw reads? I am afraid the variants identified before and after assembly would be somewhat inconsistent...

And, is there significant improvement if the variants are identified by using raw reads, rather than assembled contigs?

Thanks in advance.

Last edited by sulicon; 02-23-2011 at 10:37 AM.
sulicon is offline   Reply With Quote
Old 02-22-2011, 07:55 PM   #2
ketan_bnf
Member
 
Location: India

Join Date: Oct 2010
Posts: 59
Default

Hi sulicon,

Are you mapping contigs to the reference seq using gsMapper?

if your contigs are larger than 2000bp gsMapper will not consider them for mapping to the ref seq, as i got that error during aligning contigs with gsMapper, so it is good to map reads to ref seq.

Also the gsMapper outputs HCDiff.txt containing high confidence SNP sites and INDELS.

Last edited by ketan_bnf; 02-22-2011 at 08:00 PM.
ketan_bnf is offline   Reply With Quote
Old 02-22-2011, 09:17 PM   #3
sulicon
Member
 
Location: Los Angeles

Join Date: Aug 2010
Posts: 41
Default

Quote:
Originally Posted by ketan_bnf View Post
Hi sulicon,

Are you mapping contigs to the reference seq using gsMapper?

if your contigs are larger than 2000bp gsMapper will not consider them for mapping to the ref seq, as i got that error during aligning contigs with gsMapper, so it is good to map reads to ref seq.

Also the gsMapper outputs HCDiff.txt containing high confidence SNP sites and INDELS.
No. I used gsAssembler for the de novo assembly. Then I BLATed the contigs against reference genome to get the structure of the genes. I observed some SNPs/Indels in the alignments, but had no confidence about the results...
sulicon is offline   Reply With Quote
Old 02-22-2011, 11:30 PM   #4
ketan_bnf
Member
 
Location: India

Join Date: Oct 2010
Posts: 59
Default

If you want to find SNPs, you should map reads to ref seq using gsMapper or map contigs to ref seq using BWA http://bio-bwa.sourceforge.net/, get output in sam, extarct SNPs using SAMTools, Magicviewer.

You can also further annote that SNPs using variant effect predictor on EnsEMBL.
ketan_bnf is offline   Reply With Quote
Old 02-23-2011, 03:08 AM   #5
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Quote:
Originally Posted by sulicon View Post
No. I used gsAssembler for the de novo assembly. Then I BLATed the contigs against reference genome to get the structure of the genes. I observed some SNPs/Indels in the alignments, but had no confidence about the results...
Why don't you just map your reads with gsMapper on your refseq? As you already have seen some SNP from BLATing your contigs, you know where to look for them ...

Sven
sklages is offline   Reply With Quote
Old 02-23-2011, 04:54 AM   #6
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

If you can generate a SAM/BAM file from your alignment, then the various SNP callers which work on that format should allow you to estimate confidence in the calls.

However, they will be relying on the quality scores generated by the base caller. I've recently run into a situation on another platform (SOLiD) in another setting (RNA-Seq) in which systematic errors were reinforced, and so some of my very confident calls from the SNP caller were bogus. In the end, nothing beats verifying at least a sample of your variant calls experimentally -- which is how I discovered the trouble in my data.
krobison is offline   Reply With Quote
Old 02-23-2011, 10:42 AM   #7
sulicon
Member
 
Location: Los Angeles

Join Date: Aug 2010
Posts: 41
Default

Thanks Ketan and Sven. I have already assembled the contigs by newbler and performed a lot of subsequent analysis. It's better if I needn't to assembled the reads again... Maybe I have to map the reads to reference seq, just for the purpose of SNP detection.
sulicon is offline   Reply With Quote
Old 02-23-2011, 10:55 AM   #8
sulicon
Member
 
Location: Los Angeles

Join Date: Aug 2010
Posts: 41
Default

@krobison

Thanks. Could I generate SAM/BAM files from BLAT alignment between the contigs and human genome?

I think the there would be some information lost if I worked on the contigs, instead of reads. However, compared with aligning raw reads, I guess the de novo assembler has already considered the alignment between reads, and I could provide a quality file for the contigs. But I don't whether the SNP callers could realize sequencing error rate would rise in homopolymer regions. I will have a try.

We would perform some experiments for variants validation if I could find interesting candidates.
sulicon is offline   Reply With Quote
Old 02-24-2011, 07:20 AM   #9
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 415
Default

I would definitely just use gsMapper to align to a reference genome. There are some nice output files with confidence, I think they are called HCDiff.txt or similar.
We have been doing this and the validation quite a lot of late and the 454 data is very nice for SNP calling, even at low coverages which has really surprised us after fun with Illumina-predicted SNPs at low coverage.
colindaven is offline   Reply With Quote
Old 02-25-2011, 04:27 AM   #10
Giulietta
Junior Member
 
Location: UK

Join Date: Nov 2010
Posts: 8
Default

Just a word about the Ensembl variant effect predictor- if you have chromosome and base pair positions (on the reference assembly), you can enter in any alleles found at that position as your input. The output will let you know any dbSNP IDs that map to the same position. In this way, you can see if there is a known dbSNP ID for the allele/alternate nucleotide you have found. More is here:

http://www.ensembl.org/info/website/upload/var.html

It also accepts VCF format:

http://www.1000genomes.org/wiki/Anal...mat-version-40

The tool itself is here- both a script, and a web interface:

http://www.ensembl.org/tools.html

Hope that helps.
Giulietta is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO