SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Exome result analysis MARYAMD Introductions 1 05-09-2012 04:28 AM
RNA-seq analysis for expression result yifangt RNA Sequencing 6 08-26-2011 01:54 AM
Puzzling result from Illumina 150bp PE reads chkuo De novo discovery 8 06-06-2011 01:07 AM
Assembling final contigs, other questions etc. crag1212 Illumina/Solexa 0 09-09-2010 08:41 AM
how to get the final consensus by samtool mingkunli Bioinformatics 0 10-24-2009 09:02 AM

Reply
 
Thread Tools
Old 09-08-2012, 11:45 PM   #1
tahamasoodi
Success
 
Location: India

Join Date: May 2012
Posts: 130
Default Illumina final result analysis

Hi all,

I received the final result of illumina data in xlsx file format containing around 3,768,494 SNPs, 10,557 nsSNPS, 535,826 indels, 474 coding indels and much more. May I know how to find which SNPs are significant as their number is enourmous? Is there any software for analysing this?

Thanks.
tahamasoodi is offline   Reply With Quote
Old 09-12-2012, 03:07 PM   #2
swNGS
Member
 
Location: SW UK

Join Date: Nov 2011
Posts: 83
Default

That's a really unhelpful sequencing core that you have there....
You might find you get a better response if you are more specific about what you are trying to do...
swNGS is offline   Reply With Quote
Old 09-12-2012, 04:30 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

There are programs where you can feed them SNP data, and they will at least tell you what amino acid changes the make.

Off the top of my head, there's some ensembl variant predictor, a program called SNPeff, and a program called annovar. I use annovar on mouse SNPs, seems to work fine.
swbarnes2 is offline   Reply With Quote
Old 09-12-2012, 11:24 PM   #4
tahamasoodi
Success
 
Location: India

Join Date: May 2012
Posts: 130
Default

Hi swbarnes2,
Thanks for your response, I tried SNPeff but it is accepting SVF format input files while my data is in xlsx file. When I tries annovar, it shows me the error message when I give any command starting with annovar.pl, I get the error message command not found. I tried many things but failed.

Last edited by tahamasoodi; 11-03-2012 at 05:56 AM.
tahamasoodi is offline   Reply With Quote
Old 09-12-2012, 11:44 PM   #5
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

Are these SNPs annotated in any way (e.g.: Allele frequencies in 1000genomes project, Exome sequencing project, Prediction values of SIFT, Conservation Score, AminoAcid Change, gene affected)?
IF yes, then that's something to start with.
Filter out all common variants
If there's a special region you interested in, take out only those SNPs,

If not, get a annotation program running (I recommend annovar as well, but it needs a certain format of your input file, but since it is text-based you should be able to create that from the Excel file)

If you can't get it done, you also might have a look here:
http://snp.gs.washington.edu/SeattleSeqAnnotation/

Hope that helps
ulz_peter is offline   Reply With Quote
Old 09-13-2012, 03:11 AM   #6
tahamasoodi
Success
 
Location: India

Join Date: May 2012
Posts: 130
Default

Thanks Peter,

The excel file contains a number of fields as given below. I want to know the significant SNPs in the whole genome. Can I do it in excel itself or I have to use any tool for it? I tried to use annovar but i m getting an error in it.

Regards,

#chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth
chr10 61373 61373 A - hom 189 28
chr10 62082 62082 G T het 52 33
chr10 65878 65878 C G hom 31 3


alt_depth region gene
28 intergenic NONE(dist=NONE),TUBB8(dist=31455)
11 intergenic NONE(dist=NONE),TUBB8(dist=30746)
3 intergenic NONE(dist=NONE),TUBB8(dist=26950)

dbSNP135_full dbSNP135_common 1000G_2011Oct_allele_freq
rs9329307 . .
rs2271275 rs2271275 0.55
rs6901 rs6901 0.73

annotation
TUBB8:NM_177987:exon4:c.A314G.H105R,
ADARB2:NM_018702:exon9:c.G1876A.A626T,
PITRM1:NM_001242307:exon27:c.A3113G.Q1038R,PITRM1:NM_014889:exon27:c.A3110G.Q1037R,PITRM1:NM_001242309:exon24:c.A2816G.Q939R,
tahamasoodi is offline   Reply With Quote
Old 09-13-2012, 03:18 AM   #7
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

what do you mean by significant SNPs?

It seems that your SNPs are already annotated.
So, in case you search for the cause of a rare disease you could limit yourself to SNPs having an allele frequence < 0.01 in 1000G_2011Oct_allele_freq and have no entry i n the dbSNP135_common fields and variants that are possibly deleterious (in your case it is stated in the annotation part, e.g..H105R)

You could do that in Excel, but again,
if you do not specify your problem we cannot specify the solution
ulz_peter is offline   Reply With Quote
Old 09-13-2012, 03:51 AM   #8
tahamasoodi
Success
 
Location: India

Join Date: May 2012
Posts: 130
Default

Actually, I have around 80 samples of CRC patients and equal controls of whole genome and I got around 3,768,494 SNPs, 10,557 nsSNPS, 535,826 indels, 474 coding indels for one case sample and almost a similar figure for the controls. Now I want to know which SNPs/indels are responsible for the disease by filtering these huge number of SNPs. How can i give the filtering criteria? Can you give a full description of the annotations field?

Last edited by tahamasoodi; 09-13-2012 at 04:13 AM.
tahamasoodi is offline   Reply With Quote
Old 09-13-2012, 04:12 AM   #9
xied75
Senior Member
 
Location: Oxford

Join Date: Feb 2012
Posts: 129
Default

I was just guessing that he might be feeding whatever programs you have mentioned with the excel file directly, other than creating new text files in a format that these programs can read. (But if I'm wrong, then ignore this.)

Best,

dong
xied75 is offline   Reply With Quote
Old 09-13-2012, 04:26 AM   #10
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

So you've got 160 Excel files each having about 4million entries?

I guess you'll need some programming here...
I don't know of any program which could compute significance of certain SNPs when they show up in a significant portion of samples. Maybe someone else can help here...

What you might do is filtering out the synonymous SNPs and SNPs showing higher allele frequencies just by using an Excel filter, but for 160 huge Excel files that may not be what you want.

Since I am in a good mood today I'm gonna explain you the flags:

chr_name: Name of the chromosome
chr_start: SNP position (starting point for in/dels)
chr_end : SNP position (end point for indels)
ref_base: human reference at that exact position
alt_base : base detected in your sample at that position
hom_het : whether the mutation showed up homozygus or heterozygous
snp_quality: a quality value of how likely it is, that your SNP is real or just a sequencing artifact (no idea about the scale they use for assigning the SNP quality value)
tot_depth: Sequencing depth at that position (i.e.: how many reads cover this position)
alt_depth: sequencing reads at that position that show the mutated allele
region: Obviously shows if that mutation lies within a gene/exon/intron or elsewhere
gene: gene affected
dbSNP135_full: dbSNP version 135 reference
dbSNP135_common: dbSNP version 135 reference in case that SNP has an allele frequency >1%
1000G_2011Oct_allele_freq: Allele frequency determined by the 1000Genomes (October 2011 version) project
annotation: nomenclature for the mutation- c.XXX is the cDNA position of the NM_xxx isoform and p.xxx is the protein substitution nomenclature for that mutation

Since I did not create the files I cannot guarantee that this is absolutely true, but these are the most likely explanations.

Best regards,
Peter
ulz_peter is offline   Reply With Quote
Old 09-13-2012, 04:27 AM   #11
ulz_peter
Senior Member
 
Location: Graz, Austria

Join Date: Feb 2010
Posts: 219
Default

Quote:
Originally Posted by xied75 View Post
I was just guessing that he might be feeding whatever programs you have mentioned with the excel file directly, other than creating new text files in a format that these programs can read. (But if I'm wrong, then ignore this.)

Best,

dong
That's what I am guessing too, however his files seem to be annotated already...
ulz_peter is offline   Reply With Quote
Old 09-13-2012, 04:40 AM   #12
tahamasoodi
Success
 
Location: India

Join Date: May 2012
Posts: 130
Default

If I select the particular genes involved in CRC, I think then excel filter can help in screening the deleterious SNPs.
tahamasoodi is offline   Reply With Quote
Old 09-13-2012, 09:23 AM   #13
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

There is no perfect algorithm that goes from primary amino acid change -> functional effect. So you'll want to use a combo of programs ike polyPhen-2, pathway analysis, comparison to the 1K Genomes SNP set, stuff like that.
swbarnes2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:53 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO