Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina final result analysis

    Hi all,

    I received the final result of illumina data in xlsx file format containing around 3,768,494 SNPs, 10,557 nsSNPS, 535,826 indels, 474 coding indels and much more. May I know how to find which SNPs are significant as their number is enourmous? Is there any software for analysing this?

    Thanks.
    Thanks,

  • #2
    That's a really unhelpful sequencing core that you have there....
    You might find you get a better response if you are more specific about what you are trying to do...

    Comment


    • #3
      There are programs where you can feed them SNP data, and they will at least tell you what amino acid changes the make.

      Off the top of my head, there's some ensembl variant predictor, a program called SNPeff, and a program called annovar. I use annovar on mouse SNPs, seems to work fine.

      Comment


      • #4
        Hi swbarnes2,
        Thanks for your response, I tried SNPeff but it is accepting SVF format input files while my data is in xlsx file. When I tries annovar, it shows me the error message when I give any command starting with annovar.pl, I get the error message command not found. I tried many things but failed.
        Last edited by tahamasoodi; 11-03-2012, 04:56 AM.
        Thanks,

        Comment


        • #5
          Are these SNPs annotated in any way (e.g.: Allele frequencies in 1000genomes project, Exome sequencing project, Prediction values of SIFT, Conservation Score, AminoAcid Change, gene affected)?
          IF yes, then that's something to start with.
          Filter out all common variants
          If there's a special region you interested in, take out only those SNPs,

          If not, get a annotation program running (I recommend annovar as well, but it needs a certain format of your input file, but since it is text-based you should be able to create that from the Excel file)

          If you can't get it done, you also might have a look here:


          Hope that helps

          Comment


          • #6
            Thanks Peter,

            The excel file contains a number of fields as given below. I want to know the significant SNPs in the whole genome. Can I do it in excel itself or I have to use any tool for it? I tried to use annovar but i m getting an error in it.

            Regards,

            #chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth
            chr10 61373 61373 A - hom 189 28
            chr10 62082 62082 G T het 52 33
            chr10 65878 65878 C G hom 31 3


            alt_depth region gene
            28 intergenic NONE(dist=NONE),TUBB8(dist=31455)
            11 intergenic NONE(dist=NONE),TUBB8(dist=30746)
            3 intergenic NONE(dist=NONE),TUBB8(dist=26950)

            dbSNP135_full dbSNP135_common 1000G_2011Oct_allele_freq
            rs9329307 . .
            rs2271275 rs2271275 0.55
            rs6901 rs6901 0.73

            annotation
            TUBB8:NM_177987:exon4:c.A314G.H105R,
            ADARB2:NM_018702:exon9:c.G1876A.A626T,
            PITRM1:NM_001242307:exon27:c.A3113G.Q1038R,PITRM1:NM_014889:exon27:c.A3110G.Q1037R,PITRM1:NM_001242309:exon24:c.A2816G.Q939R,
            Thanks,

            Comment


            • #7
              what do you mean by significant SNPs?

              It seems that your SNPs are already annotated.
              So, in case you search for the cause of a rare disease you could limit yourself to SNPs having an allele frequence < 0.01 in 1000G_2011Oct_allele_freq and have no entry i n the dbSNP135_common fields and variants that are possibly deleterious (in your case it is stated in the annotation part, e.g..H105R)

              You could do that in Excel, but again,
              if you do not specify your problem we cannot specify the solution

              Comment


              • #8
                Actually, I have around 80 samples of CRC patients and equal controls of whole genome and I got around 3,768,494 SNPs, 10,557 nsSNPS, 535,826 indels, 474 coding indels for one case sample and almost a similar figure for the controls. Now I want to know which SNPs/indels are responsible for the disease by filtering these huge number of SNPs. How can i give the filtering criteria? Can you give a full description of the annotations field?
                Last edited by tahamasoodi; 09-13-2012, 03:13 AM.
                Thanks,

                Comment


                • #9
                  I was just guessing that he might be feeding whatever programs you have mentioned with the excel file directly, other than creating new text files in a format that these programs can read. (But if I'm wrong, then ignore this.)

                  Best,

                  dong

                  Comment


                  • #10
                    So you've got 160 Excel files each having about 4million entries?

                    I guess you'll need some programming here...
                    I don't know of any program which could compute significance of certain SNPs when they show up in a significant portion of samples. Maybe someone else can help here...

                    What you might do is filtering out the synonymous SNPs and SNPs showing higher allele frequencies just by using an Excel filter, but for 160 huge Excel files that may not be what you want.

                    Since I am in a good mood today I'm gonna explain you the flags:

                    chr_name: Name of the chromosome
                    chr_start: SNP position (starting point for in/dels)
                    chr_end : SNP position (end point for indels)
                    ref_base: human reference at that exact position
                    alt_base : base detected in your sample at that position
                    hom_het : whether the mutation showed up homozygus or heterozygous
                    snp_quality: a quality value of how likely it is, that your SNP is real or just a sequencing artifact (no idea about the scale they use for assigning the SNP quality value)
                    tot_depth: Sequencing depth at that position (i.e.: how many reads cover this position)
                    alt_depth: sequencing reads at that position that show the mutated allele
                    region: Obviously shows if that mutation lies within a gene/exon/intron or elsewhere
                    gene: gene affected
                    dbSNP135_full: dbSNP version 135 reference
                    dbSNP135_common: dbSNP version 135 reference in case that SNP has an allele frequency >1%
                    1000G_2011Oct_allele_freq: Allele frequency determined by the 1000Genomes (October 2011 version) project
                    annotation: nomenclature for the mutation- c.XXX is the cDNA position of the NM_xxx isoform and p.xxx is the protein substitution nomenclature for that mutation

                    Since I did not create the files I cannot guarantee that this is absolutely true, but these are the most likely explanations.

                    Best regards,
                    Peter

                    Comment


                    • #11
                      Originally posted by xied75 View Post
                      I was just guessing that he might be feeding whatever programs you have mentioned with the excel file directly, other than creating new text files in a format that these programs can read. (But if I'm wrong, then ignore this.)

                      Best,

                      dong
                      That's what I am guessing too, however his files seem to be annotated already...

                      Comment


                      • #12
                        If I select the particular genes involved in CRC, I think then excel filter can help in screening the deleterious SNPs.
                        Thanks,

                        Comment


                        • #13
                          There is no perfect algorithm that goes from primary amino acid change -> functional effect. So you'll want to use a combo of programs ike polyPhen-2, pathway analysis, comparison to the 1K Genomes SNP set, stuff like that.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X