Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tool to estimate genetic heterogeneity using SNP

    Hello everyone,

    I have a question related to using SNPs as markers of population heterogeneity.

    We have four fly strains who are supposed to have the same genetic background. To check if this is the case, we performed SNP calling on the whole transcriptome, on the four sample for each of four strain. Then we compared four samples - one from each strain - using Venn diagram, and the results were discouraging since only half of SNPs (~10000) was shared by all four strains, and the rest were strain-specific of shared by three or four strains (attached below). Then we took four samples for one strain and did the same, and the picture was quite similar (attached below). To my opinion, the fact that intra-strain variability is the same as inter-strain indicates that the background in all strains is moreless the same. The question is are there any tools which would statistically reinforce this observation?

    I have found some of R packages for population genetics (DEMEtics, popgen, genetics, pegas) working with SNP, but they need allele frequencies which are barely absent in my case. Another constraint (or not?) is that one sample in my experiment contains pooled RNA from 60 flies whereas the listed R tools work with one individual as a sample.

    I would highly appreciate any advice!

    Best regards,

    Nina.
    Attached Files

  • #2
    Just a question from the technical/analytical side (I like to check the technical things before going to deep into biological reasoning): Did you look at the quality of the data underlying the calls for the non-shared SNPs (and set the filters appropriately)? For example - if there a trend for low coverage for the non-shared SNPs (given that this is RNA-Seq), or problematic regions (splice junctions), this may explain the observed heterogeneity...

    Comment


    • #3
      Hi sarvidsson, thank you for the reply!

      We used paired-end RNAseq on mRNA (there should not be problems related to splicing junctions, isn't it?), with 90 mln reads per a sample. For SNP calling we used RNAseqmut (https://github.com/davidliwei/rnaseqmut) which, aside from SNP, coordinate etc, provides with number of reads for reference and changed nucleotide, for forward and reverse reads (4 columns at all). We made initial SNP calling using threshold 20 reads (for total number of reads, taken from all four columns). Then we selected only those SNPs who have had 50 reads + >40% of nucleotides changed. Venn diagrams are plotted using these lists of SNP.

      now we are checking if there is a difference in coverage between shared and non-shared SNPs

      Comment


      • #4
        It might be useful to generate a kmer frequency histogram. That's what I typically use to estimate heterozygousity for a novel organism without a reference. When you have a reference, there are theoretically much better ways to calculate heterozygousity, but I don't know of any specific tools for that purpose.

        Comment


        • #5
          It sounds like you used very strict parameters to call a SNP if I understood you correctly (read depth of 50 or greater, SNP seen in 40% of reads). My question is if you required those levels in all samples. What if the SNP has a read depth of 80 in sample 1 and 40 in sample 2? If you require a read depth of 50 then the output would say the SNP is in sample 1 and not in sample 2. Yet saying the SNP is not in sample 2 would be clearly wrong.
          Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

          Comment


          • #6
            Brian Bushnell, could you, please, give a cue what is kmer frequency histogram? I am new in this field and dont't know what is this. We made a search of SNPs against reference assembly of fruit fly genome, not directly comparing one transcriptome to another. We were thinking about what to do with heterozygous SNPs and decided to leave them out of analysis. The reason is that one RNAseq sample in our case consists from transcriptomes of 60 flies, and, for example, 40% of changed nucleotides may mean not only heterozygousity, but also that 40% of those 60 flies have this SNP, and 60% do not have. I am not sure that from our data we can discriminate these things.

            SNPsaurus, yep, that's proved to be true. I checked if SNPs that are present only in one strain do really exist in other strains using IGV, and - tadamm! - they really exist, but coverage is lower 50 reads and the percentage of changed nucleotides is lower than 40%. So we decided to lower threshold of coverage to 20 and to increase threshold of % of changed nucleotides to 80%, to eliminate these umbigous heterozygotes from analysis. Let's see what it gives.

            Sarvidsson, thank you for the idea about coverage! Sometimes obvious things are like glasses on you head, which you cannot find yourself.
            Last edited by NinaG; 03-01-2015, 02:59 AM.

            Comment


            • #7
              Originally posted by NinaG View Post
              Brian Bushnell, could you, please, give a cue what is kmer frequency histogram?
              Actually, never mind... I forgot this was RNA-seq data. Due to the highly variable coverage a kmer frequency histogram won't be informative. It's useful in DNA experiments to determine ploidy and het rates. Essentially, it's a graph indicating the coverage distribution of the unique portions of the genome, so you get one peak for homozygous areas, one peak for heterozygous areas, one peak for 2-copy repeats, etc. But it doesn't work without relatively flat read coverage.

              Comment


              • #8
                It sounds what you really want is to identify high-quality SNPs in each sample, and then use much more relaxed parameters to call them in the other samples. Lowering the thresholds for all samples will help, but you will still have a number of SNPs that happen to be just above in one sample and below in the others.

                One way to do it is to generate a high threshold list and a low threshold list. Then for each high-threshold SNP check for presence in the low threshold list.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  There is no simple solution to variant calling and genotyping in low-coverage loci using RNA-Seq data - shotgun DNA and a proper multi-sample caller would give you much more robust data.

                  You should genotype all variant loci (from any sample) in all samples - however this isn't possible with RNAseqmut. You need to know whether the "missing call" is a "weak" heterozygote (with variant allele coverage below your threshold - e.g. the SNP allele has a lower expression than the reference allele or the gene has has a lower expression in that sample), a homozygote variant with a low expression, or a homozygote reference with a high expression.

                  Comment


                  • #10
                    Brian Bushnell, thank you for the explanation. Indeed, coverage is far from flat.

                    SNPsaurus, thank you for the suggestion! We generated low-threshold lists for all samples and then applied filterings with more strict parameters.

                    Sarvidsson, you are right, but the purpose of the experiment was not SNP search per se, we just had RNAseq data and tried to apply what we have to estimate difference between strains. And it proved to be not just that easy way. We have come to the same idea as you wrote - to pool all SNPs from four samples of one strain into one list, remove duplicated SNPs and consider that list as representative for SNP diversity in one strain. This approach eliminates the problem of uneven coverage: if SNP appears (by SNP calling program) even just in one of four samples, it will be present in the list for inter-strain comparison. We used 20 reads cutoff for coverage and 90% mutations to prevent spoiling by heterozygotes. Now SNP distributution looks still not equal, but consistent with experimental data.

                    I think for the moment the results are satisfactory. Thank you all for your ideas and support!

                    Best,

                    Nina.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X