Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • diploid SNP caller on polyploid plant

    Hi,

    How would one interpret SNP calling results when applying reads from a polyploid plant, using diploid caller like samtools or default diploid option of GATK?
    I'm able to get SNP calls, but do the callers just take the two most common alleles (ref and next most common alt)?
    What should I be most careful about when interpreting the results, and especially the confidence values?

    Thanks

  • #2
    Sounds like a PhD project in its own right (i.e. a hard problem).

    Comment


    • #3
      Actually maubp your spot on - it was my Msc project and I'm going back to turn it into part of my Phd. I'm still trying to make sense of some of the data - and exactly what the callers do.

      What's the plant? And the ploidy? I was working with apple which is actually diploid but highly paralogous.

      Samtools does what you say (takes the two most common alleles).
      GATK (UnifiedGenotyper) does something similar but extended (now) to arbitary ploidy. To understand what GATK does read the suplimentary info in the DePristo paper which is really good.

      see response from Geraldine to my thread on GATK forum..



      The link to the Depristo paper is in there too.

      Freebayes from what I can see calls all possible genotypes for arbitary ploidy.

      Freebayes is really good I'd try that and set the ploidy setting appropriate to your organism.. and call as many samples as you have at one time (which will take AGES.. ten days for 14 apple cultivars). And make sure you set the mutation rate to the appropriate value for your organism as well as it makes a big difference to the results.

      Comment


      • #4
        Thanks bambam for the information.

        The plant I'm working on is a Nicotiana species (close relative to N. tabacco), and it's an allotetraploid, genome size between 2-3Gb, and probably 19 xsomes.
        I'm actually working with RNAseq data to call variants, and that's another layer of complexity I have to keep in mind.

        Freebayes looks good, I'll give that a go. Other software I came across that tackles SNP discovery in polyploids is UNEAK (developed for the plant Switchgrass (Tetraploid or Octoploid), and QualitySNPng, though I'm not sure of their usability yet.

        Do you think it is OK to treat SNP candidates called by Samtools/GATK as would be in bi-allelic models (i.e. do quality filtering etc to call/propose the 'major' variant, but keeping in mind the possibility of other alleles)? I am not so concerned with specific likelihood metrics as long as 'good' GQ or PL are also a good measure of a polyploid case. For example, would a GQ of say 90 also be a good confidence for the polyploid case.
        I realize this comes down to understanding the models applied in the various software, but having it explained in more biological context would help as I am having a hard time understanding the equations.

        Comment


        • #5
          Benth (Nicotiana benthamiana)?

          Comment


          • #6
            Yes it is Nicotiana benthamiana

            Comment


            • #7
              Hi Kennels

              I had a hard time understanding equations too. But I think I worked out a way to understand it by combining

              - samtools technical document
              - a paper by Li
              - The suplimentary info in the De Pristo paper

              It is the last of these which is key to understanding this topic without a maths degree. NOTE don't bother with the paper itself just the suplimentary info. I'll dig out my thesis and post to yo the links and how to fit it together.

              Right now I'm dealing with a mini emergency of fried hard drive.

              Also a couple of papers relating to what it actually means to have SNP in paralogous region.. which might be what you were originally asking? (well for polyploid)

              Like I say I was actually dealing with paralogy not true polyploid but I found the SNP sites by Samtools to be good... this is something to think about.. can you separate out calling SNP loci from calling genotypes. I think what you should see is that the bi-allelic model should call the loci fine but get the genotype wrong (does that make see to you?)

              When I mentioned I still had data to analyse this is what I was talking about.. comparing separately the loci and the gentypes called by biallelic caller (samtools) vs freebayes. I'm expecting to see Samtools mostly find the same sites but call different genotypes

              Comment


              • #8
                Originally posted by whataBamBam View Post
                Hi Kennels

                I had a hard time understanding equations too. But I think I worked out a way to understand it by combining

                - samtools technical document
                - a paper by Li
                - The suplimentary info in the De Pristo paper

                It is the last of these which is key to understanding this topic without a maths degree. NOTE don't bother with the paper itself just the suplimentary info. I'll dig out my thesis and post to yo the links and how to fit it together.

                Right now I'm dealing with a mini emergency of fried hard drive.

                Also a couple of papers relating to what it actually means to have SNP in paralogous region.. which might be what you were originally asking? (well for polyploid)

                Like I say I was actually dealing with paralogy not true polyploid but I found the SNP sites by Samtools to be good... this is something to think about.. can you separate out calling SNP loci from calling genotypes. I think what you should see is that the bi-allelic model should call the loci fine but get the genotype wrong (does that make see to you?)

                When I mentioned I still had data to analyse this is what I was talking about.. comparing separately the loci and the genotypes called by biallelic caller (samtools) vs freebayes. I'm expecting to see Samtools mostly find the same sites but call different genotypes
                Thanks bambam for your input and advice.
                And you actually helped phrase my question better.

                Yes I am more interested in identifying the SNV sites rather than assigning a genotype, but I was concerned about the confidence of such sites in the first instance. After some more thought I can probably rely on the QUAL and DP4/AD fields (ref vs alt read depths), and then manually look at the actual read mappings at these sites for the genes I am interested in. Furthering that I can then decide on how many variants/paralogs that could possibly exist at the site, and by extension the potential number of paralogous transcripts.

                I did try samtools, gatk and freebayes now, and you are right that there is good agreement, especially between samtools and freebayes, in identifying the sites (gatk was much more stringent, and so less SNVs).

                I do have the papers you describe above, so no worries, but if you have a summary on 'how to fit it together', that would be great
                hope you got a handle on your fried hard drive.

                Comment


                • #9
                  Yeah I think it's all under control now.

                  Exactly a diploid caller cannot (even in principle) give you the correct genotype because your picking between a different set of possible genotypes in the first place.

                  The situation is described here

                  http://bioinformatics.oxfordjournals.../27/3/303.long

                  in the context of SNP array. For diploid there are three genotypes, (AA, AB, BB) and with one paraologous region you get 5 (AAAA, AABB, BBBB, ABBB, AAAB).

                  Incidentally.. going off on this SNP array tangent.. these guys

                  http://www.plosone.org/article/info%...l.pone.0028334

                  show the same in maize that we see in apple.. 4 types of cluster that I call 'Mendelian', 'Shifted', 'Stretched' and 'Polyploid Mendelian' depending on how the genotype varies in the paralogous region.

                  But in terms of identifying the site the Samtools / GATK default mode seems to work fine as you also observe. I think this is what we should expect - so long as the alternative allele is the same in the two paralogous regions (sorry I keep talking about paralogy and you have a true polyploid but the principle is the same right?) - i.e. we don't see the genotype ABAC.. that would complicate things much further! And then we would be in trouble using a standard SNP caller... but we don't see this in nature right?? We wouldn't expect to see that.

                  So as long as we are only dealing with Reference allele and one alternative allele it doesn't matter how many copies you have.. so long as your only interested in identifying sites

                  We also tried to design a filter based on read depth - but again our goal was slightly different (I think) since it was paralogy which was the headache and we were basically trying to avoid calling false positive snps as a result of the paralogy.. i.e. reads from a paralogous region piling up and differences between the two regions appearing to be snps.

                  This didn't work - in any case there was no mean difference in read depth for verified false snps and verified true snps - but as I say out goal was subtly different.. QUAL did work a bit better as a predictor of this.

                  In terms of getting a handle on the equations I'd be happy to get a second opinion on whether this makes sense to you but this is how I get my head round it..

                  Paper by Li (if we are talking about the same one) states that Samtools and GATK statistical models are the same except for a couple of minor differences. GATK paper explains (in the supplimentary info) in plain english how GATK works... which means if GATK and Samtools are the same it also explains how Samtools works! Which means you don't have to follow the entirely mathematical description of how Samtools works - apart from the bits where it differs from GATK. So I then went back to Samtools technical doc (not Li paper but the online doc for Samtools if you know the one that I mean) and just concentrated on understanding the bit that made it different to GATK.. which luckily was relatively comprehensible. I don't remember right now the difference but I wrote it in my thesis. It's pretty minor though.

                  Then there is a fourth paper that I will dig out which describes in the intro the way the 3 callers are partitioning probabilities over genotytpes.. which pretty much confirms that GATK and Samtools are similar with GATK applying some pre filtering heuristics and that Freebayes is completely different from the others. Then there is Freebayes paper whcih is also fairly impenetrable. I don't really understand how Freebayes works sorry. All I understand is that it is fundementally different, and from using it I'd say it's pretty good. And this fourth paper that I will find the link for states that it partitions over all possible genotypes.

                  Does that mean that even if you were to see ABAC then Freebayes could call that??

                  Incidentally I'll be keen to try out this UNEAK caller you mentioned.

                  Comment


                  • #10
                    Originally posted by Kennels View Post
                    Thanks bambam for the information.

                    The plant I'm working on is a Nicotiana species (close relative to N. tabacco), and it's an allotetraploid, genome size between 2-3Gb, and probably 19 xsomes.
                    Hi,

                    I came to this post just now: I'm working on benthamiana too, calling variants and had the same question in my head. Since you asked this quite a time ago, maybe could you add up how you finally made the allotetraploid call? 'cause I'm struggling with it right now, and trying different pipelines (mpileup and gatk, at the moment). Do you have any suggestions?

                    Help would be very appreciated!

                    Comment


                    • #11
                      freebayes has a ploidy parameter and can call higher ploidy samples.
                      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X