SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Non-model polyploid plant Illumina RNA-seq: Paired/single ends? Fragment size range? squirrel24 Illumina/Solexa 2 10-25-2012 05:16 PM
Samtools SNP Caller versus MAQ SNP caller BertieWooster Bioinformatics 5 05-18-2012 01:41 PM
Single nucleotide polymorphism (SNP) discovery in polyploid ramouz87 Bioinformatics 1 07-09-2010 07:25 AM
SNP Caller for BAM Files AnamikaDarwin Bioinformatics 0 01-22-2010 05:53 AM

Reply
 
Thread Tools
Old 10-20-2013, 06:47 PM   #1
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default diploid SNP caller on polyploid plant

Hi,

How would one interpret SNP calling results when applying reads from a polyploid plant, using diploid caller like samtools or default diploid option of GATK?
I'm able to get SNP calls, but do the callers just take the two most common alleles (ref and next most common alt)?
What should I be most careful about when interpreting the results, and especially the confidence values?

Thanks
Kennels is offline   Reply With Quote
Old 10-21-2013, 12:00 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Sounds like a PhD project in its own right (i.e. a hard problem).
maubp is offline   Reply With Quote
Old 10-21-2013, 01:02 AM   #3
whataBamBam
Member
 
Location: Italy

Join Date: May 2013
Posts: 27
Default

Actually maubp your spot on - it was my Msc project and I'm going back to turn it into part of my Phd. I'm still trying to make sense of some of the data - and exactly what the callers do.

What's the plant? And the ploidy? I was working with apple which is actually diploid but highly paralogous.

Samtools does what you say (takes the two most common alleles).
GATK (UnifiedGenotyper) does something similar but extended (now) to arbitary ploidy. To understand what GATK does read the suplimentary info in the DePristo paper which is really good.

see response from Geraldine to my thread on GATK forum..

http://gatkforums.broadinstitute.org...092/gatk-paper

The link to the Depristo paper is in there too.

Freebayes from what I can see calls all possible genotypes for arbitary ploidy.

Freebayes is really good I'd try that and set the ploidy setting appropriate to your organism.. and call as many samples as you have at one time (which will take AGES.. ten days for 14 apple cultivars). And make sure you set the mutation rate to the appropriate value for your organism as well as it makes a big difference to the results.
whataBamBam is offline   Reply With Quote
Old 10-21-2013, 03:06 PM   #4
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Thanks bambam for the information.

The plant I'm working on is a Nicotiana species (close relative to N. tabacco), and it's an allotetraploid, genome size between 2-3Gb, and probably 19 xsomes.
I'm actually working with RNAseq data to call variants, and that's another layer of complexity I have to keep in mind.

Freebayes looks good, I'll give that a go. Other software I came across that tackles SNP discovery in polyploids is UNEAK (developed for the plant Switchgrass (Tetraploid or Octoploid), and QualitySNPng, though I'm not sure of their usability yet.

Do you think it is OK to treat SNP candidates called by Samtools/GATK as would be in bi-allelic models (i.e. do quality filtering etc to call/propose the 'major' variant, but keeping in mind the possibility of other alleles)? I am not so concerned with specific likelihood metrics as long as 'good' GQ or PL are also a good measure of a polyploid case. For example, would a GQ of say 90 also be a good confidence for the polyploid case.
I realize this comes down to understanding the models applied in the various software, but having it explained in more biological context would help as I am having a hard time understanding the equations.
Kennels is offline   Reply With Quote
Old 10-21-2013, 03:09 PM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Benth (Nicotiana benthamiana)?
maubp is offline   Reply With Quote
Old 10-21-2013, 03:13 PM   #6
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Yes it is Nicotiana benthamiana
Kennels is offline   Reply With Quote
Old 10-22-2013, 04:43 PM   #7
whataBamBam
Member
 
Location: Italy

Join Date: May 2013
Posts: 27
Default

Hi Kennels

I had a hard time understanding equations too. But I think I worked out a way to understand it by combining

- samtools technical document
- a paper by Li
- The suplimentary info in the De Pristo paper

It is the last of these which is key to understanding this topic without a maths degree. NOTE don't bother with the paper itself just the suplimentary info. I'll dig out my thesis and post to yo the links and how to fit it together.

Right now I'm dealing with a mini emergency of fried hard drive.

Also a couple of papers relating to what it actually means to have SNP in paralogous region.. which might be what you were originally asking? (well for polyploid)

Like I say I was actually dealing with paralogy not true polyploid but I found the SNP sites by Samtools to be good... this is something to think about.. can you separate out calling SNP loci from calling genotypes. I think what you should see is that the bi-allelic model should call the loci fine but get the genotype wrong (does that make see to you?)

When I mentioned I still had data to analyse this is what I was talking about.. comparing separately the loci and the gentypes called by biallelic caller (samtools) vs freebayes. I'm expecting to see Samtools mostly find the same sites but call different genotypes
whataBamBam is offline   Reply With Quote
Old 10-22-2013, 09:51 PM   #8
Kennels
Senior Member
 
Location: Sydney

Join Date: Feb 2011
Posts: 149
Default

Quote:
Originally Posted by whataBamBam View Post
Hi Kennels

I had a hard time understanding equations too. But I think I worked out a way to understand it by combining

- samtools technical document
- a paper by Li
- The suplimentary info in the De Pristo paper

It is the last of these which is key to understanding this topic without a maths degree. NOTE don't bother with the paper itself just the suplimentary info. I'll dig out my thesis and post to yo the links and how to fit it together.

Right now I'm dealing with a mini emergency of fried hard drive.

Also a couple of papers relating to what it actually means to have SNP in paralogous region.. which might be what you were originally asking? (well for polyploid)

Like I say I was actually dealing with paralogy not true polyploid but I found the SNP sites by Samtools to be good... this is something to think about.. can you separate out calling SNP loci from calling genotypes. I think what you should see is that the bi-allelic model should call the loci fine but get the genotype wrong (does that make see to you?)

When I mentioned I still had data to analyse this is what I was talking about.. comparing separately the loci and the genotypes called by biallelic caller (samtools) vs freebayes. I'm expecting to see Samtools mostly find the same sites but call different genotypes
Thanks bambam for your input and advice.
And you actually helped phrase my question better.

Yes I am more interested in identifying the SNV sites rather than assigning a genotype, but I was concerned about the confidence of such sites in the first instance. After some more thought I can probably rely on the QUAL and DP4/AD fields (ref vs alt read depths), and then manually look at the actual read mappings at these sites for the genes I am interested in. Furthering that I can then decide on how many variants/paralogs that could possibly exist at the site, and by extension the potential number of paralogous transcripts.

I did try samtools, gatk and freebayes now, and you are right that there is good agreement, especially between samtools and freebayes, in identifying the sites (gatk was much more stringent, and so less SNVs).

I do have the papers you describe above, so no worries, but if you have a summary on 'how to fit it together', that would be great
hope you got a handle on your fried hard drive.
Kennels is offline   Reply With Quote
Old 10-22-2013, 11:07 PM   #9
whataBamBam
Member
 
Location: Italy

Join Date: May 2013
Posts: 27
Default

Yeah I think it's all under control now.

Exactly a diploid caller cannot (even in principle) give you the correct genotype because your picking between a different set of possible genotypes in the first place.

The situation is described here

http://bioinformatics.oxfordjournals.../27/3/303.long

in the context of SNP array. For diploid there are three genotypes, (AA, AB, BB) and with one paraologous region you get 5 (AAAA, AABB, BBBB, ABBB, AAAB).

Incidentally.. going off on this SNP array tangent.. these guys

http://www.plosone.org/article/info%...l.pone.0028334

show the same in maize that we see in apple.. 4 types of cluster that I call 'Mendelian', 'Shifted', 'Stretched' and 'Polyploid Mendelian' depending on how the genotype varies in the paralogous region.

But in terms of identifying the site the Samtools / GATK default mode seems to work fine as you also observe. I think this is what we should expect - so long as the alternative allele is the same in the two paralogous regions (sorry I keep talking about paralogy and you have a true polyploid but the principle is the same right?) - i.e. we don't see the genotype ABAC.. that would complicate things much further! And then we would be in trouble using a standard SNP caller... but we don't see this in nature right?? We wouldn't expect to see that.

So as long as we are only dealing with Reference allele and one alternative allele it doesn't matter how many copies you have.. so long as your only interested in identifying sites

We also tried to design a filter based on read depth - but again our goal was slightly different (I think) since it was paralogy which was the headache and we were basically trying to avoid calling false positive snps as a result of the paralogy.. i.e. reads from a paralogous region piling up and differences between the two regions appearing to be snps.

This didn't work - in any case there was no mean difference in read depth for verified false snps and verified true snps - but as I say out goal was subtly different.. QUAL did work a bit better as a predictor of this.

In terms of getting a handle on the equations I'd be happy to get a second opinion on whether this makes sense to you but this is how I get my head round it..

Paper by Li (if we are talking about the same one) states that Samtools and GATK statistical models are the same except for a couple of minor differences. GATK paper explains (in the supplimentary info) in plain english how GATK works... which means if GATK and Samtools are the same it also explains how Samtools works! Which means you don't have to follow the entirely mathematical description of how Samtools works - apart from the bits where it differs from GATK. So I then went back to Samtools technical doc (not Li paper but the online doc for Samtools if you know the one that I mean) and just concentrated on understanding the bit that made it different to GATK.. which luckily was relatively comprehensible. I don't remember right now the difference but I wrote it in my thesis. It's pretty minor though.

Then there is a fourth paper that I will dig out which describes in the intro the way the 3 callers are partitioning probabilities over genotytpes.. which pretty much confirms that GATK and Samtools are similar with GATK applying some pre filtering heuristics and that Freebayes is completely different from the others. Then there is Freebayes paper whcih is also fairly impenetrable. I don't really understand how Freebayes works sorry. All I understand is that it is fundementally different, and from using it I'd say it's pretty good. And this fourth paper that I will find the link for states that it partitions over all possible genotypes.

Does that mean that even if you were to see ABAC then Freebayes could call that??

Incidentally I'll be keen to try out this UNEAK caller you mentioned.
whataBamBam is offline   Reply With Quote
Old 11-10-2016, 04:13 AM   #10
Macspider
Member
 
Location: Vienna

Join Date: Feb 2016
Posts: 35
Default

Quote:
Originally Posted by Kennels View Post
Thanks bambam for the information.

The plant I'm working on is a Nicotiana species (close relative to N. tabacco), and it's an allotetraploid, genome size between 2-3Gb, and probably 19 xsomes.
Hi,

I came to this post just now: I'm working on benthamiana too, calling variants and had the same question in my head. Since you asked this quite a time ago, maybe could you add up how you finally made the allotetraploid call? 'cause I'm struggling with it right now, and trying different pipelines (mpileup and gatk, at the moment). Do you have any suggestions?

Help would be very appreciated!
Macspider is offline   Reply With Quote
Old 11-10-2016, 06:48 PM   #11
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

freebayes has a ploidy parameter and can call higher ploidy samples.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO