SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to get the genetic heterogeneity based on NGS? hugomarquez Bioinformatics 2 05-10-2014 03:28 PM

Reply
 
Thread Tools
Old 02-27-2015, 06:30 AM   #1
NinaG
Junior Member
 
Location: Pushchino, Russia

Join Date: Nov 2014
Posts: 7
Default Tool to estimate genetic heterogeneity using SNP

Hello everyone,

I have a question related to using SNPs as markers of population heterogeneity.

We have four fly strains who are supposed to have the same genetic background. To check if this is the case, we performed SNP calling on the whole transcriptome, on the four sample for each of four strain. Then we compared four samples - one from each strain - using Venn diagram, and the results were discouraging since only half of SNPs (~10000) was shared by all four strains, and the rest were strain-specific of shared by three or four strains (attached below). Then we took four samples for one strain and did the same, and the picture was quite similar (attached below). To my opinion, the fact that intra-strain variability is the same as inter-strain indicates that the background in all strains is moreless the same. The question is are there any tools which would statistically reinforce this observation?

I have found some of R packages for population genetics (DEMEtics, popgen, genetics, pegas) working with SNP, but they need allele frequencies which are barely absent in my case. Another constraint (or not?) is that one sample in my experiment contains pooled RNA from 60 flies whereas the listed R tools work with one individual as a sample.

I would highly appreciate any advice!

Best regards,

Nina.
Attached Images
File Type: png Inter-strain comparison.png (96.9 KB, 8 views)
File Type: png Intra-strain comparison.png (97.0 KB, 7 views)
NinaG is offline   Reply With Quote
Old 02-27-2015, 06:43 AM   #2
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

Just a question from the technical/analytical side (I like to check the technical things before going to deep into biological reasoning): Did you look at the quality of the data underlying the calls for the non-shared SNPs (and set the filters appropriately)? For example - if there a trend for low coverage for the non-shared SNPs (given that this is RNA-Seq), or problematic regions (splice junctions), this may explain the observed heterogeneity...
sarvidsson is offline   Reply With Quote
Old 02-28-2015, 09:43 PM   #3
NinaG
Junior Member
 
Location: Pushchino, Russia

Join Date: Nov 2014
Posts: 7
Default

Hi sarvidsson, thank you for the reply!

We used paired-end RNAseq on mRNA (there should not be problems related to splicing junctions, isn't it?), with 90 mln reads per a sample. For SNP calling we used RNAseqmut (https://github.com/davidliwei/rnaseqmut) which, aside from SNP, coordinate etc, provides with number of reads for reference and changed nucleotide, for forward and reverse reads (4 columns at all). We made initial SNP calling using threshold 20 reads (for total number of reads, taken from all four columns). Then we selected only those SNPs who have had 50 reads + >40% of nucleotides changed. Venn diagrams are plotted using these lists of SNP.

now we are checking if there is a difference in coverage between shared and non-shared SNPs
NinaG is offline   Reply With Quote
Old 02-28-2015, 09:56 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It might be useful to generate a kmer frequency histogram. That's what I typically use to estimate heterozygousity for a novel organism without a reference. When you have a reference, there are theoretically much better ways to calculate heterozygousity, but I don't know of any specific tools for that purpose.
Brian Bushnell is offline   Reply With Quote
Old 02-28-2015, 09:57 PM   #5
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 489
Default

It sounds like you used very strict parameters to call a SNP if I understood you correctly (read depth of 50 or greater, SNP seen in 40% of reads). My question is if you required those levels in all samples. What if the SNP has a read depth of 80 in sample 1 and 40 in sample 2? If you require a read depth of 50 then the output would say the SNP is in sample 1 and not in sample 2. Yet saying the SNP is not in sample 2 would be clearly wrong.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 03-01-2015, 01:56 AM   #6
NinaG
Junior Member
 
Location: Pushchino, Russia

Join Date: Nov 2014
Posts: 7
Default

Brian Bushnell, could you, please, give a cue what is kmer frequency histogram? I am new in this field and dont't know what is this. We made a search of SNPs against reference assembly of fruit fly genome, not directly comparing one transcriptome to another. We were thinking about what to do with heterozygous SNPs and decided to leave them out of analysis. The reason is that one RNAseq sample in our case consists from transcriptomes of 60 flies, and, for example, 40% of changed nucleotides may mean not only heterozygousity, but also that 40% of those 60 flies have this SNP, and 60% do not have. I am not sure that from our data we can discriminate these things.

SNPsaurus, yep, that's proved to be true. I checked if SNPs that are present only in one strain do really exist in other strains using IGV, and - tadamm! - they really exist, but coverage is lower 50 reads and the percentage of changed nucleotides is lower than 40%. So we decided to lower threshold of coverage to 20 and to increase threshold of % of changed nucleotides to 80%, to eliminate these umbigous heterozygotes from analysis. Let's see what it gives.

Sarvidsson, thank you for the idea about coverage! Sometimes obvious things are like glasses on you head, which you cannot find yourself.

Last edited by NinaG; 03-01-2015 at 01:59 AM.
NinaG is offline   Reply With Quote
Old 03-01-2015, 09:07 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by NinaG View Post
Brian Bushnell, could you, please, give a cue what is kmer frequency histogram?
Actually, never mind... I forgot this was RNA-seq data. Due to the highly variable coverage a kmer frequency histogram won't be informative. It's useful in DNA experiments to determine ploidy and het rates. Essentially, it's a graph indicating the coverage distribution of the unique portions of the genome, so you get one peak for homozygous areas, one peak for heterozygous areas, one peak for 2-copy repeats, etc. But it doesn't work without relatively flat read coverage.
Brian Bushnell is offline   Reply With Quote
Old 03-01-2015, 10:29 AM   #8
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 489
Default

It sounds what you really want is to identify high-quality SNPs in each sample, and then use much more relaxed parameters to call them in the other samples. Lowering the thresholds for all samples will help, but you will still have a number of SNPs that happen to be just above in one sample and below in the others.

One way to do it is to generate a high threshold list and a low threshold list. Then for each high-threshold SNP check for presence in the low threshold list.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 03-02-2015, 12:33 AM   #9
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

There is no simple solution to variant calling and genotyping in low-coverage loci using RNA-Seq data - shotgun DNA and a proper multi-sample caller would give you much more robust data.

You should genotype all variant loci (from any sample) in all samples - however this isn't possible with RNAseqmut. You need to know whether the "missing call" is a "weak" heterozygote (with variant allele coverage below your threshold - e.g. the SNP allele has a lower expression than the reference allele or the gene has has a lower expression in that sample), a homozygote variant with a low expression, or a homozygote reference with a high expression.
sarvidsson is offline   Reply With Quote
Old 03-03-2015, 11:51 PM   #10
NinaG
Junior Member
 
Location: Pushchino, Russia

Join Date: Nov 2014
Posts: 7
Default

Brian Bushnell, thank you for the explanation. Indeed, coverage is far from flat.

SNPsaurus, thank you for the suggestion! We generated low-threshold lists for all samples and then applied filterings with more strict parameters.

Sarvidsson, you are right, but the purpose of the experiment was not SNP search per se, we just had RNAseq data and tried to apply what we have to estimate difference between strains. And it proved to be not just that easy way. We have come to the same idea as you wrote - to pool all SNPs from four samples of one strain into one list, remove duplicated SNPs and consider that list as representative for SNP diversity in one strain. This approach eliminates the problem of uneven coverage: if SNP appears (by SNP calling program) even just in one of four samples, it will be present in the list for inter-strain comparison. We used 20 reads cutoff for coverage and 90% mutations to prevent spoiling by heterozygotes. Now SNP distributution looks still not equal, but consistent with experimental data.

I think for the moment the results are satisfactory. Thank you all for your ideas and support!

Best,

Nina.
NinaG is offline   Reply With Quote
Reply

Tags
population genomics, rnaseq, snp analysis

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:25 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO