SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
No coverage or very low coverage in the Complete Genomics data raman91 Bioinformatics 5 01-10-2018 11:42 PM
SNP calling on low coverage NGS data JdeBruin Bioinformatics 4 04-22-2015 02:01 AM
find areas of very low and extra high coverage in NGS data willMD Bioinformatics 4 04-04-2013 11:33 AM
Tool for analyzing low-coverage human data pravee1216 Bioinformatics 0 05-17-2012 08:11 PM

Reply
 
Thread Tools
Old 08-23-2018, 07:01 AM   #1
robins91
Junior Member
 
Location: Ohio

Join Date: May 2018
Posts: 7
Default Low Coverage RAD-Seq Data

Hello,

I ran 48 samples from museum specimens on the MiSeq platform. Unfortunately, each sample has very low coverage (~0.0035x). One of the contributing factors appears to be that the average read size is 38 bp, which may be due to the large amount of sheering present in museum samples.

I was able to extract ~60 SNPs, but cannot produce any supported trees or STRUCTURE results. I know the low coverage is to blame, but I am not sure the minimum coverage that would be usable for an analysis. I typically see papers use coverages of >5x. Papers with lower coverage (the lowest I have seen is 0.5x typically mention that they would like to acquire further coverage).

However, to get that amount of coverage for one sample would require an entire lane of the HiSeq platform, which would provide 30x coverage than the MiSeq. If I went for 0.5x coverage, I could run 10 samples.

Is there a hard rule suggesting the minimum amount of coverage for a phylogenetic analysis? Is running these samples on a HiSeq worth the cost?

EDIT: I do have a reference genome for my species.
robins91 is offline   Reply With Quote
Old 08-23-2018, 10:52 PM   #2
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 451
Default

You have to be a little careful thinking about read coverage with RAD data...after all, it is meant to sample the genome at a small number of loci. So a "good" RAD sample might get 20X read depth at 10,000 loci, but over a 100 Mb genome that would be less than 1X read depth on average.

How many loci are you trying to sequence in each sample (or, what is the genome and what enzymes did you use)? 48 samples in a MiSeq is probably not enough reads, especially for a museum sample.

Here's another thing to check--what is the alignment rate of the reads to the reference? If it is low, then there may be DNA of bacteria in the sample. We see old samples that sometimes only have traces of what they should have, and lots of DNA of things that have been eating the museum sample.

A MiSeq run is actually often the same cost as a HiSeq, depending on the read lengths. At the University of Oregon facility ( https://gc3f.uoregon.edu/illumina-sequencing ) a lane of single-end 75 bp reads is $1,193 for outside users, and you get hundreds of millions of reads. That's cheaper than any v3 MiSeq run (there, at least). If you have samples with lots of sheared DNA that is getting trimmed short or samples with contaminating species, then brute force sequencing is the way to go. It might still be a failure so I would look carefully at the reads you do have to check if it worth brute forcing it or not, and you need to take extra care that any structure you see is not from biases of the samples rather than the biology of the samples.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 08-24-2018, 07:38 AM   #3
robins91
Junior Member
 
Location: Ohio

Join Date: May 2018
Posts: 7
Default

I am using the the prairie vole genome (~2.3 bp), and I used the EcoR1 enzyme. My alignment rate is varied from 12 to 80%.

Do you have any suggestions on what I should look for? I have tried filtering the SNPs for quality, but few of the loci are shared between the samples. I am trying to judge how many samples I could send for a useful analysis. What do you think would be an adequate amount of reads?
robins91 is offline   Reply With Quote
Old 08-24-2018, 10:44 AM   #4
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 451
Default

Was this regular RAD-Seq or ddRAD? There are probably 500,000 EcoRI sites in a 2 Gb genome, so if you had 48 samples and 25 million reads even with perfect data you would get a read per site unless you were sampling a subset of them (like EcoRI-EcoRI size selected ddRAD fragments).

The variability in alignment rate is worrisome. It sounds like you have some samples that are pretty pure (the 80% mappers) and some that are mostly something other than vole.

How are you filtering right now? Are you filtering for SNPs that are present in all or most samples? I'd count how many samples have alignment rates above 50%, and try to capture SNPs shared by those samples. So if 15 samples align well, just ask that the SNP locus is sequenced in 30% of the samples. In vcftools, you would do --max-missing .3 (it is opposite of how it sounds, so that allows SNPs with as low as 30% representation). Are you filtering by read depth as well?
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 08-24-2018, 11:14 AM   #5
robins91
Junior Member
 
Location: Ohio

Join Date: May 2018
Posts: 7
Default

I performed regular RAD-Seq.

Yes, the alignment values worry me as well. Some of these samples are over 100 yrs old and have faced severe degradation.

My initial filtering is to only use reads >20 bp to prevent ambiguous alignments, which doesn't eliminate much. Similarly, I marked and deleted duplicates, but few duplicates occurred. Lastly, I filtered for quality based on the GATK recommendations. I have also ran analyses relaxing all of these filtering processes. Still, I could not get much more information.

I have tried running analyses with all available SNPs, SNPs shared by at least 4 samples, and SNPs shared by more samples. No SNPs are present in all samples, and each loci is shared by less than 15% of the samples. I have tried removing samples with low alignment rates as well. The previously mentioned 60 SNPs comes from SNPs shared by at least 10 samples where only 22 samples had >10 SNPS.

Regardless of how much I filter or how little I filter, few SNPs are shared between the samples, and the analyses cannot resolve structure or a tree.
robins91 is offline   Reply With Quote
Old 08-24-2018, 11:40 AM   #6
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 451
Default

I think the problem is that you are sampling 500,000 sites which turns into 1 million RAD tags possible (each cut site has two tags that are sequenced) with an average of 500,000 reads per sample. So at best you will have 1 read at half the RAD loci. Two samples will have 1/4 of the sites in common, 20 samples will have (1/2 ^ 20) 0 sites in common.

A read depth of 1 isn't really enough to get a genotype. GATK might be throwing out many of the loci for SNP quality at that level. You certainly won't genotype both alleles at heterozygous loci.

The problem is that this library may not even get good results with a HiSeq lane. 48 samples will get 5-10 million reads each. With 1 million EcoRI sites, that is getting 5-10X read depth (more likely 2-3X with usual attrition) which is low and that would be for the best samples and best loci within those samples.

Do you still have DNA? I would re-do these as SbfI libraries (10-fold reduction in site number) and sequence on a HiSeq to get good read per locus.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 08-24-2018, 11:55 AM   #7
robins91
Junior Member
 
Location: Ohio

Join Date: May 2018
Posts: 7
Default

I am slightly confused. A HiSeq will provide at least 300 billion reads, which for 48 samples would get 6.2 million reads per sample but figuring my read length of 38 would only provide coverage for 0.1x per sample. [Coverage = No. Reads x Read Length / Genome size]

I am not sure how you are getting your numbers. Do you think it would be worth sending maybe 10 samples out, which would fit into the goals of my project?

I do still have DNA. So you think you preparing the libraries with SbfI instead of EcoRI would work better?
robins91 is offline   Reply With Quote
Old 08-24-2018, 12:09 PM   #8
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 451
Default

You can't think about coverage that way with RAD-Seq. You know that coverage will be zero across most of the genome, but then you will get high read depth at the RAD loci at the EcoRI sites. So I think about how many cut sites there might be (say, 500,000) and how that will turn into 1 million RAD loci (2 per cut site) and then how many reads would be needed per sample (20 reads desired per locus x 1 million loci = 20 million reads needed, then multiply by 2 to allow some samples to sequence less well so really you would need 40 million reads per sample). 40 million reads would be 40M x 38 bp = 1.5 Gbp or .6X coverage, but again, .6X coverage would be high coverage at a million locations and zero coverage almost everywhere so it is a poor way to represent what you are trying to do.

If you have a reference, you should count the EcoRI sites and SbfI sites to see how the number of loci would change, and if that would work for the amount of sequencing you can afford to do.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 08-24-2018, 12:25 PM   #9
robins91
Junior Member
 
Location: Ohio

Join Date: May 2018
Posts: 7
Default

Okay. However, my samples had very little duplication (>1%) and few loci are shared. Do you think further sampling will increase those numbers?

Unfortunately, we don't have the ability to prepare the libraries again. Our decision is to send the samples out for further sequencing or not at all. Do you not think that the current libraries are salvageable?
robins91 is offline   Reply With Quote
Old 08-24-2018, 12:47 PM   #10
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 451
Default

I'm not sure what is being measured by the duplication level...I thought you were marking PCR duplicates.

If you have 1 million EcoRI sites, then I'd pick the best 15 samples (high alignment rate, longest fragment lengths) and send them to Oregon for single-end 75 bp sequencing. You might consider analyzing the results with a RAD-oriented pipeline like Stacks or pyRAD.

Good luck!
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 08-24-2018, 01:14 PM   #11
robins91
Junior Member
 
Location: Ohio

Join Date: May 2018
Posts: 7
Default

Thank you for all the help!
robins91 is offline   Reply With Quote
Reply

Tags
coverage;

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO