Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low Coverage RAD-Seq Data

    Hello,

    I ran 48 samples from museum specimens on the MiSeq platform. Unfortunately, each sample has very low coverage (~0.0035x). One of the contributing factors appears to be that the average read size is 38 bp, which may be due to the large amount of sheering present in museum samples.

    I was able to extract ~60 SNPs, but cannot produce any supported trees or STRUCTURE results. I know the low coverage is to blame, but I am not sure the minimum coverage that would be usable for an analysis. I typically see papers use coverages of >5x. Papers with lower coverage (the lowest I have seen is 0.5x typically mention that they would like to acquire further coverage).

    However, to get that amount of coverage for one sample would require an entire lane of the HiSeq platform, which would provide 30x coverage than the MiSeq. If I went for 0.5x coverage, I could run 10 samples.

    Is there a hard rule suggesting the minimum amount of coverage for a phylogenetic analysis? Is running these samples on a HiSeq worth the cost?

    EDIT: I do have a reference genome for my species.

  • #2
    You have to be a little careful thinking about read coverage with RAD data...after all, it is meant to sample the genome at a small number of loci. So a "good" RAD sample might get 20X read depth at 10,000 loci, but over a 100 Mb genome that would be less than 1X read depth on average.

    How many loci are you trying to sequence in each sample (or, what is the genome and what enzymes did you use)? 48 samples in a MiSeq is probably not enough reads, especially for a museum sample.

    Here's another thing to check--what is the alignment rate of the reads to the reference? If it is low, then there may be DNA of bacteria in the sample. We see old samples that sometimes only have traces of what they should have, and lots of DNA of things that have been eating the museum sample.

    A MiSeq run is actually often the same cost as a HiSeq, depending on the read lengths. At the University of Oregon facility ( https://gc3f.uoregon.edu/illumina-sequencing ) a lane of single-end 75 bp reads is $1,193 for outside users, and you get hundreds of millions of reads. That's cheaper than any v3 MiSeq run (there, at least). If you have samples with lots of sheared DNA that is getting trimmed short or samples with contaminating species, then brute force sequencing is the way to go. It might still be a failure so I would look carefully at the reads you do have to check if it worth brute forcing it or not, and you need to take extra care that any structure you see is not from biases of the samples rather than the biology of the samples.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      I am using the the prairie vole genome (~2.3 bp), and I used the EcoR1 enzyme. My alignment rate is varied from 12 to 80%.

      Do you have any suggestions on what I should look for? I have tried filtering the SNPs for quality, but few of the loci are shared between the samples. I am trying to judge how many samples I could send for a useful analysis. What do you think would be an adequate amount of reads?

      Comment


      • #4
        Was this regular RAD-Seq or ddRAD? There are probably 500,000 EcoRI sites in a 2 Gb genome, so if you had 48 samples and 25 million reads even with perfect data you would get a read per site unless you were sampling a subset of them (like EcoRI-EcoRI size selected ddRAD fragments).

        The variability in alignment rate is worrisome. It sounds like you have some samples that are pretty pure (the 80% mappers) and some that are mostly something other than vole.

        How are you filtering right now? Are you filtering for SNPs that are present in all or most samples? I'd count how many samples have alignment rates above 50%, and try to capture SNPs shared by those samples. So if 15 samples align well, just ask that the SNP locus is sequenced in 30% of the samples. In vcftools, you would do --max-missing .3 (it is opposite of how it sounds, so that allows SNPs with as low as 30% representation). Are you filtering by read depth as well?
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment


        • #5
          I performed regular RAD-Seq.

          Yes, the alignment values worry me as well. Some of these samples are over 100 yrs old and have faced severe degradation.

          My initial filtering is to only use reads >20 bp to prevent ambiguous alignments, which doesn't eliminate much. Similarly, I marked and deleted duplicates, but few duplicates occurred. Lastly, I filtered for quality based on the GATK recommendations. I have also ran analyses relaxing all of these filtering processes. Still, I could not get much more information.

          I have tried running analyses with all available SNPs, SNPs shared by at least 4 samples, and SNPs shared by more samples. No SNPs are present in all samples, and each loci is shared by less than 15% of the samples. I have tried removing samples with low alignment rates as well. The previously mentioned 60 SNPs comes from SNPs shared by at least 10 samples where only 22 samples had >10 SNPS.

          Regardless of how much I filter or how little I filter, few SNPs are shared between the samples, and the analyses cannot resolve structure or a tree.

          Comment


          • #6
            I think the problem is that you are sampling 500,000 sites which turns into 1 million RAD tags possible (each cut site has two tags that are sequenced) with an average of 500,000 reads per sample. So at best you will have 1 read at half the RAD loci. Two samples will have 1/4 of the sites in common, 20 samples will have (1/2 ^ 20) 0 sites in common.

            A read depth of 1 isn't really enough to get a genotype. GATK might be throwing out many of the loci for SNP quality at that level. You certainly won't genotype both alleles at heterozygous loci.

            The problem is that this library may not even get good results with a HiSeq lane. 48 samples will get 5-10 million reads each. With 1 million EcoRI sites, that is getting 5-10X read depth (more likely 2-3X with usual attrition) which is low and that would be for the best samples and best loci within those samples.

            Do you still have DNA? I would re-do these as SbfI libraries (10-fold reduction in site number) and sequence on a HiSeq to get good read per locus.
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              I am slightly confused. A HiSeq will provide at least 300 billion reads, which for 48 samples would get 6.2 million reads per sample but figuring my read length of 38 would only provide coverage for 0.1x per sample. [Coverage = No. Reads x Read Length / Genome size]

              I am not sure how you are getting your numbers. Do you think it would be worth sending maybe 10 samples out, which would fit into the goals of my project?

              I do still have DNA. So you think you preparing the libraries with SbfI instead of EcoRI would work better?

              Comment


              • #8
                You can't think about coverage that way with RAD-Seq. You know that coverage will be zero across most of the genome, but then you will get high read depth at the RAD loci at the EcoRI sites. So I think about how many cut sites there might be (say, 500,000) and how that will turn into 1 million RAD loci (2 per cut site) and then how many reads would be needed per sample (20 reads desired per locus x 1 million loci = 20 million reads needed, then multiply by 2 to allow some samples to sequence less well so really you would need 40 million reads per sample). 40 million reads would be 40M x 38 bp = 1.5 Gbp or .6X coverage, but again, .6X coverage would be high coverage at a million locations and zero coverage almost everywhere so it is a poor way to represent what you are trying to do.

                If you have a reference, you should count the EcoRI sites and SbfI sites to see how the number of loci would change, and if that would work for the amount of sequencing you can afford to do.
                Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                Comment


                • #9
                  Okay. However, my samples had very little duplication (>1%) and few loci are shared. Do you think further sampling will increase those numbers?

                  Unfortunately, we don't have the ability to prepare the libraries again. Our decision is to send the samples out for further sequencing or not at all. Do you not think that the current libraries are salvageable?

                  Comment


                  • #10
                    I'm not sure what is being measured by the duplication level...I thought you were marking PCR duplicates.

                    If you have 1 million EcoRI sites, then I'd pick the best 15 samples (high alignment rate, longest fragment lengths) and send them to Oregon for single-end 75 bp sequencing. You might consider analyzing the results with a RAD-oriented pipeline like Stacks or pyRAD.

                    Good luck!
                    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                    Comment


                    • #11
                      Thank you for all the help!

                      Comment


                      • #12
                        Hello,
                        I did ddRADseq for 7 indian cattles in individual lane for sequencing in order to get as much as genome coverage. overall aligment rate was good around 92% but unique aligned reads were too less about 21.5%.
                        what maybe the reason plz give your valuable suggestions.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        8 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        66 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X