Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ddRADSeq data analysis for population structure (Stacks, dDocent...)

    Hi all,

    I'm working on marine species with no reference genome following ddRAD protocol by Peterson et al with size selection 400-600 bp. First pilot library was sequenced on MiSeq with 250 bp paired-end reads (plan is to sequence main library with 300bp PE reads). The main goal is to study population structure with SNP's found so I've tried Stacks pipeline and gave pretty good results. Unfortunately Stacks can't take full advantage of long MiSeq reads which overlap and can be merged. At the moment I'm trying dDocent pipeline but having a lot of trouble making it to work. I was wondering if anyone used it and how helpful it was. Also I would be grateful if anyone could advise me on any other pipeline that would suit population genetics purpose.

    Many thanks,

    Irina
    Last edited by yarinka; 08-18-2014, 12:50 AM.

  • #2
    paired end ddRAD with Stacks

    Hi Irina,

    Unfortunately I don't have any dDocent pipeline experience I can pass along to you, but I do have a question about the pipeline you used for your data in Stacks. I just got back a paired end ddRAD data set and am trying to analyze it in Stacks to find SNPs and look at population structure (I am also working on a nonmodel marine species with no reference genome). However, I am having trouble figuring out how to fully incorporate the paired end reads into the Stacks pipeline. Any information you can give about your Stacks pipeline and any modifications you had to do for paired end reads would be greatly appreciated. Thank you in advance for your help!

    Cheers,
    Lani

    Comment


    • #3
      Hi Lani,

      after pretty painful experience with dDocent I had to abandon it. I tried pyRAD that some people in our lab find useful but it didn't work as good for me. At the moment I'm trying out another approach from paper by Hohenlohe et al 2013 (http://www.ncbi.nlm.nih.gov/pubmed/23432212) where they use Stacks pipeline but export the tags using export_sql, collate paired end tags (as far as I understand) and assemble them in CAP3/Velvet. They use this assemblage to map reads with Bowtie and SAMtools for variant calling. This pipeline seems pretty promising to me and it allows me to filter contaminated reads with KRAKEN which is very intricate when using dDocent or pyRAD. Would be glad to hear about your experience with this approach if you decide to use it in the end.

      Cheers,

      Irina

      Comment


      • #4
        Hi Irina,

        Thanks for the info (and the quick response). So you weren't able to find a way to do the whole pipeline just in Stacks? In your original post you said you "tried Stacks pipeline and gave pretty good results"-what pipeline/components of Stacks did you use to get these results you're referring to?

        Thanks again!
        Lani

        Comment


        • #5
          Hi Lani,

          under the Stacks pipeline I assumed standard Stacks flow denovo_map - populations - structure.

          Cheers,

          Irina

          Comment


          • #6
            Thank you for the info!

            Comment


            • #7
              I just stumbled across this thread and thought I'd contribute a bit and maybe help, if it is still needed. I haven't played a lot with dDocent, but I really like what it does in principle. The pipeline from Hohenlohe is essentially doing something very similar. Stacks does have limitations, such as the overlapping read pairs one already mentioned. Stacks also cannot really use paired-end data without a reference genome; in order to use the read pairs for clustering, the user must concatenate the forward and paired reads together (and they must be the same length), which eliminates the paired-end information.

              pyRAD is nice because it gives you intuitive nucleotide alignments as output, which is what you need for phylogenetic analyses, but is very slow due to the clustering it does.

              Here is how I would analyze the data described below:
              1. Remove PCR clones and parse data (Stacks clone_filter and process_radtags are great for this).
              2. Quality trim the parsed reads (I'm partial to Trimmomatic, but lots of options are out there).
              3. In the case of long reads that can potentially overlap, use a program to make this happen (PEAR pops up first on a Google search, but there are others).
              4. Using the raw reads and overlapping longer reads you can do a couple of things:
              4a. Cluster the reads based on similarity, which is what dDocent does using Rainbow and CD-HIT (after it eliminates reads that only show up a few times).
              4b. Assemble the reads using an assembler (many options are out there).
              This gives you a "reference" to map to. Note this isn't a reference genome, as your contigs will all be relatively short (a few hundred bp) and will have no real coordinates in a genomic-sense.
              5. Map your reads from individual samples back to your "reference" using BWA/Bowtie/etc., which gives you a SAM output.
              6. You can proceed with SNP calling and downstream analyses in a couple fashions:
              6a. Use Stacks ref_map.pl wrapper, which takes your SAM outputs and runs pstacks, cstacks, and sstacks (or run these individually yourself). Then use populations to do as you please.
              6b. Use SAMtools to create BAM files, and then call SNPs (FreeBayes, GATK, SAMtools, etc.), which gives you a VCF file.
              If you are familiar with Stacks and can use the output it gives you, then 6a is good. 6b is a standard mapping and SNP calling pipeline, and as such it gives you standard output file types, which can be used in numerous programs downstream, depending on your study goals.

              While not a programmer, I've picked up enough Python coding knowledge to make tools and pipeline some of the steps above, which you are free to use if you think they will be helpful (https://github.com/darencard). Hope this helps!

              Comment


              • #8
                Hey Daren!! This looks like a pretty good idea you have worked out.
                Originally posted by dcard View Post
                I just stumbled across this thread and thought I'd contribute a bit and maybe help, if it is still needed. I haven't played a lot with dDocent, but I really like what it does in principle. The pipeline from Hohenlohe is essentially doing something very similar. Stacks does have limitations, such as the overlapping read pairs one already mentioned. Stacks also cannot really use paired-end data without a reference genome; in order to use the read pairs for clustering, the user must concatenate the forward and paired reads together (and they must be the same length), which eliminates the paired-end information.

                pyRAD is nice because it gives you intuitive nucleotide alignments as output, which is what you need for phylogenetic analyses, but is very slow due to the clustering it does.

                Here is how I would analyze the data described below:
                1. Remove PCR clones and parse data (Stacks clone_filter and process_radtags are great for this).
                2. Quality trim the parsed reads (I'm partial to Trimmomatic, but lots of options are out there).
                3. In the case of long reads that can potentially overlap, use a program to make this happen (PEAR pops up first on a Google search, but there are others).
                4. Using the raw reads and overlapping longer reads you can do a couple of things:
                4a. Cluster the reads based on similarity, which is what dDocent does using Rainbow and CD-HIT (after it eliminates reads that only show up a few times).
                4b. Assemble the reads using an assembler (many options are out there).
                This gives you a "reference" to map to. Note this isn't a reference genome, as your contigs will all be relatively short (a few hundred bp) and will have no real coordinates in a genomic-sense.
                5. Map your reads from individual samples back to your "reference" using BWA/Bowtie/etc., which gives you a SAM output.
                6. You can proceed with SNP calling and downstream analyses in a couple fashions:
                6a. Use Stacks ref_map.pl wrapper, which takes your SAM outputs and runs pstacks, cstacks, and sstacks (or run these individually yourself). Then use populations to do as you please.
                6b. Use SAMtools to create BAM files, and then call SNPs (FreeBayes, GATK, SAMtools, etc.), which gives you a VCF file.
                If you are familiar with Stacks and can use the output it gives you, then 6a is good. 6b is a standard mapping and SNP calling pipeline, and as such it gives you standard output file types, which can be used in numerous programs downstream, depending on your study goals.

                While not a programmer, I've picked up enough Python coding knowledge to make tools and pipeline some of the steps above, which you are free to use if you think they will be helpful (https://github.com/darencard). Hope this helps!

                Comment


                • #9
                  Help with dDocent

                  I'm sorry to hear that people have had a hard time with the pipeline. It's recently updated in the last few weeks to make it more compatible and problems with installation easier to diagnoses. Using rainbow to assemble is more sophisticated than the approach is suggest in this stream.

                  dDocent is designed with non-model, marine organisms in mind. Please don't hesitate to contact me with questions.

                  [email protected]

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  20 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X