Seqanswers Leaderboard Ad

**lugleason** · 09-17-2014, 04:03 PM

paired end ddRAD with Stacks

Hi Irina,

Unfortunately I don't have any dDocent pipeline experience I can pass along to you, but I do have a question about the pipeline you used for your data in Stacks. I just got back a paired end ddRAD data set and am trying to analyze it in Stacks to find SNPs and look at population structure (I am also working on a nonmodel marine species with no reference genome). However, I am having trouble figuring out how to fully incorporate the paired end reads into the Stacks pipeline. Any information you can give about your Stacks pipeline and any modifications you had to do for paired end reads would be greatly appreciated. Thank you in advance for your help!

Cheers,
Lani

**yarinka** · 09-17-2014, 04:17 PM

Hi Lani,

after pretty painful experience with dDocent I had to abandon it. I tried pyRAD that some people in our lab find useful but it didn't work as good for me. At the moment I'm trying out another approach from paper by Hohenlohe et al 2013 (http://www.ncbi.nlm.nih.gov/pubmed/23432212) where they use Stacks pipeline but export the tags using export_sql, collate paired end tags (as far as I understand) and assemble them in CAP3/Velvet. They use this assemblage to map reads with Bowtie and SAMtools for variant calling. This pipeline seems pretty promising to me and it allows me to filter contaminated reads with KRAKEN which is very intricate when using dDocent or pyRAD. Would be glad to hear about your experience with this approach if you decide to use it in the end.

Cheers,

Irina

**lugleason** · 09-18-2014, 07:48 AM

Hi Irina,

Thanks for the info (and the quick response). So you weren't able to find a way to do the whole pipeline just in Stacks? In your original post you said you "tried Stacks pipeline and gave pretty good results"-what pipeline/components of Stacks did you use to get these results you're referring to?

Thanks again!
Lani

**yarinka** · 09-18-2014, 10:39 PM

Hi Lani,

under the Stacks pipeline I assumed standard Stacks flow denovo_map - populations - structure.

Cheers,

Irina

**lugleason** · 09-22-2014, 08:40 AM

Thank you for the info!

**dcard** · 12-10-2014, 07:28 PM

I just stumbled across this thread and thought I'd contribute a bit and maybe help, if it is still needed. I haven't played a lot with dDocent, but I really like what it does in principle. The pipeline from Hohenlohe is essentially doing something very similar. Stacks does have limitations, such as the overlapping read pairs one already mentioned. Stacks also cannot really use paired-end data without a reference genome; in order to use the read pairs for clustering, the user must concatenate the forward and paired reads together (and they must be the same length), which eliminates the paired-end information.

pyRAD is nice because it gives you intuitive nucleotide alignments as output, which is what you need for phylogenetic analyses, but is very slow due to the clustering it does.

Here is how I would analyze the data described below:
1. Remove PCR clones and parse data (Stacks clone_filter and process_radtags are great for this).
2. Quality trim the parsed reads (I'm partial to Trimmomatic, but lots of options are out there).
3. In the case of long reads that can potentially overlap, use a program to make this happen (PEAR pops up first on a Google search, but there are others).
4. Using the raw reads and overlapping longer reads you can do a couple of things:
4a. Cluster the reads based on similarity, which is what dDocent does using Rainbow and CD-HIT (after it eliminates reads that only show up a few times).
4b. Assemble the reads using an assembler (many options are out there).
This gives you a "reference" to map to. Note this isn't a reference genome, as your contigs will all be relatively short (a few hundred bp) and will have no real coordinates in a genomic-sense.
5. Map your reads from individual samples back to your "reference" using BWA/Bowtie/etc., which gives you a SAM output.
6. You can proceed with SNP calling and downstream analyses in a couple fashions:
6a. Use Stacks ref_map.pl wrapper, which takes your SAM outputs and runs pstacks, cstacks, and sstacks (or run these individually yourself). Then use populations to do as you please.
6b. Use SAMtools to create BAM files, and then call SNPs (FreeBayes, GATK, SAMtools, etc.), which gives you a VCF file.
If you are familiar with Stacks and can use the output it gives you, then 6a is good. 6b is a standard mapping and SNP calling pipeline, and as such it gives you standard output file types, which can be used in numerous programs downstream, depending on your study goals.

While not a programmer, I've picked up enough Python coding knowledge to make tools and pipeline some of the steps above, which you are free to use if you think they will be helpful (https://github.com/darencard). Hope this helps!

**etwatson** · 02-26-2015, 05:55 PM

Hey Daren!! This looks like a pretty good idea you have worked out.

Originally posted by dcard View Post

I just stumbled across this thread and thought I'd contribute a bit and maybe help, if it is still needed. I haven't played a lot with dDocent, but I really like what it does in principle. The pipeline from Hohenlohe is essentially doing something very similar. Stacks does have limitations, such as the overlapping read pairs one already mentioned. Stacks also cannot really use paired-end data without a reference genome; in order to use the read pairs for clustering, the user must concatenate the forward and paired reads together (and they must be the same length), which eliminates the paired-end information.

pyRAD is nice because it gives you intuitive nucleotide alignments as output, which is what you need for phylogenetic analyses, but is very slow due to the clustering it does.

Here is how I would analyze the data described below:
1. Remove PCR clones and parse data (Stacks clone_filter and process_radtags are great for this).
2. Quality trim the parsed reads (I'm partial to Trimmomatic, but lots of options are out there).
3. In the case of long reads that can potentially overlap, use a program to make this happen (PEAR pops up first on a Google search, but there are others).
4. Using the raw reads and overlapping longer reads you can do a couple of things:
4a. Cluster the reads based on similarity, which is what dDocent does using Rainbow and CD-HIT (after it eliminates reads that only show up a few times).
4b. Assemble the reads using an assembler (many options are out there).
This gives you a "reference" to map to. Note this isn't a reference genome, as your contigs will all be relatively short (a few hundred bp) and will have no real coordinates in a genomic-sense.
5. Map your reads from individual samples back to your "reference" using BWA/Bowtie/etc., which gives you a SAM output.
6. You can proceed with SNP calling and downstream analyses in a couple fashions:
6a. Use Stacks ref_map.pl wrapper, which takes your SAM outputs and runs pstacks, cstacks, and sstacks (or run these individually yourself). Then use populations to do as you please.
6b. Use SAMtools to create BAM files, and then call SNPs (FreeBayes, GATK, SAMtools, etc.), which gives you a VCF file.
If you are familiar with Stacks and can use the output it gives you, then 6a is good. 6b is a standard mapping and SNP calling pipeline, and as such it gives you standard output file types, which can be used in numerous programs downstream, depending on your study goals.

While not a programmer, I've picked up enough Python coding knowledge to make tools and pipeline some of the steps above, which you are free to use if you think they will be helpful (https://github.com/darencard). Hope this helps!

**jpuritz** · 03-23-2015, 02:01 PM

Help with dDocent

I'm sorry to hear that people have had a hard time with the pipeline. It's recently updated in the last few weeks to make it more compatible and problems with installation easier to diagnoses. Using rainbow to assemble is more sophisticated than the approach is suggest in this stream.

dDocent is designed with non-model, marine organisms in mind. Please don't hesitate to contact me with questions.

[email protected]

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

ddRADSeq data analysis for population structure (Stacks, dDocent...)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News