SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: Perspectives on human population structure at the cusp of the sequencing era. Newsbot! Literature Watch 0 10-27-2011 02:10 AM
PubMed: Characterization of microbial community structure and population dynamics of Newsbot! Literature Watch 0 11-06-2010 07:40 AM
PubMed: Pyrosequencing analysis of endosymbiont population structure: co-occurrence o Newsbot! Literature Watch 0 04-29-2009 05:00 AM

Reply
 
Thread Tools
Old 08-18-2014, 12:42 AM   #1
yarinka
Junior Member
 
Location: New Zealand

Join Date: Dec 2013
Posts: 5
Default ddRADSeq data analysis for population structure (Stacks, dDocent...)

Hi all,

I'm working on marine species with no reference genome following ddRAD protocol by Peterson et al with size selection 400-600 bp. First pilot library was sequenced on MiSeq with 250 bp paired-end reads (plan is to sequence main library with 300bp PE reads). The main goal is to study population structure with SNP's found so I've tried Stacks pipeline and gave pretty good results. Unfortunately Stacks can't take full advantage of long MiSeq reads which overlap and can be merged. At the moment I'm trying dDocent pipeline but having a lot of trouble making it to work. I was wondering if anyone used it and how helpful it was. Also I would be grateful if anyone could advise me on any other pipeline that would suit population genetics purpose.

Many thanks,

Irina

Last edited by yarinka; 08-18-2014 at 12:50 AM.
yarinka is offline   Reply With Quote
Old 09-17-2014, 04:03 PM   #2
lugleason
Junior Member
 
Location: La Jolla, CA, USA

Join Date: Sep 2014
Posts: 3
Default paired end ddRAD with Stacks

Hi Irina,

Unfortunately I don't have any dDocent pipeline experience I can pass along to you, but I do have a question about the pipeline you used for your data in Stacks. I just got back a paired end ddRAD data set and am trying to analyze it in Stacks to find SNPs and look at population structure (I am also working on a nonmodel marine species with no reference genome). However, I am having trouble figuring out how to fully incorporate the paired end reads into the Stacks pipeline. Any information you can give about your Stacks pipeline and any modifications you had to do for paired end reads would be greatly appreciated. Thank you in advance for your help!

Cheers,
Lani
lugleason is offline   Reply With Quote
Old 09-17-2014, 04:17 PM   #3
yarinka
Junior Member
 
Location: New Zealand

Join Date: Dec 2013
Posts: 5
Default

Hi Lani,

after pretty painful experience with dDocent I had to abandon it. I tried pyRAD that some people in our lab find useful but it didn't work as good for me. At the moment I'm trying out another approach from paper by Hohenlohe et al 2013 (http://www.ncbi.nlm.nih.gov/pubmed/23432212) where they use Stacks pipeline but export the tags using export_sql, collate paired end tags (as far as I understand) and assemble them in CAP3/Velvet. They use this assemblage to map reads with Bowtie and SAMtools for variant calling. This pipeline seems pretty promising to me and it allows me to filter contaminated reads with KRAKEN which is very intricate when using dDocent or pyRAD. Would be glad to hear about your experience with this approach if you decide to use it in the end.

Cheers,

Irina
yarinka is offline   Reply With Quote
Old 09-18-2014, 07:48 AM   #4
lugleason
Junior Member
 
Location: La Jolla, CA, USA

Join Date: Sep 2014
Posts: 3
Default

Hi Irina,

Thanks for the info (and the quick response). So you weren't able to find a way to do the whole pipeline just in Stacks? In your original post you said you "tried Stacks pipeline and gave pretty good results"-what pipeline/components of Stacks did you use to get these results you're referring to?

Thanks again!
Lani
lugleason is offline   Reply With Quote
Old 09-18-2014, 10:39 PM   #5
yarinka
Junior Member
 
Location: New Zealand

Join Date: Dec 2013
Posts: 5
Default

Hi Lani,

under the Stacks pipeline I assumed standard Stacks flow denovo_map - populations - structure.

Cheers,

Irina
yarinka is offline   Reply With Quote
Old 09-22-2014, 08:40 AM   #6
lugleason
Junior Member
 
Location: La Jolla, CA, USA

Join Date: Sep 2014
Posts: 3
Default

Thank you for the info!
lugleason is offline   Reply With Quote
Old 12-10-2014, 06:28 PM   #7
dcard
Junior Member
 
Location: Arlington, TX

Join Date: Mar 2013
Posts: 5
Default

I just stumbled across this thread and thought I'd contribute a bit and maybe help, if it is still needed. I haven't played a lot with dDocent, but I really like what it does in principle. The pipeline from Hohenlohe is essentially doing something very similar. Stacks does have limitations, such as the overlapping read pairs one already mentioned. Stacks also cannot really use paired-end data without a reference genome; in order to use the read pairs for clustering, the user must concatenate the forward and paired reads together (and they must be the same length), which eliminates the paired-end information.

pyRAD is nice because it gives you intuitive nucleotide alignments as output, which is what you need for phylogenetic analyses, but is very slow due to the clustering it does.

Here is how I would analyze the data described below:
1. Remove PCR clones and parse data (Stacks clone_filter and process_radtags are great for this).
2. Quality trim the parsed reads (I'm partial to Trimmomatic, but lots of options are out there).
3. In the case of long reads that can potentially overlap, use a program to make this happen (PEAR pops up first on a Google search, but there are others).
4. Using the raw reads and overlapping longer reads you can do a couple of things:
4a. Cluster the reads based on similarity, which is what dDocent does using Rainbow and CD-HIT (after it eliminates reads that only show up a few times).
4b. Assemble the reads using an assembler (many options are out there).
This gives you a "reference" to map to. Note this isn't a reference genome, as your contigs will all be relatively short (a few hundred bp) and will have no real coordinates in a genomic-sense.
5. Map your reads from individual samples back to your "reference" using BWA/Bowtie/etc., which gives you a SAM output.
6. You can proceed with SNP calling and downstream analyses in a couple fashions:
6a. Use Stacks ref_map.pl wrapper, which takes your SAM outputs and runs pstacks, cstacks, and sstacks (or run these individually yourself). Then use populations to do as you please.
6b. Use SAMtools to create BAM files, and then call SNPs (FreeBayes, GATK, SAMtools, etc.), which gives you a VCF file.
If you are familiar with Stacks and can use the output it gives you, then 6a is good. 6b is a standard mapping and SNP calling pipeline, and as such it gives you standard output file types, which can be used in numerous programs downstream, depending on your study goals.

While not a programmer, I've picked up enough Python coding knowledge to make tools and pipeline some of the steps above, which you are free to use if you think they will be helpful (https://github.com/darencard). Hope this helps!
dcard is offline   Reply With Quote
Old 02-26-2015, 04:55 PM   #8
etwatson
Member
 
Location: Los Angeles, CA

Join Date: Jun 2012
Posts: 18
Default

Hey Daren!! This looks like a pretty good idea you have worked out.
Quote:
Originally Posted by dcard View Post
I just stumbled across this thread and thought I'd contribute a bit and maybe help, if it is still needed. I haven't played a lot with dDocent, but I really like what it does in principle. The pipeline from Hohenlohe is essentially doing something very similar. Stacks does have limitations, such as the overlapping read pairs one already mentioned. Stacks also cannot really use paired-end data without a reference genome; in order to use the read pairs for clustering, the user must concatenate the forward and paired reads together (and they must be the same length), which eliminates the paired-end information.

pyRAD is nice because it gives you intuitive nucleotide alignments as output, which is what you need for phylogenetic analyses, but is very slow due to the clustering it does.

Here is how I would analyze the data described below:
1. Remove PCR clones and parse data (Stacks clone_filter and process_radtags are great for this).
2. Quality trim the parsed reads (I'm partial to Trimmomatic, but lots of options are out there).
3. In the case of long reads that can potentially overlap, use a program to make this happen (PEAR pops up first on a Google search, but there are others).
4. Using the raw reads and overlapping longer reads you can do a couple of things:
4a. Cluster the reads based on similarity, which is what dDocent does using Rainbow and CD-HIT (after it eliminates reads that only show up a few times).
4b. Assemble the reads using an assembler (many options are out there).
This gives you a "reference" to map to. Note this isn't a reference genome, as your contigs will all be relatively short (a few hundred bp) and will have no real coordinates in a genomic-sense.
5. Map your reads from individual samples back to your "reference" using BWA/Bowtie/etc., which gives you a SAM output.
6. You can proceed with SNP calling and downstream analyses in a couple fashions:
6a. Use Stacks ref_map.pl wrapper, which takes your SAM outputs and runs pstacks, cstacks, and sstacks (or run these individually yourself). Then use populations to do as you please.
6b. Use SAMtools to create BAM files, and then call SNPs (FreeBayes, GATK, SAMtools, etc.), which gives you a VCF file.
If you are familiar with Stacks and can use the output it gives you, then 6a is good. 6b is a standard mapping and SNP calling pipeline, and as such it gives you standard output file types, which can be used in numerous programs downstream, depending on your study goals.

While not a programmer, I've picked up enough Python coding knowledge to make tools and pipeline some of the steps above, which you are free to use if you think they will be helpful (https://github.com/darencard). Hope this helps!
etwatson is offline   Reply With Quote
Old 03-23-2015, 02:01 PM   #9
jpuritz
Junior Member
 
Location: Hawaii

Join Date: Sep 2010
Posts: 3
Default Help with dDocent

I'm sorry to hear that people have had a hard time with the pipeline. It's recently updated in the last few weeks to make it more compatible and problems with installation easier to diagnoses. Using rainbow to assemble is more sophisticated than the approach is suggest in this stream.

dDocent is designed with non-model, marine organisms in mind. Please don't hesitate to contact me with questions.

jpuritz@gmail.com
jpuritz is offline   Reply With Quote
Reply

Tags
de novo assembly, miseq, stacks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO