SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Obtaining Random Sequences from Given Taxonomic Grouping afitz Bioinformatics 2 11-14-2013 08:36 PM
Working/Visualising Phased VCF files? aeonsim Bioinformatics 1 06-19-2013 02:45 PM
Does bcftools produce phased genotypes? a11msp Bioinformatics 0 04-19-2012 03:16 AM
any pipeline for tophat with 454 sequences? feng RNA Sequencing 0 11-29-2011 06:41 PM
Any pipeline to find automatically ORF in consensus sequences? Christopher Sauvage Bioinformatics 6 05-21-2010 06:09 AM

Reply
 
Thread Tools
Old 04-13-2014, 09:11 AM   #1
mharve9
Junior Member
 
Location: Louisiana, USA

Join Date: Apr 2014
Posts: 2
Default Pipeline for obtaining phased haplotype sequences w/out references

Hello,

I'm working on a pipeline for obtaining phased haplotype sequences from diploid organisms. The input data are Illumina reads from reduced representation libraries, and the goal is to use the phased sequences to estimate gene trees for coalescent analyses. I am working with non-model species, so I don't have a reference genome nor any reference panels for phasing SNPs. I've worked up a pipeline (see below), but given the proliferation of tools out there, I was hoping to get feedback on whether alternative (better) tools exist than those that I've selected. Pipeline:

1) Demultiplex and clean reads (Casava, Illumiprocessor)
2) de novo assembly (ABySS)
3) Map contigs to reference sequences of interest (in some cases we have a set of reference loci we are interested in recovering; python scripts already written for this step)
3) Map reads to consensus (BWA)
4) Call SNPs and phase using read information (GATK)
5) Output phased haplotype sequences (custom python scripts?)

In addition to advice on alternative tools, I would appreciate any input on step (5) above. Are there any tools that can do this? From what I can tell, samtools can output sequences from VCF files of phased SNPs, but these will just contain ambiguity codes rather than 2 phased haplotype sequences. I don't think GATK has this functionality yet. Will I just have to write a script to take the phasing SNP information from the phased VCF from GATK and add it back into the consensus sequences?

Thanks,

Mike
mharve9 is offline   Reply With Quote
Old 06-05-2015, 09:07 AM   #2
Sett
Junior Member
 
Location: italy

Join Date: Jun 2015
Posts: 2
Default

I'm trying too to obtain two different consensus sequences starting from a phased VCF for a diploid organism. Were you able to find an efficient solution?
Thanks
Sett is offline   Reply With Quote
Old 06-05-2015, 12:50 PM   #3
mharve9
Junior Member
 
Location: Louisiana, USA

Join Date: Apr 2014
Posts: 2
Default

Sett,

I ended up writing a python script to do this. It's available at:

https://github.com/mgharvey/misc_pyt...nps_to_seqs.py

Essentially, it takes a phased vcf file output by GATK (but see details below) and inserts the SNPs into reference sequences in fasta format for the relevant loci (e.g. those used as the index for mapping reads initially). For each diploid individual, two sequences are output for the two alleles. The script starts inserting SNPs at the beginning of the reference for each locus/contig, and correctly phases subsequent SNPs that were successfully phased by GATK. If some SNPs from that locus were not successfully phased, it inserts appropriate IUPAC ambiguity codes (unless you use the --resolve flag to force arbitrary phasing).

The input I use (and expected by the script) is actually a phased SNP table, which can be output using the GATK VariantsToTable tool. To obtain this, after phasing I make a separate vcf file for each sample in my vcf using the SelectVariants tool (using the -sn flag to output a single individual). I then run the VariantsToTable tool with the following flags (which determine which data columns get output to the table): -F CHROM -F POS -F QUAL -GF GT -GF DP -GF HP -GF AD. The full command would be:

java -Xmx2g -jar GenomeAnalysisTK.jar \
-T VariantsToTable \
-R Xenops_minutus_All_to_probes.fasta \
-V Xenops_minutus_XM2-phased.vcf \
-F CHROM -F POS -F QUAL -GF GT -GF DP -GF HP -GF AD \
-o Xenops_minutus_XM2-phased-table.txt

I then run this in the python script, the command for which is:

python add_phased_snps_to_seqs.py REF_FASTA PHASED_TABLE OUT_FILE

With the arguments in caps being replaced with your reference fasta file, phased table of SNPs, and desired output file location/name, respectively.

This script is still in draft stage. Let me know if you use it and have issues.

Mike
mharve9 is offline   Reply With Quote
Old 06-08-2015, 12:34 AM   #4
Sett
Junior Member
 
Location: italy

Join Date: Jun 2015
Posts: 2
Default

Thank you Mike! I'll try your script.
Sett is offline   Reply With Quote
Reply

Tags
gatk, haplotypes, phasing, pipeline, sequences

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO