SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Restriction-site Associated (RAD) services Patricia Dorn General 5 05-27-2014 10:54 AM
2b-RAD? does anyone test it? Marianna85 Illumina/Solexa 1 09-12-2013 05:24 AM
RAD-tag QTL annotation pfranchini Bioinformatics 0 02-04-2013 02:51 AM
RAD-tag or GBS? mcurto Illumina/Solexa 0 09-11-2012 02:55 PM
Are RAD data strand-specific? litc Bioinformatics 2 09-02-2012 05:41 AM

Reply
 
Thread Tools
Old 05-19-2013, 02:13 PM   #1
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default 2b-RAD and GATK

Hello all,

I have some Type 2B RAD data for many individuals from several populations of my non-model species. I have made a reference to map the tags to by extracting all potential RAD sites from the available genome. I am now trying to discover SNPs among the individuals.

I've previously used samtools mpileup and a home-made haplotype caller, and have also tried STACKS though my conclusion is that Stacks' SNPs do not agree with the other two methods', even when accounting for a weird indexing issue that appears to be going on. I am skeptical that Stacks is appropriate for Type2B RAD.

I would ideally like to run the same SNP analysis using GATK, then find the intersection of SNPs called using both mpileup and GATK. However, I can't get GATK to run!

I have trimmed, filtered, and mapped (bowtie1) reads with added read groups in individual.sorted.bam format. I would like to run the GATK UnifiedGenotyper on a single individual as a first pass, then refine that SNP list using BQSR, VQSR and multi-sample UnifiedGenotyper SNP identification.

However, here is the problem: even on our available supercomputer and even using -nt and -nct, this single individual will take 4.9 weeks to process!! I am surprised at this. The individual in question contains 6,925,188 36-b reads, mapped to a reference made of every possible RAD site in the genome, 1,624,953 36-b contigs.

A collaborator suggested that GATK massively increases run time with increasing numbers of reference contigs, so I went back to my .sam files and deleted any reference RAD site from my reference.fasta that was not seen more than 100 times among all my individuals. This reduced my reference from the 1.6 million potential tags to 95,000 tags that are actually seen in my data. Still, this did not solve (or even appreciably decrease) the amount of time the UnifiedGenotyper predicts.

Does anyone have any ideas about what is going on here, or, better, how to overcome this problem? I would really like to use GATK!

Thank you!
rcapper is offline   Reply With Quote
Old 05-19-2013, 05:28 PM   #2
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 501
Default

Have you tried running it against the reference without extracting potential RAD sites? I've done large populations using novoalign against a reference (RAD or nextRAD) and the time is trivial. I also routinely take a population, identify the tags in the population (you can decide to include only predominant tags or all tags), and then align those tags against each other to determine alleles, and then count those alleles in each sample.

Sorry to not address your question, but these are paths that work for me for RAD and nextRAD. I assume 2b-RAD would behave in similar ways.
SNPsaurus is offline   Reply With Quote
Old 05-19-2013, 06:38 PM   #3
rcapper
Member
 
Location: Austin, TX

Join Date: Sep 2011
Posts: 20
Default

Interesting. I had previously resisted mapping to the whole genome (12.5k contigs) because I felt like there may be increased mapping error around the restriction site. However, I don't have any data about this yet, just a gut feeling that feeding the mapper known sites selected for their restriction sites would improve this. But... on the other hand, I filter and trim my raw reads and insist that each one contains the RAD site, so maybe I'm just being cautious for no reason.

Anyway -- I did what you suggested and mapped an individual to the whole genome, then ran the UnifiedGenotyper on that guy. We're down to 51 minutes! Looks like the reference database size does indeed make a huge time difference (12.5 contigs/1 hour vs 95k contigs/4.9 weeks...)

Another idea that was suggested to me was to extract the RAD sites from the genome, then concatenate them into artificial chromosomes, potentially using strings of 1000 N's to separate each tag. I think this is a good idea, but more complicated than just mapping to the genome.

Can you think of any reason why mapping to the genome instead of to the sites themselves would be better or worse? I can think of pros and cons for each side, but I am not convinced of either way yet. Obviously, though, the computational time alone will make up my mind
rcapper is offline   Reply With Quote
Old 05-19-2013, 08:08 PM   #4
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 501
Default

To me, the biggest source of error when mapping to a whole genome is from alignment to duplicate genes/genomic regions. Sometimes a tag will align to multiple locations. One is true, and the others spurious. But a sequencing error may be enough to shift the tag from one location to the other. Or, because few genomes are "complete", your tag may map to the duplicate region rather than its real locus.

But these are not serious issues. A small loss of tags from tossing out ones that map to multiple locations is just not an issue when you have 10k (or 100k) markers to play with. And in the second scenario, the usual information desired is comparing your samples to each other. So if all the samples map to a spurious locus because of missing sequence in the reference, they will be true in comparison to each other. Mapping issues might be more of a problem with 2b-RAD with its short sequence length, but I bet it will be OK.
SNPsaurus is offline   Reply With Quote
Reply

Tags
2b-rad, gatk, gatk slow, gatk unifiedgenotyper, rad

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:50 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO