Hello all,
I have some Type 2B RAD data for many individuals from several populations of my non-model species. I have made a reference to map the tags to by extracting all potential RAD sites from the available genome. I am now trying to discover SNPs among the individuals.
I've previously used samtools mpileup and a home-made haplotype caller, and have also tried STACKS though my conclusion is that Stacks' SNPs do not agree with the other two methods', even when accounting for a weird indexing issue that appears to be going on. I am skeptical that Stacks is appropriate for Type2B RAD.
I would ideally like to run the same SNP analysis using GATK, then find the intersection of SNPs called using both mpileup and GATK. However, I can't get GATK to run!
I have trimmed, filtered, and mapped (bowtie1) reads with added read groups in individual.sorted.bam format. I would like to run the GATK UnifiedGenotyper on a single individual as a first pass, then refine that SNP list using BQSR, VQSR and multi-sample UnifiedGenotyper SNP identification.
However, here is the problem: even on our available supercomputer and even using -nt and -nct, this single individual will take 4.9 weeks to process!! I am surprised at this. The individual in question contains 6,925,188 36-b reads, mapped to a reference made of every possible RAD site in the genome, 1,624,953 36-b contigs.
A collaborator suggested that GATK massively increases run time with increasing numbers of reference contigs, so I went back to my .sam files and deleted any reference RAD site from my reference.fasta that was not seen more than 100 times among all my individuals. This reduced my reference from the 1.6 million potential tags to 95,000 tags that are actually seen in my data. Still, this did not solve (or even appreciably decrease) the amount of time the UnifiedGenotyper predicts.
Does anyone have any ideas about what is going on here, or, better, how to overcome this problem? I would really like to use GATK!
Thank you!
I have some Type 2B RAD data for many individuals from several populations of my non-model species. I have made a reference to map the tags to by extracting all potential RAD sites from the available genome. I am now trying to discover SNPs among the individuals.
I've previously used samtools mpileup and a home-made haplotype caller, and have also tried STACKS though my conclusion is that Stacks' SNPs do not agree with the other two methods', even when accounting for a weird indexing issue that appears to be going on. I am skeptical that Stacks is appropriate for Type2B RAD.
I would ideally like to run the same SNP analysis using GATK, then find the intersection of SNPs called using both mpileup and GATK. However, I can't get GATK to run!
I have trimmed, filtered, and mapped (bowtie1) reads with added read groups in individual.sorted.bam format. I would like to run the GATK UnifiedGenotyper on a single individual as a first pass, then refine that SNP list using BQSR, VQSR and multi-sample UnifiedGenotyper SNP identification.
However, here is the problem: even on our available supercomputer and even using -nt and -nct, this single individual will take 4.9 weeks to process!! I am surprised at this. The individual in question contains 6,925,188 36-b reads, mapped to a reference made of every possible RAD site in the genome, 1,624,953 36-b contigs.
A collaborator suggested that GATK massively increases run time with increasing numbers of reference contigs, so I went back to my .sam files and deleted any reference RAD site from my reference.fasta that was not seen more than 100 times among all my individuals. This reduced my reference from the 1.6 million potential tags to 95,000 tags that are actually seen in my data. Still, this did not solve (or even appreciably decrease) the amount of time the UnifiedGenotyper predicts.
Does anyone have any ideas about what is going on here, or, better, how to overcome this problem? I would really like to use GATK!
Thank you!
Comment