Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • rcapper
    Member
    • Sep 2011
    • 20

    2b-RAD and GATK

    Hello all,

    I have some Type 2B RAD data for many individuals from several populations of my non-model species. I have made a reference to map the tags to by extracting all potential RAD sites from the available genome. I am now trying to discover SNPs among the individuals.

    I've previously used samtools mpileup and a home-made haplotype caller, and have also tried STACKS though my conclusion is that Stacks' SNPs do not agree with the other two methods', even when accounting for a weird indexing issue that appears to be going on. I am skeptical that Stacks is appropriate for Type2B RAD.

    I would ideally like to run the same SNP analysis using GATK, then find the intersection of SNPs called using both mpileup and GATK. However, I can't get GATK to run!

    I have trimmed, filtered, and mapped (bowtie1) reads with added read groups in individual.sorted.bam format. I would like to run the GATK UnifiedGenotyper on a single individual as a first pass, then refine that SNP list using BQSR, VQSR and multi-sample UnifiedGenotyper SNP identification.

    However, here is the problem: even on our available supercomputer and even using -nt and -nct, this single individual will take 4.9 weeks to process!! I am surprised at this. The individual in question contains 6,925,188 36-b reads, mapped to a reference made of every possible RAD site in the genome, 1,624,953 36-b contigs.

    A collaborator suggested that GATK massively increases run time with increasing numbers of reference contigs, so I went back to my .sam files and deleted any reference RAD site from my reference.fasta that was not seen more than 100 times among all my individuals. This reduced my reference from the 1.6 million potential tags to 95,000 tags that are actually seen in my data. Still, this did not solve (or even appreciably decrease) the amount of time the UnifiedGenotyper predicts.

    Does anyone have any ideas about what is going on here, or, better, how to overcome this problem? I would really like to use GATK!

    Thank you!
  • SNPsaurus
    Registered Vendor
    • May 2013
    • 525

    #2
    Have you tried running it against the reference without extracting potential RAD sites? I've done large populations using novoalign against a reference (RAD or nextRAD) and the time is trivial. I also routinely take a population, identify the tags in the population (you can decide to include only predominant tags or all tags), and then align those tags against each other to determine alleles, and then count those alleles in each sample.

    Sorry to not address your question, but these are paths that work for me for RAD and nextRAD. I assume 2b-RAD would behave in similar ways.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment

    • rcapper
      Member
      • Sep 2011
      • 20

      #3
      Interesting. I had previously resisted mapping to the whole genome (12.5k contigs) because I felt like there may be increased mapping error around the restriction site. However, I don't have any data about this yet, just a gut feeling that feeding the mapper known sites selected for their restriction sites would improve this. But... on the other hand, I filter and trim my raw reads and insist that each one contains the RAD site, so maybe I'm just being cautious for no reason.

      Anyway -- I did what you suggested and mapped an individual to the whole genome, then ran the UnifiedGenotyper on that guy. We're down to 51 minutes! Looks like the reference database size does indeed make a huge time difference (12.5 contigs/1 hour vs 95k contigs/4.9 weeks...)

      Another idea that was suggested to me was to extract the RAD sites from the genome, then concatenate them into artificial chromosomes, potentially using strings of 1000 N's to separate each tag. I think this is a good idea, but more complicated than just mapping to the genome.

      Can you think of any reason why mapping to the genome instead of to the sites themselves would be better or worse? I can think of pros and cons for each side, but I am not convinced of either way yet. Obviously, though, the computational time alone will make up my mind

      Comment

      • SNPsaurus
        Registered Vendor
        • May 2013
        • 525

        #4
        To me, the biggest source of error when mapping to a whole genome is from alignment to duplicate genes/genomic regions. Sometimes a tag will align to multiple locations. One is true, and the others spurious. But a sequencing error may be enough to shift the tag from one location to the other. Or, because few genomes are "complete", your tag may map to the duplicate region rather than its real locus.

        But these are not serious issues. A small loss of tags from tossing out ones that map to multiple locations is just not an issue when you have 10k (or 100k) markers to play with. And in the second scenario, the usual information desired is comparing your samples to each other. So if all the samples map to a spurious locus because of missing sequence in the reference, they will be true in comparison to each other. Mapping issues might be more of a problem with 2b-RAD with its short sequence length, but I bet it will be OK.
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        21 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        39 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        46 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        49 views
        0 reactions
        Last Post SEQadmin2  
        Working...