Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2b-RAD and GATK

    Hello all,

    I have some Type 2B RAD data for many individuals from several populations of my non-model species. I have made a reference to map the tags to by extracting all potential RAD sites from the available genome. I am now trying to discover SNPs among the individuals.

    I've previously used samtools mpileup and a home-made haplotype caller, and have also tried STACKS though my conclusion is that Stacks' SNPs do not agree with the other two methods', even when accounting for a weird indexing issue that appears to be going on. I am skeptical that Stacks is appropriate for Type2B RAD.

    I would ideally like to run the same SNP analysis using GATK, then find the intersection of SNPs called using both mpileup and GATK. However, I can't get GATK to run!

    I have trimmed, filtered, and mapped (bowtie1) reads with added read groups in individual.sorted.bam format. I would like to run the GATK UnifiedGenotyper on a single individual as a first pass, then refine that SNP list using BQSR, VQSR and multi-sample UnifiedGenotyper SNP identification.

    However, here is the problem: even on our available supercomputer and even using -nt and -nct, this single individual will take 4.9 weeks to process!! I am surprised at this. The individual in question contains 6,925,188 36-b reads, mapped to a reference made of every possible RAD site in the genome, 1,624,953 36-b contigs.

    A collaborator suggested that GATK massively increases run time with increasing numbers of reference contigs, so I went back to my .sam files and deleted any reference RAD site from my reference.fasta that was not seen more than 100 times among all my individuals. This reduced my reference from the 1.6 million potential tags to 95,000 tags that are actually seen in my data. Still, this did not solve (or even appreciably decrease) the amount of time the UnifiedGenotyper predicts.

    Does anyone have any ideas about what is going on here, or, better, how to overcome this problem? I would really like to use GATK!

    Thank you!

  • #2
    Have you tried running it against the reference without extracting potential RAD sites? I've done large populations using novoalign against a reference (RAD or nextRAD) and the time is trivial. I also routinely take a population, identify the tags in the population (you can decide to include only predominant tags or all tags), and then align those tags against each other to determine alleles, and then count those alleles in each sample.

    Sorry to not address your question, but these are paths that work for me for RAD and nextRAD. I assume 2b-RAD would behave in similar ways.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      Interesting. I had previously resisted mapping to the whole genome (12.5k contigs) because I felt like there may be increased mapping error around the restriction site. However, I don't have any data about this yet, just a gut feeling that feeding the mapper known sites selected for their restriction sites would improve this. But... on the other hand, I filter and trim my raw reads and insist that each one contains the RAD site, so maybe I'm just being cautious for no reason.

      Anyway -- I did what you suggested and mapped an individual to the whole genome, then ran the UnifiedGenotyper on that guy. We're down to 51 minutes! Looks like the reference database size does indeed make a huge time difference (12.5 contigs/1 hour vs 95k contigs/4.9 weeks...)

      Another idea that was suggested to me was to extract the RAD sites from the genome, then concatenate them into artificial chromosomes, potentially using strings of 1000 N's to separate each tag. I think this is a good idea, but more complicated than just mapping to the genome.

      Can you think of any reason why mapping to the genome instead of to the sites themselves would be better or worse? I can think of pros and cons for each side, but I am not convinced of either way yet. Obviously, though, the computational time alone will make up my mind

      Comment


      • #4
        To me, the biggest source of error when mapping to a whole genome is from alignment to duplicate genes/genomic regions. Sometimes a tag will align to multiple locations. One is true, and the others spurious. But a sequencing error may be enough to shift the tag from one location to the other. Or, because few genomes are "complete", your tag may map to the duplicate region rather than its real locus.

        But these are not serious issues. A small loss of tags from tossing out ones that map to multiple locations is just not an issue when you have 10k (or 100k) markers to play with. And in the second scenario, the usual information desired is comparing your samples to each other. So if all the samples map to a spurious locus because of missing sequence in the reference, they will be true in comparison to each other. Mapping issues might be more of a problem with 2b-RAD with its short sequence length, but I bet it will be OK.
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X