Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SNP flanking regions with "pseduo"-reference & viewing all alignments around SNPs

    Hi everyone,

    Sorry for the long post but I could really use your help! I'm trying to verify flanking regions around SNPs and worried about how my "pseudo"-reference may affect my results...

    Here is basically what I did:
    - Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
    - Created contigs of sequences within individuals (cap3) and compared across 300 individuals
    - Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
    - Used SNP calls between my contigs and the reference and inserted differences to create a "pseudo"-reference for my species (GATK AlternateReferenceMaker)

    Then I have did SNP calling from raw sequences using GATK best practices (bwa-mem alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
    I did hard filtering for quality, coverage, and missing data across SNPs
    I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)

    When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:

    1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?

    2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?

    3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?

    Thanks for all your thoughts! Do you see any other flaws in this analysis?

  • #2
    I'm very confused. What are you trying to accomplish? And what organism are you working with? - and, does it have a reference genome?

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      I'm very confused. What are you trying to accomplish? And what organism are you working with? - and, does it have a reference genome?
      My species does not have a reference genome. There is a draft genome of a closely related species that I am using as a starting point. I used GATK alternate reference maker to incorporate the differences in my sequence contigs (SNPs etc) into a new reference genome that incorporates the structure of the sister species with the SNPs from my species.

      My end goal is two-fold:

      1. Identify SNPs across a large number of individuals to look for population structuring and kinship between individuals with known and unknown pedigree data.

      2. Identify highly heterozygous SNPs to create an array to perform SNP genotyping at 150 SNPs for more than 1500 individuals.

      Hopefully this explains my questions more. Thanks for any thoughts.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      30 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      32 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X