SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
COSMIC vcf file contains "N" & "." characters in ALT column vipul jain Bioinformatics 1 09-03-2015 12:16 AM
BWA-MEM non-primary alignments "false proper pairing" kartong Bioinformatics 5 08-18-2014 11:29 PM
Samtools mpileup - why does it call an "N" when there ARE correct alignments? TabeaK Bioinformatics 3 01-07-2013 08:33 AM
"refine gapped alignments... Segmentation fault" while using bwa sampe dd_genome Bioinformatics 3 01-03-2013 10:42 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 12:19 PM

Reply
 
Thread Tools
Old 10-29-2015, 09:14 AM   #1
fox454
Junior Member
 
Location: Washington, DC

Join Date: Sep 2011
Posts: 2
Default SNP flanking regions with "pseduo"-reference & viewing all alignments around SNPs

Hi everyone,

Sorry for the long post but I could really use your help! I'm trying to verify flanking regions around SNPs and worried about how my "pseudo"-reference may affect my results...

Here is basically what I did:
- Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
- Created contigs of sequences within individuals (cap3) and compared across 300 individuals
- Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
- Used SNP calls between my contigs and the reference and inserted differences to create a "pseudo"-reference for my species (GATK AlternateReferenceMaker)

Then I have did SNP calling from raw sequences using GATK best practices (bwa-mem alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
I did hard filtering for quality, coverage, and missing data across SNPs
I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)

When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:

1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?

2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?

3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?

Thanks for all your thoughts! Do you see any other flaws in this analysis?
fox454 is offline   Reply With Quote
Old 10-30-2015, 12:08 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I'm very confused. What are you trying to accomplish? And what organism are you working with? - and, does it have a reference genome?
Brian Bushnell is offline   Reply With Quote
Old 10-30-2015, 09:46 AM   #3
fox454
Junior Member
 
Location: Washington, DC

Join Date: Sep 2011
Posts: 2
Default

Quote:
Originally Posted by Brian Bushnell View Post
I'm very confused. What are you trying to accomplish? And what organism are you working with? - and, does it have a reference genome?
My species does not have a reference genome. There is a draft genome of a closely related species that I am using as a starting point. I used GATK alternate reference maker to incorporate the differences in my sequence contigs (SNPs etc) into a new reference genome that incorporates the structure of the sister species with the SNPs from my species.

My end goal is two-fold:

1. Identify SNPs across a large number of individuals to look for population structuring and kinship between individuals with known and unknown pedigree data.

2. Identify highly heterozygous SNPs to create an array to perform SNP genotyping at 150 SNPs for more than 1500 individuals.

Hopefully this explains my questions more. Thanks for any thoughts.
fox454 is offline   Reply With Quote
Reply

Tags
non-model organism, reference genome, snp region

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO