SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   SNP flanking regions with "pseduo"-reference & viewing all alignments around SNPs (http://seqanswers.com/forums/showthread.php?t=63835)

fox454 10-29-2015 09:14 AM

SNP flanking regions with "pseduo"-reference & viewing all alignments around SNPs
 
Hi everyone,

Sorry for the long post but I could really use your help! I'm trying to verify flanking regions around SNPs and worried about how my "pseudo"-reference may affect my results...

Here is basically what I did:
- Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
- Created contigs of sequences within individuals (cap3) and compared across 300 individuals
- Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
- Used SNP calls between my contigs and the reference and inserted differences to create a "pseudo"-reference for my species (GATK AlternateReferenceMaker)

Then I have did SNP calling from raw sequences using GATK best practices (bwa-mem alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
I did hard filtering for quality, coverage, and missing data across SNPs
I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)

When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:

1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?

2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?

3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?

Thanks for all your thoughts! Do you see any other flaws in this analysis?

Brian Bushnell 10-30-2015 12:08 AM

I'm very confused. What are you trying to accomplish? And what organism are you working with? - and, does it have a reference genome?

fox454 10-30-2015 09:46 AM

Quote:

Originally Posted by Brian Bushnell (Post 183539)
I'm very confused. What are you trying to accomplish? And what organism are you working with? - and, does it have a reference genome?

My species does not have a reference genome. There is a draft genome of a closely related species that I am using as a starting point. I used GATK alternate reference maker to incorporate the differences in my sequence contigs (SNPs etc) into a new reference genome that incorporates the structure of the sister species with the SNPs from my species.

My end goal is two-fold:

1. Identify SNPs across a large number of individuals to look for population structuring and kinship between individuals with known and unknown pedigree data.

2. Identify highly heterozygous SNPs to create an array to perform SNP genotyping at 150 SNPs for more than 1500 individuals.

Hopefully this explains my questions more. Thanks for any thoughts.


All times are GMT -8. The time now is 09:34 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.