Hi everyone,
Sorry for the long post but I could really use your help! I'm trying to verify flanking regions around SNPs and worried about how my "pseudo"-reference may affect my results...
Here is basically what I did:
- Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
- Created contigs of sequences within individuals (cap3) and compared across 300 individuals
- Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
- Used SNP calls between my contigs and the reference and inserted differences to create a "pseudo"-reference for my species (GATK AlternateReferenceMaker)
Then I have did SNP calling from raw sequences using GATK best practices (bwa-mem alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
I did hard filtering for quality, coverage, and missing data across SNPs
I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)
When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:
1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?
2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?
3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?
Thanks for all your thoughts! Do you see any other flaws in this analysis?
Sorry for the long post but I could really use your help! I'm trying to verify flanking regions around SNPs and worried about how my "pseudo"-reference may affect my results...
Here is basically what I did:
- Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
- Created contigs of sequences within individuals (cap3) and compared across 300 individuals
- Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
- Used SNP calls between my contigs and the reference and inserted differences to create a "pseudo"-reference for my species (GATK AlternateReferenceMaker)
Then I have did SNP calling from raw sequences using GATK best practices (bwa-mem alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
I did hard filtering for quality, coverage, and missing data across SNPs
I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)
When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:
1. How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?
2. What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?
3. Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?
Thanks for all your thoughts! Do you see any other flaws in this analysis?
Comment