Dear all,
I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:
I have Illumina short reads from which I want to call SNPs WITHOUT
using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.
Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.
Thank you a lot for your help,
Marius
I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:
I have Illumina short reads from which I want to call SNPs WITHOUT
using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.
Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.
Thank you a lot for your help,
Marius
Comment