Hierarchical reference-free SNP calling

Marius

8armed

Join Date: Dec 2010

Posts: 30
- Share
- Tweet
#1

Hierarchical reference-free SNP calling

12-20-2010, 07:13 AM

Dear all,

I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:

I have Illumina short reads from which I want to call SNPs WITHOUT
using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.

Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.

Thank you a lot for cour help,

Marius

Last edited by Marius; 12-20-2010, 02:11 PM.
Tags: None
Marius

8armed

Join Date: Dec 2010

Posts: 30
- Share
- Tweet
#2

12-27-2010, 09:38 AM

As it seems, what would be best is to do a denovo Assembly (contigs) with all (of all the individuals) my reads (I expect about 40'000 loci), so I'd get about 40'000 contigs, and then use these as a reference to do consensus calling for each individual, since I have many replicates for each locus per individual to check for heterozygot positions.

What would be the best assembler to do that? Nice features to have would be:
-I'd like only to regard good quality reads as "true" reads and only use these to build these contigs (so there should be some kind of quality filtering before contig-building).
-Once I have these 40'000 contigs (I guess I would name them simply with the numbers 1 to 40'000 or so), I'd like to use these to align the reads of every single individual to these contigs to call for the individual consensus-sequences for every locus (can be heterozygous). Therefore, when building the contigs I will have many many reads for every locus (thze sum of all replicates of all the individuals of one locus), which will have different alleles (SNPs) already. So the contig-sequences (concensus of all these biol. replicates) should make a "N" at these positions, which will allow all variants to align to this locus later on correctly when I call for the individual-consensus sequences.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Hierarchical reference-free SNP calling

Comment

Latest Articles

ad_right_rmr

News