Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hierarchical reference-free SNP calling

    Dear all,

    I'm aware there are several similar questions posted already (some almost a bit too old regarding the fast growing possibilities in this field), but I'm wondering how you would solve my specific case in the most efficient way:

    I have Illumina short reads from which I want to call SNPs WITHOUT
    using a reference genome. What I have are reads that are defined by a specific restriction enzyme site in the genome of several individuals per population. And I have several populations. These defined loci are in average 25 times replicated per individual (25 reads per locus/ind.), what allows me to first find SNPs within an individual (heterozygote positions), then compare all individuals belonging to the same population (looking for WITHIN population SNPs) and ultimatively compare populations between each other (3 "hierarchical" steps). If possible I'd like to do this SNP-calling quality aware. One of the problems I see is to get consensus sequences for an individual without a reference. How I imagine this should be done by a program is to make stacks of reads that belong to the same locus in the genome (as I said, about 25 reads per locus in average). Since there will be heterozygous single nucleotides already within an individual, when collapsing these stacks to a consensus sequence, one should maybe use the ambiguity code for polymorphic sites.

    Do you have suggestions (i.e. programs or a pipeline) for how to do this? Especially making such stacks and then get a consensus sequence without a reference would help a lot. Once I've done that for every individual, I could then again make stacks from the individual consensus sequences per population and compare these among the populations.

    Thank you a lot for cour help,

    Marius
    Last edited by Marius; 12-20-2010, 02:11 PM.

  • #2
    As it seems, what would be best is to do a denovo Assembly (contigs) with all (of all the individuals) my reads (I expect about 40'000 loci), so I'd get about 40'000 contigs, and then use these as a reference to do consensus calling for each individual, since I have many replicates for each locus per individual to check for heterozygot positions.

    What would be the best assembler to do that? Nice features to have would be:
    -I'd like only to regard good quality reads as "true" reads and only use these to build these contigs (so there should be some kind of quality filtering before contig-building).
    -Once I have these 40'000 contigs (I guess I would name them simply with the numbers 1 to 40'000 or so), I'd like to use these to align the reads of every single individual to these contigs to call for the individual consensus-sequences for every locus (can be heterozygous). Therefore, when building the contigs I will have many many reads for every locus (thze sum of all replicates of all the individuals of one locus), which will have different alleles (SNPs) already. So the contig-sequences (concensus of all these biol. replicates) should make a "N" at these positions, which will allow all variants to align to this locus later on correctly when I call for the individual-consensus sequences.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 11:49 AM
    0 responses
    15 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-24-2024, 08:47 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    61 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Working...
    X