Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determine similarity from NGS data

    Hi,

    I have 250bp paired-end sequencing data (Illumina MiSeq, k-mer coverage ~18) of four E. coli strains. What I want to determine is how similar the strains are.

    I imagine this could be done on the basis of the raw data alone, thus without trying to assemble the individual genomes (at least for a first rough approximation of similarity). Can anyone suggest to me what would be the best approach for this?

    Another option would be to take the largest scaffold currently available for one strain, and map the reads of each of the strains on to this, and compare. The data is all from the same sequencing run, and on the basis of fastQ quality metric cannot by eye be held apart. It think it would be reasonable to assume that technical errors are equally distributed. Thus, after trimming and quality filtering using the same settings, dissimilarities can be assessed. For the determination of the amount of SNPs I would need to take into account the sequencing error rate though (0.80%, http://bmcgenomics.biomedcentral.com...71-2164-13-341), however, since during the assembly many sequencing errors are discarded I don't know how to disentangle the true SNPs from sequencing errors. Any suggestions how to tackle this issue are appreciated.

  • #2
    You may be able to do this by finding strain specific kmers: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4005670/
    BBMap
    suite has k-mer identification programs you can use, in addition to the programs in the paper above.

    Comment


    • #3
      You can substantially reduce sequencing error rate by error-correcting the data, which can be done for example using Tadpole. 18x is pretty low for good error-correction or assembly, though. If the reads mostly overlap, you can also achieve some degree of error-correction by merging them using e.g. BBMerge.

      I would probably assemble each strain (using adapter-trimmed, error-corrected, merged [if they mostly overlap] reads), and then do all 16 mappings of reads to assemblies to estimate SNP rates from pairwise error rates. For example if strain 1 has a 0.1% substitution rate when mapped to its own assembly and strain 2 has a 0.7% substitution rate when mapped to strain 1's assembly, then probably, there is a 0.6% SNP rate between strain 1 and strain 2.
      Last edited by Brian Bushnell; 03-15-2016, 09:25 PM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 11:49 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X