Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Better accuracy in assembling, and SNP calling in, low-complexity sequence regions?

    Greetings,

    I'm looking for some advice on how to improve my analysis of assembly and variant analysis using 100bp Illumina in genes with low-complexity regions (imperfect repeat sequences).

    I am working on comparative genomics with a number of very AT-rich genomes (about 80%, in a variety of Plasmodium species). I am also doing some population genetics in there and need an accurate set of SNPs (and indels would be nice, too).

    Mapping, de novo assembly, and SNP/indel calling all have problems assembling/mapping low-complexity regions (using Velvet for de novo, BWA for mapping and SAMtools/bcftools for variant analysis). Velvet gets them right about 50% of the time (checking with Sanger sequencing) BWA can't map these regions at all.

    I tried masking the regions in the genome using DUST, but it only finds little regions, these are easier to find using protein sequence.

    Any advice on how to mask these regions or (even better) include them in the analyses and get them right would be appreciated.
    Last edited by Genomics101; 12-20-2013, 05:05 AM. Reason: Spelling

  • #2
    Hi,

    I've had similar problems with variable nucleotide tandem repeat regions in a high G+C bacterium. I've been using BWA as an aligner and SolSNP as a SNP/variant caller. For de novo assembly, I've had some success with using different sequencing technologies (454 and PE Illumina) and then performing a hybrid assembly using MIRA. The longer 454 reads are able to span some of these regions although there are still plenty of sections that are unable to be assembled.

    For mapping have you tried any other aligners besides BWA? I was thinking of trying bfast to see if it can cope with these regions any better. Failing that can you increase the stringency of the SAMtools SNP caller? At the very least you should be able to remove the low confidence SNPs from analysis.

    Comment


    • #3
      Originally posted by Genomics101 View Post
      Greetings,

      I'm looking for some advice on how to approve my analysis of assembly and variant analysis using 100bp Illumina in genes with low-complexity regions (imperfect repeat sequences).

      I am working on comparative genomics with a number of very AT-rich genomes (about 80%, in a variety of Plasmodium species). I am also doing some population genetics in there and need an accurate set of SNPs (and indels would be nice, too).

      Mapping, de novo assembly, and SNP/indel calling all have problems assembling/mapping low-complexity regions (using Velvet for de novo, BWA for mapping and SAMtools/bcftools for variant analysis). Velvet gets them right about 50% of the time (checking with Sanger sequencing) BWA can't map these regions at all.

      I tried masking the regions in the genome using DUST, but it only finds little regions, these are easier to find using protein sequence.

      Any advice on how to mask these regions or (even better) include them in the analyses and get them right would be appreciated.
      Have a look at:


      Regards,

      S.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 08:47 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X