Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with repetitive assembly

    Hi all

    I'm currently working on a de novo assembly in a region of the Capsella rubella genome, which has so far been quite unsuccessful.

    I'm working on 150-bp paired end Illumina reads, and the individual is of the same genotype as the C.rubella reference.

    The reads are mapped using bwa-mem. I have filtered my SAM files for mismatches, allowing up to 10% (15) due to the difficulty of mapping to repetitive regions. The alignments are then taken through picard SamToFastq and subsequently I trim the reads.

    I have used SPAdes on these reads, and am able to reconstruct 19000/30000 bp for this region. Given that these reads come from the same genotype as the reference, I was expecting the mapping and assembly to be more straightforward.

    Any suggestions would be most appreciated

  • #2
    Please don't cross-post on here and biostars.

    Comment


    • #3
      Thanks for the note. I've removed the biostars post.

      Comment


      • #4
        Hi aupadhyaya,

        Assembly of repetitive regions is always an awful task, especially with only pair-ends, but here we need more information to properly figure out your problem.

        First, what is(are) the size(s) of the repeat unit(s) ? Are the repeats divergent or identical ? Are they arranged in tandem ? Is it a kind of tansposon island ?

        You also say that you are attempting a de novo assembly, but based on a reference which is of the same genotype... it's not very clear for me. You do expect some variation compared to the reference seauence ? What is the purpose of re-sequencing/re-assembling the reference genotype ?

        By the way, in general it is better to reduce the number of allowed mismatches when mapping on repetitive regions, to target the correct repeat more specifically (if the repats are divergents of course).

        And the proper assembly of repetitive regions other than microsatellites generally require mate pairs (or PacBio) reads.

        seb.

        Comment


        • #5
          There are a few types of repeats according to repeatmasker, some of which are identical and arranged nearly in tandem. I'm not too sure what a transposon island refers to, so I can't say much about that.

          The reason I'm assembling the same genotype again is essentially as a sanity check for assembly of the region. I'm trying to assemble the region for another individual with very little success and wanted to see if the reference region could be done.

          What you say about mismatches makes sense, but for some reason the best result, ie longest contigs, is with allowing some mismatches. I'm not too sure what to make of that.

          If it helps, this is the region I'm looking at (available on jbrowse) Capsella rubella scaffold_2 7900000-7930000
          Last edited by aupadhyaya; 01-16-2015, 12:59 AM.

          Comment


          • #6
            Mmmh, Are you sure it is the Capsella rubella scaffold_2 7900000-7930000 region ? Because it appear that this region is not repeated at all, excepted a 200-bp microsatellite at pos 7910000, at least in the reference sequence available in GenBank (accession KB870806.1). There is indeed some loci which are repeated elsewhere in the genome of C. rubella, but with no more than 90% similarity, which shouldn't be a problem for the assembly.

            Longer contigs doesn't mean best assembly! If you increased the number of allowed mismatches for the assembly, you would expect more assembly errors, especially at the repeated loci.

            Comment


            • #7
              I'm sure this is the region. In terms of repetition, I'm a bit confused! there doesn't seem to be C.rubella specific annotation, but using A. thaliana repeats as a guide, around 13% of this region is annotated as repetitive (mostly as retroelements).

              You're of course right about length not equaling quality! I have checked these contigs for accuracy on a first-pass basis through blast and they do look like good matches.

              Comment


              • #8
                Hi aupadhyaya,

                Indeed, these repetitive regions are probably retroelements. But if you blast the region on itself, there is no repetition.
                So I don't understand why you are not able to reconstruct this region. The sequencing you've done is only this region (from a BAC) or the whole genome ?
                My best guess is that these retro-elements are located elsewhere in the genome with very high similarity, creating several assembly routes that assemblers cannot solve with pair-ends only. You should definitively produce 4-5 Kbp mate-pair sequences.

                Comment


                • #9
                  The sequencing is genomic. I'm going to see if I can do some mate pair sequencing to get around this issue.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 08:47 AM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  54 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X