Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • aupadhyaya
    Junior Member
    • Jan 2015
    • 5

    Problem with repetitive assembly

    Hi all

    I'm currently working on a de novo assembly in a region of the Capsella rubella genome, which has so far been quite unsuccessful.

    I'm working on 150-bp paired end Illumina reads, and the individual is of the same genotype as the C.rubella reference.

    The reads are mapped using bwa-mem. I have filtered my SAM files for mismatches, allowing up to 10% (15) due to the difficulty of mapping to repetitive regions. The alignments are then taken through picard SamToFastq and subsequently I trim the reads.

    I have used SPAdes on these reads, and am able to reconstruct 19000/30000 bp for this region. Given that these reads come from the same genotype as the reference, I was expecting the mapping and assembly to be more straightforward.

    Any suggestions would be most appreciated
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    Please don't cross-post on here and biostars.

    Comment

    • aupadhyaya
      Junior Member
      • Jan 2015
      • 5

      #3
      Thanks for the note. I've removed the biostars post.

      Comment

      • seb.lees
        Member
        • Sep 2012
        • 12

        #4
        Hi aupadhyaya,

        Assembly of repetitive regions is always an awful task, especially with only pair-ends, but here we need more information to properly figure out your problem.

        First, what is(are) the size(s) of the repeat unit(s) ? Are the repeats divergent or identical ? Are they arranged in tandem ? Is it a kind of tansposon island ?

        You also say that you are attempting a de novo assembly, but based on a reference which is of the same genotype... it's not very clear for me. You do expect some variation compared to the reference seauence ? What is the purpose of re-sequencing/re-assembling the reference genotype ?

        By the way, in general it is better to reduce the number of allowed mismatches when mapping on repetitive regions, to target the correct repeat more specifically (if the repats are divergents of course).

        And the proper assembly of repetitive regions other than microsatellites generally require mate pairs (or PacBio) reads.

        seb.

        Comment

        • aupadhyaya
          Junior Member
          • Jan 2015
          • 5

          #5
          There are a few types of repeats according to repeatmasker, some of which are identical and arranged nearly in tandem. I'm not too sure what a transposon island refers to, so I can't say much about that.

          The reason I'm assembling the same genotype again is essentially as a sanity check for assembly of the region. I'm trying to assemble the region for another individual with very little success and wanted to see if the reference region could be done.

          What you say about mismatches makes sense, but for some reason the best result, ie longest contigs, is with allowing some mismatches. I'm not too sure what to make of that.

          If it helps, this is the region I'm looking at (available on jbrowse) Capsella rubella scaffold_2 7900000-7930000
          Last edited by aupadhyaya; 01-16-2015, 12:59 AM.

          Comment

          • seb.lees
            Member
            • Sep 2012
            • 12

            #6
            Mmmh, Are you sure it is the Capsella rubella scaffold_2 7900000-7930000 region ? Because it appear that this region is not repeated at all, excepted a 200-bp microsatellite at pos 7910000, at least in the reference sequence available in GenBank (accession KB870806.1). There is indeed some loci which are repeated elsewhere in the genome of C. rubella, but with no more than 90% similarity, which shouldn't be a problem for the assembly.

            Longer contigs doesn't mean best assembly! If you increased the number of allowed mismatches for the assembly, you would expect more assembly errors, especially at the repeated loci.

            Comment

            • aupadhyaya
              Junior Member
              • Jan 2015
              • 5

              #7
              I'm sure this is the region. In terms of repetition, I'm a bit confused! there doesn't seem to be C.rubella specific annotation, but using A. thaliana repeats as a guide, around 13% of this region is annotated as repetitive (mostly as retroelements).

              You're of course right about length not equaling quality! I have checked these contigs for accuracy on a first-pass basis through blast and they do look like good matches.

              Comment

              • seb.lees
                Member
                • Sep 2012
                • 12

                #8
                Hi aupadhyaya,

                Indeed, these repetitive regions are probably retroelements. But if you blast the region on itself, there is no repetition.
                So I don't understand why you are not able to reconstruct this region. The sequencing you've done is only this region (from a BAC) or the whole genome ?
                My best guess is that these retro-elements are located elsewhere in the genome with very high similarity, creating several assembly routes that assemblers cannot solve with pair-ends only. You should definitively produce 4-5 Kbp mate-pair sequences.

                Comment

                • aupadhyaya
                  Junior Member
                  • Jan 2015
                  • 5

                  #9
                  The sequencing is genomic. I'm going to see if I can do some mate pair sequencing to get around this issue.

                  Comment

                  Latest Articles

                  Collapse

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-09-2026, 11:58 AM
                  0 responses
                  21 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  27 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  38 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  61 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...