SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Denovo assembly problem huma Asif Illumina/Solexa 1 03-27-2013 09:20 PM
assembly strategies for repetitive dna jgibbons1 Bioinformatics 6 11-29-2012 07:42 AM
velvet assembly problem.. Pinal Bioinformatics 2 09-26-2012 01:56 PM
454 assembly of repetitive region blueisgold 454 Pyrosequencing 1 09-22-2010 02:16 AM

Reply
 
Thread Tools
Old 01-15-2015, 05:53 AM   #1
aupadhyaya
Junior Member
 
Location: Germany

Join Date: Jan 2015
Posts: 5
Default Problem with repetitive assembly

Hi all

I'm currently working on a de novo assembly in a region of the Capsella rubella genome, which has so far been quite unsuccessful.

I'm working on 150-bp paired end Illumina reads, and the individual is of the same genotype as the C.rubella reference.

The reads are mapped using bwa-mem. I have filtered my SAM files for mismatches, allowing up to 10% (15) due to the difficulty of mapping to repetitive regions. The alignments are then taken through picard SamToFastq and subsequently I trim the reads.

I have used SPAdes on these reads, and am able to reconstruct 19000/30000 bp for this region. Given that these reads come from the same genotype as the reference, I was expecting the mapping and assembly to be more straightforward.

Any suggestions would be most appreciated
aupadhyaya is offline   Reply With Quote
Old 01-15-2015, 06:04 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Please don't cross-post on here and biostars.
dpryan is offline   Reply With Quote
Old 01-15-2015, 06:10 AM   #3
aupadhyaya
Junior Member
 
Location: Germany

Join Date: Jan 2015
Posts: 5
Default

Thanks for the note. I've removed the biostars post.
aupadhyaya is offline   Reply With Quote
Old 01-15-2015, 11:19 PM   #4
seb.lees
Member
 
Location: France, Poitiers

Join Date: Sep 2012
Posts: 12
Default

Hi aupadhyaya,

Assembly of repetitive regions is always an awful task, especially with only pair-ends, but here we need more information to properly figure out your problem.

First, what is(are) the size(s) of the repeat unit(s) ? Are the repeats divergent or identical ? Are they arranged in tandem ? Is it a kind of tansposon island ?

You also say that you are attempting a de novo assembly, but based on a reference which is of the same genotype... it's not very clear for me. You do expect some variation compared to the reference seauence ? What is the purpose of re-sequencing/re-assembling the reference genotype ?

By the way, in general it is better to reduce the number of allowed mismatches when mapping on repetitive regions, to target the correct repeat more specifically (if the repats are divergents of course).

And the proper assembly of repetitive regions other than microsatellites generally require mate pairs (or PacBio) reads.

seb.
seb.lees is offline   Reply With Quote
Old 01-15-2015, 11:48 PM   #5
aupadhyaya
Junior Member
 
Location: Germany

Join Date: Jan 2015
Posts: 5
Default

There are a few types of repeats according to repeatmasker, some of which are identical and arranged nearly in tandem. I'm not too sure what a transposon island refers to, so I can't say much about that.

The reason I'm assembling the same genotype again is essentially as a sanity check for assembly of the region. I'm trying to assemble the region for another individual with very little success and wanted to see if the reference region could be done.

What you say about mismatches makes sense, but for some reason the best result, ie longest contigs, is with allowing some mismatches. I'm not too sure what to make of that.

If it helps, this is the region I'm looking at (available on jbrowse) Capsella rubella scaffold_2 7900000-7930000

Last edited by aupadhyaya; 01-15-2015 at 11:59 PM.
aupadhyaya is offline   Reply With Quote
Old 01-16-2015, 12:58 AM   #6
seb.lees
Member
 
Location: France, Poitiers

Join Date: Sep 2012
Posts: 12
Default

Mmmh, Are you sure it is the Capsella rubella scaffold_2 7900000-7930000 region ? Because it appear that this region is not repeated at all, excepted a 200-bp microsatellite at pos 7910000, at least in the reference sequence available in GenBank (accession KB870806.1). There is indeed some loci which are repeated elsewhere in the genome of C. rubella, but with no more than 90% similarity, which shouldn't be a problem for the assembly.

Longer contigs doesn't mean best assembly! If you increased the number of allowed mismatches for the assembly, you would expect more assembly errors, especially at the repeated loci.
seb.lees is offline   Reply With Quote
Old 01-16-2015, 01:22 AM   #7
aupadhyaya
Junior Member
 
Location: Germany

Join Date: Jan 2015
Posts: 5
Default

I'm sure this is the region. In terms of repetition, I'm a bit confused! there doesn't seem to be C.rubella specific annotation, but using A. thaliana repeats as a guide, around 13% of this region is annotated as repetitive (mostly as retroelements).

You're of course right about length not equaling quality! I have checked these contigs for accuracy on a first-pass basis through blast and they do look like good matches.
aupadhyaya is offline   Reply With Quote
Old 01-19-2015, 01:03 AM   #8
seb.lees
Member
 
Location: France, Poitiers

Join Date: Sep 2012
Posts: 12
Default

Hi aupadhyaya,

Indeed, these repetitive regions are probably retroelements. But if you blast the region on itself, there is no repetition.
So I don't understand why you are not able to reconstruct this region. The sequencing you've done is only this region (from a BAC) or the whole genome ?
My best guess is that these retro-elements are located elsewhere in the genome with very high similarity, creating several assembly routes that assemblers cannot solve with pair-ends only. You should definitively produce 4-5 Kbp mate-pair sequences.
seb.lees is offline   Reply With Quote
Old 01-20-2015, 03:23 AM   #9
aupadhyaya
Junior Member
 
Location: Germany

Join Date: Jan 2015
Posts: 5
Default

The sequencing is genomic. I'm going to see if I can do some mate pair sequencing to get around this issue.
aupadhyaya is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO