SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
Targeted Genome Assembly for region poorly represented in reference genome? gumbos Bioinformatics 1 01-09-2012 04:01 PM
De novo assembly mihir.karnik General 1 09-07-2011 01:49 PM
de novo assembly vs. reference assembly fadista General 3 02-15-2011 11:11 PM
De novo assembly strategy Wiseone De novo discovery 0 11-18-2010 08:30 AM
de novo 454 assembly strob Bioinformatics 8 01-21-2009 10:26 AM

Reply
 
Thread Tools
Old 04-29-2011, 06:11 AM   #1
skingan
Member
 
Location: San Mateo, CA

Join Date: Feb 2010
Posts: 17
Default Targeted de novo assembly

Hello,

I have an interesting problem I am looking for some advice.

I have whole genome resequence PE illumina data and I am interested in doing a de novo assembly of three particular genes. Thus far, we have done assemblies with bwa against the genome sequence of a closely related species (4-5% divergence). However, the genes I am interested in are newly inserted in our species so are absent from our current assembly. These genes are also rapidly evolving and I expect a lot of structural rearrangements relative to genes sequences I have already. My plan had been this:

1. filter my reads for quality using the FASTX tool kit and build a blast database of the reads.
2. blast reads against a reference sequence of my genes to identify the subset of reads that map to this region (and their mates)
3. do a de novo assembly of those reads (we have used SOAPdenovo in our lab, other suggestions??)

However, simply building the blast database of the reads is taking more than 12 hours and I imagine the blast itself will be even slower. Is there a better way to pull down reads that map to my gene of interest? Should I just do a bwa alignment using my three genes as a reference instead of blast?

Thanks!
Sarah Kingan
skingan is offline   Reply With Quote
Old 05-02-2011, 12:17 AM   #2
Thorondor
Member
 
Location: Heidelberg

Join Date: Feb 2011
Posts: 69
Default

so you have references of your genes you are looking for and what %-identity you expect in the sequence? Blasting all your reads against your reference genes seems not to be the smartest way. ;-) using bwa or vmatch might be a lot faster but of course your results depends on your sequence identity.
Thorondor is offline   Reply With Quote
Old 05-02-2011, 07:35 AM   #3
skingan
Member
 
Location: San Mateo, CA

Join Date: Feb 2010
Posts: 17
Default

Hi Thorondor,
The % identity should be very high, <3% divergence for the orthologous sequences. The problem is that there are many repeat elements in and around the genes so the structure is not conserved. Right now I am pulling the reads that align to the flanking sequence in my bwa alignment and will do a deNovo assembly of those reads using mira. Mira claims to be good at assembling repetitive sequence and difficult to align regions.

I may then do another iteration. Using bwa, I will map the previously unmapped reads to the contig(s) I built with mira. Then pull the singletons whose mates mapped and do another devovo assembly with mira.

Sarah
skingan is offline   Reply With Quote
Old 05-02-2011, 08:10 AM   #4
MadsAlbertsen
Member
 
Location: Denmark

Join Date: Aug 2010
Posts: 26
Default

Quote:
Originally Posted by skingan View Post
I may then do another iteration. Using bwa, I will map the previously unmapped reads to the contig(s) I built with mira. Then pull the singletons whose mates mapped and do another devovo assembly with mira
I've been doing this to manually close gaps that can't be assembled using various short read assemblers and it generally works great if you restrict the new denovo assembly to the regions where you have "problems".

E.g. using only the reads in vicinity to where you expect your gene to be.

We normally use CLC as it is extremely fast and memory efficient (and expensive..). However most assemblers should be able to handle the repeats if it is just locally. In my experience the problem is when you have the same repeat regions in multiple area's of the genome and that is solved by doing the local assembly.

rgds
Mads

Last edited by MadsAlbertsen; 05-02-2011 at 08:11 AM. Reason: Clarifying..
MadsAlbertsen is offline   Reply With Quote
Old 05-02-2011, 08:21 AM   #5
Thorondor
Member
 
Location: Heidelberg

Join Date: Feb 2011
Posts: 69
Default

well if your genes of interested are not well covered you might also take a look at LOCAS for your assembly:
http://ab.inf.uni-tuebingen.de/software/locas/
Thorondor is offline   Reply With Quote
Old 06-20-2011, 07:35 PM   #6
shiva
Junior Member
 
Location: melbourne

Join Date: Aug 2010
Posts: 2
Default

Quote:
Originally Posted by skingan View Post
Hi Thorondor,
The % identity should be very high, <3% divergence for the orthologous sequences. The problem is that there are many repeat elements in and around the genes so the structure is not conserved. Right now I am pulling the reads that align to the flanking sequence in my bwa alignment and will do a deNovo assembly of those reads using mira. Mira claims to be good at assembling repetitive sequence and difficult to align regions.

I may then do another iteration. Using bwa, I will map the previously unmapped reads to the contig(s) I built with mira. Then pull the singletons whose mates mapped and do another devovo assembly with mira.

Sarah
I'm performing a similar analysis. I did find that targeted de novo assembly deals with short tandem repeats very nicely. I'm now wondering if there's a software that integrates the results from targeted de novo assembly with the reference genome so that I can still use samtools, for example, for SNP calling? Thanks in advance for any information.

Sue

Last edited by shiva; 06-20-2011 at 08:37 PM.
shiva is offline   Reply With Quote
Old 06-20-2011, 11:10 PM   #7
pierre350d
Junior Member
 
Location: rennes, france

Join Date: Nov 2008
Posts: 7
Default

Dear all,

I'd like to mention a tool called mapsembler. It takes some sequence fragments and a set of (illumina) reads. It tries to reconstruct each sequence fragment using the reads (authorizing some substitutions) and for each sequence it reconstructed it extends it left and right by targetted assembly.

The output may be either a fasta file (contig containing the sequence) or a graph that shows indels, SNPS, or more complex events like gene fusion, exon skipping...

The tool and documentation are accessible here: http://alcovna.genouest.org/mapsembler/

Any comment / feedback welcome.

Pierre
pierre350d is offline   Reply With Quote
Old 07-12-2011, 07:39 AM   #8
salmonella
Junior Member
 
Location: texas

Join Date: Feb 2011
Posts: 5
Default mapsembler usage

Pierre
mapsembler sounds like it may work for one of my projects.
have you used it before?
Can i import the output into a viewer so that I can see how it attempted to assemble the sequences around the 'starter'?

thanks
salmonella is offline   Reply With Quote
Old 09-12-2011, 06:58 AM   #9
pierre350d
Junior Member
 
Location: rennes, france

Join Date: Nov 2008
Posts: 7
Default

Dear salmonella,

Sorry for this late answer...

I'm one of the authors of Mapsembler.
The output of mapsembler, while used using the graph output, can be viewed by any viewer able to deal with xgmml format (I'm using Cytoscape) or .graphml (I'm using gephi).

Pierre
pierre350d is offline   Reply With Quote
Reply

Tags
assembly, blast, bwa, denovo, illumina

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:52 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO