SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
question about mapping reads back to an endosymbiont genome NGS_Newbie RNA Sequencing 0 09-26-2011 06:20 PM
target mapped read percentage eren Illumina/Solexa 0 08-13-2011 10:06 PM
RNA-Seq: Isoform-level microRNA-155 target prediction using RNA-seq. Newsbot! Literature Watch 0 02-15-2011 02:00 AM
ChIP-Seq: Genome-wide transcription factor binding: beyond direct target regulation. Newsbot! Literature Watch 0 02-08-2011 02:00 AM
PubMed: Pyrosequencing analysis of endosymbiont population structure: co-occurrence o Newsbot! Literature Watch 0 04-29-2009 05:00 AM

Reply
 
Thread Tools
Old 04-28-2011, 01:52 AM   #1
PHSchi
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 12
Default disentangling target genome and endosymbiont at read level

Hi!

Recently I got data at my hands were a single lane of GAIIx was sequenced from genomic DNA. >6GB output, all high qual. However, GC plot of reads shows two distinct peaks (larger at 37% -> target genome, smaller at around 50%). Seeing this and knowing the source of the DNA the second peak seems to come from an endosymbiont (or bacterial contamination). When I assemble with velvet (already tested cc=50 and large kmers) or Ray I get a genome of around 2MB (far to small) with bad cegma and also none of the stuff that should be in there, although blast hits for the right organism. Questions is: how to separate the endosymbiont from the target, possibly at read level?

Any help highly appreciated.
PHSchi is offline   Reply With Quote
Old 04-28-2011, 04:44 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

I'd try splitting the reads by GC, and assembling the high GC and low GC pools separately.
maubp is offline   Reply With Quote
Old 04-28-2011, 04:49 AM   #3
PHSchi
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 12
Default my first though, but

Hi!

That was my first thought as well, but then I loose all the reads of high GC from the target genome, i.e. they will be included in the other one - it is a curve of GC content and thus the target genome has to have regions with ~ 50% GC as well, whereas the possible endosymbiont has have reads of lower GC as well. I could take all the reads from a certain level of GC upwards and hope to assemble a bacterium out of those, then extract all the reads that went into the bacterial genome from the complete set of reads and hope to end up with more or less pure target genome, but is this sensible and feasible?!?
PHSchi is offline   Reply With Quote
Old 12-06-2011, 02:38 PM   #4
Rockx
Junior Member
 
Location: Sydney

Join Date: Dec 2011
Posts: 7
Default

I am also having the same problem.

What programs should I be using to separate reads based on GC content? A search of these forums only revealed replies such as "there are many programs that do this" but with no examples.

Any help would be appreciated, cheers!
Rockx is offline   Reply With Quote
Old 12-12-2011, 10:35 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

What is your favourite scripting language? e.g. BioPerl/Biopython/etc would all make it easy to write a quick script to filter FASTQ on GC content. Also, do you have paired end data - and if so presumably you might want to filter at the pair level? That makes things a little more complicated...
maubp is offline   Reply With Quote
Old 12-12-2011, 10:38 AM   #6
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

The other way to split is to align all the reads to your target organism stringently, and the velvet only the unmapped reads. Then, either figure out what your mystery contaminant is from the velvet, or include the velvet contigs in your genome alongside your desired organism, so that the reads will align to that, instead of beign forced somewhere in your target organism genome.
swbarnes2 is offline   Reply With Quote
Old 12-12-2011, 01:55 PM   #7
Rockx
Junior Member
 
Location: Sydney

Join Date: Dec 2011
Posts: 7
Default

Thanks maubp and swbarnes2. Indeed, I did end up editing the DynamicTrim perl script to include and option for GC trimming, this deals with paired end data fine.

swbarnes2, thanks for this tip. However, I 'm unable to do this as I am assembling de novo. Makes things a bit tougher.
Rockx is offline   Reply With Quote
Old 12-12-2011, 02:11 PM   #8
koadman
Member
 
Location: Sydney, Australia

Join Date: May 2010
Posts: 65
Default

You could try running an assembly pipeline designed explicitly to deal with mixes of organisms. metAMOS seems to be one such option:

https://github.com/treangen/metAMOS/wiki

It uses metagenome taxonomy analysis to figure out which organism each scaffold group comes from and creates a separate assembly fasta file for each organism. Looks like it's under very active development at the moment.
koadman is offline   Reply With Quote
Old 12-12-2011, 06:10 PM   #9
polyatail
Member
 
Location: New York, NY

Join Date: Dec 2010
Posts: 25
Default

There are a number of tools out there that attempt to cluster or classify reads or contigs by sequence-intrinsic properties (i.e. k-mers, protein domains). Check out TETRA, WebCarma, TACOA or PhyloPythia.
polyatail is offline   Reply With Quote
Old 12-12-2011, 07:54 PM   #10
koadman
Member
 
Location: Sydney, Australia

Join Date: May 2010
Posts: 65
Default

The authors of PhyloPythia have an interesting comparison of the nucleotide composition-based methods to a sequence identity/homology-based method (MEGAN) in the 50 pages of supplemental material for this 1.5 page paper:
http://www.nature.com/nmeth/journal/...h0311-191.html

I didn't notice whether they ran a nucleotide or amino acid blast search for MEGAN, but either way, it seems that using homology information gives pretty darn good results compared to the composition methods (among which PhyloPythiaS seems to be superior).
koadman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO