SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
effect of mRNA contamination eren Genomic Resequencing 0 08-23-2011 10:43 PM
DNA contamination wolfypita RNA Sequencing 0 03-16-2011 07:27 PM
Vector contamination? gconcepcion Illumina/Solexa 5 02-08-2011 06:14 AM
tRNA contamination Newie RNA Sequencing 0 01-08-2011 07:40 AM
How do you prevent contamination? sem Sample Prep / Library Generation 0 05-20-2009 07:23 AM

Reply
 
Thread Tools
Old 09-25-2009, 01:58 AM   #1
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Question Identifying contamination

I've got an interesting problem and wondered if anyone else had any thoughts about how I can approach this.

I've got some Illumina data from a run which should have contained human sequence but appears to have been contaminated with some other sequence of unknown origin. We're pretty sure the samples haven't been mixed up since some of the affected lanes were barcoded and the barcodes are present. The problem now is to try to identify the source of the contamination.

The sequences we produced are very diverse, with little or no duplication of reads, so this isn't just primers or plasmid DNA.

So far I've ruled out:

Human
Mouse
Rat
Any other vertebrate species the lab concerned work on
E.coli

..and now I'm stuck!

If you had 30million+ reads of unknown origin (or origins) how would you try to find where they'd come from?
simonandrews is offline   Reply With Quote
Old 09-25-2009, 05:06 AM   #2
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

MEGAN?

http://www-ab.informatik.uni-tuebing...software/megan

(or any other metagenomics analysis pipeline suitable for Illumina reads)
flxlex is offline   Reply With Quote
Old 09-25-2009, 05:15 AM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by flxlex View Post
MEGAN?
Very interesting! I'd not seen that before and I can think of uses for it! However it does require a separate blast step before you can do any of that analysis and I don't really have the resources to blast this number of sequences. I may try something along those lines with a smaller random selection of sequences though.

As an aside I've been thinking about putting together a database of potentially contaminating sequences which you could map a next gen dataset against as a QC measure. It would include all of the primers used for library prep, Ecoli sequence, various families of repeats and other stuff we regularly see turning up in our libraries. Has anyone tried this before?
simonandrews is offline   Reply With Quote
Old 09-25-2009, 05:29 AM   #4
severin
Genome Informatics Facility
 
Location: Iowa @isugif

Join Date: Sep 2009
Posts: 105
Default Align then blast

There is usually quite a bit of overlap of sequences. In your case you do not have a reference genome and there for no genes to which to align the sequence. However, may I propose another way of utilizing the coverage depth to obtain a gene. Align the sequence reads to themselves with a required alignment of some arbitrarily high identity match of say 80%. For those genes that are highly expressed there will be enough coverage depth to recover the exon providing larger sequences to do a blast search for your organism (provided it is a single organism).

Let us know how it goes. Good luck!

The alignment might look something like this
__________________
_____________________ ___________________
_____________________
_______________________
____________________ __________________

providing a sequence that is quite a bit longer.
_______________________________________________

Last edited by severin; 09-25-2009 at 05:39 AM.
severin is offline   Reply With Quote
Old 09-25-2009, 05:31 AM   #5
francesco.vezzi
Member
 
Location: Udine (Italy)

Join Date: Jan 2009
Posts: 50
Default

I dear,
do you simply looks for short sequences and try to understand from where they are from? I think a good approach could be try de novo assembly of short reads and blast the output on NCBI on you database of supposed sources of contamination.

The advantage of perform de novo assembly is that you can blast longer sequences and you probably discard a lot of reads that contain only errors.
francesco.vezzi is offline   Reply With Quote
Old 09-25-2009, 05:39 AM   #6
severin
Genome Informatics Facility
 
Location: Iowa @isugif

Join Date: Sep 2009
Posts: 105
Default

@francesco.vezzi

Exactly.
severin is offline   Reply With Quote
Old 09-25-2009, 05:48 AM   #7
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by simonandrews View Post
Very interesting! I'd not seen that before and I can think of uses for it! However it does require a separate blast step before you can do any of that analysis and I don't really have the resources to blast this number of sequences.
You don't have to BLAST the entire set to get a good picture of the source of the contamination. Select a random set of ~300,000 (1% of your total). That should provide enough information.
kmcarr is offline   Reply With Quote
Old 09-25-2009, 12:39 PM   #8
rglover
rg
 
Location: uk

Join Date: Dec 2008
Posts: 51
Default

Quote:
Originally Posted by francesco.vezzi View Post
I dear,
do you simply looks for short sequences and try to understand from where they are from? I think a good approach could be try de novo assembly of short reads and blast the output on NCBI on you database of supposed sources of contamination.

The advantage of perform de novo assembly is that you can blast longer sequences and you probably discard a lot of reads that contain only errors.
This is exactly the approach (CLCbio de novo assembly combined with blast and MEGAN) that we use to characterise metagenomics datasets. Contamination is easy to identify when you use MEGAN to visualise the blast results - despite us working on plant pathogens/environmental samples, we always seem to get some good quality human contaminating sequences...
rglover is offline   Reply With Quote
Old 09-25-2009, 01:24 PM   #9
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I ended up doing a de-novo assembly with velvet (which was much easier and quicker than I thought it would be). I got several contigs of over 1kb in length. Blasting these gave a few high identity (though not identical) hits to a bacterial genome so I guess that something similar to that was the main contaminant. Interestingly I've still got a couple of contigs of 5+kb which don't appear anywhere in EMBL so the mystery isn't completely solved - but things are a lot clearer than they were.

Thanks for the suggestions.
simonandrews is offline   Reply With Quote
Old 09-25-2009, 01:26 PM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kmcarr View Post
You don't have to BLAST the entire set to get a good picture of the source of the contamination. Select a random set of ~300,000 (1% of your total). That should provide enough information.
I think you overestimate the compute power I currently have available to me. 300,000 blasts is not something I generally do just before I go home on a Friday
simonandrews is offline   Reply With Quote
Old 08-04-2011, 04:58 AM   #11
Chuckytah
Member
 
Location: Barcelos, Braga, Portugal

Join Date: Mar 2011
Posts: 65
Default

Someone used before megan? there are some tutorials?
Chuckytah is offline   Reply With Quote
Old 09-30-2011, 01:04 PM   #12
rmc7777
Junior Member
 
Location: Colorado

Join Date: Oct 2009
Posts: 2
Default Megan Blast

We Blast 100K 50bp reads against the NCBI nucleotide database (14M sequences) on a Cray XT6m supercomputer, 24-hour runs, followed by Megan analysis for metagenomic data. The 100K sample represents 2.5% of the total reads population (4M reads). The taxonomic distribution of species shown in Megan generally agrees with the taxon expected by the PI's. We plan on making 168-hour runs (1 week) to sample 17.5% of the total reads population. We want to see if the taxon distribution changes substantially. I think it's an open question whether Blasting a very small subset of reads yields an accurate estimate of the taxon represented in a metagenomic sample. Another question is whether you would miss mapping reads from rare species in the sample.

R
rmc7777 is offline   Reply With Quote
Reply

Tags
contamination, identify, search, sequence

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO