Seqanswers Leaderboard Ad

**flxlex** · 09-25-2009, 04:06 AM

MEGAN?

404 Error | Universität Tübingen

http://www-ab.informatik.uni-tuebingen.de/software/megan

(or any other metagenomics analysis pipeline suitable for Illumina reads)

**simonandrews** · 09-25-2009, 04:15 AM

Originally posted by flxlex View Post

MEGAN?

Very interesting! I'd not seen that before and I can think of uses for it! However it does require a separate blast step before you can do any of that analysis and I don't really have the resources to blast this number of sequences. I may try something along those lines with a smaller random selection of sequences though.

As an aside I've been thinking about putting together a database of potentially contaminating sequences which you could map a next gen dataset against as a QC measure. It would include all of the primers used for library prep, Ecoli sequence, various families of repeats and other stuff we regularly see turning up in our libraries. Has anyone tried this before?

**severin** · 09-25-2009, 04:29 AM

Align then blast

There is usually quite a bit of overlap of sequences. In your case you do not have a reference genome and there for no genes to which to align the sequence. However, may I propose another way of utilizing the coverage depth to obtain a gene. Align the sequence reads to themselves with a required alignment of some arbitrarily high identity match of say 80%. For those genes that are highly expressed there will be enough coverage depth to recover the exon providing larger sequences to do a blast search for your organism (provided it is a single organism).

Let us know how it goes. Good luck!

The alignment might look something like this
__________________
_____________________ ___________________
_____________________
_______________________
____________________ __________________

providing a sequence that is quite a bit longer.
_______________________________________________

**francesco.vezzi** · 09-25-2009, 04:31 AM

I dear,
do you simply looks for short sequences and try to understand from where they are from? I think a good approach could be try de novo assembly of short reads and blast the output on NCBI on you database of supposed sources of contamination.

The advantage of perform de novo assembly is that you can blast longer sequences and you probably discard a lot of reads that contain only errors.

**severin** · 09-25-2009, 04:39 AM

@francesco.vezzi

Exactly.

**kmcarr** · 09-25-2009, 04:48 AM

Originally posted by simonandrews View Post

Very interesting! I'd not seen that before and I can think of uses for it! However it does require a separate blast step before you can do any of that analysis and I don't really have the resources to blast this number of sequences.

You don't have to BLAST the entire set to get a good picture of the source of the contamination. Select a random set of ~300,000 (1% of your total). That should provide enough information.

**rglover** · 09-25-2009, 11:39 AM

Originally posted by francesco.vezzi View Post

I dear,
do you simply looks for short sequences and try to understand from where they are from? I think a good approach could be try de novo assembly of short reads and blast the output on NCBI on you database of supposed sources of contamination.

The advantage of perform de novo assembly is that you can blast longer sequences and you probably discard a lot of reads that contain only errors.

This is exactly the approach (CLCbio de novo assembly combined with blast and MEGAN) that we use to characterise metagenomics datasets. Contamination is easy to identify when you use MEGAN to visualise the blast results - despite us working on plant pathogens/environmental samples, we always seem to get some good quality human contaminating sequences...

**simonandrews** · 09-25-2009, 12:24 PM

I ended up doing a de-novo assembly with velvet (which was much easier and quicker than I thought it would be). I got several contigs of over 1kb in length. Blasting these gave a few high identity (though not identical) hits to a bacterial genome so I guess that something similar to that was the main contaminant. Interestingly I've still got a couple of contigs of 5+kb which don't appear anywhere in EMBL so the mystery isn't completely solved - but things are a lot clearer than they were.

Thanks for the suggestions.

**simonandrews** · 09-25-2009, 12:26 PM

Originally posted by kmcarr View Post

You don't have to BLAST the entire set to get a good picture of the source of the contamination. Select a random set of ~300,000 (1% of your total). That should provide enough information.

I think you overestimate the compute power I currently have available to me. 300,000 blasts is not something I generally do just before I go home on a Friday

**Chuckytah** · 08-04-2011, 03:58 AM

Someone used before megan? there are some tutorials?

**rmc7777** · 09-30-2011, 12:04 PM

Megan Blast

We Blast 100K 50bp reads against the NCBI nucleotide database (14M sequences) on a Cray XT6m supercomputer, 24-hour runs, followed by Megan analysis for metagenomic data. The 100K sample represents 2.5% of the total reads population (4M reads). The taxonomic distribution of species shown in Megan generally agrees with the taxon expected by the PI's. We plan on making 168-hour runs (1 week) to sample 17.5% of the total reads population. We want to see if the taxon distribution changes substantially. I think it's an open question whether Blasting a very small subset of reads yields an accurate estimate of the taxon represented in a metagenomic sample. Another question is whether you would miss mapping reads from rare species in the sample.

R

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Identifying contamination

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News