SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
relating variables to OTUs furor Bioinformatics 0 05-25-2015 09:10 AM
Comparing OTUs from different V regions? JenBarb Metagenomics 1 04-09-2015 12:49 PM
Appropriate to assign taxonomy to OTUs? primordialsoup77 Bioinformatics 0 12-03-2013 07:31 PM
Bacterial species definition / OTUs rhinoceros Metagenomics 1 10-24-2013 05:31 AM
Comparison between SOLiD, Illumina MiSeq and Illumina HiSeq NGS_New_User SOLiD 0 12-12-2012 12:37 PM

Reply
 
Thread Tools
Old 10-26-2015, 01:44 AM   #1
Tuibian
Junior Member
 
Location: Denmark

Join Date: Sep 2014
Posts: 3
Default Illumina MiSeq, too many OTUs

Dear All,
I am pretty new for Illumina Miseq platform.
The results I got are demultiplexed paired end reads. It has been quality filtered based on the quality score 25.
What I have done is to remove both the forward and reverse primer using QIIME and merge the reads, then quality-filtering (-fastq_maxee 1.0 ) the reads using usearch. However, when I pick up OTUs, there are too many OTUs.
I am wondering if anybody who did the similar work in the past can give me some suggestions.

Best reagards,
Zhigang

Last edited by Tuibian; 10-26-2015 at 01:45 AM. Reason: spelling mistake
Tuibian is offline   Reply With Quote
Old 10-26-2015, 10:30 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It's common to get too many clusters (note that clusters are NOT the same as OTUs) if the reads were merged incorrectly, the clustering was done incorrectly, or the data is low quality. It would be helpful to post things like the merge rate, a complete FastQC output, information on how sequencing was done, read length, target amplicon insert size, how the merging was performed, how the clustering was performed, etc. Also, how many clusters you got, how many you expect, and the size distribution.
Brian Bushnell is offline   Reply With Quote
Old 10-27-2015, 02:57 AM   #3
Tuibian
Junior Member
 
Location: Denmark

Join Date: Sep 2014
Posts: 3
Default

Hi Brian,
thanks for your kind response.
when I got my data, I did all the steps as below:
1. remove the primer. I got two fastq files for each sample (R1 and R2) and using cutadpt to remove both forward and reverse primer. the amplicon length is around 298 bp. the read length of Illumina Miseq output is around 250 bp.
2. merge the paired end reads using Usearch. i used usearch to merged the reads and more than 90%of the reads are merged.
3. quality filtering. I still used usearch to do this step with the parameters of maxmium error rate 1.0, and minimum length 200 bp. after this step, it seems most of the reads are still kept.
4. OTU picking. I did open reference OTUs picking in Qiime. then the output is an incredibly huge numbers of OTUs for each sample. Bascially, more than ten thousand OTUs in each sample. thats rediculous.
Can you give me some suggestions? Really thankful to your help.

Best regards,
Zhigang
Tuibian is offline   Reply With Quote
Old 10-27-2015, 01:57 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

1) Should be fine.
2) Usearch is not the best tool for this. I highly recommend BBMerge, as Usearch (like all other read mergers I've tested) yields a very high false-positive merge rate - particularly with the default settings - which results in extra clusters.
3) After merging, you should do quality and length filtering. A maximum expected error rate of 1 is very low, but if you still retain most of your data, maybe it's OK. Too low a value will cause bias, though. So, I advice caution. Read merging tools do not need the reads filtered to Q25 prior to merging. As for length filtering, you need to look at your length distribution and decide the bounds are containing real amplicon pairs, versus off-target or chimeric pairs. For example, if 95% of your merged reads are in the range of 280-310bp, then a pair that merged to a length of 400bp is very suspicious and likely to be garbage.
4) There are various methods of clustering. I don't know what the best is, and I have not used Qiime. I'm not sure why 10000 would be considered a lot, though, unless you are dealing with a low-complexity community. For a soil community, having 10k species in a sample would not be surprising, if you have enough reads so that the low-abundance organisms are seen. What kind of metagenome is it?

In summary, try merging like this:
bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq strict minoverlap0=20

Then generate a length histogram:
readlength.sh in=merged.fq

Then length-filter like this:
reformat.sh in=merged.fq out=filtered.fq minlen=X maxlen=Y

...where X and Y are numbers that you decide on based on the length histogram, where the goal is to eliminate chimeras.

Then, optionally quality-filter (on the merged reads, not on the raw reads):
reformat.sh in=filtered.fq out=qfiltered.fq minavgquality=20

...where 20 is a number I picked arbitrarily, corresponding to a 1% expected error rate. But, I recommend you tune that number so that you don't loose too many reads. What is too many? Hard to say.... maybe 5%? Remember, the more you lose the more bias you'll get, but the easier clustering will be.

Finally, cluster the reads. You can do that with Qiime, or try a different tool, like Dedupe:
dedupe.sh in=x.fq am ac fo c pc rnc=f mcs=3 mo=270 s=1 pto cc qin=33 pattern=cluster_%.fq

Here, "mo" means "Minimum overlap" and should be set to the lower length limit of your input sequences, whatever that happens to be. You can alternately set "mop=100" (min overlap percent = 100) meaning reads will have to overlap along their entire length to cluster, which is probably a good idea for amplicon sequencing unless you have staggered inline adapters of different lengths. "mcs=3" sets a minimum cluster size of 3; clusters with fewer than 3 reads will be ignored; it's best to empirically determine this cutoff, but clusters below a certain size are indistinguishable between low-abundance organisms and high-error-rate reads. "s=1" allows at most one mismatch between reads for clustering.

This will give you one file per cluster, containing all the reads in the cluster. You can then generate a consensus of each cluster to get candidate OTUs. Again, I don't know if this will do better clustering than Qiime, since I have not tested Qiime, it's just an alternative if you think Qiime is giving incorrect results.

These programs are all in the BBMap package.
Brian Bushnell is offline   Reply With Quote
Old 10-28-2015, 12:20 AM   #5
Tuibian
Junior Member
 
Location: Denmark

Join Date: Sep 2014
Posts: 3
Default

Hi Brian,

Thank you so much for your kind suggestions.

I work with the rumen liquid sample from dairy cows. so there is a very complex microbial community.

With respect to the workflow you mentioned above, It looks very nice but I think i will test the QIIME out first. I also get some suggestions from other people who using QIIME. Since I have done most of work in QIIME and have a little experience now, I am gonna continue.

Best regards,
Zhigang
Tuibian is offline   Reply With Quote
Reply

Tags
bacterial dna, illumina miseq 250bp, otus, quality filtering

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO