Seqanswers Leaderboard Ad

**Brian Bushnell** · 10-26-2015, 09:30 AM

It's common to get too many clusters (note that clusters are NOT the same as OTUs) if the reads were merged incorrectly, the clustering was done incorrectly, or the data is low quality. It would be helpful to post things like the merge rate, a complete FastQC output, information on how sequencing was done, read length, target amplicon insert size, how the merging was performed, how the clustering was performed, etc. Also, how many clusters you got, how many you expect, and the size distribution.

**Tuibian** · 10-27-2015, 01:57 AM

Hi Brian,
thanks for your kind response.
when I got my data, I did all the steps as below:
1. remove the primer. I got two fastq files for each sample (R1 and R2) and using cutadpt to remove both forward and reverse primer. the amplicon length is around 298 bp. the read length of Illumina Miseq output is around 250 bp.
2. merge the paired end reads using Usearch. i used usearch to merged the reads and more than 90%of the reads are merged.
3. quality filtering. I still used usearch to do this step with the parameters of maxmium error rate 1.0, and minimum length 200 bp. after this step, it seems most of the reads are still kept.
4. OTU picking. I did open reference OTUs picking in Qiime. then the output is an incredibly huge numbers of OTUs for each sample. Bascially, more than ten thousand OTUs in each sample. thats rediculous.
Can you give me some suggestions? Really thankful to your help.

Best regards,
Zhigang

**Brian Bushnell** · 10-27-2015, 12:57 PM

1) Should be fine.
2) Usearch is not the best tool for this. I highly recommend BBMerge, as Usearch (like all other read mergers I've tested) yields a very high false-positive merge rate - particularly with the default settings - which results in extra clusters.
3) After merging, you should do quality and length filtering. A maximum expected error rate of 1 is very low, but if you still retain most of your data, maybe it's OK. Too low a value will cause bias, though. So, I advice caution. Read merging tools do not need the reads filtered to Q25 prior to merging. As for length filtering, you need to look at your length distribution and decide the bounds are containing real amplicon pairs, versus off-target or chimeric pairs. For example, if 95% of your merged reads are in the range of 280-310bp, then a pair that merged to a length of 400bp is very suspicious and likely to be garbage.
4) There are various methods of clustering. I don't know what the best is, and I have not used Qiime. I'm not sure why 10000 would be considered a lot, though, unless you are dealing with a low-complexity community. For a soil community, having 10k species in a sample would not be surprising, if you have enough reads so that the low-abundance organisms are seen. What kind of metagenome is it?

In summary, try merging like this:
bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq strict minoverlap0=20

Then generate a length histogram:
readlength.sh in=merged.fq

Then length-filter like this:
reformat.sh in=merged.fq out=filtered.fq minlen=X maxlen=Y

...where X and Y are numbers that you decide on based on the length histogram, where the goal is to eliminate chimeras.

Then, optionally quality-filter (on the merged reads, not on the raw reads):
reformat.sh in=filtered.fq out=qfiltered.fq minavgquality=20

...where 20 is a number I picked arbitrarily, corresponding to a 1% expected error rate. But, I recommend you tune that number so that you don't loose too many reads. What is too many? Hard to say.... maybe 5%? Remember, the more you lose the more bias you'll get, but the easier clustering will be.

Finally, cluster the reads. You can do that with Qiime, or try a different tool, like Dedupe:
dedupe.sh in=x.fq am ac fo c pc rnc=f mcs=3 mo=270 s=1 pto cc qin=33 pattern=cluster_%.fq

Here, "mo" means "Minimum overlap" and should be set to the lower length limit of your input sequences, whatever that happens to be. You can alternately set "mop=100" (min overlap percent = 100) meaning reads will have to overlap along their entire length to cluster, which is probably a good idea for amplicon sequencing unless you have staggered inline adapters of different lengths. "mcs=3" sets a minimum cluster size of 3; clusters with fewer than 3 reads will be ignored; it's best to empirically determine this cutoff, but clusters below a certain size are indistinguishable between low-abundance organisms and high-error-rate reads. "s=1" allows at most one mismatch between reads for clustering.

This will give you one file per cluster, containing all the reads in the cluster. You can then generate a consensus of each cluster to get candidate OTUs. Again, I don't know if this will do better clustering than Qiime, since I have not tested Qiime, it's just an alternative if you think Qiime is giving incorrect results.

These programs are all in the BBMap package.

**Tuibian** · 10-27-2015, 11:20 PM

Hi Brian,

Thank you so much for your kind suggestions.

I work with the rumen liquid sample from dairy cows. so there is a very complex microbial community.

With respect to the workflow you mentioned above, It looks very nice but I think i will test the QIIME out first. I also get some suggestions from other people who using QIIME. Since I have done most of work in QIIME and have a little experience now, I am gonna continue.

Best regards,
Zhigang

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Illumina MiSeq, too many OTUs

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News