Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina MiSeq, too many OTUs

    Dear All,
    I am pretty new for Illumina Miseq platform.
    The results I got are demultiplexed paired end reads. It has been quality filtered based on the quality score 25.
    What I have done is to remove both the forward and reverse primer using QIIME and merge the reads, then quality-filtering (-fastq_maxee 1.0 ) the reads using usearch. However, when I pick up OTUs, there are too many OTUs.
    I am wondering if anybody who did the similar work in the past can give me some suggestions.

    Best reagards,
    Zhigang
    Last edited by Tuibian; 10-26-2015, 12:45 AM. Reason: spelling mistake

  • #2
    It's common to get too many clusters (note that clusters are NOT the same as OTUs) if the reads were merged incorrectly, the clustering was done incorrectly, or the data is low quality. It would be helpful to post things like the merge rate, a complete FastQC output, information on how sequencing was done, read length, target amplicon insert size, how the merging was performed, how the clustering was performed, etc. Also, how many clusters you got, how many you expect, and the size distribution.

    Comment


    • #3
      Hi Brian,
      thanks for your kind response.
      when I got my data, I did all the steps as below:
      1. remove the primer. I got two fastq files for each sample (R1 and R2) and using cutadpt to remove both forward and reverse primer. the amplicon length is around 298 bp. the read length of Illumina Miseq output is around 250 bp.
      2. merge the paired end reads using Usearch. i used usearch to merged the reads and more than 90%of the reads are merged.
      3. quality filtering. I still used usearch to do this step with the parameters of maxmium error rate 1.0, and minimum length 200 bp. after this step, it seems most of the reads are still kept.
      4. OTU picking. I did open reference OTUs picking in Qiime. then the output is an incredibly huge numbers of OTUs for each sample. Bascially, more than ten thousand OTUs in each sample. thats rediculous.
      Can you give me some suggestions? Really thankful to your help.

      Best regards,
      Zhigang

      Comment


      • #4
        1) Should be fine.
        2) Usearch is not the best tool for this. I highly recommend BBMerge, as Usearch (like all other read mergers I've tested) yields a very high false-positive merge rate - particularly with the default settings - which results in extra clusters.
        3) After merging, you should do quality and length filtering. A maximum expected error rate of 1 is very low, but if you still retain most of your data, maybe it's OK. Too low a value will cause bias, though. So, I advice caution. Read merging tools do not need the reads filtered to Q25 prior to merging. As for length filtering, you need to look at your length distribution and decide the bounds are containing real amplicon pairs, versus off-target or chimeric pairs. For example, if 95% of your merged reads are in the range of 280-310bp, then a pair that merged to a length of 400bp is very suspicious and likely to be garbage.
        4) There are various methods of clustering. I don't know what the best is, and I have not used Qiime. I'm not sure why 10000 would be considered a lot, though, unless you are dealing with a low-complexity community. For a soil community, having 10k species in a sample would not be surprising, if you have enough reads so that the low-abundance organisms are seen. What kind of metagenome is it?

        In summary, try merging like this:
        bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq strict minoverlap0=20

        Then generate a length histogram:
        readlength.sh in=merged.fq

        Then length-filter like this:
        reformat.sh in=merged.fq out=filtered.fq minlen=X maxlen=Y

        ...where X and Y are numbers that you decide on based on the length histogram, where the goal is to eliminate chimeras.

        Then, optionally quality-filter (on the merged reads, not on the raw reads):
        reformat.sh in=filtered.fq out=qfiltered.fq minavgquality=20

        ...where 20 is a number I picked arbitrarily, corresponding to a 1% expected error rate. But, I recommend you tune that number so that you don't loose too many reads. What is too many? Hard to say.... maybe 5%? Remember, the more you lose the more bias you'll get, but the easier clustering will be.

        Finally, cluster the reads. You can do that with Qiime, or try a different tool, like Dedupe:
        dedupe.sh in=x.fq am ac fo c pc rnc=f mcs=3 mo=270 s=1 pto cc qin=33 pattern=cluster_%.fq

        Here, "mo" means "Minimum overlap" and should be set to the lower length limit of your input sequences, whatever that happens to be. You can alternately set "mop=100" (min overlap percent = 100) meaning reads will have to overlap along their entire length to cluster, which is probably a good idea for amplicon sequencing unless you have staggered inline adapters of different lengths. "mcs=3" sets a minimum cluster size of 3; clusters with fewer than 3 reads will be ignored; it's best to empirically determine this cutoff, but clusters below a certain size are indistinguishable between low-abundance organisms and high-error-rate reads. "s=1" allows at most one mismatch between reads for clustering.

        This will give you one file per cluster, containing all the reads in the cluster. You can then generate a consensus of each cluster to get candidate OTUs. Again, I don't know if this will do better clustering than Qiime, since I have not tested Qiime, it's just an alternative if you think Qiime is giving incorrect results.

        These programs are all in the BBMap package.

        Comment


        • #5
          Hi Brian,

          Thank you so much for your kind suggestions.

          I work with the rumen liquid sample from dairy cows. so there is a very complex microbial community.

          With respect to the workflow you mentioned above, It looks very nice but I think i will test the QIIME out first. I also get some suggestions from other people who using QIIME. Since I have done most of work in QIIME and have a little experience now, I am gonna continue.

          Best regards,
          Zhigang

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          54 views
          0 likes
          Last Post seqadmin  
          Working...
          X