Hello,
I have several metagenomic datasets (from soil, animal faeces and river water samples) which were all generated using Illumina HiSeq, MiSeq and NovaSeq. In some of them, I detected some sequence duplication, according to FastQC (for example, peaks of duplication between 9->500 duplication level, with 22% of sequences remaining if de-duplicated). When using fastp, the level of duplication detected is lower, but still present (9% for example). I am confused as to whether I should de-duplicate these sequence files or not before analysis. I contacted the administrators at the European Nucleotide Archive (ENA-MGNify) and they were not concerned about duplication. However, is it not unlikely that diverse microbial communities would generate sequences that are 100% identical? Is de-duplication and standard procedure before metagenomic sequence analyses? Or would it be better to keep the duplicates for the first steps of analyses, and remove them optionally for downstream analyses?
Thank you,
Alex
I have several metagenomic datasets (from soil, animal faeces and river water samples) which were all generated using Illumina HiSeq, MiSeq and NovaSeq. In some of them, I detected some sequence duplication, according to FastQC (for example, peaks of duplication between 9->500 duplication level, with 22% of sequences remaining if de-duplicated). When using fastp, the level of duplication detected is lower, but still present (9% for example). I am confused as to whether I should de-duplicate these sequence files or not before analysis. I contacted the administrators at the European Nucleotide Archive (ENA-MGNify) and they were not concerned about duplication. However, is it not unlikely that diverse microbial communities would generate sequences that are 100% identical? Is de-duplication and standard procedure before metagenomic sequence analyses? Or would it be better to keep the duplicates for the first steps of analyses, and remove them optionally for downstream analyses?
Thank you,
Alex
Comment