Hi there,
I recently received rRNA-Seq data from a number of gut meta-transcriptomics samples of 100 mil 2x100bp Illumina HiSeq reads/sample. As the biopsy samples are mainly human tissue we used Microbe Enrich to deplete human RNA, in addition to Microbe Express to deplete rRNA. Due to the small cDNA volumes we had to amplify it using GenomiPhi prior to sending off to the Illumina sequencing provider. Now having checked the sequences I noticed that there are surprisingly low proportions of rRNA (5% following ribopicker-standalone-0.4.3/ribopicker.pl -c 50 -i 70 -l 30 -f sample1.fastq -dbs rrnadb) and manageable proportions of human mRNA (28% following bowtie2 -t --qc-filter --very-fast-local -x Homo_sapiens.GRCh37.67.cDNA_ncRNA -1 sample1.1.fq -2 sample1.2.fq).
However, running FastQC for quality control reveals huge amounts of Sequence Duplication Levels where a majority of samples fail with >60% Total Duplicate Percentage. There are two typical duplication scenarios. First,
>>Sequence Duplication Levels fail
#Total Duplicate Percentage 82.04422080939267
#Duplication Level Relative count
1 100.0
2 16.117142135824288
3 8.134309517798537
4 5.544054531683918
5 4.385256248422116
6 3.500378692249432
7 2.910881090633678
8 2.5334511486998235
9 2.3024488765463267
10++ 107.03357737944964
>>END_MODULE
>>Overrepresented sequences warn
#Sequence Count Percentage Possible Source
CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG 269367 0.24784156137009358 No Hit
ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG 192996 0.17757345917719164 No Hit
ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT 174958 0.16097689729695483 No Hit
TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT 164622 0.15146685939950902 No Hit
TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC 163800 0.15071054640108597 No Hit
These sequences mainly hit Parabacteroides distasonis ATCC 8503, Bacteroides vulgatus ATCC 8482 and Homo sapiens BAC clone RP11-327N17. These strains are very likely habitants of this environment.
The second duplication type is
>>Sequence Duplication Levels fail
#Total Duplicate Percentage 83.36543150690594
#Duplication Level Relative count
1 100.0
2 31.920496894409936
3 25.20248447204969
4 22.56645962732919
5 21.696894409937887
6 19.985093167701862
7 17.77639751552795
8 15.91055900621118
9 14.596273291925465
10++ 227.23975155279504
>>END_MODULE
>>Overrepresented sequences pass
>>END_MODULE
>>Kmer Content warn
#Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
TTTTT 30440525 3.3208988 4.2889795 95-97
AAAAA 31526075 3.3173122 3.9691818 95-97
Note there are no overrepresented sequences here, just a general high duplication rate at lower flow cycles which possibly could be due to the high Kmer content.
Could these be due to some sequencing or library prep artefact, for instance the use of GenomiPhi cDNA amplification? Could the Parabacteroides strain simply be very abundant and highly expressed in these samples, or are the duplicated sequences TOO freakishly identical to be biologically possible? Contamination?
What would you recommend me to do before de novo assembling the meta-transcripomes and carrying out DE-Seq differential gene expression on these samples?
a) leave as is and pretend as nothing
b) removing exact duplicates, e.g. with fastx_collapser from the FastX toolkit and carry on from there. If so, is it safe to use the duplication numbers in the subsequent analysis?
c) something else?
Many thanks for any advice on this!
Regards,
Marcus
I recently received rRNA-Seq data from a number of gut meta-transcriptomics samples of 100 mil 2x100bp Illumina HiSeq reads/sample. As the biopsy samples are mainly human tissue we used Microbe Enrich to deplete human RNA, in addition to Microbe Express to deplete rRNA. Due to the small cDNA volumes we had to amplify it using GenomiPhi prior to sending off to the Illumina sequencing provider. Now having checked the sequences I noticed that there are surprisingly low proportions of rRNA (5% following ribopicker-standalone-0.4.3/ribopicker.pl -c 50 -i 70 -l 30 -f sample1.fastq -dbs rrnadb) and manageable proportions of human mRNA (28% following bowtie2 -t --qc-filter --very-fast-local -x Homo_sapiens.GRCh37.67.cDNA_ncRNA -1 sample1.1.fq -2 sample1.2.fq).
However, running FastQC for quality control reveals huge amounts of Sequence Duplication Levels where a majority of samples fail with >60% Total Duplicate Percentage. There are two typical duplication scenarios. First,
>>Sequence Duplication Levels fail
#Total Duplicate Percentage 82.04422080939267
#Duplication Level Relative count
1 100.0
2 16.117142135824288
3 8.134309517798537
4 5.544054531683918
5 4.385256248422116
6 3.500378692249432
7 2.910881090633678
8 2.5334511486998235
9 2.3024488765463267
10++ 107.03357737944964
>>END_MODULE
>>Overrepresented sequences warn
#Sequence Count Percentage Possible Source
CCATAATAACTTTTGCTGAAGATGATGAGCTAGCTAAAAAGGCTATCGAG 269367 0.24784156137009358 No Hit
ATGGATAGAGAAACCGGCCGTTCAAGAGGCTTCGGTTTCGTTGAGCTGAG 192996 0.17757345917719164 No Hit
ATGGTGAGCATCTGCGTTAGAAACACAGGATAAAAAGATTTTTTTTTTTT 174958 0.16097689729695483 No Hit
TGTGTGTCTAATTTTTAAGATTAATTAATTAATTGTTATTTGCAATTCTT 164622 0.15146685939950902 No Hit
TGTGGAGGCTACTGAAAATTTCAGTGGGACGAAACTATTTTTGTGCTGAC 163800 0.15071054640108597 No Hit
These sequences mainly hit Parabacteroides distasonis ATCC 8503, Bacteroides vulgatus ATCC 8482 and Homo sapiens BAC clone RP11-327N17. These strains are very likely habitants of this environment.
The second duplication type is
>>Sequence Duplication Levels fail
#Total Duplicate Percentage 83.36543150690594
#Duplication Level Relative count
1 100.0
2 31.920496894409936
3 25.20248447204969
4 22.56645962732919
5 21.696894409937887
6 19.985093167701862
7 17.77639751552795
8 15.91055900621118
9 14.596273291925465
10++ 227.23975155279504
>>END_MODULE
>>Overrepresented sequences pass
>>END_MODULE
>>Kmer Content warn
#Sequence Count Obs/Exp Overall Obs/Exp Max Max Obs/Exp Position
TTTTT 30440525 3.3208988 4.2889795 95-97
AAAAA 31526075 3.3173122 3.9691818 95-97
Note there are no overrepresented sequences here, just a general high duplication rate at lower flow cycles which possibly could be due to the high Kmer content.
Could these be due to some sequencing or library prep artefact, for instance the use of GenomiPhi cDNA amplification? Could the Parabacteroides strain simply be very abundant and highly expressed in these samples, or are the duplicated sequences TOO freakishly identical to be biologically possible? Contamination?
What would you recommend me to do before de novo assembling the meta-transcripomes and carrying out DE-Seq differential gene expression on these samples?
a) leave as is and pretend as nothing
b) removing exact duplicates, e.g. with fastx_collapser from the FastX toolkit and carry on from there. If so, is it safe to use the duplication numbers in the subsequent analysis?
c) something else?
Many thanks for any advice on this!
Regards,
Marcus
Comment