Dear SEQanswers,
My name is Alex SMITH and I've recently started a RNA-seq bioinformatics post-doc at the Malaghan Institute of medical Research. In order to practice with the tools I'll be using, I decided to try and map the RNA-seq reads from the Mouse Dendritic Cell dataset (GSE29209 ; GSM722533) that was used as an example in the original Trinity paper "Full-length transcriptome assembly from RNA-Seq data without a reference genome" (2011) to the latest mouse genome. I downloaded the paired-end read file (52.6 M reads) and split it into 2 separate end files. However, before attempting the mapping, given that the reads were generated using Illumina technology, I decided to run them through FASTQC to get a feel for them. I was very surprised when FASTQC reported very high levels of read duplication - for example, for each end file, the 4 most duplicated reads accounted for almost 3% of all reads (each representing more than 35k reads), and the total sequence duplication level reported is >=70% in both cases.
I realise that FASTQC is not the best software for getting an idea of sequence duplication, given that it does not take paired ends into account and limits itself to unique sequences from the 50 first nucleotides in the 200 000 first reads, as is known to give such results, but as I am not very experienced with RNA-seq these results worried me. I tried to find out what these 4 most duplicated reads corresponded to by blasting them against the Mouse genome, the whole of the nr database (temporary report link expires on 10-25 05:00 am), the 92 common ERCC RNA-seq spike-in control sequences, and against whatever Illumina adaptors, primers, barcodes etc that I could find. However, I have come up completely blank! Looking at the sequences of these heavily-replicated 50nt read parts, I also noticed that there were very few "double nucleotides", which one might expect in any given sequence. I've attached the two tables of over-represented sequences, and copied these sequences below (they are different, and not mirrors, for ends 1 and 2):
Ends1:
TCTAGAGTACAGTGACGAGTGACGATACACGCATACGACTGACGCCGTAC
CACGTCACGTGTACGTAGTACGTACGCATACACGCATGTACGTATATAGT
AGATCTCATATCGTCGCTCGTCATGCGTGTATGCGTCTGCATACGGCGCA
GTGCAGTGCGCACATATCACATGCTATGCGTGTATGACAGTCGTATACTG
Ends2:
TGTCGATTATCGCACTGGTGCGAATGGATACGCGACATCTATCTGATGAC
CACTATAGCGATAGACAAGCATGCGCTGCGTCGACTCAGATGAGTGCACG
GCGATCGCTCTATCTGCTCATCTGCACTGCATATGAGCACGCTACTGCTA
ATAGCGCAGAGCGTGATCATGACTATACATGATCTGTGTGCAGCACATGT
I have attached the FASTQC duplicate sequence graphs and the per-base sequence quality box plots (for end 1) as well. Please note that the 4 first over-represented sequences did not seem to correspond to any particular quality distribution (i.e. were not all low-quality). Obviously the 5th for each was!
Searching on SEQanswers, I found these interesting threads, but was not able to find a consensus interpretation:
http://seqanswers.com/forums/showthread.php?t=24094
http://seqanswers.com/forums/showthread.php?t=28607
http://seqanswers.com/forums/showthread.php?t=30397
http://seqanswers.com/forums/showthread.php?t=24040
A blog post on FASTQC duplicate sequences (pointed to by one of the threads) was interesting as well:
http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/
I have no idea on how to interpret the strong presence of these peculiar sequences other than some problem in the library preparation, which I would find surprising given that this dataset was used as an example in a paper. Bottom line, I don't know what the best way to deal with them would be: keep them (as the mapping results should not be impacted), or remove them (only losing about 3% total reads)? Should I just go ahead and do the mapping, then use Picard tools to look at library diversity (but then these reads shouldn't map anyway)? Maybe in practice it doesn't change anything but I would like to be sure I understand what I'm doing (or not doing)!
Thank you in advance for any help or enlightenment you can bring and always, thanks for reading!
-- Alex
My name is Alex SMITH and I've recently started a RNA-seq bioinformatics post-doc at the Malaghan Institute of medical Research. In order to practice with the tools I'll be using, I decided to try and map the RNA-seq reads from the Mouse Dendritic Cell dataset (GSE29209 ; GSM722533) that was used as an example in the original Trinity paper "Full-length transcriptome assembly from RNA-Seq data without a reference genome" (2011) to the latest mouse genome. I downloaded the paired-end read file (52.6 M reads) and split it into 2 separate end files. However, before attempting the mapping, given that the reads were generated using Illumina technology, I decided to run them through FASTQC to get a feel for them. I was very surprised when FASTQC reported very high levels of read duplication - for example, for each end file, the 4 most duplicated reads accounted for almost 3% of all reads (each representing more than 35k reads), and the total sequence duplication level reported is >=70% in both cases.
I realise that FASTQC is not the best software for getting an idea of sequence duplication, given that it does not take paired ends into account and limits itself to unique sequences from the 50 first nucleotides in the 200 000 first reads, as is known to give such results, but as I am not very experienced with RNA-seq these results worried me. I tried to find out what these 4 most duplicated reads corresponded to by blasting them against the Mouse genome, the whole of the nr database (temporary report link expires on 10-25 05:00 am), the 92 common ERCC RNA-seq spike-in control sequences, and against whatever Illumina adaptors, primers, barcodes etc that I could find. However, I have come up completely blank! Looking at the sequences of these heavily-replicated 50nt read parts, I also noticed that there were very few "double nucleotides", which one might expect in any given sequence. I've attached the two tables of over-represented sequences, and copied these sequences below (they are different, and not mirrors, for ends 1 and 2):
Ends1:
TCTAGAGTACAGTGACGAGTGACGATACACGCATACGACTGACGCCGTAC
CACGTCACGTGTACGTAGTACGTACGCATACACGCATGTACGTATATAGT
AGATCTCATATCGTCGCTCGTCATGCGTGTATGCGTCTGCATACGGCGCA
GTGCAGTGCGCACATATCACATGCTATGCGTGTATGACAGTCGTATACTG
Ends2:
TGTCGATTATCGCACTGGTGCGAATGGATACGCGACATCTATCTGATGAC
CACTATAGCGATAGACAAGCATGCGCTGCGTCGACTCAGATGAGTGCACG
GCGATCGCTCTATCTGCTCATCTGCACTGCATATGAGCACGCTACTGCTA
ATAGCGCAGAGCGTGATCATGACTATACATGATCTGTGTGCAGCACATGT
I have attached the FASTQC duplicate sequence graphs and the per-base sequence quality box plots (for end 1) as well. Please note that the 4 first over-represented sequences did not seem to correspond to any particular quality distribution (i.e. were not all low-quality). Obviously the 5th for each was!
Searching on SEQanswers, I found these interesting threads, but was not able to find a consensus interpretation:
http://seqanswers.com/forums/showthread.php?t=24094
http://seqanswers.com/forums/showthread.php?t=28607
http://seqanswers.com/forums/showthread.php?t=30397
http://seqanswers.com/forums/showthread.php?t=24040
A blog post on FASTQC duplicate sequences (pointed to by one of the threads) was interesting as well:
http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/
I have no idea on how to interpret the strong presence of these peculiar sequences other than some problem in the library preparation, which I would find surprising given that this dataset was used as an example in a paper. Bottom line, I don't know what the best way to deal with them would be: keep them (as the mapping results should not be impacted), or remove them (only losing about 3% total reads)? Should I just go ahead and do the mapping, then use Picard tools to look at library diversity (but then these reads shouldn't map anyway)? Maybe in practice it doesn't change anything but I would like to be sure I understand what I'm doing (or not doing)!
Thank you in advance for any help or enlightenment you can bring and always, thanks for reading!
-- Alex
Comment