Hi All,
We are working with a de novo transcriptome assembly of Illumina Hi-seq data - 20x 100 bp paired end, stranded libraries. Raw data underwent standard trimming and was assembled using mainly default settings in Trinity with the appropriate RF flag for stranded data.
However, it appears that our libraries are not as 'stranded' as we would hope, as after searching our assembly for common qPCR reference genes (about 15) I found that in each case, our assembly contains a strong hit in the expected orientation as well as a near identical contig in reverse complement.
We estimated that our 'stranded' libraries actually have 13-25% reverse mapping reads, by mapping each library to contigs from the combined transcriptome assembly (won't be a perfect estimate because some of the forward and reverse strand transcripts will overlap and we don't have a reference genome).
We have another transcriptome for a related species (same treatments), where the 'strandedness' appears more efficient (estimated 5-15% reads mapping to reverse strand).
My questions are:
Has anyone come across this problem in their own data and what might lead to a low efficiency in the stranded protocol?
Can anyone suggest an approach for redundancy removal that would also recognize reverse complement contigs? Programs such as CD-HIT don't seem to search in reverse complement.
Thanks in advance for your thoughts!
We are working with a de novo transcriptome assembly of Illumina Hi-seq data - 20x 100 bp paired end, stranded libraries. Raw data underwent standard trimming and was assembled using mainly default settings in Trinity with the appropriate RF flag for stranded data.
However, it appears that our libraries are not as 'stranded' as we would hope, as after searching our assembly for common qPCR reference genes (about 15) I found that in each case, our assembly contains a strong hit in the expected orientation as well as a near identical contig in reverse complement.
We estimated that our 'stranded' libraries actually have 13-25% reverse mapping reads, by mapping each library to contigs from the combined transcriptome assembly (won't be a perfect estimate because some of the forward and reverse strand transcripts will overlap and we don't have a reference genome).
We have another transcriptome for a related species (same treatments), where the 'strandedness' appears more efficient (estimated 5-15% reads mapping to reverse strand).
My questions are:
Has anyone come across this problem in their own data and what might lead to a low efficiency in the stranded protocol?
Can anyone suggest an approach for redundancy removal that would also recognize reverse complement contigs? Programs such as CD-HIT don't seem to search in reverse complement.
Thanks in advance for your thoughts!
Comment