Hello,
I have paired-end 100bp reads generated from BGI-seq 500. The sequencing center did some adapter removal trimming before delivering the data but there appears to be a fraction of reads which still have putative adapter sequences. I am making this judgement based on the presence of 'overrepresented kmers' at the start and ends of both forward and reverse reads seen in the output of FastQC. I've included an image of this module output for one particular sample as an attachment.
The overrepresented kmer at the 3' end match the beginning of the 3' adapter sequence, which makes sense, and I assume this is due to cases where the insert size is less than the read length, so the reads sequence into the adapter on the other side of the genomic fragment.
What is confusing me is that the overrepresented kmers at the 5' end of reads contain what looks like partial sequence of the 5' adapter sequence, but degraded at the 3' end, which I wouldn't expect, and also with one base pair position variable. I wouldn't necessarily expect sequencing error either as the quality scores are generally very high at the start of the reads.
Here is the 5' adapter sequence provide by the sequencing center:
5' adapter AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG. The underlined part is what is appearing in fragments at the start of reads, and the position in bold is variable among these. Libraries were prepared by the sequencing center, and the sequencing technology is still a bit unclear to me, so I'm not sure whether this is a true artefact. Has anyone seen this patterns in the data from BGI before? I may just leave the data as is and proceed with mapping, as these reads are a small fraction overall, but I'm trying to understand what might be going on....
Thanks
I have paired-end 100bp reads generated from BGI-seq 500. The sequencing center did some adapter removal trimming before delivering the data but there appears to be a fraction of reads which still have putative adapter sequences. I am making this judgement based on the presence of 'overrepresented kmers' at the start and ends of both forward and reverse reads seen in the output of FastQC. I've included an image of this module output for one particular sample as an attachment.
The overrepresented kmer at the 3' end match the beginning of the 3' adapter sequence, which makes sense, and I assume this is due to cases where the insert size is less than the read length, so the reads sequence into the adapter on the other side of the genomic fragment.
What is confusing me is that the overrepresented kmers at the 5' end of reads contain what looks like partial sequence of the 5' adapter sequence, but degraded at the 3' end, which I wouldn't expect, and also with one base pair position variable. I wouldn't necessarily expect sequencing error either as the quality scores are generally very high at the start of the reads.
Here is the 5' adapter sequence provide by the sequencing center:
5' adapter AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG. The underlined part is what is appearing in fragments at the start of reads, and the position in bold is variable among these. Libraries were prepared by the sequencing center, and the sequencing technology is still a bit unclear to me, so I'm not sure whether this is a true artefact. Has anyone seen this patterns in the data from BGI before? I may just leave the data as is and proceed with mapping, as these reads are a small fraction overall, but I'm trying to understand what might be going on....
Thanks
Comment