![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Illumina fragment length distribution | delphi_ote | Genomic Resequencing | 3 | 05-18-2012 02:59 AM |
Using Tophat with low quality Illumina Reads | sphil | Bioinformatics | 5 | 08-02-2011 08:28 AM |
Reason for low quality of illumina reads | nvteja | Illumina/Solexa | 2 | 07-07-2010 10:41 AM |
Quality trimmming / Mask low quality bases? | bbimber | Bioinformatics | 9 | 03-25-2010 02:40 PM |
How will trimming low-quality ends of Illumina reads affect TopHat and Cufflinks? | ecabot | RNA Sequencing | 1 | 02-25-2010 09:31 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Japan Join Date: Feb 2009
Posts: 10
|
![]()
In my NGS data analysis, before mapping, I trimmed low-quality bases (<Q20) from 3' ends until a high quality (≥Q20) base appears. After that, I plotted the distribution of read length and obtained the weird periodical read length distribution. Please see attached.
In the graph, length distributions from different lanes or tiles were drawn in different colors. Frequencies of reads were oscillated with 5bp intervals. I also saw this kind of weird length distribution for other our RNA-seq and genome sequence dataset, and RNA-seq data from SRA as well. Does anyone know the reason why such periodical length distribution was appeared after trimming? Thanks in advance. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]()
I can't remember any details but I do recall hearing once that there is something about the Illumina quality scoring algorithms which creates these 5bp cycles.
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Japan Join Date: Feb 2009
Posts: 10
|
![]()
Thank you, kmcarr.
Do you mean that such weird distribution is caused by the base calling algorithm in the illumina pipeline? Can we just ignore the length distribution after trimming of low-quality bases? We would not worry about it? |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Cambridge, UK Join Date: Sep 2009
Posts: 625
|
![]()
Funny that you mention this, I have done something quite similar recently.
I wanted to find out whether the increase in sequencing errors towards later sequencing cycles (which is equivalent to a drop in Phred quality) can be described by some kind of mathematical formula. I used a couple of sequence files to determine the starting position of poor qualities. Poor qualities were defined as reads which exceeded a certain number of low quality basecalls in total (in the attached figure there had to be at least 8 quality values below 30). I tried various different thresholds (qualities 10, 15, 20, 30) but the graph does not change much. Interestingly the pattern I got did not increase steadily towards later cycles (as I expected), and I also saw a periodicity of - you might have guessed - 5 bp for poor quality starting positions. This seems to be indeed a feature of the Illumina pipeline algorithms used. Even though it looks artefactual and I found this slightly worrying I don't think one can do much about it, as it is present in all samples irrespective of their origin. This led me to the conclusion that the increased error rate one sees towards the end of longer reads is not chemistry or run-time related but seems to be largely the cumulative effect of these spikes of low quality basecalls which are introduced into the reads with a periodicity of 5 bp. Quite odd, isn't it? |
![]() |
![]() |
![]() |
#5 |
Member
Location: Japan Join Date: Feb 2009
Posts: 10
|
![]()
Thank you fkrueger.
I really think so, it's weird. I hope this hidden bias will be improved in the near future. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|