The meaning of B in Illumina 1.5 pipeline data?

maasha

Senior Member

Join Date: Apr 2009
Posts: 153

The meaning of B in Illumina 1.5 pipeline data?

05-10-2011, 04:49 AM

Hello all,

I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces (www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.

Code:

read_fastq -n 1000000 -i in.fastq | trim_seq | plot_lendist -k SEQ_LEN -x


                                Length Distribution

  12000 ++----------------------------------------------------------------++
        |                                                                  |
        |                                                                  |
  10000 ++                                                               **+
        |                                                                **|
        |                                                                **|
   8000 ++                                                               **+
        |                                                                **|
   6000 ++                                                          **   **+
        |                                                           **   **|
        |                                                           **   **|
   4000 ++                                                    *    ***   **+
        |                                                     *    ***  ***|
        |                                              **   ***  *****  ***|
   2000 ++                                      **    ***  **** ***********+
        |                                **    ***  ***********************|
        |**        *****  ****   ******************************************|
      0 +******************************************************************+
         +      +     +      +      +     +      +     +      +     +      +
         0      5     10     15     20    25     30    35     40    45     50

Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:

Code:

read_fastq -n 1000000 -i in.fastq | grab -p B -k SCORES -i | trim_seq | plot_lendist -k SEQ_LEN -x

                                Length Distribution

  4000 ++-----------------------------------------------------------------++
       |                                                                 **|
  3500 ++                                                                **+
       |                                                                 **|
  3000 ++                                                                **+
       |                                                                 **|
  2500 ++                                                                **+
       |                                                                 **|
  2000 ++                                                                **+
       |                                                                 **|
       |                                                                 **|
  1500 ++                                                                **+
       |                                                                 **|
  1000 ++                                                                **+
       |                                                                 **|
   500 ++                                                                **+
       |                                                                ***|
     0 ++------+------+-----+------+------+-----+------+------+-----+******+
        +      +      +     +      +      +     +      +      +     +      +
        0      5      10    15     20     25    30     35     40    45     50

And for all records with B's:

Code:

 read_fastq -n 1000000 -i in.fastq | grab -p B -k SCORES | trim_seq | plot_lendist -k SEQ_LEN -x


                                Length Distribution

  7000 ++-----------------------------------------------------------------++
       |                                                            **     |
  6000 ++                                                           **   **+
       |                                                            **   **|
       |                                                            **   **|
  5000 ++                                                           **   **+
       |                                                     **     **   **|
  4000 ++                                                    **    ***   **+
       |                                                     **    ***  ***|
       |                                                     **    ***  ***|
  3000 ++                                              *    ***  *****  ***+
       |                                               *    ***  ***** ****|
  2000 ++                                       **   ***    *** ***********+
       |                                 **    ***   ***   ****************|
       |                                ***    ***  ***********************|
  1000 +**           **     **   ***   ************************************+
       |**         ********************************************************|
     0 +*******************************************************************+
        +      +      +     +      +      +     +      +      +     +      +
        0      5      10    15     20     25    30     35     40    45     50

Now, according to Wikipedia:

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format#Encoding

and the docs I have been able to find (page 32):

CASAVA1.7_User_Guide_15011196_A

http://www.scribd.com/doc/48889532/CASAVA1-7-User-Guide-15011196-A

Illumina does not convey any license under its patent, trademark, copyright, or similar rights of any third parties by this document. All of the contents of this document must be fully read and understood prior to using such product(s)

B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.

But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.

Anyone?

(and why do Illumina keep changing FASTQ encoding?)

Cheers,

Martin

Tags: biopieces, fastq, illumina, quality score

maasha

Senior Member

Join Date: Apr 2009

Posts: 153
- Share
- Tweet
#2

06-01-2011, 05:04 AM

Here is the answer from Illumina tech support:

In looking at the graphs, I noticed the cyclic pattern you described is present in all three graphs. It is, however, more subtle in the non-multiplexed sample. The cyclic nature is a result of several factors:

The neighborhood analysis aspect of Illumina Q-scoring results in a 5-cycle cycle nature to the scores. This has been noted in the past.
The other aspect is the quality of the sample. In certain cases this cyclic pattern may be more pronounced in some samples. The use of the Trim_Seq option may also result in a heightened presentation of the cycles, especially in cases of more aggressive trimming.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

The meaning of B in Illumina 1.5 pipeline data?

Comment

Latest Articles

ad_right_rmr

News