Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The meaning of B in Illumina 1.5 pipeline data?

    Hello all,


    I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces (www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.


    Code:
    read_fastq -n 1000000 -i in.fastq | trim_seq | plot_lendist -k SEQ_LEN -x
    
    
                                    Length Distribution
    
      12000 ++----------------------------------------------------------------++
            |                                                                  |
            |                                                                  |
      10000 ++                                                               **+
            |                                                                **|
            |                                                                **|
       8000 ++                                                               **+
            |                                                                **|
       6000 ++                                                          **   **+
            |                                                           **   **|
            |                                                           **   **|
       4000 ++                                                    *    ***   **+
            |                                                     *    ***  ***|
            |                                              **   ***  *****  ***|
       2000 ++                                      **    ***  **** ***********+
            |                                **    ***  ***********************|
            |**        *****  ****   ******************************************|
          0 +******************************************************************+
             +      +     +      +      +     +      +     +      +     +      +
             0      5     10     15     20    25     30    35     40    45     50

    Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:


    Code:
    read_fastq -n 1000000 -i in.fastq | grab -p B -k SCORES -i | trim_seq | plot_lendist -k SEQ_LEN -x
    
                                    Length Distribution
    
      4000 ++-----------------------------------------------------------------++
           |                                                                 **|
      3500 ++                                                                **+
           |                                                                 **|
      3000 ++                                                                **+
           |                                                                 **|
      2500 ++                                                                **+
           |                                                                 **|
      2000 ++                                                                **+
           |                                                                 **|
           |                                                                 **|
      1500 ++                                                                **+
           |                                                                 **|
      1000 ++                                                                **+
           |                                                                 **|
       500 ++                                                                **+
           |                                                                ***|
         0 ++------+------+-----+------+------+-----+------+------+-----+******+
            +      +      +     +      +      +     +      +      +     +      +
            0      5      10    15     20     25    30     35     40    45     50
    And for all records with B's:

    Code:
     read_fastq -n 1000000 -i in.fastq | grab -p B -k SCORES | trim_seq | plot_lendist -k SEQ_LEN -x
    
    
                                    Length Distribution
    
      7000 ++-----------------------------------------------------------------++
           |                                                            **     |
      6000 ++                                                           **   **+
           |                                                            **   **|
           |                                                            **   **|
      5000 ++                                                           **   **+
           |                                                     **     **   **|
      4000 ++                                                    **    ***   **+
           |                                                     **    ***  ***|
           |                                                     **    ***  ***|
      3000 ++                                              *    ***  *****  ***+
           |                                               *    ***  ***** ****|
      2000 ++                                       **   ***    *** ***********+
           |                                 **    ***   ***   ****************|
           |                                ***    ***  ***********************|
      1000 +**           **     **   ***   ************************************+
           |**         ********************************************************|
         0 +*******************************************************************+
            +      +      +     +      +      +     +      +      +     +      +
            0      5      10    15     20     25    30     35     40    45     50

    Now, according to Wikipedia:



    and the docs I have been able to find (page 32):

    Illumina does not convey any license under its patent, trademark, copyright, or similar rights of any third parties by this document. All of the contents of this document must be fully read and understood prior to using such product(s)


    B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.

    But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.

    Anyone?

    (and why do Illumina keep changing FASTQ encoding?)




    Cheers,



    Martin

  • #2
    Here is the answer from Illumina tech support:

    In looking at the graphs, I noticed the cyclic pattern you described is present in all three graphs. It is, however, more subtle in the non-multiplexed sample. The cyclic nature is a result of several factors:

    The neighborhood analysis aspect of Illumina Q-scoring results in a 5-cycle cycle nature to the scores. This has been noted in the past.
    The other aspect is the quality of the sample. In certain cases this cyclic pattern may be more pronounced in some samples. The use of the Trim_Seq option may also result in a heightened presentation of the cycles, especially in cases of more aggressive trimming.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 08:47 AM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    59 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 09:21 AM
    0 responses
    54 views
    0 likes
    Last Post seqadmin  
    Working...
    X