Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • paired end quality scores

    I'm new at bioinformatics, and just got a paired end (120bp) set of Illumina sequences, the sequence_2 file looks like I'm used to but the sequence_1 looks like this:


    @GRC13_0025_FC:8:1:8024:1022#0/1
    NAGTGAGTAGTCAAAAGAATAGTTCTATCCGACTTAACCAAAGCTAACATCTTCTGAACATCAATCCGTGCAGCAGGATCCATTCCAGCAGTTGGTTCATCCAAAAGAATCACTTTACTCC
    +GRC13_0025_FC:8:1:8024:1022#0/1
    BQNNMMLMRLY[Y[[WVOQQWJWXXYVYYYMMTQOMNOQNVOTVTXXWWW_____TY[[YRRWWMJRRRTWPTRTPTVTV_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @GRC13_0025_FC:8:1:8161:1016#0/1
    NTCTTCATCGTCAGGCACTGGAAAGTGATTATGCGTCATCTCATCTTCATGAATGGATTGATCTGATTTTCGGATATAAACAGAATGGAGAAGAGGCAGTGAAGGAAAGAAGCAATTATTT
    +GRC13_0025_FC:8:1:8161:1016#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @GRC13_0025_FC:8:1:8180:1012#0/1
    NTCATTTAGAAAAAATAGAAACGATATTGAAGATAAAGTACGAATAATTATAGACCTGACAGTTGATGAGGTAGAAAGTGTAAAAATTAGATCTGAGAAAATTCAAGTAGATGGGCATTGA
    +GRC13_0025_FC:8:1:8180:1012#0/1
    BUUUUXXXXXbbQQQQQQ___b____b_________b__________b__b___b__b___bbbb_b_______b___bbbb___b_______T___b__QQ_____Z_Z___________


    I have been looking at RAD sequencing data, and interpreting the 2nd line as fastq scores but if that's true for these, the quality is really poor. Is that true or do the paired end reads have a different file format? Thanks for any advice!

  • #2
    Hi Oregon,

    the 4th line of a FastQ file shows the basecall qualities for the bases in line 2. Paired-end data is normally arranged so that each of the paired end files contains the sequences on either end in the same order throughout both files.
    The quality score 'B' (Phred score 2) is not a generic low quality value but has a special meaning which was already discussed in a previous thread:

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    Perhaps it is related to this new 'feature' of Pipeline 1.3+? See SLIDE 17 in http://docs.google.com/fileview?id=0...NTUyNDE3&hl=en. Here is the text of the slide:

    "The Read Segment Quality Control Indicator: At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads A quality score of 2, encoded as a "B", is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis. Some reads will end with a run of B (or Q2) basecalls, but there will never be an isolated Q2 basecall."
    Also, looking at the first few (hundred) lines of a FastQ file only can give you a wrong impression, as they can contain more than 100 million lines. Using a quality control tool such as FastQC might help you to get a better idea of your sequencing data.

    Best wishes

    Comment


    • #3
      thanks!

      Thanks for the reply- I used two fastq analysis programs and they are both showing terrible quality scores- I guess I will get in touch with our sequencing people and see what is going on.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 11:49 AM
      0 responses
      8 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X