Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • xiangwulu
    Member
    • Apr 2014
    • 18

    All sequence bases have the same quality score.

    Hi all,
    I am doing some analysis on the dataset here:



    Some basic info for the data without looking into above link:
    ----
    Illumina Genome Analyzer IIx paired end sequencing
    shotgun sequencing
    WGS
    Pseudomonas fluorescens
    Paired-end
    ----

    When I search for 'Genome Analyzer IIx', could find the quality encoding information. I have seen that the quality scores for all bases are '?', e.g.

    @ERR1363506.14 226/1
    GTCCACTACAGGTCGAAGCCGAAGGCGACGAGTTGCGTGTTTACGCGCCCAATCGTTTTGTTCTCGACTGGGTCAACGAGAAGTACCTGAGCCGCGTGCT
    +
    ????????????????????????????????????????????????????????????????????????????????????????????????????

    My question is:
    Is it normal to have a identical quality score for all bases?
    When I analysis the data, some bio tools report errors that it cannot detect the quality offset or quality encoding, is above the cause of the errors?

    Thanks.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    This is an odd dataset.

    First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

    You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

    Comment

    • xiangwulu
      Member
      • Apr 2014
      • 18

      #3
      Originally posted by GenoMax View Post
      This is an odd dataset.

      First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

      You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

      Thanks your answer.

      This data can be found from DRASearch, NCBI SRA, and EBI.
      All these sources of these data has strange quality values.
      However I wasn't able to find the contact info of the submitter, but I email EBI help, and got reply as follow:

      CRAM files are compressed NGS read files. The sequences can are retrieved byusing the reference but quality scores are quantised into a smaller range in
      order to use less space. It looks like the compression on this cram file is such
      that all quality scores average into the same value. These are probably low
      value quality scores, or the quality scores were not available in the first
      place.
      I would just leave the data, or set the --offset =33 for the tool, just to pass the analysis.

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?

        Edit: I think the third file is likely of single reads that had the mate discarded during trimming. You can check on that possibility to see if the headers there are not present in _1 or _2 file.
        Last edited by GenoMax; 06-24-2016, 07:37 AM.

        Comment

        • xiangwulu
          Member
          • Apr 2014
          • 18

          #5
          Originally posted by GenoMax View Post
          Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?
          Usually, splitting the .sra files of pair-end reads using fastq-dump from SRA-toolkit,

          a parameter --split-3 is used to do this:


          Legacy 3-file splitting for mate-pairs: First 2 biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only 1 biological read is dumpable - it is placed in *.fastq.

          so the smaller file is usually called unmapped sequence, which contains the sequence which the mate pair sequence cannot be found.


          SRA Tools. Contribute to ncbi/sra-tools development by creating an account on GitHub.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            See the edit I just made to the post above.

            Comment

            • xiangwulu
              Member
              • Apr 2014
              • 18

              #7
              Originally posted by GenoMax View Post
              See the edit I just made to the post above.
              Saw it.
              I think there is no trimming involved at/before that stage. The third file is a collection of unloved ones.

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Today, 10:09 AM
              0 responses
              9 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, Yesterday, 08:59 AM
              0 responses
              16 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              24 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 11:40 AM
              0 responses
              21 views
              0 reactions
              Last Post SEQadmin2  
              Working...