Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina FastQ Question

    Hi,

    New to working with Illumina data. I am running into some inconsistency issues on what programs will recognize my file. For example, bowtie works just fine with the file, however, Galaxy, the FastX Toolkit and Stampy all give me error messages in regards to line 4 containing the quality information. I have read lots of posts about converting to Sanger, the variety of Illumina generated FastQ formats, but have yet to completely identify the issue. Some of the data is below.

    Any thoughts or recommendations?

    @HWUSI-EAS-100R_0003:8:1:1034:19859#NCCCCC/1
    GGTATGTTAACATTCANTGAGCTATACACTTAAGATTTGTGCACTTTATCATAGTAAGTTATTTGTCAGTTTGAA
    +
    ddadcdeeeedddd^bBa]\ZX[^Xdddaddca^a\dddab^J]`[Y^`^^cLa^c`a`YJWOSSRb\_BBBBBBB
    @HWUSI-EAS-100R_0003:8:1:1034:9367#NCCCTC/1
    CACTGAATGACATGGGACTGTTTGGACAAAACGTGCTATACCTCTACCTCGGGAGGGCCGGTTACACCATACATG
    +
    eceeeeedefcdfcfdcc^a`c^_`_Y^^\M^][^baYa\a^K^^T[]Y^BBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS-100R_0003:8:1:1035:9515#NCTATG/1
    CGCCGTTTCCCAGTAGGTCTCCTAGAACACTATTTCAATGATGAACAAGGCGAGATCGAGCTCATTGAGACCTCG
    +
    feefffffdfffedcdcbe\bdaa^T^^YaT^_\Sa^\^^YY^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

    EDIT: Just noticed that there are 75 bp and 76 ascii quality scores. Am I missing something here?

    Many thanks,

    Scott
    Last edited by shandley; 11-18-2010, 08:24 AM.

  • #2
    Hi Scott,

    regarding FastQ files it is normally a good idea to run FastQC on the data, it might tell you a lot about the sequencing data.


    I guess most programs will complain about the space in the sequence line, for which there is a quality value though. If you delete this space then FastQC wouldn't complain anymore.

    Apparently your data is in Illimunia 1.3+ encoding (Phred64 scale), however you need to find out why the sequence line and the quality value lanes are of different lengths, and why there is a space in the sequence in the first place...

    Comment


    • #3
      Thanks fkrueger,

      FastQC recognizes and works with the data as Illumina 1.3+ format. When I run bowtie I turn the --phred64-quals option on.

      The Galaxy complaint is specifically that there are 76 quality scores and only 75 nucleotides ... I guess this is where I will direct my attention and try to sort it all out. If anyone has any suggestions or has run into a similar problem please let me know!

      Scott

      Comment


      • #4
        Originally posted by fkrueger View Post
        I guess most programs will complain about the space in the sequence line, ...
        There is no space in the sequence - it is just the forum rendering it badly, trying to help it line wrap. If the OP had used the [ code ] tag I think it would have looked like this:
        Code:
        @HWUSI-EAS-100R_0003:8:1:1034:19859#NCCCCC/1
        GGTATGTTAACATTCANTGAGCTATACACTTAAGATTTGTGCACTTTATCATAGTAAGTTATTTGTCAGTTTGAA
        +
        ddadcdeeeedddd^bBa]\ZX[^Xdddaddca^a\dddab^J]`[Y^`^^cLa^c`a`YJWOSSRb\_BBBBBBB
        @HWUSI-EAS-100R_0003:8:1:1034:9367#NCCCTC/1
        CACTGAATGACATGGGACTGTTTGGACAAAACGTGCTATACCTCTACCTCGGGAGGGCCGGTTACACCATACATG
        +
        eceeeeedefcdfcfdcc^a`c^_`_Y^^\M^][^baYa\a^K^^T[]Y^BBBBBBBBBBBBBBBBBBBBBBBBBB
        @HWUSI-EAS-100R_0003:8:1:1035:9515#NCTATG/1
        CGCCGTTTCCCAGTAGGTCTCCTAGAACACTATTTCAATGATGAACAAGGCGAGATCGAGCTCATTGAGACCTCG
        +
        feefffffdfffedcdcbe\bdaa^T^^YaT^_\Sa^\^^YY^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
        This is not valid FASTQ since the quality strings are one character longer than the sequences. Tools which ignore the quality strings may accept this as input anyway if they don't check the length.

        Comment


        • #5
          Thanks maubp,

          The data was provided by a collaborating lab. I have written them to ask if they have any insight as to why there is an extra character in the quality string. I am tempted to just eliminate the last value. The quality is low here anyways. But this does assume that the extra character was entered at the end of the string instead of the beginning or middle.

          Thanks again

          Comment


          • #6
            Originally posted by shandley View Post
            The data was provided by a collaborating lab. I have written them to ask if they have any insight as to why there is an extra character in the quality string.
            Probably a bug in their filtering script or something
            Originally posted by shandley View Post
            I am tempted to just eliminate the last value. The quality is low here anyways. But this does assume that the extra character was entered at the end of the string instead of the beginning or middle.
            Yeah - hard to say where the error is.

            P.S. If you hadn't seen this before, I think the trailing BBBB... qualities are Illumina's Read Segment Quality Control Indicator, and can be stripped off, see:
            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

            Comment


            • #7
              Ahhh ... I thought that trailing BBBBB were suspicious. Based on glancing at the overall run data in FastQC the quality drops off dramatically in the last 10 nt's or so. I think I will just eliminate the trailing quality score and run with it.

              Thanks for all of the rapid an insightful responses.

              Comment


              • #8
                There is always the trade-off of trimming and losing information, or including and adding noise. Is it really an accepted convention now to trim all trailing B's from Illumina data? Or the aligner and variant caller using Quality values takes care of odd low quality bases..
                --
                bioinfosm

                Comment


                • #9
                  I am only going to trim the final B to alleviate the conflict I am having (at least until the sequencing center gets back with me and explains the extra quality value). I would actually prefer to include as much data as possible and tune out the low quality stuff using assemblers. My thought on this is that 'low-quality' data actually implies there is some accurate data there. I hate to loose any of the precious information, particularly with illumina short-read data.

                  Comment


                  • #10
                    I think people running out of memory with Illumina assemblies tend to be much more ruthless with their quality trimming. Read errors push up the unique kmer count, and therefore the memory requirements for something like velvet. As a result, I think the optimal trimming strategy will vary according to the amount and nature of your data, and your assembler and computer resources.

                    Or in other words, YMMV (your mileage may vary)

                    Comment


                    • #11
                      Certainly. Depending on application one can get deep coverage and then trim low Q to end up with really good quality data. Though have not seen anything to point how helpful that is in the end
                      --
                      bioinfosm

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 08:47 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      54 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X