Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • QC Filter FLag:

    Hi,

    Could somebody help me in understanding this,

    In my pair-end data (raw fastq), i found some sequences with bad QC Filter Flag (Where N:Y indicates Good:Bad)

    cat R1.fastq | grep :Y | wc -l
    701834
    cat R2.fastq | grep :Y | wc -l
    701834

    @HWI-1KL114:350CGTACXX:4:1101:2523:1980 1:Y:0:CGATGT
    @HWI-1KL114:350CGTACXX:4:1101:6456:1995 1:Y:0:CGATGT


    How do i remove these ?

    I tried using fastx_toolkit

    fastq_quality_filter -i R1.fastq -o Test_R1.fastq

    It gives me error:
    fastq_quality_filter: Invalid quality score value (char '#' ord 35 quality value -29) on line 80


    @HWI-1KL114:350CGTACXX:5:1101:1983:1985 2:N:0:TGACCA
    CTGGCTTCTTACTCCGTTCAGTCTGAGCTTGGAGATTATAACCCGGGAAC
    +
    =B@DDEFDFH>DFEIJEHFGHCHHG@H@FH9ECGCFFEAFFD?DHG#### [Line 80]


    Could somebody please help me with this.
    Thank you for your help in advance !

    regards
    CN

  • #2
    This was a problem when Illumina first released CASAVA 1.8 (or was it 1.7, I can't remember). The default behavior, with not way to bypass, was to mix passed (:N and failed (:Y reads in the output file. Here is the recommendation from Illumina on how to filter failed reads from the file:

    Code:
    grep -A3 '^@.* [^:]*:N:[^:]*:' [I]your_input_file[/I] | grep -ve '^--$' > [I]your_output_file[/I]
    The first grep statement searches for headers with :N: in the appropriate place and prints that line plus the 3 following lines (sequence, qual header and qual). The second grep statements removes the '--' lines which the first grep inserts between blocks of matches.

    Comment


    • #3
      Thank you Kmcarr !!

      It removed all those sequences without bad QC quality flag.

      #First, i filtered:
      grep -A3 '^@.* [^:]*:N:[^:]*:' Embryo_R1.fastq | grep -ve '^--$' > Emb_R1.fastq


      #Check if BAD flag
      cat Emb_R1.fastq | grep :Y | wc -l
      0 [None]

      Then i try to filter these RAW read using flastq_quality_filter

      fastq_quality_filter -i Emb_R1.fastq -o Test.fastq
      fastq_quality_filter: Invalid quality score value (char '#' ord 35 quality value -29) on line 12

      How could i solve this error ?

      Thank you !

      Comment


      • #4


        Check this once..
        Quality filtering and PCRduplicate removal
        Krishna

        Comment


        • #5
          Originally posted by Chirag View Post

          Then i try to filter these RAW read using flastq_quality_filter

          fastq_quality_filter -i Emb_R1.fastq -o Test.fastq
          fastq_quality_filter: Invalid quality score value (char '#' ord 35 quality value -29) on line 12

          How could i solve this error ?

          Thank you !
          This has to due with the way in which the quality score is encoded on your fastq file, that is if the character offset is phred+33 or phred+64. (Check out the Wikipedia article for a detailed explanation.) The Fastx toolkit programs default to the assumption that the encoding is phred+64 but Illumina now uses phred+33. You need to tell fastq_quality_filter to use 33.

          Code:
          fastq_quality_filter -Q33 -i Emb_R1.fastq -o Test.fastq

          Comment


          • #6
            Thank you very much !!!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X