Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Software to filter errors in fastq files?

    Hello community,
    In the most recent data I have from an Illumina GAII the individual flowcell lane seq.txt files have a few random errors in the fastq format as determined using the FastX toolkit. Can someone recommend software or a script which will run thru a fastQ format file and discard or parse out the reads that have inappropriate formats or quality values.

  • #2
    What specific binary are you running within FastX and what's the exact command? What kind of "random errors" the tool is reporting?
    -drd

    Comment


    • #3
      Drio,
      As background I am running multiplexed tagged SE 76nt sequencing runs. For this run there are about 50 million reads being generated per lane. As a first step before parsing the the tagged reads, I run the individual lane data thru the FastX tool kit to get an overview of the run. I am getting the following error messages when running fastx_quality_stats:

      fastx_quality_stats: Error: invalid quality score data on line 112913096 (quality_tok = "\KJY\a\\Y_a\_aaa^cca_a_cRK_^^J\\\^\_cc^aa_aYccc084#0/1"
      sbberes@computer1:~/Desktop/M59_GAS/ln2$

      fastx_quality_stats: Error: invalid quality score data on line 74052832 (quality_tok = ""
      sbberes@computer1:~/Desktop/M59_GAS/ln3$

      fastx_quality_stats: Error: invalid quality score data on line 48710608 (quality_tok = "TTTCTNATTGTCGTGAAGGCATGACAACAGTATCAAGTCAAAAAGTCGGAGGCGGAGTTAATGTTTTTAGAGTTCC"
      sbberes@computer1:~/Desktop/M59_GAS/ln4$

      fastx_quality_stats: Error: invalid quality score data on line 69545528 (quality_tok = "__faad^dcXOcZfdaYa_adccSY`__YadaaYdd^aWSaaSWWM^ggdgB"
      sbberes@computer1:~/Desktop/M59_GAS/ln5$

      fastx_quality_stats: Error: invalid quality score data on line 99033924 (quality_tok = "gYgagf_cfffdffc`W]`BB"
      sbberes@computer1:~/Desktop/M59_GAS/ln6$

      fastx_quality_stats: Error: invalid quality score data on line 22453904 (quality_tok = "``````U[[`BBBBBBBBBBBBBBBBBBBBBBBBd^dcB"
      sbberes@computer1:~/Desktop/M59_GAS/ln7$

      I figure the easiest thing to do is just throw out the individual reads that have either inappropriate format or quality value designations.

      Comment


      • #4
        You left out the command line you used - in particular what FASTQ variant is the data, and what did you tell fastx to treat it as (with the -Q option).

        There are two interesting errors there, an empty quality line, and what could be a sequence line being treated as a quality line. Have you looked at these regions of the file to see if there is anything obviously wrong?

        Comment


        • #5
          All,
          First, thanks for the feedback. The command line is just: fastx_quality_stats -i input-file -o output-file. The data is in standard fastq with illumina v 1.3 quality scores. I have the latest version of the standalone, not running under galaxy, version of the fastx tool kit. I cannot find in the documentation or using -h (help) any -Q parameter for this command. No, I have not gone to these regions of the file to look at the data. This is becoming an occurrence that comes up often enough in the data that I am getting that I was looking for an automated/scripted solution to run just as a matter of course. I am not a programmer by training, so before going to the trouble of writing a script myself I wanted to see if it had already been done for me by someone else.
          SBB

          Here are the first few line of the file:
          @HWUSI-EAS528_16:1:1:1002:940#0/1
          NNNGCTTGAAAAAATTGTTGGTTTAGCACTTGAAGATNNNNNTGNGNTANAGGGAGTACNNNGTNNNTNNNNNGNN
          +HWUSI-EAS528_16:1:1:1002:940#0/1
          BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
          @HWUSI-EAS528_16:1:1:1040:934#0/1
          NGACTTCGAAAATCTCATCTGCATCTGATTATCAGGTTCTTTTTTTGAAAAAAACACCCAAAGTTATCTCTTATGT
          +HWUSI-EAS528_16:1:1:1040:934#0/1
          BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
          @HWUSI-EAS528_16:1:1:1094:930#0/1
          NGTACTTCTATGGAATTTTTTTGTTTAATTATGGTCACAACGGCTCTATTTTAATGCCCAAGGATGTCGTTGAACA
          +HWUSI-EAS528_16:1:1:1094:930#0/1
          BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

          Comment


          • #6
            Originally posted by sbberes View Post
            All,
            First, thanks for the feedback. The command line is just: fastx_quality_stats -i input-file -o output-file. The data is in standard fastq with illumina v 1.3 quality scores. I have the latest version of the standalone, not running under galaxy, version of the fastx tool kit. I cannot find in the documentation or using -h (help) any -Q parameter for this command.
            It isn't documented, yet
            See http://seqanswers.com/forums/showthread.php?t=7596
            Originally posted by sbberes View Post
            No, I have not gone to these regions of the file to look at the data.
            I think you will have to in order to solve this - or switch to a different FASTQ tool: If you try using EMBOSS seqret, BioPerl, Biopython etc to parse the FASTQ do you get a more helpful error message?
            Originally posted by sbberes View Post
            This is becoming an occurrence that comes up often enough in the data that I am getting that I was looking for an automated/scripted solution to run just as a matter of course. I am not a programmer by training, so before going to the trouble of writing a script myself I wanted to see if it had already been done for me by someone else.
            SBB

            Here are the first few line of the file:
            Code:
            @HWUSI-EAS528_16:1:1:1002:940#0/1
            NNNGCTTGAAAAAATTGTTGGTTTAGCACTTGAAGATNNNNNTGNGNTANAGGGAGTACNNNGTNNNTNNNNNGNN
            +HWUSI-EAS528_16:1:1:1002:940#0/1
            BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
            @HWUSI-EAS528_16:1:1:1040:934#0/1
            NGACTTCGAAAATCTCATCTGCATCTGATTATCAGGTTCTTTTTTTGAAAAAAACACCCAAAGTTATCTCTTATGT
            +HWUSI-EAS528_16:1:1:1040:934#0/1
            BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
            @HWUSI-EAS528_16:1:1:1094:930#0/1
            NGTACTTCTATGGAATTTTTTTGTTTAATTATGGTCACAACGGCTCTATTTTAATGCCCAAGGATGTCGTTGAACA
            +HWUSI-EAS528_16:1:1:1094:930#0/1
            BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
            I added [ code ] and [ /code ] tags round the example to stop the forum displaying it wrong. This can be done with the # icon on the advanced reply box.

            Comment


            • #7
              fastq_cleaner.pl

              All,
              A colleague at the Broad, Brian Haas (a good guy), provided me with a perl script which he generated for cleaning up fastq files. It steps thru the input file parsing the reads into those which conform to fastq format and those which do not, provides a count of each, and outputs the reads in a good and a bad file. it has solved my problem.

              Comment


              • #8
                If it's useful to anyone this is a small script I knocked up when we had to process some fastq files which were corrupted during an FTP transfer. You can pipe data through it and it does some basic sanity checks to ensure that the file looks like valid fastq data. It will remove any entries which look broken and leave you just the good stuff.

                Code:
                #!/usr/bin/perl
                use warnings;
                use strict;
                
                while (<>) {
                
                  unless (/^\@/) {
                    warn "$_ should have had an \@ at the start and it didn't\n";
                    next;
                  }
                  my $id1 = $_;
                  my $seq = <>;
                  my $id2 = <>;
                  my $qual = <>;
                
                  if ($seq =~/^[@+]/) {
                    warn "Sequence '$seq' looked like an id";
                    next;
                  }
                  if ($qual =~/^[@+]/) {
                    warn "Quality '$qual' looked like an id";
                    next;
                  }
                  if ($id2 !~ /^\+/) {
                    warn "Midline '$id2' didn't start with a +";
                    next;
                  }
                
                  if ($qual =~ /[GATCN]{20,}/) {
                    warn "Quality '$qual' looked like sequence";
                    next;
                  }
                
                  if (length($seq) != length($qual)) {
                    warn "Seq $seq and Qual $qual weren't the same length";
                    next;
                  }
                
                  print $id1,$seq,$id2,$qual;
                
                
                }

                Comment


                • #9
                  Simon,
                  Thanks for the script, corruption of the files during transfer is exactly what I think probably occurred.
                  SBB

                  Comment


                  • #10
                    Im trying to use this perl script but I don't know who to pipe my input file through this and get an output file with the good fastq file.
                    Thanks,
                    Tatiana


                    Originally posted by simonandrews View Post
                    If it's useful to anyone this is a small script I knocked up when we had to process some fastq files which were corrupted during an FTP transfer. You can pipe data through it and it does some basic sanity checks to ensure that the file looks like valid fastq data. It will remove any entries which look broken and leave you just the good stuff.

                    Code:
                    #!/usr/bin/perl
                    use warnings;
                    use strict;
                    
                    while (<>) {
                    
                      unless (/^\@/) {
                        warn "$_ should have had an \@ at the start and it didn't\n";
                        next;
                      }
                      my $id1 = $_;
                      my $seq = <>;
                      my $id2 = <>;
                      my $qual = <>;
                    
                      if ($seq =~/^[@+]/) {
                        warn "Sequence '$seq' looked like an id";
                        next;
                      }
                      if ($qual =~/^[@+]/) {
                        warn "Quality '$qual' looked like an id";
                        next;
                      }
                      if ($id2 !~ /^\+/) {
                        warn "Midline '$id2' didn't start with a +";
                        next;
                      }
                    
                      if ($qual =~ /[GATCN]{20,}/) {
                        warn "Quality '$qual' looked like sequence";
                        next;
                      }
                    
                      if (length($seq) != length($qual)) {
                        warn "Seq $seq and Qual $qual weren't the same length";
                        next;
                      }
                    
                      print $id1,$seq,$id2,$qual;
                    
                    
                    }

                    Comment


                    • #11
                      Originally posted by tarias View Post
                      Im trying to use this perl script but I don't know who to pipe my input file through this and get an output file with the good fastq file.
                      Thanks,
                      Tatiana
                      If you save that script to a file called filter.pl and give it execute permissions (chmod 755 filter.pl) then you can either pass your data in as a filename, or via a pipe (which you'll need to do if you have a compressed file for example).

                      To get the output simply redirect the output of the program to a file.

                      So to process a normal fastq file you'd do

                      ./filter.pl starting_file.fq > filtered_file.fq

                      For a compressed file you could do:

                      zcat starting_file.fq.gz | ./filter.pl > filtered_file.fq

                      Hope this helps

                      Comment


                      • #12
                        I did both and getting erros dump in the screen.

                        mu-166002esktop pireslab$ ./filter.pl TA16.txt > filtered.txt
                        Seq AAGGGGATACATTAGTTACATTTATATATGAAAAATCGAGATCCGGTGATATAACCCAAGGTCTTCCAAAAGTAGA
                        and Qual hhhhhghhhhechhhhhhgghhhghhhhhhhchgeehh_hfghhhhhhhhhedhfhachdcfKdacfccceheWhc_fcf
                        weren't the same length at ./filter.pl line 36, <> line 4.
                        Seq TTCTATCTCAAACAATAGCCTATCAACATCAAGTTCATCTCCTAAATCAGCTAGTTCAAGAAGAGGTCTAGAATTC
                        and Qual hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhhhhhfhhhhhghachehfcdhaaf_fa_bdeb\Z\[^
                        weren't the same length at ./filter.pl line 36, <> line 8.
                        Seq GGATANAGACATGGCAAAAACTCCATCTGAGGAAGAACTGTATCACCATTTGTTGAATGGGGTAACTTGGACTCCG
                        and Qual hhhhhffffB`b``bhhhhhhhhhhhhhhhhhhghhfghhhhfhfgggfgfehhghghgadghhdfffhahdfdddcffb
                        weren't the same length at ./filter.pl line 36, <> line 12.
                        Seq ATTTANTCAACTTCCTGGTGATTCTCCACCACTTTATGTATTTAAATCAAGCTTCTTACAAAGTGATTCATCCTGG
                        and Qual hhhhhffffBdeeeehhhhhhhhhhhhhhhhhhhhghhhhhhgghhhhhhhhhhhhhhhhhgfgghcbhgfgghchhchh
                        weren't the same length at ./filter.pl line 36, <> line 16.
                        Seq GGATCNTCACAATCAAGTATAAGGCTGATGGTCAGGTTGAGAGAAAGAAGTCAAGACTGGTTGCAAGAGGTTTTAC
                        and Qual hhhhhffffBeeeeehhhhhhghhhhhhhghhcghhhchhhhhcfhhgghhhghgff_eefWfdagfhhghghfchffd_
                        weren't the same length at ./filter.pl line 36, <> line 20.
                        Seq GAATCATCATTGAGTTTTGGGTTACTTAATGTATTTTAAAGTTTGGACAGAATATGTGTGATCGAGTGTTGAAAAT
                        and Qual ggggggdgggggcggggggggggaggggggcffgggdeggggcgfgggggfadf_faccfegegfafdafcfcfffffff
                        weren't the same length at ./filter.pl line 36, <> line 24.
                        Seq AATACAGGATTTCCAATCCTAGCAGGAAAAGGAGGGAAACGGATACTCAATTTAAAAGTGAGTAAACAGAATTCCA
                        and Qual hhhhhhhhhhhhhhhhhhghhhhhhghhhhfhhhhhhhhhhhghhhfhghhhhfhhhegeghdhhhhfhhceb`da_cac
                        weren't the same length at ./filter.pl line 36, <> line 28.
                        Seq AGAGANACCAGCATCCTCCTCCCCATCGTCCTCCTCCCCATCGTCCTCCTCCCCATCATGATTTTCATCTTCAACC
                        and Qual hhhhgdeeeBbbba`hhhhhhhhhhhhhhhhhfhfghhghhhhefgdegegffbghbe`cc_caRaaaaaddccdfcddd
                        weren't the same length at ./filter.pl line 36, <> line 32.
                        Seq GTAAGNTCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA
                        and Qual dc_f^]UZZBc^cacffffffffffffffffffffffffffffffffffffffffffcfffffaffffeccacBBBBBBB
                        weren't the same length at ./filter.pl line 36, <> line 36.
                        Seq TGAAACTGGATCGCTTGGCTCGGTGGAAGCGAGGCTCAGTTGTTGATCGGTGGAAGCGAGGCTCGGTTGGTGATCG
                        and Qual hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhghhhhghhghhhhfhhfcggadhgdfbeeaeee`dcdbd
                        weren't the same length at ./filter.pl line 36, <> line 40.
                        Seq CTTCCTTGTTTGCCTTACAGAAGAACAAGCTATCATCTGCAAATAGAAAATGAGAGATTCCAGGGCAAGCTCTTGC
                        and Qual hhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhhghhhhhhhdhdefcfbbb``fchahgghehfdaahggddefdccf






                        Originally posted by simonandrews View Post
                        If you save that script to a file called filter.pl and give it execute permissions (chmod 755 filter.pl) then you can either pass your data in as a filename, or via a pipe (which you'll need to do if you have a compressed file for example).

                        To get the output simply redirect the output of the program to a file.

                        So to process a normal fastq file you'd do

                        ./filter.pl starting_file.fq > filtered_file.fq

                        For a compressed file you could do:

                        zcat starting_file.fq.gz | ./filter.pl > filtered_file.fq

                        Hope this helps

                        Comment


                        • #13
                          You should post the first few lines of TA16.txt ... maybe we can see some problem ..

                          Comment


                          • #14
                            Hi all i had this error message while using tophat. please how can i solve this error problem, because look very stock at the this point. and i am new to tophat, linux etc.
                            i am using ubuntu on a vitural machine ubuntu, the fastq file is of phred scale with Q30 as score.
                            kindly help
                            thanks
                            [2013-07-20 04:11:39] Checking for Bowtie
                            Bowtie version: 2.1.0.0
                            [2013-07-20 04:11:39] Checking for Samtools
                            Samtools version: 0.1.19.0
                            [2013-07-20 04:11:39] Checking for Bowtie index files (genome)..
                            [2013-07-20 04:11:39] Checking for reference FASTA file
                            Warning: Could not find FASTA file seq.fa
                            [2013-07-20 04:11:39] Reconstituting reference FASTA file from Bowtie index
                            Executing: /usr/bin/bowtie2-inspect seq > ./tophat_out/tmp/seq.fa
                            [2013-07-20 04:11:45] Generating SAM header for seq
                            format: fastq
                            quality scale: phred33 (default)
                            [2013-07-20 04:11:51] Preparing reads
                            left reads: min. length=100, max. length=100, 63588062 kept reads (424 discarded)
                            right reads: min. length=100, max. length=100, 63000645 kept reads (587841 discarded)
                            [2013-07-20 05:17:15] Mapping left_kept_reads to genome seq with Bowtie2
                            [2013-07-20 13:05:46] Mapping left_kept_reads_seg1 to genome seq with Bowtie2 (1/4)
                            [2013-07-20 13:59:24] Mapping left_kept_reads_seg2 to genome seq with Bowtie2 (2/4)
                            [2013-07-20 15:06:18] Mapping left_kept_reads_seg3 to genome seq with Bowtie2 (3/4)
                            [2013-07-20 15:56:59] Mapping left_kept_reads_seg4 to genome seq with Bowtie2 (4/4)
                            [2013-07-20 16:49:52] Mapping right_kept_reads to genome seq with Bowtie2
                            [2013-07-20 23:01:03] Mapping right_kept_reads_seg1 to genome seq with Bowtie2 (1/4)
                            [FAILED]
                            Error running bowtie:
                            Saw ASCII character -93 but expected 33-based Phred qual.
                            terminate called after throwing an instance of 'int'

                            Comment


                            • #15
                              Software to filter errors in fastq files?

                              Q30 isn't a quality scale.

                              You need to know what version of the Illumina software/quality encoding was used. The current version is Illumina 1.9. Illumina has been using the Phred+33 scale since version 1.8, but earlier versions used different quality encoding scales.

                              See

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              29 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X