Seqanswers Leaderboard Ad

**drio** · 11-10-2010, 07:00 AM

What specific binary are you running within FastX and what's the exact command? What kind of "random errors" the tool is reporting?

**sbberes** · 11-10-2010, 07:18 AM

Drio,
As background I am running multiplexed tagged SE 76nt sequencing runs. For this run there are about 50 million reads being generated per lane. As a first step before parsing the the tagged reads, I run the individual lane data thru the FastX tool kit to get an overview of the run. I am getting the following error messages when running fastx_quality_stats:

fastx_quality_stats: Error: invalid quality score data on line 112913096 (quality_tok = "\KJY\a\\Y_a\_aaa^cca_a_cRK_^^J\\\^\_cc^aa_aYccc084#0/1"
sbberes@computer1:~/Desktop/M59_GAS/ln2$

fastx_quality_stats: Error: invalid quality score data on line 74052832 (quality_tok = ""
sbberes@computer1:~/Desktop/M59_GAS/ln3$

fastx_quality_stats: Error: invalid quality score data on line 48710608 (quality_tok = "TTTCTNATTGTCGTGAAGGCATGACAACAGTATCAAGTCAAAAAGTCGGAGGCGGAGTTAATGTTTTTAGAGTTCC"
sbberes@computer1:~/Desktop/M59_GAS/ln4$

fastx_quality_stats: Error: invalid quality score data on line 69545528 (quality_tok = "__faad^dcXOcZfdaYa_adccSY`__YadaaYdd^aWSaaSWWM^ggdgB"
sbberes@computer1:~/Desktop/M59_GAS/ln5$

fastx_quality_stats: Error: invalid quality score data on line 99033924 (quality_tok = "gYgagf_cfffdffc`W]`BB"
sbberes@computer1:~/Desktop/M59_GAS/ln6$

fastx_quality_stats: Error: invalid quality score data on line 22453904 (quality_tok = "``````U[[`BBBBBBBBBBBBBBBBBBBBBBBBd^dcB"
sbberes@computer1:~/Desktop/M59_GAS/ln7$

I figure the easiest thing to do is just throw out the individual reads that have either inappropriate format or quality value designations.

**maubp** · 11-10-2010, 07:42 AM

You left out the command line you used - in particular what FASTQ variant is the data, and what did you tell fastx to treat it as (with the -Q option).

There are two interesting errors there, an empty quality line, and what could be a sequence line being treated as a quality line. Have you looked at these regions of the file to see if there is anything obviously wrong?

**sbberes** · 11-10-2010, 09:53 AM

All,
First, thanks for the feedback. The command line is just: fastx_quality_stats -i input-file -o output-file. The data is in standard fastq with illumina v 1.3 quality scores. I have the latest version of the standalone, not running under galaxy, version of the fastx tool kit. I cannot find in the documentation or using -h (help) any -Q parameter for this command. No, I have not gone to these regions of the file to look at the data. This is becoming an occurrence that comes up often enough in the data that I am getting that I was looking for an automated/scripted solution to run just as a matter of course. I am not a programmer by training, so before going to the trouble of writing a script myself I wanted to see if it had already been done for me by someone else.
SBB

Here are the first few line of the file:
@HWUSI-EAS528_16:1:1:1002:940#0/1
NNNGCTTGAAAAAATTGTTGGTTTAGCACTTGAAGATNNNNNTGNGNTANAGGGAGTACNNNGTNNNTNNNNNGNN
+HWUSI-EAS528_16:1:1:1002:940#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS528_16:1:1:1040:934#0/1
NGACTTCGAAAATCTCATCTGCATCTGATTATCAGGTTCTTTTTTTGAAAAAAACACCCAAAGTTATCTCTTATGT
+HWUSI-EAS528_16:1:1:1040:934#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS528_16:1:1:1094:930#0/1
NGTACTTCTATGGAATTTTTTTGTTTAATTATGGTCACAACGGCTCTATTTTAATGCCCAAGGATGTCGTTGAACA
+HWUSI-EAS528_16:1:1:1094:930#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

**maubp** · 11-10-2010, 02:56 PM

Originally posted by sbberes View Post

All,
First, thanks for the feedback. The command line is just: fastx_quality_stats -i input-file -o output-file. The data is in standard fastq with illumina v 1.3 quality scores. I have the latest version of the standalone, not running under galaxy, version of the fastx tool kit. I cannot find in the documentation or using -h (help) any -Q parameter for this command.

It isn't documented, yet

See http://seqanswers.com/forums/showthread.php?t=7596

Originally posted by sbberes View Post

No, I have not gone to these regions of the file to look at the data.

I think you will have to in order to solve this - or switch to a different FASTQ tool: If you try using EMBOSS seqret, BioPerl, Biopython etc to parse the FASTQ do you get a more helpful error message?

Originally posted by sbberes View Post

This is becoming an occurrence that comes up often enough in the data that I am getting that I was looking for an automated/scripted solution to run just as a matter of course. I am not a programmer by training, so before going to the trouble of writing a script myself I wanted to see if it had already been done for me by someone else.
SBB

Here are the first few line of the file:

Code:

@HWUSI-EAS528_16:1:1:1002:940#0/1
NNNGCTTGAAAAAATTGTTGGTTTAGCACTTGAAGATNNNNNTGNGNTANAGGGAGTACNNNGTNNNTNNNNNGNN
+HWUSI-EAS528_16:1:1:1002:940#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS528_16:1:1:1040:934#0/1
NGACTTCGAAAATCTCATCTGCATCTGATTATCAGGTTCTTTTTTTGAAAAAAACACCCAAAGTTATCTCTTATGT
+HWUSI-EAS528_16:1:1:1040:934#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS528_16:1:1:1094:930#0/1
NGTACTTCTATGGAATTTTTTTGTTTAATTATGGTCACAACGGCTCTATTTTAATGCCCAAGGATGTCGTTGAACA
+HWUSI-EAS528_16:1:1:1094:930#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

I added [ code ] and [ /code ] tags round the example to stop the forum displaying it wrong. This can be done with the # icon on the advanced reply box.

**sbberes** · 11-11-2010, 05:02 AM

fastq_cleaner.pl

All,
A colleague at the Broad, Brian Haas (a good guy), provided me with a perl script which he generated for cleaning up fastq files. It steps thru the input file parsing the reads into those which conform to fastq format and those which do not, provides a count of each, and outputs the reads in a good and a bad file. it has solved my problem.

**simonandrews** · 11-12-2010, 12:26 AM

If it's useful to anyone this is a small script I knocked up when we had to process some fastq files which were corrupted during an FTP transfer. You can pipe data through it and it does some basic sanity checks to ensure that the file looks like valid fastq data. It will remove any entries which look broken and leave you just the good stuff.

Code:

#!/usr/bin/perl
use warnings;
use strict;

while (<>) {

  unless (/^\@/) {
    warn "$_ should have had an \@ at the start and it didn't\n";
    next;
  }
  my $id1 = $_;
  my $seq = <>;
  my $id2 = <>;
  my $qual = <>;

  if ($seq =~/^[@+]/) {
    warn "Sequence '$seq' looked like an id";
    next;
  }
  if ($qual =~/^[@+]/) {
    warn "Quality '$qual' looked like an id";
    next;
  }
  if ($id2 !~ /^\+/) {
    warn "Midline '$id2' didn't start with a +";
    next;
  }

  if ($qual =~ /[GATCN]{20,}/) {
    warn "Quality '$qual' looked like sequence";
    next;
  }

  if (length($seq) != length($qual)) {
    warn "Seq $seq and Qual $qual weren't the same length";
    next;
  }

  print $id1,$seq,$id2,$qual;


}

**sbberes** · 11-12-2010, 04:18 AM

Simon,
Thanks for the script, corruption of the files during transfer is exactly what I think probably occurred.
SBB

**tarias** · 09-07-2012, 09:09 AM

Im trying to use this perl script but I don't know who to pipe my input file through this and get an output file with the good fastq file.
Thanks,
Tatiana

Originally posted by simonandrews View Post

If it's useful to anyone this is a small script I knocked up when we had to process some fastq files which were corrupted during an FTP transfer. You can pipe data through it and it does some basic sanity checks to ensure that the file looks like valid fastq data. It will remove any entries which look broken and leave you just the good stuff.

Code:

#!/usr/bin/perl
use warnings;
use strict;

while (<>) {

  unless (/^\@/) {
    warn "$_ should have had an \@ at the start and it didn't\n";
    next;
  }
  my $id1 = $_;
  my $seq = <>;
  my $id2 = <>;
  my $qual = <>;

  if ($seq =~/^[@+]/) {
    warn "Sequence '$seq' looked like an id";
    next;
  }
  if ($qual =~/^[@+]/) {
    warn "Quality '$qual' looked like an id";
    next;
  }
  if ($id2 !~ /^\+/) {
    warn "Midline '$id2' didn't start with a +";
    next;
  }

  if ($qual =~ /[GATCN]{20,}/) {
    warn "Quality '$qual' looked like sequence";
    next;
  }

  if (length($seq) != length($qual)) {
    warn "Seq $seq and Qual $qual weren't the same length";
    next;
  }

  print $id1,$seq,$id2,$qual;


}

**simonandrews** · 09-07-2012, 09:40 AM

Originally posted by tarias View Post

Im trying to use this perl script but I don't know who to pipe my input file through this and get an output file with the good fastq file.
Thanks,
Tatiana

If you save that script to a file called filter.pl and give it execute permissions (chmod 755 filter.pl) then you can either pass your data in as a filename, or via a pipe (which you'll need to do if you have a compressed file for example).

To get the output simply redirect the output of the program to a file.

So to process a normal fastq file you'd do

./filter.pl starting_file.fq > filtered_file.fq

For a compressed file you could do:

zcat starting_file.fq.gz | ./filter.pl > filtered_file.fq

Hope this helps

**tarias** · 09-07-2012, 10:09 AM

I did both and getting erros dump in the screen.

mu-166002

esktop pireslab$ ./filter.pl TA16.txt > filtered.txt
Seq AAGGGGATACATTAGTTACATTTATATATGAAAAATCGAGATCCGGTGATATAACCCAAGGTCTTCCAAAAGTAGA
and Qual hhhhhghhhhechhhhhhgghhhghhhhhhhchgeehh_hfghhhhhhhhhedhfhachdcfKdacfccceheWhc_fcf
weren't the same length at ./filter.pl line 36, <> line 4.
Seq TTCTATCTCAAACAATAGCCTATCAACATCAAGTTCATCTCCTAAATCAGCTAGTTCAAGAAGAGGTCTAGAATTC
and Qual hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhhhhhfhhhhhghachehfcdhaaf_fa_bdeb\Z\[^
weren't the same length at ./filter.pl line 36, <> line 8.
Seq GGATANAGACATGGCAAAAACTCCATCTGAGGAAGAACTGTATCACCATTTGTTGAATGGGGTAACTTGGACTCCG
and Qual hhhhhffffB`b``bhhhhhhhhhhhhhhhhhhghhfghhhhfhfgggfgfehhghghgadghhdfffhahdfdddcffb
weren't the same length at ./filter.pl line 36, <> line 12.
Seq ATTTANTCAACTTCCTGGTGATTCTCCACCACTTTATGTATTTAAATCAAGCTTCTTACAAAGTGATTCATCCTGG
and Qual hhhhhffffBdeeeehhhhhhhhhhhhhhhhhhhhghhhhhhgghhhhhhhhhhhhhhhhhgfgghcbhgfgghchhchh
weren't the same length at ./filter.pl line 36, <> line 16.
Seq GGATCNTCACAATCAAGTATAAGGCTGATGGTCAGGTTGAGAGAAAGAAGTCAAGACTGGTTGCAAGAGGTTTTAC
and Qual hhhhhffffBeeeeehhhhhhghhhhhhhghhcghhhchhhhhcfhhgghhhghgff_eefWfdagfhhghghfchffd_
weren't the same length at ./filter.pl line 36, <> line 20.
Seq GAATCATCATTGAGTTTTGGGTTACTTAATGTATTTTAAAGTTTGGACAGAATATGTGTGATCGAGTGTTGAAAAT
and Qual ggggggdgggggcggggggggggaggggggcffgggdeggggcgfgggggfadf_faccfegegfafdafcfcfffffff
weren't the same length at ./filter.pl line 36, <> line 24.
Seq AATACAGGATTTCCAATCCTAGCAGGAAAAGGAGGGAAACGGATACTCAATTTAAAAGTGAGTAAACAGAATTCCA
and Qual hhhhhhhhhhhhhhhhhhghhhhhhghhhhfhhhhhhhhhhhghhhfhghhhhfhhhegeghdhhhhfhhceb`da_cac
weren't the same length at ./filter.pl line 36, <> line 28.
Seq AGAGANACCAGCATCCTCCTCCCCATCGTCCTCCTCCCCATCGTCCTCCTCCCCATCATGATTTTCATCTTCAACC
and Qual hhhhgdeeeBbbba`hhhhhhhhhhhhhhhhhfhfghhghhhhefgdegegffbghbe`cc_caRaaaaaddccdfcddd
weren't the same length at ./filter.pl line 36, <> line 32.
Seq GTAAGNTCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAA
and Qual dc_f^]UZZBc^cacffffffffffffffffffffffffffffffffffffffffffcfffffaffffeccacBBBBBBB
weren't the same length at ./filter.pl line 36, <> line 36.
Seq TGAAACTGGATCGCTTGGCTCGGTGGAAGCGAGGCTCAGTTGTTGATCGGTGGAAGCGAGGCTCGGTTGGTGATCG
and Qual hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhghhhhghhghhhhfhhfcggadhgdfbeeaeee`dcdbd
weren't the same length at ./filter.pl line 36, <> line 40.
Seq CTTCCTTGTTTGCCTTACAGAAGAACAAGCTATCATCTGCAAATAGAAAATGAGAGATTCCAGGGCAAGCTCTTGC
and Qual hhhhhhhhhhhhhhhhhhhhhhhhhhhhghhhhhhghhhhhhhdhdefcfbbb``fchahgghehfdaahggddefdccf

Originally posted by simonandrews View Post

If you save that script to a file called filter.pl and give it execute permissions (chmod 755 filter.pl) then you can either pass your data in as a filename, or via a pipe (which you'll need to do if you have a compressed file for example).

To get the output simply redirect the output of the program to a file.

So to process a normal fastq file you'd do

./filter.pl starting_file.fq > filtered_file.fq

For a compressed file you could do:

zcat starting_file.fq.gz | ./filter.pl > filtered_file.fq

Hope this helps

**sklages** · 09-10-2012, 10:30 AM

You should post the first few lines of TA16.txt ... maybe we can see some problem ..

**aforntacc** · 07-25-2013, 12:57 AM

Hi all i had this error message while using tophat. please how can i solve this error problem, because look very stock at the this point. and i am new to tophat, linux etc.
i am using ubuntu on a vitural machine ubuntu, the fastq file is of phred scale with Q30 as score.
kindly help
thanks
[2013-07-20 04:11:39] Checking for Bowtie
Bowtie version: 2.1.0.0
[2013-07-20 04:11:39] Checking for Samtools
Samtools version: 0.1.19.0
[2013-07-20 04:11:39] Checking for Bowtie index files (genome)..
[2013-07-20 04:11:39] Checking for reference FASTA file
Warning: Could not find FASTA file seq.fa
[2013-07-20 04:11:39] Reconstituting reference FASTA file from Bowtie index
Executing: /usr/bin/bowtie2-inspect seq > ./tophat_out/tmp/seq.fa
[2013-07-20 04:11:45] Generating SAM header for seq
format: fastq
quality scale: phred33 (default)
[2013-07-20 04:11:51] Preparing reads
left reads: min. length=100, max. length=100, 63588062 kept reads (424 discarded)
right reads: min. length=100, max. length=100, 63000645 kept reads (587841 discarded)
[2013-07-20 05:17:15] Mapping left_kept_reads to genome seq with Bowtie2
[2013-07-20 13:05:46] Mapping left_kept_reads_seg1 to genome seq with Bowtie2 (1/4)
[2013-07-20 13:59:24] Mapping left_kept_reads_seg2 to genome seq with Bowtie2 (2/4)
[2013-07-20 15:06:18] Mapping left_kept_reads_seg3 to genome seq with Bowtie2 (3/4)
[2013-07-20 15:56:59] Mapping left_kept_reads_seg4 to genome seq with Bowtie2 (4/4)
[2013-07-20 16:49:52] Mapping right_kept_reads to genome seq with Bowtie2
[2013-07-20 23:01:03] Mapping right_kept_reads_seg1 to genome seq with Bowtie2 (1/4)
[FAILED]
Error running bowtie:
Saw ASCII character -93 but expected 33-based Phred qual.
terminate called after throwing an instance of 'int'

**mastal** · 07-25-2013, 01:12 AM

Software to filter errors in fastq files?

Q30 isn't a quality scale.

You need to know what version of the Illumina software/quality encoding was used. The current version is Illumina 1.9. Illumina has been using the Phred+33 scale since version 1.8, but earlier versions used different quality encoding scales.

See

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Software to filter errors in fastq files?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News