Seqanswers Leaderboard Ad

**seq_GA** · 06-21-2010, 10:05 PM

Hi,
I am trying to use FASTX for illumina reads. Can you please tell me which version to download from http://codex.cshl.org/labmembers/gor.../download.html and the steps involved. I assume that its better to download fastx_toolkit-0.0.11.tar.bz2 and libgtextutils-0.5.tar.bz2 as its the latest version.
I am interested in plotting atgc distribution. Thanks.

**seq_GA** · 06-21-2010, 11:42 PM

Hi,
I am still trying to use FASTX for illumina reads and my objective is to draw box plots and draw nucleotide distribution.
1. I downloaded precompiled from Linux (64bit),http://hannonlab.cshl.edu/fastx_toolkit/download.html

2. I want to try as below in the website and I am getting the following error:

Code:

a)   $ ./fastx_quality_stats -i /solexa/data/s_5_1_sequence.txt  -o /solexa/data/s_5_1_stats.txt 

b)  $ sh fastx_nucleotide_distribution_graph.sh   -i /solexa/data/s_5_
1_stats.txt -o /solexa/data/s_5_1_distri.png -t "My Library"


 line 0: undefined variable: vertical

         line 0: undefined variable: invert


gnuplot> set style histogram rowstacked 
                   ^
         line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'


gnuplot> set style data histograms 
                                   ^
         line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
        'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
        'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
        'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
        'yerrorlines', 'xyerrorlines', 'pm3d'

         line 0: undefined function: xtic

When I open my stats file produced during the first step, I didnot see any error. Here is my head of stats file generated:

Code:

 head s_5_1_stats.txt 
column  count   min     max     sum     mean    Q1      med     Q3      IQR     lW      rW      A_Count C_Count G_Count T_Count N_Count  Max_count
1       34831800        2       35      1167973413      33.53   34      34      34      0       34      34      9717028 8225908 8411883  8370604 106377  34831800
2       34831800        2       35      1166982783      33.50   34      34      34      0       34      34      9722780 7405910 8637580  9065309 221     34831800
3       34831800        2       35      1166744486      33.50   34      34      34      0       34      34      9327988 7829625 8363455  9310732 0       34831800
4       34831800        2       35      1166560776      33.49   34      34      34      0       34      34      9337611 7742389 8855551  8896249 0       34831800
5       34831800        2       35      1165922020      33.47   34      34      34      0       34      34      9507407 7760322 8371000  9193071 0       34831800
6       34831800        2       35      1166684479      33.49   34      34      34      0       34      34      9311222 8311632 8348617  8860329 0       34831800
7       34831800        2       35      1166443105      33.49   34      34      34      0       34      34      9280435 8093156 8125314  9332895 0       34831800
8       34831800        2       35      1166644279      33.49   34      34      34      0       34      34      9166913 8244889 7982211  9437787 0       34831800
9       34831800        2       35      1166241221      33.48   34      34      34      0       34      34      9114496 8321642 7959958  9435588 116     34831800

Can I start using solexa fastq data directly? Thx.

**kmcarr** · 06-22-2010, 05:26 AM

Originally posted by Lspoor View Post

I'm pretty new to NGS. I have Illumina GAII 36bp paired reads for several bacterial genomes. The sequencing was carried out in one run, using 2 lanes.
I have been using FASTX toolkit to produce quality statistics of both sets of reads of each isolate from the Solexa fastq files. From that output, a boxplot and nucleotide distribution graph for each set of reads of each isolate has been produced which has prompted 2 main questions:

1. In the boxplot, it plots the median quality scores against nucleotide position. For both sets of reads for all isolates for all 36bp, the median score is 34. Is it normal to get this much consistency? I had been told that the median score tends to tail off a bit lower towards the 3' end of the read.

2. For each 36bp position of the reads the graph shows the ACGT nucletide distribution. According to the nucleotide distribution graph, for each set of reads for every isolate, the first 3 nucleotide positions are skewed in comparison to the rest of the read. I believe it should be fairly constant, reflecting all the reads covering the whole genome.
Is this due to adaptor contamination? Again, is it normal to get this sort of consistency?

The bacteria are closely related and the sequencing was carried out in one run. Consequently I have trimmed the first 3 nucleotides from each read prior to assembly against a reference genome.

I'd be grateful if anyone can explain what I'm seeing here?

I don't know that I can explain it but I can reassure you that what you are seeing is not out of the ordinary.

1. In a decent run you will not see significant decrease in the median Q score over only 36 cycles. In recent versions of their base caller Illumina caps the Q score at 34 and it seems the majority of its bases meet this level. (I think they're being a little generous with themselves but that's just one man's opinion.)

2. I very often see slight deviations in the base call composition over the first 2-3 cycles so you are not alone. There are two basic possibilities: the abnormal distribution is a true representation of the the DNA sequence meaning that the fragmentation or selection of fragments for library preparation is not strictly random; or the observation is due to some artifact of the sequencing or data analysis. I think the first possibility can be ruled out because I observed the bias in libraries fragmented both by nebulizer and by Covaris. It seem highly unlikely that two completely different fragmentation methods would produce the same non-random result. Also, I found that degree of bias in the first couple of cycles reduced dramatically in the shift from pipeline v1.3 to pipeline 1.4 (or maybe it was 1.4 to 1.5).

I used to remove bases from the first 2 cycles as you did but I stopped doing this after the change in the pipeline mitigated the problem. Do you know which pipeline (or RTA) version was used to call the bases?

**Mark Ott** · 08-30-2010, 04:10 AM

Hello, being a new user of FASTX I was running into the same error message mentioned above by seq_GA.
Do you know the answer how to resolve this issue?
Many thanks!

a) $ ./fastx_quality_stats -i /solexa/data/s_5_1_sequence.txt -o /solexa/data/s_5_1_stats.txt

b) $ sh fastx_nucleotide_distribution_graph.sh -i /solexa/data/s_5_
1_stats.txt -o /solexa/data/s_5_1_distri.png -t "My Library"

line 0: undefined variable: vertical

line 0: undefined variable: invert

gnuplot> set style histogram rowstacked
^
line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'

gnuplot> set style data histograms
^
line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
'yerrorlines', 'xyerrorlines', 'pm3d'

line 0: undefined function: xtic

**upendra_35** · 02-24-2011, 07:08 PM

Originally posted by Mark Ott View Post

Hello, being a new user of FASTX I was running into the same error message mentioned above by seq_GA.
Do you know the answer how to resolve this issue?
Many thanks!

a) $ ./fastx_quality_stats -i /solexa/data/s_5_1_sequence.txt -o /solexa/data/s_5_1_stats.txt

b) $ sh fastx_nucleotide_distribution_graph.sh -i /solexa/data/s_5_
1_stats.txt -o /solexa/data/s_5_1_distri.png -t "My Library"

line 0: undefined variable: vertical

line 0: undefined variable: invert

gnuplot> set style histogram rowstacked
^
line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'

gnuplot> set style data histograms
^
line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
'yerrorlines', 'xyerrorlines', 'pm3d'

line 0: undefined function: xtic

I don't know much about the errors above but what i can suggest is to do the fastxquality stats on the command line in Unix and then uploaded the txt file on to galaxy (http://main.g2.bx.psu.edu/). (get data ---> upload file),  you can generate the quality score box plot graph (Graph/Display Data ---> Boxplot)  and visualize the nucleotide distribution graph in the same manner. Used default parameters for boxplot.

Good luck

**ulz_peter** · 02-24-2011, 11:58 PM

Hi all,

Why not try the great FastQC program. Comes with a nice GUI and great visualization.
Worth a try...

**ulz_peter** · 02-25-2011, 12:31 AM

Forgot to post the link:

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

**upendra_35** · 02-25-2011, 05:50 PM

I have tried FastQC and it is quite good as it does most of the quality scoring stuff. The only graphic or quality checking that is missing is nucleotide distribution across each read which you can only do it either in Fastax or Galaxy.

**flobpf** · 03-04-2011, 01:22 PM

Maybe not...

Originally posted by upendra_35 View Post

I have tried FastQC and it is quite good as it does most of the quality scoring stuff. The only graphic or quality checking that is missing is nucleotide distribution across each read which you can only do it either in Fastax or Galaxy.

Maybe this feature was not present in the old version. But atleast the one I am using gives a plot showing %A,%T,%C,%G at each position across all the reads. The image is named "per_base_sequence_content.png"

**son_nexg** · 09-12-2011, 08:35 PM

Hi guys,

I want to use 'fastx_quality_stats' for my illumina data but it gives the following error:

fastx_quality_stats -i /PATH/Sample_XX.fastq -o /PATH/Sample_xx.fastq.stats
fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4

This is how my fastq file looks like:

head -4 /PATH/Sample_xx.fastq
@HWI-ST212:160:AB01NFABXX:4:1101:1355:2086 1:N:0:ATCACG
CGGGAAGTATGTACACGGGGTACGTGCCAAGCATCCTCGCGCGACCCCGAGAGCCTGGGGAGCGGGGGCTTGCCGGCCGT CGCACTCATTTACCCGGAGAC
+
HHHHHHHHHHHHHHHHHHHHHHFHHHHHHBHFHHHGHFHHHHHHHHHHHHHHHHHHHHHHEHHHHHHFEH?EEBFEDEE< >>5:ADC>CFE2CDEAD####

I'd appreciate any help I can get with this.

Thanks a lot!!

**sklages** · 09-13-2011, 03:18 AM

Originally posted by son_nexg View Post

Hi guys,

I want to use 'fastx_quality_stats' for my illumina data but it gives the following error:

fastx_quality_stats -i /PATH/Sample_XX.fastq -o /PATH/Sample_xx.fastq.stats
fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4

This is how my fastq file looks like:

head -4 /PATH/Sample_xx.fastq
@HWI-ST212:160:AB01NFABXX:4:1101:1355:2086 1:N:0:ATCACG
CGGGAAGTATGTACACGGGGTACGTGCCAAGCATCCTCGCGCGACCCCGAGAGCCTGGGGAGCGGGGGCTTGCCGGCCGT CGCACTCATTTACCCGGAGAC
+
HHHHHHHHHHHHHHHHHHHHHHFHHHHHHBHFHHHGHFHHHHHHHHHHHHHHHHHHHHHHEHHHHHHFEH?EEBFEDEE< >>5:ADC>CFE2CDEAD####

I'd appreciate any help I can get with this.

Thanks a lot!!

This is new casava 1.8 fastq output; you should use a new version of fastqc for reading these files (There is another thread[1] where there is a link with a development snapshot of fastqc allowing to correctly reads v1.8 fastq files).

hth, Sven

[1]= http://seqanswers.com/forums/showthr...ghlight=fastqc

**flobpf** · 09-13-2011, 04:41 AM

Originally posted by son_nexg View Post

Hi guys,

I want to use 'fastx_quality_stats' for my illumina data but it gives the following error:

fastx_quality_stats -i /PATH/Sample_XX.fastq -o /PATH/Sample_xx.fastq.stats
fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4

I'd appreciate any help I can get with this.

Thanks a lot!!

Seems like you may have Sanger quality encoding (### at the end of sequence in FASTQ). Try this:

fastx_quality_stats -i /PATH/Sample_XX.fastq -Q 33 -o /PATH/Sample_xx.fastq.stats

**son_nexg** · 09-13-2011, 03:23 PM

Originally posted by flobpf View Post

Seems like you may have Sanger quality encoding (### at the end of sequence in FASTQ). Try this:

fastx_quality_stats -i /PATH/Sample_XX.fastq -Q 33 -o /PATH/Sample_xx.fastq.stats

Thanks a lot, -Q option worked for me!!

I am now onto the next step and generating the nucleotide distribution graph by doing:

fastx_nucleotide_distribution_graph.sh \
-i /PATH/Sample_xx.fastq.stats \
-o /PATH/Sample_xx.fastq.stats_nuc.png \
-t Sample_xx

But getting the following error:

line 0: undefined variable: vertical

line 0: undefined variable: invert

gnuplot> set style histogram rowstacked
^
line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'

gnuplot> set style data histograms
^
line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
'yerrorlines', 'xyerrorlines', 'pm3d'

line 0: undefined function: xtic

These are the first 4 lines of my stats file:

column count min max sum mean Q1 med Q3 IQR lW rW A_Count C_Count G_Count T_Count N_Count Max_count
1 12192845 2 40 446955197 36.66 38 39 39 1 37 40 1290565 4112034 5828477 943901 17868 12192845
2 12192845 2 40 441775514 36.23 37 39 39 2 34 40 5245648 1771654 1927735 3247808 0 12192845
3 12192845 2 40 441662699 36.22 37 39 39 2 34 40 2061208 2422450 2202622 5506565 0 12192845

Any guesses what might be happening with the input format?

Cheers!!

**flobpf** · 09-13-2011, 03:27 PM

Sorry I dont use FASTX for doing this step. I found FASTQC gives better pictures

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Newbie questions regarding Illumina read quality statistics using FASTX toolkit

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News