Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbie questions regarding Illumina read quality statistics using FASTX toolkit

    I'm pretty new to NGS. I have Illumina GAII 36bp paired reads for several bacterial genomes. The sequencing was carried out in one run, using 2 lanes.
    I have been using FASTX toolkit to produce quality statistics of both sets of reads of each isolate from the Solexa fastq files. From that output, a boxplot and nucleotide distribution graph for each set of reads of each isolate has been produced which has prompted 2 main questions:

    1. In the boxplot, it plots the median quality scores against nucleotide position. For both sets of reads for all isolates for all 36bp, the median score is 34. Is it normal to get this much consistency? I had been told that the median score tends to tail off a bit lower towards the 3' end of the read.

    2. For each 36bp position of the reads the graph shows the ACGT nucletide distribution. According to the nucleotide distribution graph, for each set of reads for every isolate, the first 3 nucleotide positions are skewed in comparison to the rest of the read. I believe it should be fairly constant, reflecting all the reads covering the whole genome.
    Is this due to adaptor contamination? Again, is it normal to get this sort of consistency?

    The bacteria are closely related and the sequencing was carried out in one run. Consequently I have trimmed the first 3 nucleotides from each read prior to assembly against a reference genome.

    I'd be grateful if anyone can explain what I'm seeing here?

  • #2
    Hi,
    I am trying to use FASTX for illumina reads. Can you please tell me which version to download from http://codex.cshl.org/labmembers/gor.../download.html and the steps involved. I assume that its better to download fastx_toolkit-0.0.11.tar.bz2 and libgtextutils-0.5.tar.bz2 as its the latest version.
    I am interested in plotting atgc distribution. Thanks.

    Comment


    • #3
      Hi,
      I am still trying to use FASTX for illumina reads and my objective is to draw box plots and draw nucleotide distribution.
      1. I downloaded precompiled from Linux (64bit),http://hannonlab.cshl.edu/fastx_toolkit/download.html


      2. I want to try as below in the website and I am getting the following error:

      Code:
      a)   $ ./fastx_quality_stats -i /solexa/data/s_5_1_sequence.txt  -o /solexa/data/s_5_1_stats.txt 
      
      b)  $ sh fastx_nucleotide_distribution_graph.sh   -i /solexa/data/s_5_
      1_stats.txt -o /solexa/data/s_5_1_distri.png -t "My Library"
      
      
       line 0: undefined variable: vertical
      
               line 0: undefined variable: invert
      
      
      gnuplot> set style histogram rowstacked 
                         ^
               line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'
      
      
      gnuplot> set style data histograms 
                                         ^
               line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
              'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
              'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
              'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
              'yerrorlines', 'xyerrorlines', 'pm3d'
      
               line 0: undefined function: xtic

      When I open my stats file produced during the first step, I didnot see any error. Here is my head of stats file generated:

      Code:
       head s_5_1_stats.txt 
      column  count   min     max     sum     mean    Q1      med     Q3      IQR     lW      rW      A_Count C_Count G_Count T_Count N_Count  Max_count
      1       34831800        2       35      1167973413      33.53   34      34      34      0       34      34      9717028 8225908 8411883  8370604 106377  34831800
      2       34831800        2       35      1166982783      33.50   34      34      34      0       34      34      9722780 7405910 8637580  9065309 221     34831800
      3       34831800        2       35      1166744486      33.50   34      34      34      0       34      34      9327988 7829625 8363455  9310732 0       34831800
      4       34831800        2       35      1166560776      33.49   34      34      34      0       34      34      9337611 7742389 8855551  8896249 0       34831800
      5       34831800        2       35      1165922020      33.47   34      34      34      0       34      34      9507407 7760322 8371000  9193071 0       34831800
      6       34831800        2       35      1166684479      33.49   34      34      34      0       34      34      9311222 8311632 8348617  8860329 0       34831800
      7       34831800        2       35      1166443105      33.49   34      34      34      0       34      34      9280435 8093156 8125314  9332895 0       34831800
      8       34831800        2       35      1166644279      33.49   34      34      34      0       34      34      9166913 8244889 7982211  9437787 0       34831800
      9       34831800        2       35      1166241221      33.48   34      34      34      0       34      34      9114496 8321642 7959958  9435588 116     34831800
      Can I start using solexa fastq data directly? Thx.

      Comment


      • #4
        Originally posted by Lspoor View Post
        I'm pretty new to NGS. I have Illumina GAII 36bp paired reads for several bacterial genomes. The sequencing was carried out in one run, using 2 lanes.
        I have been using FASTX toolkit to produce quality statistics of both sets of reads of each isolate from the Solexa fastq files. From that output, a boxplot and nucleotide distribution graph for each set of reads of each isolate has been produced which has prompted 2 main questions:

        1. In the boxplot, it plots the median quality scores against nucleotide position. For both sets of reads for all isolates for all 36bp, the median score is 34. Is it normal to get this much consistency? I had been told that the median score tends to tail off a bit lower towards the 3' end of the read.

        2. For each 36bp position of the reads the graph shows the ACGT nucletide distribution. According to the nucleotide distribution graph, for each set of reads for every isolate, the first 3 nucleotide positions are skewed in comparison to the rest of the read. I believe it should be fairly constant, reflecting all the reads covering the whole genome.
        Is this due to adaptor contamination? Again, is it normal to get this sort of consistency?

        The bacteria are closely related and the sequencing was carried out in one run. Consequently I have trimmed the first 3 nucleotides from each read prior to assembly against a reference genome.

        I'd be grateful if anyone can explain what I'm seeing here?
        I don't know that I can explain it but I can reassure you that what you are seeing is not out of the ordinary.

        1. In a decent run you will not see significant decrease in the median Q score over only 36 cycles. In recent versions of their base caller Illumina caps the Q score at 34 and it seems the majority of its bases meet this level. (I think they're being a little generous with themselves but that's just one man's opinion.)

        2. I very often see slight deviations in the base call composition over the first 2-3 cycles so you are not alone. There are two basic possibilities: the abnormal distribution is a true representation of the the DNA sequence meaning that the fragmentation or selection of fragments for library preparation is not strictly random; or the observation is due to some artifact of the sequencing or data analysis. I think the first possibility can be ruled out because I observed the bias in libraries fragmented both by nebulizer and by Covaris. It seem highly unlikely that two completely different fragmentation methods would produce the same non-random result. Also, I found that degree of bias in the first couple of cycles reduced dramatically in the shift from pipeline v1.3 to pipeline 1.4 (or maybe it was 1.4 to 1.5).

        I used to remove bases from the first 2 cycles as you did but I stopped doing this after the change in the pipeline mitigated the problem. Do you know which pipeline (or RTA) version was used to call the bases?

        Comment


        • #5
          Hello, being a new user of FASTX I was running into the same error message mentioned above by seq_GA.
          Do you know the answer how to resolve this issue?
          Many thanks!


          a) $ ./fastx_quality_stats -i /solexa/data/s_5_1_sequence.txt -o /solexa/data/s_5_1_stats.txt

          b) $ sh fastx_nucleotide_distribution_graph.sh -i /solexa/data/s_5_
          1_stats.txt -o /solexa/data/s_5_1_distri.png -t "My Library"


          line 0: undefined variable: vertical

          line 0: undefined variable: invert


          gnuplot> set style histogram rowstacked
          ^
          line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'


          gnuplot> set style data histograms
          ^
          line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
          'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
          'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
          'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
          'yerrorlines', 'xyerrorlines', 'pm3d'

          line 0: undefined function: xtic

          Comment


          • #6
            Originally posted by Mark Ott View Post
            Hello, being a new user of FASTX I was running into the same error message mentioned above by seq_GA.
            Do you know the answer how to resolve this issue?
            Many thanks!


            a) $ ./fastx_quality_stats -i /solexa/data/s_5_1_sequence.txt -o /solexa/data/s_5_1_stats.txt

            b) $ sh fastx_nucleotide_distribution_graph.sh -i /solexa/data/s_5_
            1_stats.txt -o /solexa/data/s_5_1_distri.png -t "My Library"


            line 0: undefined variable: vertical

            line 0: undefined variable: invert


            gnuplot> set style histogram rowstacked
            ^
            line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'


            gnuplot> set style data histograms
            ^
            line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
            'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
            'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
            'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
            'yerrorlines', 'xyerrorlines', 'pm3d'

            line 0: undefined function: xtic
            I don't know much about the errors above but what i can suggest is to do the fastxquality stats on the command line in Unix and then uploaded the txt file on to galaxy (http://main.g2.bx.psu.edu/).
(get data ---> upload file), 
you can generate the quality score box plot graph (Graph/Display Data ---> Boxplot) 
and visualize the nucleotide distribution graph in the same manner. Used default parameters for boxplot.

            Good luck

            Comment


            • #7
              Hi all,

              Why not try the great FastQC program. Comes with a nice GUI and great visualization.
              Worth a try...

              Comment


              • #8
                Forgot to post the link:

                Comment


                • #9
                  I have tried FastQC and it is quite good as it does most of the quality scoring stuff. The only graphic or quality checking that is missing is nucleotide distribution across each read which you can only do it either in Fastax or Galaxy.

                  Comment


                  • #10
                    Maybe not...

                    Originally posted by upendra_35 View Post
                    I have tried FastQC and it is quite good as it does most of the quality scoring stuff. The only graphic or quality checking that is missing is nucleotide distribution across each read which you can only do it either in Fastax or Galaxy.
                    Maybe this feature was not present in the old version. But atleast the one I am using gives a plot showing %A,%T,%C,%G at each position across all the reads. The image is named "per_base_sequence_content.png"

                    Comment


                    • #11
                      Hi guys,

                      I want to use 'fastx_quality_stats' for my illumina data but it gives the following error:

                      fastx_quality_stats -i /PATH/Sample_XX.fastq -o /PATH/Sample_xx.fastq.stats
                      fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4



                      This is how my fastq file looks like:

                      head -4 /PATH/Sample_xx.fastq
                      @HWI-ST212:160:AB01NFABXX:4:1101:1355:2086 1:N:0:ATCACG
                      CGGGAAGTATGTACACGGGGTACGTGCCAAGCATCCTCGCGCGACCCCGAGAGCCTGGGGAGCGGGGGCTTGCCGGCCGT CGCACTCATTTACCCGGAGAC
                      +
                      HHHHHHHHHHHHHHHHHHHHHHFHHHHHHBHFHHHGHFHHHHHHHHHHHHHHHHHHHHHHEHHHHHHFEH?EEBFEDEE< >>5:ADC>CFE2CDEAD####


                      I'd appreciate any help I can get with this.

                      Thanks a lot!!

                      Comment


                      • #12
                        Originally posted by son_nexg View Post
                        Hi guys,

                        I want to use 'fastx_quality_stats' for my illumina data but it gives the following error:

                        fastx_quality_stats -i /PATH/Sample_XX.fastq -o /PATH/Sample_xx.fastq.stats
                        fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4



                        This is how my fastq file looks like:

                        head -4 /PATH/Sample_xx.fastq
                        @HWI-ST212:160:AB01NFABXX:4:1101:1355:2086 1:N:0:ATCACG
                        CGGGAAGTATGTACACGGGGTACGTGCCAAGCATCCTCGCGCGACCCCGAGAGCCTGGGGAGCGGGGGCTTGCCGGCCGT CGCACTCATTTACCCGGAGAC
                        +
                        HHHHHHHHHHHHHHHHHHHHHHFHHHHHHBHFHHHGHFHHHHHHHHHHHHHHHHHHHHHHEHHHHHHFEH?EEBFEDEE< >>5:ADC>CFE2CDEAD####


                        I'd appreciate any help I can get with this.

                        Thanks a lot!!
                        This is new casava 1.8 fastq output; you should use a new version of fastqc for reading these files (There is another thread[1] where there is a link with a development snapshot of fastqc allowing to correctly reads v1.8 fastq files).

                        hth, Sven

                        [1]= http://seqanswers.com/forums/showthr...ghlight=fastqc

                        Comment


                        • #13
                          Originally posted by son_nexg View Post
                          Hi guys,

                          I want to use 'fastx_quality_stats' for my illumina data but it gives the following error:

                          fastx_quality_stats -i /PATH/Sample_XX.fastq -o /PATH/Sample_xx.fastq.stats
                          fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4



                          I'd appreciate any help I can get with this.

                          Thanks a lot!!
                          Seems like you may have Sanger quality encoding (### at the end of sequence in FASTQ). Try this:

                          fastx_quality_stats -i /PATH/Sample_XX.fastq -Q 33 -o /PATH/Sample_xx.fastq.stats

                          Comment


                          • #14
                            Originally posted by flobpf View Post
                            Seems like you may have Sanger quality encoding (### at the end of sequence in FASTQ). Try this:

                            fastx_quality_stats -i /PATH/Sample_XX.fastq -Q 33 -o /PATH/Sample_xx.fastq.stats
                            Thanks a lot, -Q option worked for me!!

                            I am now onto the next step and generating the nucleotide distribution graph by doing:

                            fastx_nucleotide_distribution_graph.sh \
                            -i /PATH/Sample_xx.fastq.stats \
                            -o /PATH/Sample_xx.fastq.stats_nuc.png \
                            -t Sample_xx


                            But getting the following error:

                            line 0: undefined variable: vertical

                            line 0: undefined variable: invert


                            gnuplot> set style histogram rowstacked
                            ^
                            line 0: expecting 'data', 'function', 'line', 'fill' or 'arrow'


                            gnuplot> set style data histograms
                            ^
                            line 0: expecting 'lines', 'points', 'linespoints', 'dots', 'impulses',
                            'yerrorbars', 'xerrorbars', 'xyerrorbars', 'steps', 'fsteps',
                            'histeps', 'filledcurves', 'boxes', 'boxerrorbars', 'boxxyerrorbars',
                            'vectors', 'financebars', 'candlesticks', 'errorlines', 'xerrorlines',
                            'yerrorlines', 'xyerrorlines', 'pm3d'

                            line 0: undefined function: xtic


                            These are the first 4 lines of my stats file:

                            column count min max sum mean Q1 med Q3 IQR lW rW A_Count C_Count G_Count T_Count N_Count Max_count
                            1 12192845 2 40 446955197 36.66 38 39 39 1 37 40 1290565 4112034 5828477 943901 17868 12192845
                            2 12192845 2 40 441775514 36.23 37 39 39 2 34 40 5245648 1771654 1927735 3247808 0 12192845
                            3 12192845 2 40 441662699 36.22 37 39 39 2 34 40 2061208 2422450 2202622 5506565 0 12192845


                            Any guesses what might be happening with the input format?

                            Cheers!!

                            Comment


                            • #15
                              Sorry I dont use FASTX for doing this step. I found FASTQC gives better pictures

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X