SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina FASTQ Quality Scores - Missing Value Bio.X2Y Bioinformatics 24 08-29-2013 07:01 AM
Per Base Quality scores in FastQC mittymat Illumina/Solexa 3 03-30-2012 05:34 AM
GATK base quality recalibration suppose to keep old and new quality scores? Heisman Bioinformatics 2 10-21-2011 07:40 AM
Sanger FASTQ Quality Scores upper Bioinformatics 2 05-03-2010 07:20 PM
fastq quality scores bioxyz Bioinformatics 2 11-25-2009 03:28 PM

Reply
 
Thread Tools
Old 12-04-2011, 03:45 PM   #1
brachysclereid
Member
 
Location: California

Join Date: Feb 2011
Posts: 32
Default Ideas on collecting quality scores per base in an illumina fastq file

Hi,

I am trying to make per base quality plots like fastqc because I would like to customize reporting. The summary stats reported by fastq in the text export is difficult to work with in R. Instead it would be better to have a list of the quality scores and let R do the work/stats. Does anyone know of a program that will generate a raw list of of quality scores per base from a fastq file? If not I think it should be pretty easy to write a perl script for this. I thought it would be worth asking...

thanks
brachysclereid is offline   Reply With Quote
Old 12-04-2011, 04:16 PM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 823
Default

Do you want something more than what fastx_quality_stats from fastx-tools can provide?
Code:
usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE]
...
   [-N]         = New output format (with more information per nucleotide/cycle).
...
The *NEW* output format:
        cycle (previously called 'column') = cycle number
        max-count
        For each nucleotide in the cycle (ALL/A/C/G/T/N):
                count   = number of bases found in this column.
                min     = Lowest quality score value found in this column.
                max     = Highest quality score value found in this column.
                sum     = Sum of quality score values for this column.
                mean    = Mean quality score value for this column.
                Q1      = 1st quartile quality score.
                med     = Median quality score.
                Q3      = 3rd quartile quality score.
                IQR     = Inter-Quartile range (Q3-Q1).
                lW      = 'Left-Whisker' value (for boxplotting).
                rW      = 'Right-Whisker' value (for boxplotting).
gringer is offline   Reply With Quote
Old 12-04-2011, 05:46 PM   #3
dgtnk
Junior Member
 
Location: Shanghai

Join Date: Nov 2011
Posts: 4
Default

agree with gringer

fastx_quality_stats from Fastx_Toolkit works well. It will not give you the raw list of quality scores, but will provide you the quartile values of read quality at each read position, which you can use for boxplotting in R.
dgtnk is offline   Reply With Quote
Old 12-04-2011, 06:00 PM   #4
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 163
Default

Try using QualityScore in ShortRead.
Dario1984 is offline   Reply With Quote
Old 12-04-2011, 11:17 PM   #5
Blahah404
Member
 
Location: Cambridge, UK

Join Date: Dec 2011
Posts: 48
Default

You can easily extract a .qual file containing per-base quality scores from a fastq file, for example using biopython:
Code:
#!/usr/bin/env python

"""Usage: fastq2qual.py filename 
    where filename is a .fastq (without extension)
    will produce: filename.qual
"""

import sys
from Bio import SeqIO

file_name = sys.argv[1]

SeqIO.convert(file_name+".fastq", "fastq", file_name+".qual", "qual")

sys.exit()
Blahah404 is offline   Reply With Quote
Old 12-05-2011, 03:29 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Quote:
Originally Posted by Dario1984 View Post
Try using QualityScore in ShortRead.
+1

If you want to use R for the plotting and analysis, why not use R to read the FASTQ files as well?
maubp is offline   Reply With Quote
Old 12-05-2011, 05:27 AM   #7
kwyattm
Junior Member
 
Location: Johns Hopkins

Join Date: Jul 2011
Posts: 7
Default

The way I handled this was to write a perl script that 1)parses qseq to fastq 2)trims for adaptor and 3)parses quality score data to a text file. The text file is subsequently imported into R and simply graphed. I even get the graphs imported into a pdf and e-mailed to me when everything is done!
kwyattm is offline   Reply With Quote
Old 12-05-2011, 05:33 AM   #8
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 823
Default

qseq -> fastq is already done in CASAVA, most likely including the removal of any adaptor sequences. CASAVA 1.8+ process the intensity files directly into fastq:

http://seqanswers.com/forums/showthread.php?t=13147
gringer is offline   Reply With Quote
Old 12-05-2011, 05:36 AM   #9
kwyattm
Junior Member
 
Location: Johns Hopkins

Join Date: Jul 2011
Posts: 7
Default Yep!

Quote:
Originally Posted by gringer View Post
qseq -> fastq is already done in CASAVA, most likely including the removal of any adaptor sequences. CASAVA 1.8+ process the intensity files directly into fastq:

http://seqanswers.com/forums/showthread.php?t=13147
Thanks, Ginger! Yeah, I knew about this, it's just an old script. Just passing along the information I had!
kwyattm is offline   Reply With Quote
Old 12-05-2011, 06:57 AM   #10
brachysclereid
Member
 
Location: California

Join Date: Feb 2011
Posts: 32
Default Idease on q scores

Thanks!

I used the biopython suggestion and now have the .qual files. This is what I wanted.

kwyattm,
Is there a tool that will take a random sample of the .qual file in R for the purpose of plotting? I am curious about what your are using to make the plots.

Thanks again!
brachysclereid is offline   Reply With Quote
Old 12-05-2011, 07:44 AM   #11
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 823
Default

Quote:
I used the biopython suggestion and now have the .qual files. This is what I wanted.
Just as a word of caution, you need to make sure the quality base is correct. Different sequencers have in the past used different bases / ascii values to represent the same qualities.

Quote:
Is there a tool that will take a random sample of the .qual file in R for the purpose of plotting?
You can randomly sample data in R by using the 'sample' function, but boxplot should be able to manage with the full dataset. There's also a fastX tool for displaying quality statistics (fastq_quality_boxplot_graph), just in case you want something that's already been made by someone else.
gringer is offline   Reply With Quote
Old 12-05-2011, 01:00 PM   #12
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 163
Default

Since he is working in R, it seems much more straightforward to read it in R.

e.g.

library(ShortRead)
fastqs <- readFastq("/path/to/fastqs")
qualities <- quality(fastqs)
Dario1984 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:22 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO