SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq byb121 Bioinformatics 6 12-20-2013 01:26 AM
i converted illumina fastq into sanger fastq, need advice Aicen Bioinformatics 5 08-27-2012 06:24 AM
Convert illumina v1.5 fastq to sanger fastq zouzou Bioinformatics 29 05-14-2012 09:07 PM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 04:35 PM
Illumina FASTQ format question... scozza Bioinformatics 5 10-26-2009 01:55 AM

Reply
 
Thread Tools
Old 11-18-2010, 07:16 AM   #1
shandley
Member
 
Location: Saint Louis, MO

Join Date: Sep 2010
Posts: 58
Default Illumina FastQ Question

Hi,

New to working with Illumina data. I am running into some inconsistency issues on what programs will recognize my file. For example, bowtie works just fine with the file, however, Galaxy, the FastX Toolkit and Stampy all give me error messages in regards to line 4 containing the quality information. I have read lots of posts about converting to Sanger, the variety of Illumina generated FastQ formats, but have yet to completely identify the issue. Some of the data is below.

Any thoughts or recommendations?

@HWUSI-EAS-100R_0003:8:1:1034:19859#NCCCCC/1
GGTATGTTAACATTCANTGAGCTATACACTTAAGATTTGTGCACTTTATCATAGTAAGTTATTTGTCAGTTTGAA
+
ddadcdeeeedddd^bBa]\ZX[^Xdddaddca^a\dddab^J]`[Y^`^^cLa^c`a`YJWOSSRb\_BBBBBBB
@HWUSI-EAS-100R_0003:8:1:1034:9367#NCCCTC/1
CACTGAATGACATGGGACTGTTTGGACAAAACGTGCTATACCTCTACCTCGGGAGGGCCGGTTACACCATACATG
+
eceeeeedefcdfcfdcc^a`c^_`_Y^^\M^][^baYa\a^K^^T[]Y^BBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS-100R_0003:8:1:1035:9515#NCTATG/1
CGCCGTTTCCCAGTAGGTCTCCTAGAACACTATTTCAATGATGAACAAGGCGAGATCGAGCTCATTGAGACCTCG
+
feefffffdfffedcdcbe\bdaa^T^^YaT^_\Sa^\^^YY^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

EDIT: Just noticed that there are 75 bp and 76 ascii quality scores. Am I missing something here?

Many thanks,

Scott

Last edited by shandley; 11-18-2010 at 07:24 AM.
shandley is offline   Reply With Quote
Old 11-18-2010, 07:48 AM   #2
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Hi Scott,

regarding FastQ files it is normally a good idea to run FastQC on the data, it might tell you a lot about the sequencing data.


I guess most programs will complain about the space in the sequence line, for which there is a quality value though. If you delete this space then FastQC wouldn't complain anymore.

Apparently your data is in Illimunia 1.3+ encoding (Phred64 scale), however you need to find out why the sequence line and the quality value lanes are of different lengths, and why there is a space in the sequence in the first place...
fkrueger is offline   Reply With Quote
Old 11-18-2010, 07:56 AM   #3
shandley
Member
 
Location: Saint Louis, MO

Join Date: Sep 2010
Posts: 58
Default

Thanks fkrueger,

FastQC recognizes and works with the data as Illumina 1.3+ format. When I run bowtie I turn the --phred64-quals option on.

The Galaxy complaint is specifically that there are 76 quality scores and only 75 nucleotides ... I guess this is where I will direct my attention and try to sort it all out. If anyone has any suggestions or has run into a similar problem please let me know!

Scott
shandley is offline   Reply With Quote
Old 11-18-2010, 08:36 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by fkrueger View Post
I guess most programs will complain about the space in the sequence line, ...
There is no space in the sequence - it is just the forum rendering it badly, trying to help it line wrap. If the OP had used the [ code ] tag I think it would have looked like this:
Code:
@HWUSI-EAS-100R_0003:8:1:1034:19859#NCCCCC/1
GGTATGTTAACATTCANTGAGCTATACACTTAAGATTTGTGCACTTTATCATAGTAAGTTATTTGTCAGTTTGAA
+
ddadcdeeeedddd^bBa]\ZX[^Xdddaddca^a\dddab^J]`[Y^`^^cLa^c`a`YJWOSSRb\_BBBBBBB
@HWUSI-EAS-100R_0003:8:1:1034:9367#NCCCTC/1
CACTGAATGACATGGGACTGTTTGGACAAAACGTGCTATACCTCTACCTCGGGAGGGCCGGTTACACCATACATG
+
eceeeeedefcdfcfdcc^a`c^_`_Y^^\M^][^baYa\a^K^^T[]Y^BBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS-100R_0003:8:1:1035:9515#NCTATG/1
CGCCGTTTCCCAGTAGGTCTCCTAGAACACTATTTCAATGATGAACAAGGCGAGATCGAGCTCATTGAGACCTCG
+
feefffffdfffedcdcbe\bdaa^T^^YaT^_\Sa^\^^YY^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
This is not valid FASTQ since the quality strings are one character longer than the sequences. Tools which ignore the quality strings may accept this as input anyway if they don't check the length.
maubp is offline   Reply With Quote
Old 11-18-2010, 08:53 AM   #5
shandley
Member
 
Location: Saint Louis, MO

Join Date: Sep 2010
Posts: 58
Default

Thanks maubp,

The data was provided by a collaborating lab. I have written them to ask if they have any insight as to why there is an extra character in the quality string. I am tempted to just eliminate the last value. The quality is low here anyways. But this does assume that the extra character was entered at the end of the string instead of the beginning or middle.

Thanks again
shandley is offline   Reply With Quote
Old 11-18-2010, 08:57 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by shandley View Post
The data was provided by a collaborating lab. I have written them to ask if they have any insight as to why there is an extra character in the quality string.
Probably a bug in their filtering script or something
Quote:
Originally Posted by shandley View Post
I am tempted to just eliminate the last value. The quality is low here anyways. But this does assume that the extra character was entered at the end of the string instead of the beginning or middle.
Yeah - hard to say where the error is.

P.S. If you hadn't seen this before, I think the trailing BBBB... qualities are Illumina's Read Segment Quality Control Indicator, and can be stripped off, see:
http://seqanswers.com/forums/showpos...91&postcount=3
http://news.open-bio.org/news/2010/0...q2-trim-fastq/
maubp is offline   Reply With Quote
Old 11-18-2010, 09:07 AM   #7
shandley
Member
 
Location: Saint Louis, MO

Join Date: Sep 2010
Posts: 58
Default

Ahhh ... I thought that trailing BBBBB were suspicious. Based on glancing at the overall run data in FastQC the quality drops off dramatically in the last 10 nt's or so. I think I will just eliminate the trailing quality score and run with it.

Thanks for all of the rapid an insightful responses.
shandley is offline   Reply With Quote
Old 11-18-2010, 10:50 AM   #8
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

There is always the trade-off of trimming and losing information, or including and adding noise. Is it really an accepted convention now to trim all trailing B's from Illumina data? Or the aligner and variant caller using Quality values takes care of odd low quality bases..
__________________
--
bioinfosm
bioinfosm is offline   Reply With Quote
Old 11-18-2010, 11:03 AM   #9
shandley
Member
 
Location: Saint Louis, MO

Join Date: Sep 2010
Posts: 58
Default

I am only going to trim the final B to alleviate the conflict I am having (at least until the sequencing center gets back with me and explains the extra quality value). I would actually prefer to include as much data as possible and tune out the low quality stuff using assemblers. My thought on this is that 'low-quality' data actually implies there is some accurate data there. I hate to loose any of the precious information, particularly with illumina short-read data.
shandley is offline   Reply With Quote
Old 11-18-2010, 12:40 PM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I think people running out of memory with Illumina assemblies tend to be much more ruthless with their quality trimming. Read errors push up the unique kmer count, and therefore the memory requirements for something like velvet. As a result, I think the optimal trimming strategy will vary according to the amount and nature of your data, and your assembler and computer resources.

Or in other words, YMMV (your mileage may vary)
maubp is offline   Reply With Quote
Old 11-18-2010, 02:29 PM   #11
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Certainly. Depending on application one can get deep coverage and then trim low Q to end up with really good quality data. Though have not seen anything to point how helpful that is in the end
bioinfosm is offline   Reply With Quote
Reply

Tags
bowtie, fastq, galaxy, illumina, stampy

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO