Seqanswers Leaderboard Ad

**fkrueger** · 11-18-2010, 08:48 AM

Hi Scott,

regarding FastQ files it is normally a good idea to run FastQC on the data, it might tell you a lot about the sequencing data.

I guess most programs will complain about the space in the sequence line, for which there is a quality value though. If you delete this space then FastQC wouldn't complain anymore.

Apparently your data is in Illimunia 1.3+ encoding (Phred64 scale), however you need to find out why the sequence line and the quality value lanes are of different lengths, and why there is a space in the sequence in the first place...

**shandley** · 11-18-2010, 08:56 AM

Thanks fkrueger,

FastQC recognizes and works with the data as Illumina 1.3+ format. When I run bowtie I turn the --phred64-quals option on.

The Galaxy complaint is specifically that there are 76 quality scores and only 75 nucleotides ... I guess this is where I will direct my attention and try to sort it all out. If anyone has any suggestions or has run into a similar problem please let me know!

Scott

**maubp** · 11-18-2010, 09:36 AM

Originally posted by fkrueger View Post

I guess most programs will complain about the space in the sequence line, ...

There is no space in the sequence - it is just the forum rendering it badly, trying to help it line wrap. If the OP had used the [ code ] tag I think it would have looked like this:

Code:

@HWUSI-EAS-100R_0003:8:1:1034:19859#NCCCCC/1
GGTATGTTAACATTCANTGAGCTATACACTTAAGATTTGTGCACTTTATCATAGTAAGTTATTTGTCAGTTTGAA
+
ddadcdeeeedddd^bBa]\ZX[^Xdddaddca^a\dddab^J]`[Y^`^^cLa^c`a`YJWOSSRb\_BBBBBBB
@HWUSI-EAS-100R_0003:8:1:1034:9367#NCCCTC/1
CACTGAATGACATGGGACTGTTTGGACAAAACGTGCTATACCTCTACCTCGGGAGGGCCGGTTACACCATACATG
+
eceeeeedefcdfcfdcc^a`c^_`_Y^^\M^][^baYa\a^K^^T[]Y^BBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS-100R_0003:8:1:1035:9515#NCTATG/1
CGCCGTTTCCCAGTAGGTCTCCTAGAACACTATTTCAATGATGAACAAGGCGAGATCGAGCTCATTGAGACCTCG
+
feefffffdfffedcdcbe\bdaa^T^^YaT^_\Sa^\^^YY^^^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

This is not valid FASTQ since the quality strings are one character longer than the sequences. Tools which ignore the quality strings may accept this as input anyway if they don't check the length.

**shandley** · 11-18-2010, 09:53 AM

Thanks maubp,

The data was provided by a collaborating lab. I have written them to ask if they have any insight as to why there is an extra character in the quality string. I am tempted to just eliminate the last value. The quality is low here anyways. But this does assume that the extra character was entered at the end of the string instead of the beginning or middle.

Thanks again

**maubp** · 11-18-2010, 09:57 AM

Originally posted by shandley View Post

The data was provided by a collaborating lab. I have written them to ask if they have any insight as to why there is an extra character in the quality string.

Probably a bug in their filtering script or something

Originally posted by shandley View Post

I am tempted to just eliminate the last value. The quality is low here anyways. But this does assume that the extra character was entered at the end of the string instead of the beginning or middle.

Yeah - hard to say where the error is.

P.S. If you hadn't seen this before, I think the trailing BBBB... qualities are Illumina's Read Segment Quality Control Indicator, and can be stripped off, see:

Illumina FASTQ Quality Scores - Missing Value - SEQanswers

http://seqanswers.com/forums/showpost.php?p=17491&postcount=3

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

OBF » Illumina FASTQ files – Read Segment Quality Control Indicator » Illumina FASTQ files – Read Segment Quality Control Indicator

http://news.open-bio.org/news/2010/04/illumina-q2-trim-fastq/

Open Bioinformatics Foundation Homepage

**shandley** · 11-18-2010, 10:07 AM

Ahhh ... I thought that trailing BBBBB were suspicious. Based on glancing at the overall run data in FastQC the quality drops off dramatically in the last 10 nt's or so. I think I will just eliminate the trailing quality score and run with it.

Thanks for all of the rapid an insightful responses.

**bioinfosm** · 11-18-2010, 11:50 AM

There is always the trade-off of trimming and losing information, or including and adding noise. Is it really an accepted convention now to trim all trailing B's from Illumina data? Or the aligner and variant caller using Quality values takes care of odd low quality bases..

**shandley** · 11-18-2010, 12:03 PM

I am only going to trim the final B to alleviate the conflict I am having (at least until the sequencing center gets back with me and explains the extra quality value). I would actually prefer to include as much data as possible and tune out the low quality stuff using assemblers. My thought on this is that 'low-quality' data actually implies there is some accurate data there. I hate to loose any of the precious information, particularly with illumina short-read data.

**maubp** · 11-18-2010, 01:40 PM

I think people running out of memory with Illumina assemblies tend to be much more ruthless with their quality trimming. Read errors push up the unique kmer count, and therefore the memory requirements for something like velvet. As a result, I think the optimal trimming strategy will vary according to the amount and nature of your data, and your assembler and computer resources.

Or in other words, YMMV (your mileage may vary)

**bioinfosm** · 11-18-2010, 03:29 PM

Certainly. Depending on application one can get deep coverage and then trim low Q to end up with really good quality data. Though have not seen anything to point how helpful that is in the end

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Illumina FastQ Question

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News