Seqanswers Leaderboard Ad

**maubp** · 07-21-2011, 10:12 AM

You can't have spaces in the name.

With FASTQ by definition/convention as with FASTA, the first word is the name/identifier and anything after a white space is a comment or description.

[I don't know why Bowtie might be inconsistent in this regard - are you sure these are really spaces in both cases?]

**ashish** · 07-21-2011, 10:25 AM

> You can't have spaces in the name.

I don't think this is true. The wikipedia article on FASTQ gives examples where spaces are used. In fact, my fastq files are from CASAVA 1.8, which standardizes on using a space.

However, the SAM format specification clearly states that spaces are disallowed. Thus, any tool transferring read names from FASTQ files to SAM files needs to specify a name conversion technique.

**maubp** · 07-21-2011, 10:29 AM

Originally posted by ashish View Post

You can't have spaces in the name.

I don't think this is true. The wikipedia article on FASTQ gives examples where spaces are used. In fact, my fastq files are from CASAVA 1.8, which standardizes on using a space.

You CAN have spaces in the @ line (and + line) of FASTQ, just like you can in the > line of FASTA.

My point is the space acts as a delimiter, the name/identifier is the first WORD of that string.

Originally posted by ashish View Post

However, the SAM format specification clearly states that spaces are disallowed. Thus, any tool transferring read names from FASTQ files to SAM files needs to specify a name conversion technique.

If you regard the whole string after the @ as the name in FASTQ, then yes. All the tools I've worked with take the first word.

**ashish** · 07-22-2011, 10:34 AM

Originally posted by maubp View Post

My point is the space acts as a delimiter, the name/identifier is the first WORD of that string.

I was using "name" to refer to the entire @ line. I haven't seen a specification of FASTQ that defines more structure within the line.

Originally posted by maubp View Post

All the tools I've worked with take the first word.

Bowtie seems not to sometimes. Here's an example showing that the entire line, with space, is retained:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.fastq
@HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.sam
HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0: 4 * 0 0 * * 0 0 CNNNCAGTGAAAATTAAATTTGCCCCAAGGAACTCC <###<><6<<AAAAAAAAAAAAAAAAAAAAAAAAAA XM:i:0

And here's an example from the same files, showing that only the first word is retained:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.fastq
@HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 1:Y:0:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.sam
HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 16 chr9 3034652 255 36M * 0 0 AGTGGACATTTCTAAATTTTCCACCTTTTTCAGNNT 9:83@:@@@@@@@@@@3::::99999)+(.+,-##> XA:i:2 MD:Z:33T0T1 NM:i:2

**maubp** · 07-22-2011, 11:50 AM

Originally posted by ashish View Post

I was using "name" to refer to the entire @ line. I haven't seen a specification of FASTQ that defines more structure within the line.

We tried to make this "first word is the identifier" point clearly here:

http://dx.doi.org/10.1093/nar/gkp1137

Likewise I thought the Wikipedia page was fairly clear:

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format

**maubp** · 07-22-2011, 11:54 AM

Originally posted by ashish View Post

Here's an example ...

That is strange and looks like a bug in bowtie to me.

Try piping those grep results I to hexdump to double check it is a space (chr 32, x20), and not some other non-printing character.

**chadn737** · 07-22-2011, 12:19 PM

Originally posted by ashish View Post

I was using "name" to refer to the entire @ line. I haven't seen a specification of FASTQ that defines more structure within the line.

Bowtie seems not to sometimes. Here's an example showing that the entire line, with space, is retained:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.fastq
@HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061" SL6140.sam
HWUSI-EAS1758R:20:70KK0AAXX:4:1:7887:1061 1:Y:0: 4 * 0 0 * * 0 0 CNNNCAGTGAAAATTAAATTTGCCCCAAGGAACTCC <###<><6<<AAAAAAAAAAAAAAAAAAAAAAAAAA XM:i:0

And here's an example from the same files, showing that only the first word is retained:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.fastq
@HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 1:Y:0:

$ grep "HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067" SL6140.sam
HWUSI-EAS1758R:20:70KK0AAXX:4:1:13505:1067 16 chr9 3034652 255 36M * 0 0 AGTGGACATTTCTAAATTTTCCACCTTTTTCAGNNT 9:83@:@@@@@@@@@@3::::99999)+(.+,-##> XA:i:2 MD:Z:33T0T1 NM:i:2

I don't know if that has anything to do with it, but in the two examples you have just given, it looks to me like that the unmodified read name is an unaligned read.

Whereas the modified read name is clearly an aligned read. Is this true for all the examples you see?

**ashish** · 07-22-2011, 12:23 PM

Originally posted by maubp View Post

Try piping those grep results I to hexdump to double check it is a space (chr 32, x20), and not some other non-printing character.

Good idea. I did that and confirmed that they are always space characters in both the input fastq file and output sam file.

**maubp** · 07-22-2011, 12:33 PM

Originally posted by chadn737 View Post

I don't know if that has anything to do with it, but in the two examples you have just given, it looks to me like that the unmodified read name is an unaligned read.

Whereas the modified read name is clearly an aligned read. Is this true for all the examples you see?

If you're right that should make it much easier to trace the bug inside bowtie - well spotted.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Bowtie changes read names in SAM output

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News