Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-24-2014, 05:17 PM

The sam specification requires read 1 and read 2 to have identical names, so the /1 and /2 will be dropped. BBMap has a flag that will retain them ("keepnames=t"), but bear in mind that the result will be an improper sam file, so some downstream programs may not be able to process it.

I'm just guessing, but probably, Tophat was treating the reads you renamed as single-ended?

FYI, those names look odd to me. I'm used to seeing something like this:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /1:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /2:CGATGT

...well, that may not be exactly right, but anyway, there's normally a space before the "/1", and everything before the space is identical for both reads (whereas in your case, they are not identical). If the space is missing, and the reads are interleaved rather than in 2 files, BBTools at least will not recognize them as paired (though obviously it would if they were in two files). Were they processed in some way (like removing all whitespace) before you got them?

P.S. It would be helpful if you gave your Tophat command line or explained whether these reads are in two files or interleaved in one file.

**wisense** · 04-25-2014, 08:19 PM

Originally posted by Brian Bushnell View Post

The sam specification requires read 1 and read 2 to have identical names, so the /1 and /2 will be dropped. BBMap has a flag that will retain them ("keepnames=t"), but bear in mind that the result will be an improper sam file, so some downstream programs may not be able to process it.

I'm just guessing, but probably, Tophat was treating the reads you renamed as single-ended?

FYI, those names look odd to me. I'm used to seeing something like this:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /1:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /2:CGATGT

...well, that may not be exactly right, but anyway, there's normally a space before the "/1", and everything before the space is identical for both reads (whereas in your case, they are not identical). If the space is missing, and the reads are interleaved rather than in 2 files, BBTools at least will not recognize them as paired (though obviously it would if they were in two files). Were they processed in some way (like removing all whitespace) before you got them?

P.S. It would be helpful if you gave your Tophat command line or explained whether these reads are in two files or interleaved in one file.

Hi, Brian
In my case, Read1 and Read2 are in two files, and this is my Tophat command line:

Code:

tophat -o tophat_out -G zv9.gtf bowtie_index/zv9 read1.fq.gz read2.fq.gz

In my opinion, after I rename the read 2 in the alignment, the read 1 and read 2 will have the identical name, so Tophat will treat these reads as paired reads. If I don't rename the read 2, Tophat will fail to treat these reads as paired reads.

If Tophat treat these reads as single-end reads, theoretically the FPKMs of most genes will increase, but in my case, over 90% genes with FPKM equal to 0.

**Brian Bushnell** · 04-25-2014, 11:49 PM

Actually, one of my fellow researchers just mentioned to me that he was having trouble with Illumina reads using /1 and /2 without a leading space. Perhaps it's a new Illumina software version. What can I say, other than shame on Illumina... but it's very strange that Tophat does not recognize 2 files as being paired - BBMap certainly will, regardless of the read names:

bbmap.sh in1=read1.fq.gz in2=read2.fq.gz ref=zv9.fa out=mapped.sam maxindel=200000

**kmcarr** · 04-26-2014, 07:38 AM

Originally posted by Brian Bushnell View Post

Actually, one of my fellow researchers just mentioned to me that he was having trouble with Illumina reads using /1 and /2 without a leading space. Perhaps it's a new Illumina software version. What can I say, other than shame on Illumina...

Illumina has not changed the format of its read names in years, since the release of v1.8 in May 2011. The standard format is:

Code:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 2:1:0:CGATGT

The definition is:

Code:

@<instrument>:<run_number>:<flowcell_ID>:<lane>:<tile>:<x-pos>:<y-pos><space><read>:<is_filtered>:<control_number>:<index_sequence>

Notice that definition lines have two parts separated by a <space>. The first part is the ID, the second is additional information. Only the ID part is required to be identical for TopHat (and other software) to recognize the reads as mates.

Originally posted by wisense

In my opinion, after I rename the read 2 in the alignment, the read 1 and read 2 will have the identical name, so Tophat will treat these reads as paired reads.

No, after you renamed them they still did not have identical names:

Code:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]2[/COLOR]:1:0:CGATGT

What apparently has happened is the somewhere along the line someone replaced the required <space> with a ':' in the definition line. This means the entire line is now considered the ID, not just the first part. Most cases I have seen with software having trouble recognizing Illumina IDs for read pairs have been caused by someone has intentionally altering the ID's.

**wisense** · 04-27-2014, 04:23 PM

Originally posted by kmcarr View Post

Illumina has not changed the format of its read names in years, since the release of v1.8 in May 2011. The standard format is:

Code:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 2:1:0:CGATGT

The definition is:

Code:

@<instrument>:<run_number>:<flowcell_ID>:<lane>:<tile>:<x-pos>:<y-pos><space><read>:<is_filtered>:<control_number>:<index_sequence>

Notice that definition lines have two parts separated by a <space>. The first part is the ID, the second is additional information. Only the ID part is required to be identical for TopHat (and other software) to recognize the reads as mates.

No, after you renamed them they still did not have identical names:

Code:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]2[/COLOR]:1:0:CGATGT

What apparently has happened is the somewhere along the line someone replaced the required <space> with a ':' in the definition line. This means the entire line is now considered the ID, not just the first part. Most cases I have seen with software having trouble recognizing Illumina IDs for read pairs have been caused by someone has intentionally altering the ID's.

Hi, kmcarr
Thanks for your reply.
What I have done was like:

Code:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:[COLOR="Red"]1[/COLOR]:1:0:CGATGT

I have changed the "2" in the read name field of read 2 to "1", so read 1 and read 2 share the identical ID, sorry for the ambigous expression before.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Tophat/Cufflinks is sensitive to the read name in the gene expression estimation?

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News