SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
gene and isoform expression level estimation with Cufflinks Jane M RNA Sequencing 1 09-01-2011 12:47 AM
help with differential gene expression with cufflinks and tophat waterboy Bioinformatics 1 11-28-2010 10:51 AM
Tophat/cufflinks; a way to assign read to a gene poisson200 Bioinformatics 2 07-27-2010 02:09 AM
Cufflinks and gene expression McBryan Bioinformatics 4 06-08-2010 11:13 AM

Reply
 
Thread Tools
Old 04-24-2014, 05:33 PM   #1
wisense
Member
 
Location: Shanghai,China

Join Date: Sep 2011
Posts: 30
Question Tophat/Cufflinks is sensitive to the read name in the gene expression estimation?

Hi, all
Recently, I have found that tophat/cufflinks is sensitive to the read name in the alignment.
In a project, I have a sequencing paired-end reads, of which the read names are as follows:
Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT/1
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:2:1:0:CGATGT/2
After mapping these reads to the reference by tophat, the read names in the alignment were like:
Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:2:1:0:CGATGT
Obiously, they were different. So when I only used the proper paired reads in the alignment as the input for Cufflinks, I got over 90% genes of 0 FPKM.
After manually changing the name of Read 2 to the same as Read 1, I got a proper result with over 70% genes expressed.

It seems that Tophat/Cufflinks is sensitive to the read name. Dose anyone have experience in this condition?

Last edited by wisense; 04-24-2014 at 05:39 PM. Reason: some syntax error
wisense is offline   Reply With Quote
Old 04-24-2014, 06:17 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The sam specification requires read 1 and read 2 to have identical names, so the /1 and /2 will be dropped. BBMap has a flag that will retain them ("keepnames=t"), but bear in mind that the result will be an improper sam file, so some downstream programs may not be able to process it.

I'm just guessing, but probably, Tophat was treating the reads you renamed as single-ended?

FYI, those names look odd to me. I'm used to seeing something like this:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /1:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /2:CGATGT

...well, that may not be exactly right, but anyway, there's normally a space before the "/1", and everything before the space is identical for both reads (whereas in your case, they are not identical). If the space is missing, and the reads are interleaved rather than in 2 files, BBTools at least will not recognize them as paired (though obviously it would if they were in two files). Were they processed in some way (like removing all whitespace) before you got them?

P.S. It would be helpful if you gave your Tophat command line or explained whether these reads are in two files or interleaved in one file.

Last edited by Brian Bushnell; 04-24-2014 at 06:20 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-25-2014, 09:19 PM   #3
wisense
Member
 
Location: Shanghai,China

Join Date: Sep 2011
Posts: 30
Default

Quote:
Originally Posted by Brian Bushnell View Post
The sam specification requires read 1 and read 2 to have identical names, so the /1 and /2 will be dropped. BBMap has a flag that will retain them ("keepnames=t"), but bear in mind that the result will be an improper sam file, so some downstream programs may not be able to process it.

I'm just guessing, but probably, Tophat was treating the reads you renamed as single-ended?

FYI, those names look odd to me. I'm used to seeing something like this:

Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /1:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 /2:CGATGT

...well, that may not be exactly right, but anyway, there's normally a space before the "/1", and everything before the space is identical for both reads (whereas in your case, they are not identical). If the space is missing, and the reads are interleaved rather than in 2 files, BBTools at least will not recognize them as paired (though obviously it would if they were in two files). Were they processed in some way (like removing all whitespace) before you got them?

P.S. It would be helpful if you gave your Tophat command line or explained whether these reads are in two files or interleaved in one file.
Hi, Brian
In my case, Read1 and Read2 are in two files, and this is my Tophat command line:
Code:
tophat -o tophat_out -G zv9.gtf bowtie_index/zv9 read1.fq.gz read2.fq.gz
In my opinion, after I rename the read 2 in the alignment, the read 1 and read 2 will have the identical name, so Tophat will treat these reads as paired reads. If I don't rename the read 2, Tophat will fail to treat these reads as paired reads.

If Tophat treat these reads as single-end reads, theoretically the FPKMs of most genes will increase, but in my case, over 90% genes with FPKM equal to 0.
wisense is offline   Reply With Quote
Old 04-26-2014, 12:49 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Actually, one of my fellow researchers just mentioned to me that he was having trouble with Illumina reads using /1 and /2 without a leading space. Perhaps it's a new Illumina software version. What can I say, other than shame on Illumina... but it's very strange that Tophat does not recognize 2 files as being paired - BBMap certainly will, regardless of the read names:

bbmap.sh in1=read1.fq.gz in2=read2.fq.gz ref=zv9.fa out=mapped.sam maxindel=200000
Brian Bushnell is offline   Reply With Quote
Old 04-26-2014, 08:38 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,154
Default

Quote:
Originally Posted by Brian Bushnell View Post
Actually, one of my fellow researchers just mentioned to me that he was having trouble with Illumina reads using /1 and /2 without a leading space. Perhaps it's a new Illumina software version. What can I say, other than shame on Illumina...
Illumina has not changed the format of its read names in years, since the release of v1.8 in May 2011. The standard format is:

Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 2:1:0:CGATGT
The definition is:

Code:
@<instrument>:<run_number>:<flowcell_ID>:<lane>:<tile>:<x-pos>:<y-pos><space><read>:<is_filtered>:<control_number>:<index_sequence>
Notice that definition lines have two parts separated by a <space>. The first part is the ID, the second is additional information. Only the ID part is required to be identical for TopHat (and other software) to recognize the reads as mates.

Quote:
Originally Posted by wisense
In my opinion, after I rename the read 2 in the alignment, the read 1 and read 2 will have the identical name, so Tophat will treat these reads as paired reads.
No, after you renamed them they still did not have identical names:

Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:2:1:0:CGATGT
What apparently has happened is the somewhere along the line someone replaced the required <space> with a ':' in the definition line. This means the entire line is now considered the ID, not just the first part. Most cases I have seen with software having trouble recognizing Illumina IDs for read pairs have been caused by someone has intentionally altering the ID's.
kmcarr is offline   Reply With Quote
Old 04-27-2014, 05:23 PM   #6
wisense
Member
 
Location: Shanghai,China

Join Date: Sep 2011
Posts: 30
Default

Quote:
Originally Posted by kmcarr View Post
Illumina has not changed the format of its read names in years, since the release of v1.8 in May 2011. The standard format is:

Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943 2:1:0:CGATGT
The definition is:

Code:
@<instrument>:<run_number>:<flowcell_ID>:<lane>:<tile>:<x-pos>:<y-pos><space><read>:<is_filtered>:<control_number>:<index_sequence>
Notice that definition lines have two parts separated by a <space>. The first part is the ID, the second is additional information. Only the ID part is required to be identical for TopHat (and other software) to recognize the reads as mates.



No, after you renamed them they still did not have identical names:

Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:2:1:0:CGATGT
What apparently has happened is the somewhere along the line someone replaced the required <space> with a ':' in the definition line. This means the entire line is now considered the ID, not just the first part. Most cases I have seen with software having trouble recognizing Illumina IDs for read pairs have been caused by someone has intentionally altering the ID's.
Hi, kmcarr
Thanks for your reply.
What I have done was like:
Code:
Read 1: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT
Read 2: @HWI-ST507:215:C18H0ACXX:2:1101:1831:1943:1:1:0:CGATGT
I have changed the "2" in the read name field of read 2 to "1", so read 1 and read 2 share the identical ID, sorry for the ambigous expression before.
wisense is offline   Reply With Quote
Reply

Tags
cufflinks, ngs, sequencing, softerware, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO